Methods and systems for proximity enhanced sequence assembly
By aligning sequence reads to a reference and utilizing flow cell proximity information, the method addresses the loss of connectivity in traditional sequencing, enhancing the assembly of complex genomic samples with repeat regions or multiple genomes.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ILLUMINA INC
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-25
AI Technical Summary
Traditional nucleic acid sequencing methods, particularly shotgun sequencing, lose information about the original position and connectivity of sequence fragments, complicating assembly, especially in samples with repeat regions or multiple genomes, leading to difficulties in reconstructing the original genomic sequence.
The method involves obtaining sequence reads from clusters on a flow cell, aligning them to a reference sequence, generating a linking model based on flow cell locations, determining linking information, recruiting additional reads, and scaffolding the assembly to reconstruct the genomic sequence accurately, using proximity and linking information to resolve repeat regions and misassemblies.
This approach enhances the accuracy and completeness of genomic sequence assembly by maintaining connectivity information, improving the assembly process even in complex samples with repeat regions or multiple genomes.
Smart Images

Figure US2025058938_25062026_PF_FP_ABST
Abstract
Description
ILLINC.860WO / / IP-2877-PCT PATENT METHODS AND SYSTEMS FOR PROXIMITY ENHANCED SEQUENCE ASSEMBLYCROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U. S. Provisional Application Nos.63 / 737,528 filed December 20, 2024, 63 / 747,850 filed January 21, 2025, and 63 / 857,956 filed August 5, 2025, the content of each of which is incorporated by reference in its entirety.Field
[0001] The present disclosure relates to nucleic acid sequencing systems and methods. In particular, this disclosure relates to systems and methods for nucleic acid sequence assembly.Description
[0002] Traditional nucleic acid sequencing methods, and several types of nextgeneration sequencing methods, including Sequencing by Synthesis (SBS), use a shotgun approach to sequence large genomic DNA fragments, sometimes called template genomic sequences. During SBS sequencing the template genomic sequences are first fragmented into smaller pieces that are amenable to next-generation sequencing methods on a flow cell. One of the difficulties of this approach is that by the time the smaller sequence fragments from the template genomic sequences have been sequenced, knowledge of their original position in the genome, and their connectivity and proximity to each other in the original template genomic sequence is lost. The loss of this information makes sequence assembly more complicated, especially when there are repeat regions within a genomic nucleic acid sample and / or when a genomic nucleic acid sample includes more than one genome, such as a metagenomic sample, and / or includes one or more polyploid genomes.SUMMARY
[0003] The methods disclosed herein each have several aspects, no single one of which is solely responsible for their desirable attributes. Without limiting the scope of the claims, some prominent features will now be discussed briefly. Numerous other embodimentsare also contemplated, including embodiments that have fewer, additional, and / or different components, steps, features, objects, benefits, and advantages. The components, aspects, and steps may also be arranged and ordered differently. After considering this discussion, and particularly after reading the section entitled “Detailed Description”, one will understand how the features of the devices and methods disclosed herein provide advantages over other known devices and methods.
[0004] Disclosed herein are methods for assembling sequence reads from a genomic nucleic acid sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell. In some embodiments, the methods comprise obtaining flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids; aligning the sequence reads to a first assembly or to a reference sequence; generating a linking model based on the alignment of the sequence reads to the first assembly or to the reference sequence, and based on the proximity of clusters determined from their flow cell locations; determining linking information between the sequence reads based on the linking model; recruiting sequence reads based on the alignment to the first assembly or to the reference sequence; generating a second assembly comprising recruited sequence reads; and scaffolding the second assembly by ordering contiguous sequences of the second assembly, thereby obtaining a final assembly; wherein the recruiting, assembling and / or scaffolding includes the linking information.
[0005] In some embodiments, the method comprises assembling the sequence reads into a first assembly. In some embodiments, assembling the sequence reads into a first assembly comprises: determining frequencies of k-mers within the sequence reads; selecting k-mers with a frequency within a threshold; and assembling the sequence reads containing the selected k-mers. In some embodiments, wherein assembling the sequence reads into a first assembly comprises: analyzing k-mers from the sequence reads and sequence read flow cell proximities; grouping selected k-mers based on flow cell proximity; and assembling each group of selected k-mers into a contiguous sequence. In some embodiments, the method comprises assembling the selected k-mers or groups of selected k-mers using an assembly graph. In some embodiments, the assembly graph comprises a DeBruijn graph or an overlap layout consensus graph.
[0006] In some embodiments, the reference sequence is from a different species or strain than the genomic nucleic acid sample.
[0007] In some embodiments, the linking model is generated based on the genomic proximity of sequence reads aligned to at least one contiguous sequence of the first assembly or to the reference sequence, and based on flow cell proximity.
[0008] In some embodiments, generating the second assembly comprises resolving placement of sequence reads based on the linking information. In some embodiments, the method comprises resolving assembly of repeat regions based on the linking information. In some embodiments, the repeat regions comprise one or more of structural variants, long interspersed nuclear elements (LINES), short interspersed nuclear elements (SINES), tandem duplications, tandem repeats, short tandem repeats (STRs), and paralogous regions.
[0009] In some embodiments, scaffolding comprises ordering and / or orienting contiguous sequences based on linking information between sequence reads spanning two contiguous sequences.
[0010] In some embodiments, the method comprises recruiting sequence reads based on sequence similarity to the first assembly or to the reference sequence. In some embodiments, the method comprises recruiting sequence reads based on flow cell proximity to sequence reads aligned to the reference sequence or to the first assembly. In some embodiments, the method comprises filtering sequence reads based on k-mer frequency.
[0011] In some embodiments, the method comprises excluding sequence reads from a region of the assembly based on flow cell proximity information. In some embodiments, the region of the assembly comprises a repeat region.
[0012] In some embodiments, the method comprises iteratively updating the linking model, recruiting additional sequence reads, and updating an assembly.
[0013] In some embodiments, the genomic nucleic acid sample comprises one or more genomes. In some embodiments, the one or more genomes comprise at least one polyploid genome. In some embodiments, the genomic nucleic acid sample comprises two or more genomes.
[0014] In some embodiments, the method comprises phasing the sequence reads based on heterozygous sites, based on k-mer frequencies, and / or based on linking information.In some embodiments, the method comprises generating a haplotype-resolved sequence assembly.
[0015] Further disclosed herein are systems for assembling sequence reads from a genomic nucleic acid sample. In some embodiments, the system comprises a processor configured to perform a method comprising: obtaining flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids; aligning the sequence reads to a first assembly or to a reference sequence; generating a linking model based on the alignment of the sequence reads to the first assembly or to the reference sequence, and based on the proximity of clusters determined from their flow cell locations; determining linking information between the sequence reads based on the linking model; recruiting sequence reads based on the alignment to the first assembly or to the reference sequence; generating a second assembly comprising recruited sequence reads; and scaffolding the second assembly by ordering contiguous sequences of the second assembly, thereby obtaining a final assembly; wherein the recruiting, assembling and / or scaffolding includes the linking information.
[0016] Further disclosed herein are non-transitory computer-readable media. In some embodiments, the non-transitory computer-readable medium comprises a plurality of instructions, which when executed by at least one processor, cause the at least one processor to: obtain flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids; align the sequence reads to a first assembly or to a reference sequence; generate a linking model based on the alignment of the sequence reads to the first assembly or to the reference sequence, and based on the proximity of clusters determined from their flow cell locations; determine linking information between the sequence reads based on the linking model; recruit sequence reads based on the alignment to the first assembly or to the reference sequence; generate a second assembly comprising recruited sequence reads; and scaffold the second assembly by ordering contiguous sequences of the second assembly, thereby obtaining a final assembly; wherein the recruiting, assembling and / or scaffolding includes the linking information.
[0017] Further disclosed herein are methods for assembling sequence reads from a genomic DNA sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell. In some embodiments, the methods include obtaining flow cell data from thenucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell; aligning the sequence reads to a reference sequence or the initial assembly to obtain a genomic location of the sequence reads; assembling the sequence reads to obtain a first assembly; obtaining linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads; identifying a region of syntenic abnormality; and reassembling the sequence reads in the region of syntenic abnormality based on the linking information, thereby determining an updated sequence in the region of syntenic abnormality. A syntenic abnormality may be one or more disruptions in the order of genes on two homologous chromosomes. These disruptions can include gene deletions, gene duplications, gene insertions and translocations.
[0018] In some embodiments, reassembling the sequence reads comprises updating scaffolding in the region of syntenic abnormality based on linking information. In some embodiments, updating scaffolding comprises reordering or reorienting one or more contiguous sequences in the region of syntenic abnormality based on linking information between sequence reads spanning two contiguous sequences. In some embodiments, reassembling the sequence reads comprises updating the assembly of a contiguous sequence in the region of syntenic abnormality based on linking information. In some embodiments, reassembling the sequence reads comprises recruiting or excluding sequence reads in the region of syntenic abnormality based on flow cell proximity information.
[0019] In some embodiments, the region of syntenic abnormality is identified based on linking information. In some embodiments, the method comprises generating colocation data based on the linking information. In some embodiments, the colocation data comprises a colocation plot or colocation matrix. In some embodiments, the method comprises identifying a syntenic abnormality based on off-diagonal signal in the colocation data. In some embodiments, identifying a syntenic abnormality based on the linking information comprises identifying a visual feature on the colocation plot.
[0020] In some embodiments, the genomic DNA sample comprises one or more genomes. In some embodiments, the method comprises phasing the sequence reads based on heterozygous sites or based on linking information. In some embodiments, the methodcomprises generating a haplotype-resolved sequence assembly based on the phasing of the sequence reads.
[0021] In some embodiments, the method comprises assembling the sequence using an assembly graph. In some embodiments, the assembly graph comprises a DeBruijn graph or an overlap layout consensus graph.
[0022] In some embodiments, the linking information is obtained based on a linking model. In some embodiments, the method comprises resolving assembly of repeat regions based on the linking information. In some embodiments, the repeat regions comprise one or more of structural variants, long interspersed nuclear elements (LINES), short interspersed nuclear elements (SINES), tandem duplications, tandem repeats, short tandem repeats (STRs), and paralogous regions.
[0023] Further disclosed herein are systems for assembling sequence reads from a genomic nucleic acid sample. In some embodiments, the systems comprise a processor configured to perform a method comprising: obtaining flow cell data from the nucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell; aligning the sequence reads to a reference sequence or the initial assembly to obtain a genomic location of the sequence reads; assembling the sequence reads to obtain a first assembly; obtaining linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads; identifying a region of syntenic abnormality; and reassembling the sequence reads in the region of syntenic abnormality based on the linking information, thereby determining an updated sequence in the region of syntenic abnormality.
[0024] Further disclosed herein are non- transitory computer-readable media. In some embodiments, the non-transitory computer-readable media comprises a plurality of instructions, which when executed by at least one processor, cause the at least one processor to: obtain flow cell data from the nucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell; align the sequence reads or the initial assembly to a reference sequence to obtain a genomic location of the sequence reads; assemble the sequence reads to obtain a first assembly; obtain linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads; identify a regionof syntenic abnormality; and reassemble the sequence reads in the region of syntenic abnormality based on the linking information, thereby determining an updated sequence in the region of syntenic abnormality.
[0025] In some embodiments, disclosed herein are methods for assembling sequence reads from a genomic sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell. In some embodiments, the methods comprise obtaining flow cell data comprising: 1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and 2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference sequence; determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; scaffolding the contigs from the first assembly to produce a second assembly, wherein the second assembly has undefined contiguous sequences; and mapping additional contigs from the first assembly to the undefined contiguous sequences on the second assembly by referencing the linking information to create a final assembly.
[0026] In some embodiments, the method comprises assembling the sequence reads to produce a first assembly, including mapping the sequence reads to contiguous sequences of the first assembly to produce contigs. In some embodiments, producing the first assembly comprises producing a hybrid assembly of a plurality of de novo assemblers. In some embodiments, the plurality of de novo assemblers comprises using strategic k-mer extension for scrupulous assemblies (SKESA), the St. Petersburg genome assembler (SPAdes), or a combination thereof.
[0027] In some embodiments, assembling the sequence reads comprises determining at least two or more sequence reads that exceed a mapping quality threshold, and adding the at least two or more sequence reads to a plurality of bins. In some embodiments, adding the at least two or more sequence reads to a plurality of bins is based on a k-mer signature threshold.
[0028] In some embodiments, the k-mer signature threshold is based on a number of sequence reads, which are genomically adjacent to the sequence reads in the plurality of bins. In some embodiments, the plurality of bins have a bin connection strength, wherein the bin connection strength is based on at least two sequence reads, which are sorted into differentbins and are genomically adjacent to one another. In some embodiments, the bin connections strength is based on at least two sequence reads, which are sorted into different bins and are genomically adjacent to one another. In some embodiments, the bin connection strength is based upon proximity information for clusters determined from flow cell locations of the clusters.
[0029] In some embodiments, scaffolding comprises selecting contigs from the first assembly based upon a sequence length and a bin connection strength. In some embodiments, the sequence length is less than 500 bases. In some embodiments, the sequence length is between 500 bases and 10 kilobases. In some embodiments, the sequence length is greater than 10 kilobases.
[0030] In some embodiments, each of the contigs has an orientation, wherein the orientation is associated with a head terminal or a tail terminal of a contig. In some embodiments, the method further comprises determining the orientation of a first contig in the plurality of contigs in relation to the orientation of a second contig in the plurality of contigs based on a threshold number of connections between the first contig and the second contig.
[0031] In some embodiments, the method further comprises generating a first graph, comprising nodes each corresponding to a terminal of a contig and edges connecting the nodes, wherein each edge has a weight based on the bin connection strength between the nodes connected by the edge. In some embodiments, the method further comprises pruning the edges from the first graph based upon a unique maximum weight amongst all edges in the graph and a reciprocity metric to generate a second graph; and traversing the second graph to map the contigs to the second assembly. In some embodiments, the edges in the second graph are bidirectional. In some embodiments, the first and second graphs are cyclic.
[0032] In some embodiments, scaffolding further comprising selecting contigs based upon a mapping connection strength. In some embodiments, the mapping connection strength is based upon the bin connection strength and orientation of the contigs. In some embodiments, a contig may be connected to more than one contig if the mapping connection strength is above a threshold. In some embodiments, the proximity information and mapping connection strength of a first contig to a second contig can be used to determine the orientation of a first contig relative to a third contig.
[0033] In some embodiments, the mapping of additional contigs to the sections of the second assembly having undefined contiguous sequences is based upon a mapping connection strength or a fraction of total reads that have a mapping connection greater than a threshold. In some embodiments, sections having undefined contiguous sequences have a fixed sequence length.
[0034] In some embodiments, the genomic sample includes bacterial, fungal, human or non-human samples.
[0035] Further disclosed herein are systems for assembling sequence reads from a genomic nucleic acid sample. In some embodiments, the system comprises a memory storing instructions and a processor that, when executing the instructions, is configured to perform a method comprising: obtaining flow cell data comprising: 1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and 2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference genome; determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; scaffolding the contigs from the first assembly to produce a second assembly, wherein the second assembly has portions having undefined contiguous sequences; and mapping additional contigs from the first assembly undefined contiguous sequences to assemble a final assembly by referencing the linking information.
[0036] Further disclosed herein are non-transitory computer-readable media. In some embodiments, the non-transitory computer-readable medium comprises a plurality of instructions, which when executed by at least one processor, cause the at least one processor to: obtain flow cell data comprising: 1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and 2) the flow cell locations of the clusters of the nucleic acids; assemble the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference genome; determine linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; scaffold the contigs from the first assembly to produce a second assembly, wherein the second assembly has portions having undefined contiguous sequences; and map additional contigs from the first assembly to the undefined contiguous sequences to assemble a final assembly by referencing the linking information.
[0037] Disclosed herein are methods for assembling sequence reads from a genomic sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell. In some embodiments, the methods comprise obtaining flow cell data comprising: 1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and 2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference sequence; determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; and detecting misassemblies in the first assembly using the linking information.
[0038] In some embodiments, the method further comprises scaffolding the contigs from the first assembly to produce a second assembly and detecting misassemblies in the second assembly using the linking information. In some embodiments, scaffolding comprises ordering and orienting contigs from the first assembly to produce the second assembly using proximity information. In some embodiments, scaffolding further comprises incorporating additional contigs into the second assembly based on the measure of the strength of the determined linking information.
[0039] In some embodiments, the misassemblies are detected based on identifying an abnormality based on an off-diagonal signal from colocation data based on the linking information. In some embodiments, the colocation data is in the form of a colocation plot or a colocation matrix. In some embodiments, the misassemblies are detected by providing the colocation data to a machine learning model to identify errors in the order and / or orientation of contigs in the first assembly or second assembly. In other embodiments, the misassemblies are detected based on the proximity link size deviation in the linking information.
[0040] In some embodiments, the method further comprises breaking the first assembly or the second assembly into separate contigs at the site of a misassembly in the first assembly or the second assembly.
[0041] In some embodiments, the method further comprises correcting the detected misassemblies using proximity information to create a corrected assembly. In some embodiments, correcting the detected misassemblies comprises correcting the order and / or orientation of contigs in the first assembly or the second assembly using proximity information.
[0042] In some embodiments, the incorporation of additional contigs to the second assembly is further based upon an assembly graph, wherein contigs are represented as outputs of non-branching paths in the assembly graph.
[0043] In some embodiments, the method further comprises correcting the assembly using proximity information. In some embodiments, correcting the assembly comprises using a pile-up comprising a plurality of linked sequence reads. In some embodiments, correcting the assembly comprises identifying positions of discrepancy between the assembly and a consensus sequence built from the linked sequence reads in the pile-up. In some embodiments, the method updates the assembly at the positions of discrepancy based on the consensus sequence from the pile-up. In other embodiments, the method marks the positions of discrepancy in the assembly. In some embodiments, the linked sequence reads comprise local reads having 1-1000 bp. In some embodiments, correcting the assembly uses ploidy-aware proximity information.
[0044] In some embodiments, the method determines the circularity of an assembly in a BANDAGE (Bioinformatics Application for Navigating De novo Assembly Graphs Easily) plot. In some embodiments, the method further determines the circularity using proximity information.
[0045] Disclosed herein are methods for assembling sequence reads from a genomic nucleic acid sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell. In some embodiments, the methods comprise obtaining flow cell data comprising: 1 ) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and 2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference sequence; determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; and further comprising correcting the assembly using proximity information.
[0046] In some embodiments, correcting the assembly comprises using a pile-up comprising a plurality of linked sequence reads. In some embodiments, correcting the assembly comprises identifying positions of discrepancy between the assembly and a consensus sequence built from the linked sequence reads in the pile-up. In some embodiments, the method updates the assembly at the positions of discrepancy based on the consensussequence from the pile-up. In other embodiments, the method marks the position of the discrepancy in the assembly. In some embodiments, the linked sequence reads comprise reads having 1-1000 bp.BRIEF DESCRIPTION OF THE DRAWINGS
[0047] Features of examples of the present disclosure will become apparent by reference to the following detailed description and drawings, in which like reference numerals correspond to similar, though perhaps not identical, components. For the sake of brevity, reference numerals or features having a previously described function may or may not be described in connection with other drawings in which they appear. In addition to the features described herein, additional features and variations will be readily apparent from the following descriptions of the drawings and exemplary embodiments. It is to be understood that these drawings depict typical embodiments and are not intended to be limiting in scope.
[0048] FIG. 1 A is a flow diagram that schematically illustrates an overall method of performing a de novo assembly of a genome or other nucleic acid sequence.
[0049] FIG. I B is a block diagram of an exemplary sequencing system that may be used to perform the disclosed methods.
[0050] FIG. 1 C is a block diagram of an exemplary computing device that may be used in connection with the exemplary sequencing system of FIG. IB.
[0051] FIG. 2 is a flow diagram that schematically illustrates steps for obtaining sequence reads and flow cell locations from a genomic sample placed on a flow cell, for use in sequence assembly.
[0052] FIG. 3 is a flow diagram that schematically illustrates the steps for assembling sequence reads using proximity information.
[0053] FIG. 4 is a statistical analysis of the assembly of the Staphylococcus aureus genome using strategic k-mer extension for scrupulous assemblies (SKESA) assembler, the St. Petersburg genome assembler (Unicycler), and a hybrid assembly approach, along with statistics for the final assembly following scaffolding and gap-filling.
[0054] FIG. 5 is a schematic depiction of an exemplary mapping of sequence reads to bins on contigs together with their flow cell neighbors.
[0055] FIG. 6 is a schematic depiction of the construction, pruning, and traversing of an assembly graph based on linked-bins on terminals of different contigs.
[0056] FIG. 7 A is a schematic depiction of a contig G having a bin with a maximum bin- connection strength in the tail of contig G.
[0057] FIG. 7B is a schematic depiction of the use of mapping connection information to determine the orientation of Contig G relative to Contig Ck.
[0058] FIG. 7C is a schematic depiction of the use of mapping information to determine the orientation of a contig G relative to another contig Ct based on the mapping connection of G to contig Gn.
[0059] FIG. 8A is a schematic depiction of gap-filling of tiny contigs into a scaffolded assembly using mapping connection information.
[0060] FIG. 8B shows the results of including tiny contigs and gap-filling on the assembly of the Staphylococcus aureus genome.
[0061] FIG. 9 is a colocation map identifying an inversion of 100kbp introduced at the I Mbp position and a deletion of 60 kbp introduced at the 2 Mbp position, including zoomed-in views of the inversion and deletion signals.
[0062] FIG. 10A is a BANDAGE (Bioinformatics Application for Navigating De novo Assembly Graphs Easily) visualization of a de novo assembly graph for Pseudomonas aeruginosa built using Unicycler.
[0063] FIG. 10B is a BANDAGE visualization for a de novo assembly graph for Pseudomonas aeruginosa following assembly using proximity information.DETAILED DESCRIPTION
[0064] The foregoing and other aspects of the present disclosure will now be described in more detail with respect to the description and methodologies provided herein. This description is not intended to be a detailed catalogue of all the ways in which the embodiments of the present disclosure may be implemented, or of ail the features that may be added to the present disclosure. For example, features illustrated with respect to one embodiment may be incorporated into other embodiments, and features illustrated with respect to a particular embodiment may be deleted from that embodiment. In addition, numerous variations and additions to the various embodiments suggested herein, which do not departfrom the instant disclosure, will be apparent to those skilled in the art in light of the instant detailed description, figures and claims. Hence, the following specification is intended to illustrate some particular embodiments, and not to exhaustively specify all permutations, combinations and variations thereof.
[0065] All patents, patent applications, and other publications, including all sequences disclosed within these references, referred to herein are expressly incorporated herein by reference, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference. All documents cited are, in relevant part, incorporated herein by reference in their entireties for the purposes indicated by the context of their citation herein. However, the citation of any document is not to be construed as an admission that it is prior art with respect to the present disclosure.
[0066] In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.Overview
[0067] Described herein are methods and systems for assembling sequence reads from a nucleic acid sample into a final assembly of a larger nucleic acid sequence, such as a genomic sequence. The nucleic acid sample may include DNA, RNA or a mixture of DNA and RNA In one embodiment, the method is performed as shown in a method 100 of FIG 1A. In this embodiment, the systems and methods begin at a start state 105 and then obtain sequence reads from a nucleic acid sequencing system, such as a Next Generation Sequencing (NGS) system in a state 107.
[0068] During massively parallel sequencing processes, such as those found in NGS systems, relatively short reads of 100-500 nucleotides are typically generated. During this process, original nucleic acids may be fragmented by tagmentation of a sample directly on the flow cell. In an example process, transposons are linked to the flow cell, and when contacted by the original nucleic acids from the sample, they may fragment the original nucleic acids, resulting in many fragments from the same nucleic acid molecule being generated. This example process may result in fragments from the same nucleic acid molecule being bound to the flow cell in geographically adjacent or nearby locations. These fragments then give rise to nucleic acid clusters at nearby locations on the flow cell, and these nucleic acid clusters will participate in sequencing reactions to generate reads or read pairs. Therefore, reads or read pairs that are generated by sequencing these fragments which are located nearby each other on the flow cell may be determined by the disclosed systems or methods to be “connected” or “linked”, or that there are “links” among the reads or read pairs since they may have derived from the same originating nucleic acid molecule.
[0069] Once the sequence reads have been determined at the state 107, the method 100 then moves to a state 110 wherein the sequence reads are used to assemble a set of larger sequences of contiguous sequence reads, termed “contigs”. These contigs generally are formed from a plurality of sequence reads which are determined to be located next to one another, in a particular order, in the original nucleic acid sequence being assembled. In some embodiments, the assembly of the sequence reads into the contigs is performed without using a reference sequence. This is termed a “ufe wovo” sequence assembly because the sequence reads are assembled into the contigs without comparison to a known reference sequence. The method may include performing a de novo assembly of sequence reads to produce a first assembly. This first assembly may be highly fragmented, meaning that the first assembly includes a series of contigs, but at this point the contigs have not been assembled with one another to form a larger assembly covering the entire original nucleic acid sequence. For example, if the original nucleic acid sequence is a genomic sequence, the initial assembly may be a senes of contigs that cover only a portion of the entire genomic sequence.
[0070] Once the method 100 has assembled the contigs from the sequence reads into a first assembly at the state 110, the method 100 moves to a state 112 wherein additional sequence reads are “recruited” to form additional or larger contigs which can be added to thefirst assembly to form a second assembly. In the state 112, the reads may be recruited by using linking or proximity information taken from the flow cell within the NGS system to recruit additional reads which may be determined to be located to the same contig as sequence reads determined at state 110.
[0071] Linking information may enable an improved assembly of the contigs by way of grouping together linked read pairs coming from nearby locations. However, in some cases, two reads or read pairs that are nearby on the flow cell may not have originated from the same nucleic acid molecule; they may be nearby by chance alone. Therefore, in order to leverage the linking information of read pairs in processes such contig generation, methods can be used to quantify the relative likelihood that two read pairs obtained from nearby locations on a flow cell exhibited this flow-cell geographical proximity due to linking instead of by chance. This likelihood is referred to as a “linking quality”.
[0072] To quantify the linking quality of two sequence reads more accurately, statistical models, or “linking models” can be developed that describe, for two linked read pairs at a certain genomic distance on the originating nucleic acid molecule, the probability that the two read pairs’ clusters are at a given relative displacement on the flow cell. Each statistical model is fitted to sequencing data in which preliminary alignments of the read pairs to a reference sequence have been made. From the initial proximity information, the method may include generating a linking model based on alignment of the sequence reads to the first assembly or to the reference sequence. The linking model may include a mathematical representation of the locations of each sequence on the flow cell and a strength of the likelihood that two sequence reads were derived from the same original nucleic acid molecule.
[0073] Based on the proximity of flow cell clusters determined from flow cell locations, the method may determine linking information between the sequence reads based on the linking model. From this linking information, the method may recruit additional sequence reads based on the alignment to the first assembly. This location information, along with the known genomic location information of each sequence read may be used to generate a linking model which accurately determines if two sequence reads are part of the same contig. The system and method may use the flow cell location and known genomic information to determine which sequence reads are more likely to be linked with one another, and thus recruitadditional sequence reads that can be properly assembled into larger or more complete contigs of the second assembly.
[0074] Once the set of contigs has been assembled into a second assembly at the state 110, the method 100 moves to a state 115 wherein the contigs are scaffolded into a final assembly. This process involves analyzing the contigs and placing them in their correct order so that each contig is located in the final assembly to create an accurate sequence assembly reflecting the sequence of the nucleic acid being determined. For example, the contigs may be scaffolded to create the correct genomic sequence of the nucleic acid sequence that was originally placed into the NGS system. This step can be performed by ordering and orienting each contig into the final assembly. Ordering the contigs is performed by placing the different contigs into the correct order within the final assembly. Orienting the contigs is placing each contig into the correct orientation, so that the contig can be properly ordered into the final assembly.
[0075] In some embodiments, the process of scaffolding the contigs into the final assembly uses linking or proximity information of the sequence reads within each contig to improve the order and orientation of the contigs in the final assembly. For example, and as shown in more detail with reference to Figures 7 / X, 7B and 7C, each contig is made from a set of sequence reads. By analyzing the flow cell location and proximity of the sequence reads within two contigs, the method 100 may be able to determine if the two contigs are likely to be next to each other in the final assembly. For example, if a sequence read at the most 3’ end of one contig is linked to a sequence read on the flow cell at the most 5 ’ end of a second contig, it may be very likely that those contigs are located adjacent each other in the final assembly. Sequence reads located near each other on the flow cell are more likely to have come from an adjacent position in the original nucleic acid sequence. The system may use this proximity and location information to order and orient the contigs into a larger final assembly.
[0076] Once the method 100 creates a scaffold from the contigs at the state 120, the method 100 moves to a decision state 122 to determine if any errors are detected in the final assembly. For example, an assembly graph, wherein contigs are represented as outputs of non-branching paths in the assembly graph, may be used to detect misassemblies. In another embodiment, misassemblies of the contigs in the final assembly can be detected based on identifying an abnormality based on an off-diagonal signal from colocation data. Thecolocation data may be in the form of a colocation plot or colocation matrix as shown in FIG.9. In one embodiment, colocation data can be provided to a machine learning model, such as a deep learning object detection model, to identify errors in the order and / or orientation of contigs in the first assembly. In some embodiments, misassemblies can be detected based on a proximity link size deviation in the linking or proximity information.
[0077] If a determination is made at the decision state 122 that there is a misassembly of certain contigs, the method 100 moves to a state 124 wherein the system attempts to repair the misassembly. To repair the misassembly, the first assembly can be broken into separate contigs at the site of the misassembly. Proximity or linking information can then be used to create a corrected assembly. In some embodiments, correcting the detected misassembly can include correcting the order of the contigs in the first assembly. In other embodiments, correcting the misassembly can include correcting the orientation of contigs relative to one another within the first assembly. In some embodiments, correcting the detected misassembly can include correcting both the order and orientation of contigs in the first assembly using proximity information.
[0078] Once any misassemblies of the contigs have been corrected, the system may create a correct scaffold of the contigs into a corrected final assembly. The method 100 then returns to the state 120 to detect any misassemblies in the corrected final assembly.
[0079] If no misassemblies are detected at the decision state 122, the method 100 then moves to a state 125 to repair any nucleotide base errors that are found in the final assembly. At this state, the final assembly may have the contigs in the correct order and orientation, but the final assembly may still have minor errors in the nucleic acid sequence from minor sequencing errors. At state 125, these sequence errors may be corrected by analyzing a “pile up” of sequence reads taken at each position along a final assembly. This pile up of sequence reads is possible because in many NGS sequencing systems, each position along the final assembly was likely sequenced to a predetermined depth, such a plurality of sequence reads covers each position. For example, each nucleotide position in the final assembly may have 5, 10, 20, 30 or more sequence reads which cover that position and can be piled up to analyze the proper nucleotides from those positions. The pile up may be a set of all the sequence reads which cover a particular portion of nucleotides in the final assembly. For example, a particular pile up of sequence reads may have 20 sequence reads covering aparticular portion of one contig within the final assembly. If, for example, nineteen of the sequence reads have an “A” at position 3 and one sequence read shows a “C” at position 3, the correct nucleotide is very likely to be the “A” nucleotide. Thus, if the current assembly shows a “C” at position 3, the method 100 may update position 3 to contain an “A” at that position. This process of analyzing pile ups of sequence reads to determine if each nucleotide position is likely to be correct may be used to create the final, corrected, assembly of the sequence at a state 127.
[0080] Once the base errors are corrected in the final assembly and the updated final assembly is created, the method 100 terminates at an end state 130.
[0081] It should be realized that embodiments do not require each of these steps to be performed, nor that each of the steps be performed in any particular order. For example, methods may include only one step of determining sequence reads and generating contigs. It is not necessary for the methods to include a first step of assembling reads into a first assembly of contigs and then performing a second step of recruiting additional reads to create a second assembly of contigs. The method may include only a single step of obtaining sequence reads and preparing contigs into an assembly, which is then scaffolded into a final assembly.
[0082] Similarly, embodiments may include performing only the step of detecting misassemblies of contigs and reordering and / or reorienting those contigs into a corrected assembly. Embodiments may also only include a step of correcting errors in a final assembly by analyzing the sequence reads which make up portions of the final assembly and determining if any nucleotide errors appear within that final assembly.Systems
[0083] Additional aspects include electronic systems configured to carry out the methods described herein. In some embodiments, the systems are for assembling sequence reads from a genomic nucleic acid sample. In some embodiments, the system includes a processor configured to perform any of the methods described herein. In some embodiments, the method comprises: obtaining flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads into an initial set of contigs without using a reference sequence (de novo assembly of the contigs): generating a linking model based on thealignment of the sequence reads to the first assembly of contigs or to any reference sequence, and based on the proximity of clusters determined from their flow cell locations; determining linking information between the sequence reads based on the linking model; recruiting sequence reads based on the alignment to the first assembly of contigs or to the reference sequence; generating a second assembly comprising recruited sequence reads; and scaffolding the second assembly by ordering contiguous sequences of the second assembly, thereby obtaining a final assembly; wherein the recruiting, assembling and / or scaffolding includes the linking information. The systems can perform a method comprising any of the methods described herein.
[0084] In some embodiments, the method comprises obtaining flow cell data from the nucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell; aligning the sequence reads to a reference sequence or the initial assembly to obtain a genomic location of the sequence reads; assembling the sequence reads to obtain a first assembly; obtaining linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads; identifying a region of syntenic abnormality; and reassembling the sequence reads in the region of syntenic abnormality based on the linking information, thereby determining an updated sequence in the region of syntenic abnormality.
[0085] Further disclosed herein are non- transitory computer-readable media. In some embodiments, the non-transitory computer-readable medium includes a plurality of instructions, which when executed by at least one processor, cause the at least one processor to: obtain flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids; align the sequence reads to a first assembly or to a reference sequence; generate a linking model based on the alignment of the sequence reads to the first assembly or to the reference sequence, and based on the proximity of clusters determined from their flow cell locations; determine linking information between the sequence reads based on the linking model; recruit sequence reads based on the alignment to the first assembly or to the reference sequence; generate a second assembly comprising recruited sequence reads; and scaffold the second assembly by ordering contiguous sequences of the second assembly, thereby obtaining a final assembly; wherein the recruiting, assembling and / or scaffolding includes the linking information. The non-transitorycomputer-readable media can include a plurality of instructions, which when executed by at least one processor, cause the at least one processor to perform any of the methods described herein.
[0086] In some embodiments, the non-transitory computer-readable medium includes a plurality of instructions, which when executed by at least one processor, cause the at least one processor to: obtain flow cell data from the nucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell; align the sequence reads to a reference sequence or the initial assembly to obtain a genomic location of the sequence reads; assemble the sequence reads to obtain a first assembly; obtain linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads; identify a region of syntenic abnormality; and reassemble the sequence reads in the region of syntenic abnormality based on the linking information, thereby determining an updated sequence in the region of syntenic abnormality.
[0087] FIG. IB illustrates a diagram of an environment in which a system for assembling sequence reads can operate in accordance with one or more implementations. The following paragraphs describe the sequence assembly system with respect to illustrative figures that portray example implementations and embodiments. For example, FIG. IB illustrates a schematic diagram of a computing system 1000 in which a sequence assembly application 1106 operates in accordance with one or more implementations. As illustrated, the computing system 1000 includes one or more server devices 1102 connected to a user client device 1108, a local device 1118, and a sequencing device 1114 via a network 1112. The network 1112 can comprise any suitable network over which computing devices can communicate, for example, the Internet.
[0088] As shown in FIG. IB, the computing system 1000 includes the server device(s) 1102. In various implementations, the server device(s) 1102 may generate, receive, analyze, store, and transmit digital data, such as data for nucleobase calls or sequenced nucleic- acid polymers. In some implementations, the server device(s) 1102 receive various data from the sequencing device 1114, such as data from a sample genome and / or sequence reads. The server device(s) 1102 may also communicate with the user client device 1108. In particular,the server device(s) 1102 can send data for sequence reads, direct nucleobase calls, nucleobase calls, and / or sequencing metrics to the user client device 1108.
[0089] As shown, the server device(s) 1102 includes a sequencing application 1110. In general, the sequencing application 1110 analyzes the data (such as call data) received from the sequencing device 1114 or elsewhere to determine nucleobase sequences for nucleic- acid polymers. The sequencing application 1110 may perform one or all of the steps outlined in method 100 of FIG. 1 A. For example, the sequencing application 1110 can receive raw data from the sequencing device 1114 and determine a nucleobase sequence for a sample genome or a nucleic-acid segment. In some implementations, the sequencing application 1110 determines the sequences of nucleobases in DNA and / or RNA segments or oligonucleotides,
[0090] As also shown, the sequencing application 1110 includes the sequence assembly application 1106. As described herein, in some embodiments, the sequence assembly application 1106 can assemble sequence reads from a genomic nucleic acid sample. For example, in some embodiments, the sequence assembly application 1106 obtains flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids. The application then aligns the sequence reads to a first assembly of contigs or to a reference sequence and generates a linking model based on the alignment of the sequence reads to the first assembly or to the reference sequence. Based on the proximity of the flow cell clusters determined from their flow cell locations the application may then determine linking information between the sequence reads based on the linking model. Alternatively, rather than generating a linking model, the application may determine colocation information between the plurality of (assembled) contigs in the first assembly based on the proximity of clusters determined from their flow cell locations, and then determine linking information between the sequence reads based on the colocation information. The application may then recruit sequence reads based on the alignment to the first assembly or to the reference sequence and generate a second assembly comprising recruited sequence reads and contigs. After the second assembly of contigs is created, the contigs in the second assembly may be scaffolded by ordering contiguous sequences of the second assembly to obtain a final assembly. It should be realized that the recruiting, assembling and / or scaffolding processes may include the linkinginformation in order to improve each of those processes by incorporating the proximity of each cluster, or related read to each other.
[0091] In some embodiments, the sequence assembly application 1106 obtains flow cell data from the nucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell. The sequence assembly application 1106 may then align the sequence reads to a reference sequence or the initial assembly to obtain a genomic location of the sequence reads and assemble the sequence reads to obtain a first assembly. After the first assembly is obtained, the application may obtains linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads. The application may then identify a region of syntenic abnormality and reassemble the sequence reads in the region of syntenic abnormality based on the linking information, thereby determining an updated sequence in the region of syntenic abnormality.
[0092] As described herein, in some embodiments the sequence assembly application 1106 can assemble sequence reads from a nucleic acid sample. For example, in some embodiments, the sequence assembly application 1106 obtains flow cell data comprising sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample along with flow cell location information on the locations of the clusters of the nucleic acids. The application may then assemble the sequence reads to produce a first assembly comprising a plurality of contigs without comparing those sequence reads to a reference sequence. The application may then determine linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids. Once that linking information is determined, the application may scaffold the contigs from the first assembly to produce a second assembly, wherein the second assembly has undefined contiguous sequences. The application may then map additional contigs from the first assembly to the undefined contiguous sequences on the second assembly by referencing the linking information to create a final assembly.
[0093] While the sequencing application 1110 has been described as including the sequence assembly application 1106, other systems or methods may be included within the sequencing application 1110, such as an application to detect sequence variants or to assemble sequence reads (not illustrated).
[0094] Moreover, while the sequence assembly application 1106 is described being implemented on the server device(s) 1102, as part of the sequencing application 1110, in some implementations, the sequence assembly application 1106 is implemented by (such as located entirely or in part) on the user client device 1108, the sequencing device 1114, and / or the local device 1118. As mentioned, in some implementations, sequence assembly application 1106 is implemented by one or more other components of the computing system 1000, such as the sequencing device 1114. In particular, the sequence assembly application 1106 can be implemented in a variety of different ways across the server device(s) 1102, the network 1112, the user client device 1108, the local device 1118, and the sequencing device 1114.
[0095] As further shown in FIG. IB, the computing system 1000 includes the user client device 1108, In various implementations, the user client device 1108 can generate, store, receive, and send digital data. In particular, the user client device 1108 can receive the data from the sequencing device 1114, As further illustrated, the user client device 1108 includes a sequencing application 1110, The sequencing application 1110 may be a web application or a native application stored and executed on the user client device 1108 (for example, a mobile application, desktop application, or web application). The sequencing application 1110 can receive data from the sequencing application 1110 and / or the sequence assembly application 1106. For example, the user client device 1108 can receive variant call files and / or alignment files from the sequencing application 1110.
[0096] The sequencing application 1110 can also include instructions that (when executed) cause the user client device 1108 to receive data from the sequence assembly application 1106 and present data from the sequencing device 1114 and / or the server device(s) 1102. Furthermore, the sequencing application 1110 can instruct the user client device 1108 to display data for variant calls, such as nucleobase calls or an indication of a sequence variant or genotype. Indeed, the user client device 1108 can display nucleobase call results for a genome sample and / or an indication of a predicted variant or genotype.
[0097] As further shown in FIG. IB, the computing system 1000 includes the sequencing device 1114. In various implementations, the sequencing device 1114 can sequence a genomic sample or other nucleic- acid polymer. For example, the sequencing device 1114 analyzes nucleic-acid segments or oligonucleotides extracted from genomic samples to generate data either directly or indirectly on the sequencing device 1114. More particularly,the sequencing device 1114 receives and analyzes, within nucleotide-sample slides (such as flow cells), nucleic-acid sequences extracted from genomic samples. In one or more implementations, the sequencing device 1114 utilizes sequencing by synthesis (SBS) to sequence a genomic sample or other nucleic-acid polymers. In addition to, or in the alternative to communicating across the network 1112, in some implementations, the sequencing device 1114 bypasses the network 1112 and communicates directly with the user client device 1108.
[0098] As further depicted in FIG. IB, in some implementations, the server device(s) 1102 includes a distributed collection of servers, where the server device(s) 1102 include several server devices distributed across the network 1112 and located in the same or different physical locations. For instance, the server device(s) 1102 can be implemented, in whole or in part, on the local device 1118. To illustrate, the local device 1118 may implement the sequencing application 1110 and / or the sequence assembly application 1106. Further, the server device(s) 1102 and / or the local device 1118 can include a content server, an application server, a communication server, a web-hosting server, or another type of server.
[0099] The user client device 1108 is illustrated in FIG. IB can include various types of client devices. For example, in some implementations, the user client device 1108 includes non-mobile devices, such as desktop computers or servers, or other types of client devices. In various implementations, the user client device 1108 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones.
[0100] Though FIG. IB illustrates the components of the computing system 1000 communicating via the network 1112, in certain implementations, the components of the computing system 1000 can also communicate directly with each other, bypassing the network 1112. For instance, in some implementations, the user client device 1108 communicates directly with the sequencing device 1114. Additionally, in some implementations, the user client device 1108 communicates directly with the sequence assembly application 1106 and / or the server device(s) 1102. In some implementations, the user client device 1108 communicates directly with the local device 1118. Moreover, the sequence assembly application 1106 can access one or more databases housed on or accessed by the server device(s) 1102 or elsewhere in the computing system 1000.
[0101] FIG. 1C is a block diagram of an exemplary server device 1102 that may be used in connection with the computing system 1000 of FIG. IB. The server device 1102 maybe configured to assemble sequence reads from a genomic nucleic acid sample. The general architecture of the server device 1102 depicted in FIG. 1C includes an arrangement of computer hardware and software components. The server device 1102 may include many more (or fewer) elements than those shown in FIG. 1C. It is not necessary, however, that all of these generally conventional elements be shown in order to provide an enabling disclosure. As illustrated, the server device 1102 includes a processing unit 110, a network interface 120, a computer readable medium drive 130, an input / output device interface 140, a display 150, and an input device 160, all of which may communicate with one another by way of a communication bus. The network interface 120 may provide connectivity to one or more networks or computing systems. The processing unit 110 may thus receive information and instructions from other computing systems or services via a network. The processing unit 110 may also communicate to and from memory 170 and further provide output information for an optional display 150 via the input / output device interface 140. The input / output device interface 140 may also accept input from the optional input device 160, such as a keyboard, mouse, digital pen, microphone, touch screen, gesture recognition system, voice recognition system, gamepad, accelerometer, gyroscope, or other input device.
[0102] The memory 170 may contain computer program instructions (grouped as modules or components in some embodiments) that the processing unit 110 executes in order to implement one or more embodiments. The memory 170 generally includes RAM, ROM and / or other persistent, auxiliary or non-transitory computer readable media. The memory 170 may store an operating system 172 that provides computer program instructions for use by the processing unit 110 in the general administration and operation of the server device 1102. The memory 170 may store a reference sequence 173, such as for use by the sequencing application 1110. The memory 170 may further include computer program instructions and other information for implementing aspects of the present disclosure.
[0103] For example, in one embodiment, the memory 170 includes a sequencing application 1110, which may include a sequence assembly application 1106. The sequence assembly application 1106 can perform the methods disclosed herein. In addition, memory 170 may include or communicate with the data store 190 and / or one or more other data stores that store one or more inputs, one or more outputs, and / or one or more results (including intermediate results) of aligning sequence reads, and / or one or more reference sequences.
[0104] In some embodiments, the disclosed systems and methods may involve approaches for shifting or distributing certain sequence data analysis features and sequence data storage to a cloud computing environment or cloud-based network. User interaction with sequencing data, genome data, or other types of biological data may be mediated via a central hub that stores and controls access to various interactions with the data. In some embodiments, the cloud computing environment may also provide sharing of protocols, analysis methods, libraries, sequence data as well as distributed processing for sequencing, analysis, and reporting. In some embodiments, the cloud computing environment facilitates modification or annotation of sequence data by users. In some embodiments, the systems and methods may be implemented in a computer browser, on-demand or on-line.
[0105] In some embodiments, software written to perform the methods as described herein is stored in some form of computer readable medium, such as memory, CD-ROM, DVD-ROM, memory stick, flash drive, hard drive, SSI) hard drive, server, mainframe storage system and the like.
[0106] In some embodiments, the methods may be written in any of various suitable programming languages, for example compiled languages such as C, C#, C++, Fortran, and Java. Other programming languages could be script languages, such as Perl, MatLab, SAS, SPSS, Python, Ruby, Pascal, Delphi, R and PHP. In some embodiments, the methods are written in C, C#, C++, Fortran, Java, Perl, R, Java or Python. In some embodiments, the method may be an independent application with data input and data display modules. Alternatively, the method may be a computer software product and may include classes wherein distributed objects comprise applications including computational methods as described herein.
[0107] In some embodiments, the methods may be incorporated into pre-existing data analysis software, such as that found on sequencing instruments. Software comprising computer implemented methods as described herein is installed either onto a computer system directly, or is indirectly held on a computer readable medium and loaded as needed onto a computer system. Further, the methods may be located on computers that are remote from where the data is being produced, such as software found on servers and the like that are maintained m another location relative to where the data is being produced, such as that provided by a third party service provider.
[0108] An assay instrument, desktop computer, laptop computer, or server which may contain a processor in operational communication with accessible memory comprising instructions for implementation of systems and methods. In some embodiments, a desktop computer or a laptop computer is in operational communication with one or more computer readable storage media or devices and / or outputting devices. An assay instrument, desktop computer and a laptop computer may operate under a number of different computer based operational languages, such as those utilized by Apple based computer systems or PC based computer systems. An assay instrument, desktop and / or laptop computers and / or server system may further provide a computer interface for creating or modifying experimental definitions and / or conditions, viewing data results and monitoring experimental progress. In some embodiments, an outputing device may be a graphic user interface such as a computer monitor or a computer screen, a printer, a hand-held device such as a personal digital assistant (such as PDA, Smart Phone, iPhone®), a tablet computer (such as an iPad®), a hard drive, a server, a memory stick, a flash drive and the like.
[0109] A computer readable storage device or medium may be any device such as a server, a mainframe, a supercomputer, a magnetic tape system and the like. In some embodiments, a storage device may be located onsite m a location proximate to the assay instrument, for example adjacent to or in close proximity to, an assay instrument. For example, a storage device may be located in the same room, in the same building, in an adjacent building, on the same floor in a building, on different floors in a building, etc. in relation to the assay instrument. In some embodiments, a storage device may be located off-site, or distal, to the assay instrument. For example, a storage device may be located in a different part of a city, in a different city, in a different state, in a different country, etc. relative to the assay instrument. In embodiments where a storage device is located distal to the assay instrument, communication between the assay instrument and one or more of a desktop, laptop, or server is typically via Internet connection, either wireless or by a network cable through an access point. In some embodiments, a storage device may be maintained and managed by the individual or entity directly associated with an assay instrument, whereas in other embodiments a storage device may be maintained and managed by a third party, typically at a distal location to the individual or entity associated with an assay instrument. In embodiments as described herein, an outputting device may be any device for visualizing data.
[0110] An assay instrument, desktop, laptop and / or server system may be used itself to store and / or retrieve computer implemented software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like. One or more of an assay instrument, desktop, laptop and / or server may comprise one or more computer readable storage media for storing and / or retrieving software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like. Computer readable storage media may include, but is not limited to, one or more of a hard drive, a SSD hard drive, a CD-ROM drive, a DVD-ROM drive, a floppy disk, a tape, a flash memory stick or card, and the like. Further, a network including the Internet may be the computer readable storage media. In some embodiments, computer readable storage media refers to computational resource storage accessible by a computer network via the Internet or a company network offered by a service provider rather than, for example, from a local desktop or laptop computer at a distal location to the assay instrument.
[0111] In some embodiments, computer readable storage media for storing and / or retrieving computer implemented software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like, is operated and maintained by a service provider in operational communication with an assay instrument, desktop, laptop and / or server system via an Internet connection or network connection.
[0112] In some embodiments, a hardware platform for providing a computational environment comprises a processor (such as CPU) wherein processor time and memory layout such as random access memory (such as RAM) are systems considerations. For example, smaller computer systems offer inexpensive, fast processors and large memory and storage capabilities. In some embodiments, graphics processing units (GPUs) can be used. In some embodiments, hardware platforms for performing computational methods as described herein comprise one or more computer systems with one or more processors. In some embodiments, smaller computer are clustered together to yield a supercomputer network.
[0113] In some embodiments, computational methods as described herein are carried out on a collection of inter- or intra-connected computer systems (such as gridtechnology) which may run a variety of operating systems in a coordinated manner. For example, the CONDOR framework (University of Wisconsin-Madison) and systems available through United Devices are exemplary of the coordination of multiple stand-alone computer systems for the purpose dealing with large amounts of data. These systems may offer Perl interfaces to submit, monitor and manage large sequence analysis jobs on a cluster in serial or parallel configurations.Methods
[0114] FIG. 2 is a flow diagram that schematically illustrates an exemplary method 200 of steps for obtaining sequence reads and flow cell locations of clusters from a nucleic acid sample placed on a flow cell. In some embodiments, the nucleic acid sample comprises genomic DNA. For example, a nucleic acid sample 201, such as a DNA sample, may be placed on flow cell 205. In some embodiments, the nucleic acid sample 201 comprises DNA fragments that are greater than 500 bp in length, greater than 1 kbp in length, greater than 5 kbp in length, greater than 10 kbp in length, greater than 100 kbp in length, greater than 200 kbp in length, greater than 250 kbp in length, greater than 300 kbp in length, greater than 500 kbp in length, or more. In some embodiments, the DNA fragments are high molecular weight DNA of about 250 kbp in length or more.Obtaining Sequence Reads
[0115] Sequence reads can be obtained from sample nucleic acids on the flow cell 205. For example, in some embodiments, the flow cell 205 comprises transposome complexes bound to the flow cell 205. The transposome complexes include a transposase and a first polynucleotide comprising an end sequence and a first tag. An extracted nucleic acid sequences is then contacted with the flow cell 205 and transposomes in order to contact the transposome complexes with the target nucleic acid sample under conditions to cause the transposome to fragment the nucleic acid sample. Because the transposomes are bound to the flow cell 205, following cleavage of the genomic DNA sample, the resulting fragments become bound to the flow cell 205. The process may then include amplifying the fragmented nucleic acids to form a plurality of nucleic acid clusters on the flow cell 205. A sequencing by synthesis process may then be started to sequence the nucleic acids in each cluster on the flowcell to generate sequence reads. In some embodiments, the sequence reads comprise paired end sequence reads where each nucleic acid is sequenced by two primers bound in opposite directions to one another on the bound fragment.
[0116] Embodiments of the disclosure relate to systems and methods for sequencing target nucleic acids by fragmenting the target nucleic acids and distributing the fragments onto a flow cell. As the fragments are distributed along the flow cell, they bind capture primers and are then used to create clusters by well-known technologies, such as those provided by Illumina Inc. (San Diego, CA). According to the methods of this disclosure, fragments which were derived from the same nucleic acid sequence are more likely to bind to the flow cell in close physical proximity as compared to fragments that are from different nucleic acid sequences, particularly when the fragmentation is performed directly on the flow cell using immobilized transposome complexes on the surface of the flow cell. In some embodiments, the library preparation steps are performed on the flow cell, which may reduce the complexity and the amount of equipment associated with the systems. In some library preparations with fragmentation happening prior to loading, fragments can land anywhere m the flow cell independently of whether they came from the same nucleic acid molecule. However, when fragmentation is performed directly on the flow cell proximity information is retained. This flow cell proximity information can be used to help guide assembly and variant calling of the original target nucleic acid sequence, as will be described in more detail below.
[0117] For example, transposome complexes may be provided as part of the sequencing process. In some embodiments, the transposome complexes include a transposase and a first polynucleotide having end sequences which can be used to fragment the target polynucleotides and insert into each fragment an end sequence or tag which can be used to bind to capture probes located on the substrate. The method can include contacting the transposome complexes with the target nucleic acid sequences under conditions to fragment the target nucleic acids and add capture sequences to the ends of each fragment. In some embodiments, the capture sequences include P5 or P7 sequences as provided by Illumina, Inc. In some embodiments, the complexed nucleic acid and transposome is in solution, and is then brought towards a substrate and immobilized thereon. In some embodiments, prior to immobilization of the transposome complexes on the substrate, one or more of the transposomecomplexes bind the target nucleic acids in solution. In this embodiment, the transposome complexes in solution become immobilized to the substrate.
[0118] While embodiments involving transposome complexes have been described above, a variety of sequencing library preparation methods and sequencing techniques may be used to capture flow cell and nucleic acid proximity information for sequence reads. For example, in some embodiments, the methods herein include rolling circle amplification or DNA nanoball sequencing. For example, in some embodiments, any library’ preparation method that facilitates the recording and storing of flow cell proximity and nucleic acid proximity information. For example, the methods can include library preparation by a multitude of different methods to prepare the nucleic acid fragments on the flow cell.
[0119] In some embodiments, the method 200 includes generating sequence reads from fragments of the target nucleic acid sample bound to a flow cell. The method 200 may proceed to block 210, wherein sequence reads are obtained by SBS or other methods from each of the clusters generated on the flow cell.Obtaining Geographic Location Information
[0120] The method 200 may proceed to block 220, wherein geographic location information for each of the sequence reads is obtained. Geographic location information can include locations of sequence read clusters on the flow cell 205, where each cluster corresponds to a nucleic acid fragment that is sequenced at block 210. For example, flow cell locations of the sequence read clusters on the flow cell can be obtained. For example, in some embodiments, the geographic location information comprises spatial coordinates in a cartesian coordinate system. For example, the dimensions of the flow cell may be mapped with a cartesian coordinate system, e.g., with x and y dimensions, and locations of clusters on a flow cell may be assigned to a spatial coordinate using this system. The spatial coordinates can be used to determine the flow cell proximity between two or more clusters on the flow cell.
[0121] For example, once the fragments have been bound to substrate, the bound fragments can be amplified to form a plurality of nucleic acid clusters on the substrate. While block 220 is described as taking place after block 210 in the example of method 200, the location of each cluster on the flow cell can then be determined before, during or afterperforming sequencing by synthesis reactions (SBS) to obtain the nucleotide sequence of each fragment located in each cluster.Assembling Sequence Reads Using Flow Cell Locations
[0122] The method 200 may proceed to block 230, wherein sequence reads are assembled using flow cell locations. For example, this process can include receiving flow cell data comprising 1) sequence reads from the nucleic acids from the genomic nucleic acid sample and 2) the flow cell locations of the clusters of the nucleic acids and aligning the sequence reads into contigs, which include a plurality of sequence reads. The method 200 may also generate a linking model based on the alignment of the sequence reads within contigs to the first assembly or to the reference sequence, and based on the proximity of clusters determined from their flow cell locations; determining linking information between the sequence reads based on the linking model. Alternatively, rather than generating a linking model, method 200 may determine colocation information between the plurality of (assembled) contigs in the first assembly based on the proximity of clusters determined from their flow cell locations, and then determine linking information between the sequence reads based on the colocation information. The method 200 may then recruit additional sequence reads to each contig based on the alignment to the first assembly or to the reference sequence and generate a second assembly comprising contigs which include the recruited sequence reads. The method may also scaffold the second assembly of contigs into a final assembly by ordering the contiguous sequences of the second assembly to obtaining the final assembly. In some embodiments, recruiting, assembling and / or scaffolding of the sequence reads includes the linking information, as further described herein.
[0123] Furthermore, the process of block 230 can include receiving flow cell data from the nucleic acids, wherein the flow cell data comprises nucleic acid sequence reads and flow cell locations of the sequence reads on the flow cell; aligning the sequence reads to a reference sequence or the initial assembly to obtain a genomic location of the sequence reads; assembling the sequence reads to obtain a first assembly; obtaining linking information between pairs of sequence reads on the flow cell based on the flow cell location and genomic location of each sequence read in the pairs of sequence reads; identifying a region of syntenic abnormality; and reassembling the sequence reads m the region of syntenic abnormality basedon the linking information, thereby determining an updated sequence in the region of syntenic abnormality.Producing An Assembly of Contigs
[0124] FIG. 3 is a flow diagram that schematically illustrates an exemplary method 300 for producing a final assembly of contigs from the sequence reads. The method 300 can begin by performing a de novo assembly method 310 to produce an assembly 318 (“HYBRID ASM”) comprising a plurality of contigs without comparison to a reference sequence, i.e. performing a method of “tfe novo assembly”. Typical de novo assemblies produced from relatively short reads, such as from an NGS system, can be highly fragmented because longer assemblies may be difficult to produce from short reads when there is no reference sequence used as a guide to prepare longer fragments. Embodiments allow assembly of sequence reads in the absence of a reference genome which enables accurate and efficient reconstruction of genomic sequences. This de novo assembly may integrate unconventional computational techniques that improve assembly accuracy and scalability of such assembly systems
[0125] Producing the first assembly can use a plurality of de novo assemblers to produce the hybrid assembly 318 from a set of sequence reads 312. Some assemblers in the plurality of de novo assemblers can produce relatively shorter contigs with decreased error rates, whereas other de novo assemblers can produce relatively longer contigs more rapidly, although they may have less accuracy. Using methodologies from a plurality of assemblers may produce improved first assemblies that align to a higher percentage of a reference genome with potentially shorter compute times and fewer misassemblies compared to assemblies produced using a single de novo assembly method. There are several non-limiting examples of de novo assemblers which may be used, including the assembler 314. For instance, the sequence reads 312 can be passed to both the assembler 314, which may produce assemblies comprising shorter, more fragmented contigs with higher accuracy, and to a second assembler 316 which may recruit additional sequence reads to produce larger, more contiguous coverage. In one embodiment, the process 300 uses both the assembler 314 to produce a first de novo assembly, and the second assembler 316 which may use linking or other information to recruit additional reads and improve on the contigs assembled by the assembler 314. The output ofthe two assemblers can be used to create contigs in the assembly 318. Of course, it should be realized that other embodiments may only use a single assembler to create contigs for the assembly 318, and aspects are not limited to a particular number of assemblers being used to create the contigs. FIG. 4 shows one non-limiting example of alignment-based statistics from assemblies of the Staphylococcus aureus genome performed using an exemplary SKESA assembler, a Unicycler assembler and a hybrid approach of using both example assemblers. As shown in FIG. 4, the hybrid assembly approach may produce longer alignments with higher accuracy compared to using a single assembly method.Proximity Detection
[0126] In some embodiments, the sequence reads are read from clusters of nucleic acids on a flow cell. In some embodiments, the methods and systems use information relating to the proximity of clusters determined from their flow cell locations to assemble the sequence reads. According to the methods and systems of the present disclosure, flow cell proximity information can be used to infer genomic proximity.
[0127] Various sequencing library preparation methods and sequencing techniques may be used to capture flow cell and genomic proximity information. For example, in some embodiments, sequence reads are generated from fragments of a genomic nucleic acid (e.g., DNA) sample which are bound to the flow cell. For example, relatively long (e.g., greater than 500 bp) DNA fragments may be flowed across a sequencing flow cell and may attach to transposome complexes that are embedded on the surface of the flow cell. The transposome complexes can fragment the long DNA fragments into shorter fragments (e.g., of less than 500 bp), and attach sequencing adapters. Each shorter fragment becomes bound to the flow cell and can be amplified to form clusters. The sequence of each shorter fragment can be determined by performing SBS on the clusters of fragments. During this process, the physical location of each cluster on the flow cell can be recorded m addition to sequence reads.
[0128] As mentioned above, a linking model may be developed based on the geographic location information for each cluster of sequence reads on the flow cell, and may include a probability that two pairs of sequence reads on a sequencing flow cell are derived from the same original nucleic acid molecule. Thus, flow cell proximity information can be used to assemble sequence reads taken from a genomic nucleic acid sample. For example, flowcell proximity information and / or linking information may be used during steps of generating a first assembly, recruiting sequence reads, generating a second assembly, and / or scaffolding an assembly.
[0129] As used herein, a “syntenic abnormality” can refer to when sequence reads from a genomic nucleic acid sample do not align as expected to a reference sequence or assemble as expected based on alignment to a reference sequence. In some embodiments, the methods and systems identify a syntenic abnormality based on linking information. In some embodiments, there is an expected relationship between the proximity between clusters on a flow cell and the genomic distance between corresponding sequence reads after aligning to a reference genome. For example, clusters which are proximate to one another on the flow cell are expected to correspond to sequence reads which are genomically proximate to one another, and vice-versa. However, some features of a genome in a particular genomic nucleic acid sample can perturb the relationship between genomic proximity and flow cell proximity. For example, if a genomic DNA sample has a structural variant as compared to the reference genome, clusters may be in close proximity to one another on the flow cell because they are derived from the same original long DNA fragment, but when the sequence reads are mapped to the reference genome, their genomic distance is greater or less than would be expected based on the flow cell proximity / linking information because they are being mapped closer or farther apart for the sample than they are on the reference genome, due to the structural variant in the genomic DNA sample.Scaffolding
[0130] Referring back to FIG. 3, the output of the assembly 318 can be provided to a mapping program 320, for further alignment and mapping of the contigs to a reference sequence. In some cases, the sequence reads 312 can be provided directly to the mapping program 320 without prior de novo assembly. In other cases, the contigs from the hybrid assembly 318 can be fed to the mapping program 320. After performing mapping with the program 320, the program 300 moves to a block 330 to scaffold the contigs into a final assembly with extended contigs and more contiguous coverage of the original nucleic acid.
[0131] As part of the scaffolding process performed at state 330, the contigs from the mapping process 320 can be assembled into larger contigs using proximity information aspart of block 332. At this state 332, the process can include determining whether sequence reads are mapped on the same contig based on determining whether at least two or more sequence reads exceed a mapping quality threshold. In one example, assembling the contigs can be based on analyzing the sequence reads within each contig which are flow cell neighbors to look for linked sequence reads where each sequence read is mapped to a contig with a certain mapping quality (MAPQ) threshold. This is explained more fully below with reference to FIG 5.
[0132] In some embodiments, scaffolding at block 330 further comprises selecting contigs based on proximity information and a mapping connection strength as part of block 334. The contigs may be selected based on their length in some embodiments, with the process first selecting the largest contigs to assemble, and then moving to progressively smaller contigs as they are scaffolded together. For example, the larger contigs of over 50kpb may be first selected for assembly, followed by medium contigs of between 1 Okbp and 50kbp, followed by smaller contigs of less than lOkbp, In some embodiments, scaffolding based on proximity information and a mapping connection strength can be particularly useful for small contigs having a length between about 500 bp and about 10 kbp. The mapping connection strength can be based upon the bin connection strength and the orientation of the contigs. More information on scaffolding contigs using bin connection strength and the order and orientation of the contigs is described below with reference to FIGs. 7 A, 7B and 7C.
[0133] Following a method of scaffolding at block 330 to produce the second assembly, the second assembly may still include sections having undefined contiguous sequences which do not have contigs or sequence reads assigned to particular positions on the genome. The method 300 can use mapping information to add additional unmapped contigs to create a final assembly. For example, the method may first select larger unmapped contigs of more than 10, 20, 30, 40, or 50 kbp and analyze the flow cell proximity of sequence reads within the unmapped contigs to other sequence reads to determine where the unmapped contig may be properly mapped on the final assembly. The process then may look at the flow cell proximity of sequence reads in relatively smaller unmapped contigs of between 500bp and lOkbp as one example, and map those unmapped contigs to a final assembly. Finally, the system may look at sequence reads of approximately 100 bp to approximately 500 bp which are still unmapped as part of block 336 and map them to the final assembly based on theirflow cell proximity to other contigs which have been mapped to the final assembly. In some embodiments, these undefined contiguous sequences may have a fixed sequence length. Additional contigs from the first assembly that have not been previously placed can be mapped to the undefined contiguous sequences on the second assembly by referencing linking information to create a final assembly.
[0134] After scaffolding has been completed, the assembly may be further provided to a program 342, such as Gap2Seq, for filling additional small nucleotide gaps in the assembly as part of block 340. These gaps between scaffolds may be filled by comparison of the assembly to a known reference genome in one embodiment. This may also include “polishing” or correcting errors in the final assembly sequence as discussed below. The scaffolds can then be extended in the remaining gaps in block 344 to give the final assembly 350.Scaffolding Contigs Using Flow Cell Neighbors
[0135] In one embodiment, scaffolding contigs includes binning sequence reads that make up each contig if the sequence reads exceed the mapping quality (MAPQ) threshold. If a sequence read pair does exceed a MAPQ threshold, then the two or more sequence reads can be added to a plurality of bins as shown in FIG. 5. The addition of the reads to a particular bin can be further based on the number of flow cell neighbors of each read in the bin, which are shown as dashed lines in FIG. 5. The number of flow cell neighbors of each read can be found by determining if the number of neighboring reads m the bin with a shared k-mer signature is greater than a k-mer signature threshold, e.g., 3. In the context of mapping sequence reads as described herein, a “k-mer signature” may be a k-mer that appears in more than a threshold number of reads. In an embodiment, a k-mer signature is a k-mer that appears at least n times in a set of reads, wherein n is any integer, for example 2, 3, 4, 5, 10, or more. Optionally, a k-mer signature is a k-mer that appears at least two times, at least three times, at least four times, at least five times, or at least ten times in a set of sequence reads.
[0136] Thus, the k-mers within each read can be determined, and then compared between the reads to determine how’ many of these k-mer signatures are in common. The k-mer signature threshold can be based on a number of sequence reads that are genomically adjacent to the sequence reads m the plurality of bins.
[0137] After binning the sequence reads, a bin connection strength can be determined between the plurality of bins based on at least two sequence reads, which are sorted into different bins but are genomically adjacent to one another. For instance, a read in Bin i and its neighbor in Bin k make one connection. The bin-connection strength can be determined from the depth-normalized number of connections between Bin i and Bin k using the depth of Bin k for normalization. Thus, two bins with a higher number of depth-normalized connections will have a higher bin strength when compared to two bins with a lower number of depth-normalized connections between the two bins. As such, the bin connection strength can be based upon proximity information for the clusters of sequence reads determined from the flow cell locations of the clusters. Contigs may be assembled by determining the maximum bin connection strength between bins,
[0138] Each contig can have two segments, or terminals, a head terminal and a tail terminal, as shown in FIG. 6. Each contig can also have an orientation associated with the order of the head and tail terminals. In some embodiments, the method includes determining the orientation of a first contig in the plurality of contigs in relation to the orientation of a second contig in the plurality of contigs based on a threshold number of bin connections between the first contig and the second contig. For example, a connection between the head of Contig i is strong if the total absolute number of connections of all bins in Head i that are connected to some bin in the head or tail of another contig is more than a threshold, e.g., 10. In FIG. 6, bins that are shaded the same are best connected with each other. Based on the connections between the terminals a first data structure, such as a graph of connections can be built, wherein each node on the graph corresponds to a terminal of a contig, and each edge on the graph has a weight based on the bin connection strength between the nodes connected by the edge. In some embodiments, the edge-weight can be the length-normalized sum of the bin connection strengths of all bins for a terminal of the first contig that are connected to a terminal of the second contig. This graph data structure may be stored m a memory, or other storage, of the system, in one embodiment.
[0139] For each node in the graph, the graph can be pruned by removing all edges, leaving only the strongest edge originating from the node. This type of pruning may increasethe computational efficiency of traversing the graph to produce an assembly of contigs since removing edges based on the threshold allows systems described herein to quickly and accurately identify the correct path, enabling efficient mapping of contigs. Pruning can be based upon determining if an edge has the maximum weight amongst all edges originating from the node and that the weight is unique, such that no other edge has the same weight. Edges are reciprocal, and after pruning, an edge between the head / tail of a first contig and the head / tail of a second contig is bidirectional due to the reciprocity. In some embodiments, pruning the edges from the first graph can be based upon a unique maximum weight amongst all edges in the graph and a reciprocity metric to generate a second graph. The second, pruned graph can be traversed to give the order and orientation of the contigs in the second, or final, assembly. Paths in the first and second graphs can be cyclic. For example, FIG. 6 shows a representation of determining a path, or chain, between Contigjm, Contig_i, and Contig_k. The tail of Contig_m (T_m) can have a connection to the head of Contig_i (H i), which can also have a connection to the head of Contig_k (H_k). The tail of Contig_i (T_i) can have a connection to the tail of Contig k (T_k). If, for instance, the connection of H i to T in is stronger than the connection of H i to H_k, the edge between H_i to H_k can be pruned providing a path between the contigs of: H_m — H_i — T_k.Detecting and Correcting Misassemblies
[0140] In some embodiments, assembling the sequence reads using linking information from flow cell locations to produce an assembly comprising a plurality of contigs without comparison to a reference sequence can be completed with errors in the assembly. In some embodiments, linking information can further be used to detect misassemblies in the first assembly following the hybrid de novo assembly. The first assembly of contigs can be scaffolded to produce a second assembly. Scaffolding can include ordering and orienting contigs from the first assembly to produce the second assembly using proximity information. Scaffolding can further comprise incorporating additional contigs into the second assembly based on a measure of the strength of the determined linking information. In some embodiments, the incorporation of additional contigs to the second assembly can be based upon an assembly graph, wherein contigs are represented as outputs of non-branching paths inthe assembly graph. In other embodiments, linking information can be used to detect misassemblies in the second assembly following scaffolding.
[0141] In some embodiments, misassemblies can be detected based on identifying an abnormality based on a signal from colocation data based on the linking information. The colocation data may be in any form indicating the number of links between bins of reads on the scaffolded contigs. For example, the colocation data may be an array or matrix stored in memory’ and indicating the number of links between bins of sequence reads. Alternatively, the colocation data may be a 2-dimensional plot of the same data, as shown in the colocation plot of FIG, 9, In some embodiments, colocation plots visualize the number of links between sequence reads from a sample. The colocation data may also comprise a graph representing linking relationships between different regions of a reference genome. In some embodiments, the colocation data comprises counts of links between sequence reads in different subsections (e.g., bins) of a reference genome.
[0142] For example, the signal may be an off-diagonal signal from the colocation data. The colocation data may be in the form of a colocation plot or colocation matrix as shown in FIG. 9.
[0143] Colocation maps divide the genome into bins and count reads m neighboring clusters for each possible pair of genomic bins. Large numbers of reads from neighboring clusters occur almost exclusively when those bins are in close genomic proximity. In regions with no structural variants, bins that are nearby in the reference genome are nearby in the sample and appear as a diagonal line in the colocation plot. In regions with structural variants, nearby bins m the reference genome are no longer nearby in the sample and exhibit off-diagonal signals. FIG. 9 shows a colocation map, wherein an inversion of 100 kbp is introduced at the 1 Mbp position and a deletion of 60 kbp is introduced at the 2 Mbp position. The inversion, for example, generates ‘butterfly’ shapes off-diagonal with a location that identifies the event breakpoint. Colocation data can be provided to a machine learning model, such as a deep learning object detection model, to identify errors in the order and / or orientation of contigs in the first assembly or second assembly. In some embodiments, misassemblies can be detected based on the proximity link size deviation in the linking information.
[0144] Upon detection of a misassembly in the first assembly or the second assembly, the first or second assembly can be broken into separate contigs at the site of themisassembly. In other embodiments, misassemblies can be detected without breaking or correcting the assembly and the position of the misassembly in the assembly can be simply marked.
[0145] Proximity information can then be used to create a corrected assembly. In some embodiments, correcting the detected misassembly can include correcting the order of contigs in the first assembly. In other embodiments, correcting the detected misassembly can include correcting the orientation of contigs relative to one another. In some embodiments, correcting the detected misassembly can include correcting the order and orientation of contigs in the first assembly or the second assembly using proximity information.Determining Mapping Connection Strength to Produce Assemblies
[0146] FIGs, 7 A, 7B and 7C illustrate one embodiment of reviewing contigs to determine the sequence reads within each contig and using that information to properly order and orient the contigs in a final assembly. For example, a first contig (G) has a head end 710 and a tail end 715. The head end 710 includes the sequence read 718 at the 5’ end of the first contig. The first contig also has internal sequence reads 720 and 722. A second contig (Ck) has a head end 725 and a tail end 728, plus internal sequence reads 730, 732, 734, 736 and 738. As shown in FIGs. 7A, 7B and 7C, sequence reads with similar shading are found to be linked to one another on the flow cell. Thus, sequence reads 718, 730 and 738 are linked. Similarly, sequence reads 732, 734 and 722 are found to be linked to one another. Finally, sequence reads 720 and 736 are found to be linked to one another.
[0147] The first contig C, can be linked to the second contig Ck if sequence reads at a first end of one contig are found to be linked to sequence reads at a second end of the second contig. Determining the order and orientation of how the contigs may be assembled with each other can be based on which end of Ck should be connected to which end of G. This can depend on whether the linked sequence reads lie in the head or tail of each of the contigs. For example, as shown in FIG. 7A, Ck can be connected to the tail of G because the bin with the maximum bin-connection strength, indicated with the diagonal pattern, lies in the tail of Ck. Reads mapped at the end of contig G that have mates on the inferred terminal of contig G that are m opposite directions (or in the same direction for split alignments) can be used to determine the mapping connection strength. Ci will be connected to the head of Ck in the sameorientation as mates or reads near its terminal that are in opposite directions on Ck as shown in FIG. 7B. A contig may be connected to more than one contig if the mapping connection strength is above a threshold. In some embodiments, the proximity information and mapping connection strength of a first contig to a second contig can be used to determine the orientation of the first contig relative to a third contig. For instance, in Figure 7C, the order of G and G« is initially unknown as is the orientation of G with respect to G because no mapping connection between Gand G exists. However, based on the mapping connection between G and Cm it can be determined that the tail of G is connected to the head of Cm in the orientation shown in FIG, 7C, However, because the reads in the tail of Cm indicate a split alignment with the head of Ck, Cmwill be connected to the head of G in the opposite direction. Thus, the orientation of G with respect to Gear also be determined based on the orientation of Cm and Ck determined from the mapping connection strength. If mapping connection strength is the same for multiple small contigs placed near a large contig’s terminal, proximity bin order can be used to scaffold the second assembly.
[0148] However, after producing the second assembly, it may still include sections having undefined contiguous sequences. For instance, as shown in FIG. 8A, unplaced contigs, CPand Cq can be aligned to segments around an undefined contiguous sequence, Gap i. CPand C will be connected to the tail of Scaffold Segment! and the head of Scaffold Segment_i+1 if the mapping connection strength is greater of their mates and split reads is greater than a threshold (e.g. 100) or the fraction of total reads that make connections is greater than a threshold (e.g., 20%). In some cases, a contig Cm, may be mapped almost fully in a segment of the scaffold and can be dropped. FIG. 8B shows the results of including tiny contigs m closing gaps m the assembly of the Staphylococcus aureus genome.Determining Circularity of Assemblies
[0149] Traditional short-read de novo genome assembly methods are typically too fragmented for genome completion to be practical. FIG. 10A visualizes a short read assembly from the de novo assembly graph for Pseudomonas aeruginosa that was built using Uni cycler. This Bandage (Bioinformatics Application for Navigating De novo Assembly Graphs Easily) plot uses nodes in the graphs which represent contigs. The Unicycler de novo assembly produces a graph with 222 nodes. Assembly according to the methods described herein usingproximity information enables the circularity of an assembly to be determined using a Bandage plot. As shown in FIG. 1 OB, using proximity information significantly reduces the number of nodes to just 8 nodes and enables the circularity, completeness, and accuracy of a genome assembly to be determined.Polishing Assemblies
[0150] Embodiments of the disclosure relate to systems and methods for correcting sequence assemblies by analyzing the pile up of sequence reads at each position along a genome and using that information to correct relatively minor sequence errors. For example, a particular pile up of sequence reads may have 10 sequence reads covering a particular region. Of those 10 sequence reads, nine of them show an adenosine “A” at position 3, whereas one of the reads has a cytosine “C” at position 3. If the current assembly shows a “C” at position 3, the system or method may update that position to have an adenosine nucleotide based on the number of reads showing that nucleotide base at position 3,
[0151] Thus, the process of de novo assembly may start with assembling sequence reads from a genomic sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell. The system may obtain flow cell data that includes sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample as well as the flow cell locations of the clusters of the nucleic acids. The system may then assemble the sequence reads to produce a first de novo assembly with a plurality of contigs without comparison to a reference sequence. The system may then determine linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids. The application may then correct the assembly based on proximity information and the determined links between the sequence reads to discover errors where one contig may be misplaced based on the linking information.
[0152] Correcting the assembly can include using a pile-up comprising a plurality of linked sequence reads. The linked sequence reads can comprise local sequence reads having 1-1000 bp. In some embodiments, correcting the assembly comprises identifying positions of discrepancy between the assembly and a consensus sequence built from the linked sequence reads in the pile-up. In some embodiments, the method can update or correct the assembly atthe positions of discrepancy based on the consensus sequence from the pile-up. In other embodiments, the method can mark the position of discrepancy for subsequent analysis.Definitions
[0153] The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.
[0154] Although the following terms are believed to be well understood by one of skill in the art, the following definitions are set forth to facilitate understanding of the presently disclosed subject matter.
[0155] All technical and scientific terms used herein, unless otherwise defined below, are intended to have the same meaning as commonly understood by one of ordinary skill in the art. References to techniques employed herein are intended to refer to the techniques as commonly understood in the art, including variations on those techniques or substitutions of equivalent techniques that would be apparent to one of skill in the art,
[0156] As used herein, the terms “a” or “an” or “the” may refer to one or more than one. For example, “a” marker can mean one marker or a plurality of markers.
[0157] As used herein, the term “and / or” refers to and encompasses any and all possible combinations of one or more of the associated listed items, as well as the lack of combinations when interpreted in the alternative (“or”).
[0158] Throughout this specification, unless the context requires otherwise, the words “comprise,” “comprises,” and “comprising” will be understood to imply the inclusion of a stated step or element or group of steps or elements but not the exclusion of any other step or element or group of steps or elements.
[0159] As used herein, the term “consists essentially of’ (and grammatical variants thereof), as applied to the compositions and methods of the present disclosure, means that the compositions / methods may contain additional components so long as the additional components do not materially alter the composition / method.
[0160] The term “nucleic acid” or “polynucleotide” refers to a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise limited, encompasses known analogs of natural nucleotides that hybridize to nucleic acids in manner similar to naturally occurring nucleotides, such as peptide nucleic acids (PNAs) andphosphorothioate DNA. Unless otherwise indicated, a particular nucleic acid sequence includes the complementary sequence thereof. Nucleotides include, but are not limited to, ATP, dATP, CTP, dCTP, GTP, dGTP, UTP, TTP, dUTP, 5-methyl-CTP, 5-methyl-dCTP, ITP, diTP, 2-amino-adenosine-TP, 2-amino-deoxyadenosine-TP, 2-thiothymidine triphosphate, pyrrolo-pyrimidine triphosphate, and 2 -thiocytidine, as well as the alphathiotriphosphates for all of the above, and 2'-O-methyl-ribonucleotide triphosphates for all the above bases. Modified bases include, but are not limited to, 5-Br-UTP, 5-Br-dUTP, 5-F-UTP, 5-F-dUTP, 5-propynyl dCTP, and 5-propynyl-dUTP.
[0161] As used herein, the term "fragment," when used in reference to a first nucleic acid, is intended to mean a second nucleic acid having a part or portion of the sequence of the first nucleic acid. Generally, the fragment and the first nucleic acid are separate molecules. The fragment can be derived, for example, by physical removal from the larger nucleic acid, by replication or amplification of a region of the larger nucleic acid, by degradation of other portions of the larger nucleic acid, a combination thereof or the like. The term can be used analogously to describe sequence data or other representations of nucleic acids. As used herein, the term "haplotype" refers to a set of alleles at more than one locus inherited by an individual from one of its parents. A haplotype can include two or more loci from all or part of a chromosome. Alleles include, for example, single nucleotide polymorphisms (SNPs), short tandem repeats (STRs), gene sequences, chromosomal insertions, chromosomal deletions etc. The term "phased alleles" refers to the distribution of the particular alleles from a particular chromosome, or portion thereof. Accordingly, the "phase" of two alleles can refer to a characterization or representation of the relative location of two or more alleles on one or more chromosomes.
[0162] “Fragmentation” as described herein refers to the shearing or fragmenting of nucleic acid into shorter lengths. Fragmentation methods include enzymatic, physical (including sonication, nebulization, needle shearing, microwave, etc.), and chemical (including depurination, hydrolysis, oxidation, etc.). The terms “fragmenting enzymes” or “enzymebased fragmentation” or “enzyme fragmentation” as used herein refers to enzymes that fragment nucleic acid. The enzymes can be a single enzyme or two or more enzymes that work together to fragment the nucleic acid. Some enzymes work on single stranded nucleic acid whereas others work on double stranded nucleic acid and yet others work on one strand of adouble stranded nucleic acid. Fragmenting enzymes can cut randomly or specifically. Non¬ limiting examples of fragmenting enzymes include transposase, restriction enzymes, Argonaute, CRISPR -associated nuclease (Cas), endonucleases, exonuclease, topoisomerase, FragmentaseTM(New England Biolabs, Ipswich, MA). Preferred fragmentation embodiments include methods that fragment while retaining proximity information of the fragments.
[0163] As used herein, the term "nucleotide sequence" or simply “sequence” is intended to refer to the order and type of nucleotide monomers in a nucleic acid polymer. A nucleotide sequence is a characteristic of a nucleic acid molecule and can be represented in any of a variety of formats including, for example, a depiction, image, electronic medium, series of symbols, series of numbers, series of letters, series of colors, etc. The information can be represented, for example, at single nucleotide resolution, at higher resolution (e.g. indicating molecular structure for nucleotide subunits) or at lower resolution (e.g. indicating chromosomal regions, such as haplotype blocks). A series of " A," " T," " G," and " C" letters is a well-known sequence representation for DNA that can be correlated, at single nucleotide resolution, with the actual sequence of a DNA molecule. A similar representation is used for RNA except that " T" is replaced with " U" in the series.
[0164] As used herein, the term “reference genome” or “reference sequence” refers to any particular known genome sequence, whether partial or complete, of any organism or virus which may be used to reference identified sequences from a subject. For example, a reference genome used for human subjects as well as many other organisms is found at the National Center for Biotechnology Information at ncbi.nlm.mh.gov. In various embodiments, the reference sequence is significantly larger than the reads that are aligned to it. For example, it may be at least about 100 times larger, or at least about 900 times larger, or at least about 10,000 times larger, or at least about 105times larger, or at least about 106times larger, or at least about 107times larger. In one example, the reference sequence is that of a full-length genome. Such sequences may be referred to as genomic reference sequences. Other examples of reference sequences include genomes of other species, such as of control organisms as disclosed herein, as well as chromosomes, sub-chromosomal regions (such as strands), etc., of any species. In various embodiments, the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the referencesequence may be taken from a particular individual. Examples of reference genomes include GRCh38 from the Genome Reference Consortium.
[0165] The term “nucleic acid sample” herein may refer to a sample, typically derived from any organism, including but not limited to animals, plants, fungi, and microbes. For example, such samples may be derived from one or more biological fluids, cells, tissues, organs, or organisms, comprising a nucleic acid or a mixture of nucleic acids comprising at least one nucleic acid sequence. Such samples may include, but are not limited to sputum / oral fluid, amniotic fluid, blood, a blood fraction, or fine needle biopsy samples (such as surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, and the like. Although the sample is often taken from a human subject (such as a patient), the sample may be from any mammal, including, but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, etc. Alternatively, the sample may be microbial such as bacteria, viral, or fungal. The sample may be used directly as obtained from the biological source or following a pretreatment to modify the character of the sample. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids and so forth. Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents, lysing, etc. If such methods of pretreatment are employed with respect to the sample, such pretreatment methods are typically such that the nucleic acid(s) of interest remain in the test sample, sometimes at a concentration proportional to that m an untreated test sample (such as namely, a sample that is not subjected to any such pretreatment method(s)). Such “treated” or “processed” samples are still considered to be biological “test” samples with respect to the methods described herein. A “nucleic acid sample” may also include nucleic acid sequence information stored in a memory, and which was originally obtained from a source such as one or more biological fluids, cells, tissues, organs, or organisms.
[0166] The sample can include high molecular weight material, such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another implementation, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some implementations, the sample caninclude nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some implementations, the sample can be an epidemiological, agricultural, forensic, or pathogenic sample. In some implementations, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another implementation, the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus, or fungus. In some implementations, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0167] As further used herein, the term “sequencing run” refers to an iterative process on a sequencing device to determine a primary structure of nucleotide sequences from a sample (e.g,, genomic sample). In particular, a sequencing run includes cycles of sequencing chemistry and imaging performed by a sequencing device (including an imaging device, such as a CCD or CMOS) that incorporate nucleobases into growing oligonucleotides to determine nucleotide reads from nucleotide sequences extracted from a sample (or other sequences within a library fragment) and seeded throughout a flow cell or other nucleotide-sample slide. In some cases, a sequencing run includes replicating oligonucleotides derived or extracted from one or more genomic samples seeded in clusters throughout a flow cell. Upon completing a sequencing run, a sequencing device can generate base-call data in a file, such as a binary base call (BCL) sequence file or a fast-all quality (FASTQ) file.
[0168] Relatedly, the term “sequencing cycle” (or “cycle”) refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to a sample’s sequence (e.g., a genomic or transcriptomic sequence from a sample) or a corresponding adapter sequence. In some cases, a sequencing cycle includes an iteration of both incorporating nucleobases into clusters of oligonucleotides using sequencing chemistry and capturing images of such clusters attached to a nucleotide-sample slide (e.g., a flow cell). Accordingly, cycles can be repeated as part of sequencing a nucleic-acid polymer (e.g., a sample genomic sequence). For example, in one or more embodiments, each sequencing cycle involves incorporating nucleobases into either a single nucleotide read in which DNA or RNA strands are read in only a single direction or paired-end reads in which DNA or RNA strands are read from both ends but m different cycles. Further, in certain cases,each sequencing cycle involves a camera taking an image of the nucleotide-sample slide or multiple sections of the nucleotide-sample slide to generate image data for determining a particular nucleobase added or incorporated into particular oligonucleotides. Following the image capture stage, a sequencing system can remove certain fluorescent labels from incorporated nucleobases and perform another sequencing cycle until the nucleic-acid polymer has been completely sequenced. In one or more embodiments, a sequencing cycle includes a cycle within an SBS run. A sequencing cycle can include one or both of an indexing cycle and a genomic sequencing cycle. For instance, one cluster of oligonucleotides or a set of clusters of oligonucleotides may be undergoing a genomic sequencing cycle in which nucleobases corresponding to a sample genomic sequence are incorporated and another cluster of oligonucleotides or another set of clusters of oligonucleotides may be concurrently undergoing an indexing cycle in which nucleobases corresponding to an indexing sequence for a nucleotide read are incorporated.
[0169] Further, as used herein, the term “nucleotide-sample slide” (or “nucleotide-sample substrate”) refers to a plate or substrate, such as a flow cell, comprising oligonucleotides for sequencing nucleotide sequences from genomic samples or other sample nucleic-acid polymers. In particular, a nucleotide-sample slide can refer to a substrate containing fluidic channels through which reagents and buffers can travel as part of sequencing. For example, in one or more embodiments, a flow cell (e.g., a patterned flow cell or non-patterned flow cell) may comprise small fluidic channels and oligonucleotide samples that can be bound to adapter sequences on the substrate. In other implementations, a nucleotide-sample slide can be an open substrate with one or more regions for oligonucleotide samples to be analyzed and the oligonucleotide samples may be positioned using charged pads or other means. In yet another implementation, the nucleotide-sample slide can be a membrane having a nanopore through which one or more oligonucleotide samples may pass.
[0170] Relatedly, as used herein, the term “region of a nucleotide-sample slide” (or “nucleotide-sample slide region”) refers to an area that is part of a nucleotide-sample slide. In particular, a region of a nucleotide-sample slide can refer to a discrete portion of a nucleotide- sample slide that differs from other portions of the nucleotide-sample slide. For instance, a region of a nucleotide-sample slide can include a subsection of patterned flow cell comprising one or more wells (e.g., a nano-wells) or a discrete subsection of a non-pattered flow cell (e.g.,a subsection corresponding to one or more clusters). In some cases, a region (e.g., section) of a nucleotide-sample slide includes a tile or a sub-tile of a flow cell having clusters of oligonucleotides growing in parallel.
[0171] The term “read” or “sequence read” (or sequencing reads) refers to a sequence obtained from a portion of a nucleic acid sample. A read may be represented by a string of nucleotides sequenced from any part or all of a nucleic acid molecule. Typically, though not necessarily, a read represents a short sequence of contiguous base pairs in the sample. The read may be represented symbolically by the base pair sequence (in A, T, C, or G) of the sample portion. It may be stored in a memory device and processed as appropriate to determine whether it matches a reference sequence or meets other criteria. A read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample. In some cases, a read is a DNA sequence of sufficient length (such as at least about 25 bp) that can be used to identify a larger sequence or region, for example, that can be aligned and specifically assigned to a chromosome or genomic region or gene. For example, a sequence read may be a short string of nucleotides (such as 20-150 bases) sequenced from a nucleic acid fragment, a short string of nucleotides at one or both ends of a nucleic acid fragment, or the sequencing of the entire nucleic acid fragment that exists in the biological sample. Sequence reads may be obtained by any method known in the art. For example, a sequence read may be obtained in a variety of ways, such as using sequencing techniques or using probes, such as in hybridization arrays or capture probes, or amplification techniques.
[0172] Embodiments described herein can be used with any suitable sequencing chemistry, such as sequencing by synthesis (SBS), sequencing by binding, sequencing by ligation, sequencing-by-expansion, or nanopore sequencing.
[0173] SBS can be with or without the use of reversible terminators. For example, SBS can be initiated by contacting the target nucleic acids with one or more nucleotides (e.g., labelled, synthetic, modified, or a combination thereof), DNA polymerase, etc. Those features where a primer is extended using the target nucleic acid as template will incorporate a labeled nucleotide that can be detected. The incorporation time used in a sequencing run can be significantly reduced using the altered polymerases described herein. Optionally, the labeled nucleotides can further include a reversible termination property that terminates further primerextension once a nucleotide has been added to a primer. For example, a nucleotide analog having a reversible terminator moiety can be added to a primer such that subsequent extension cannot occur until a deblocking agent is delivered to remove the moiety. Thus, for embodiments that use reversible termination, a deblocking reagent can be delivered to the flow cell (before or after detection occurs). Washes can be carried out between the various delivery steps. The cycle can then be repeated n times to extend the primer by n nucleotides, thereby detecting a sequence of length n. Exemplary SBS procedures, fluidic systems, and detection platforms that can be readily adapted for use with an array produced by the methods of the present disclosure are described, for example, in Bentley et al,, Nature 456:53-59 (2008); WO 04 / 018497; WO 91 / 06678; WO 07 / 123744; U. S. Pat. Nos. 7,057,026 B2, 7,329,492 B2, 7,211,414 B2, 7,315,019 B2, 7,405,281 B2, and 8,343,746 B2. Sequence reads can be generated using instruments such as MimSeq™, MiSeq™, NextSeq™, HiSeq™’ and NovaSeq™ sequencing instruments from Illumina, Inc. (San Diego, CA).
[0174] One example of SBS is termed sequencing by binding. One implementation of sequencing by binding includes cycles of initiating sequencing of a template with a reversible blocker on the 3’ end to prevent additional bases from incorporating, interrogating the template by flooding the flow cell with fluorescently tagged bases that do not include a blocker and measuring an emitted signal of bound bases, activating the 3’ end via removal of the reversible blocker, and incorporating the complementary base from unlabeled, blocked nucleotides. Reads using sequencing by binding can be generated from using instruments such as Onso™ sequencing instruments from Pacific Biosciences of California, Inc. (Menlo Park, CA). Another implementation of sequencing by binding could be sequencing by avidity. In sequencing by avidity, fluorescent dye labeled cores termed avidites are used. One potential cycle of sequencing by avidity includes providing a reagent of polymerase and reversibly terminated nucleotides to templates immobilized on a solid surface, de-blocking the incorporated nucleotides, flowing a set of four types of avidites, washing away unbound avidites, detecting the incorporated bases / nucleotides, and removing the bound avidites. The steps in the cycle of sequencing by avidity may be performed in other orders. Sequencing by avidity is described in Arslan, S., Garcia, F. J., Guo, M. et al. Sequencing by avidity enables high accuracy with low reagent consumption. Nat Biotechnol 42, 132-138 (2024). doi.org / 10.1038 / s41587-023-01750-7, which is incorporated by reference in its entirety.Reads using sequencing by avidity can be generated using instruments such as Aviti™sequencing instruments from Element Biosciences (San Diego).
[0175] One example of SBS using an open flow cell and without using reversible terminators is disclosed in Almogy, G.(2022) “Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform” doi.org / 10.1101 / 2022.05.29.493900, which is incorporation by reference in its entirety. Sequence reads using an open flow cell can be generated using instruments such as UG 100TM Sequencer from Ultima Genomics, Inc, (Fremont, CA)
[0176] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are described in U. S. Pat. Nos. 8,262,900 B2, 7,948,015 B2, 8,349,167 B2, and U. S. Pat. Pub.2010 / 0137143 Al, which are incorporated by reference in its entirety,
[0177] Sequence reads can be generated using instruments such as DNBSEQ™ sequencing instruments from MGI Tech Co., Ltd. (Shenzhen, China) and as SURFSeq™, FASTASeq™, and GenoLab™ sequencing instruments from GeneMind Biosciences Co., Ltd. (Shenzhen, China).
[0178] Some embodiments can use methods involving the real-time monitoring of DNA polymerase activity. For example, nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides, or with zeromode waveguides. Techniques and reagents for FRET-based sequencing are described, for example, in Levene et al. Science 299, 682-686 (2003); Lundquist et al. Opt. Lett. 33, 1026-1028 (2008); Korlach et al. Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), which are incorporated by reference in its entirety. Techniques sequencing using zeromode waveguides is described in U. S. Pat. No.6,917,726 B2, which is incorporated by reference in its entirety.
[0179] As used herein, the terms “aligned,” “alignment,” or “aligning” refer to the process of comparing a read or tag to a reference sequence and thereby determining the likelihood of the reference sequence contains the read sequence. If the reference sequence contains the read, the read may be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. For example, the alignmentof a read to the reference sequence for human chromosome 13 will tell the likelihood of the read is present in the reference sequence for chromosome 13. In some cases, an alignment additionally indicates a location where the read or tag maps to in the reference sequence. For example, if the reference sequence is the whole human genome sequence, an alignment may indicate that a read is present on chromosome 13, and may further indicate that the read is on a particular strand and / or site of chromosome 13. A “site” may be a unique position on a polynucleotide sequence or a reference sequence (e g., chromosome ID, chromosome position and orientation). In some embodiments, a site may provide a position for a residue, a sequence tag, or a segment on a sequence,
[0180] Aligned reads or tags are one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known sequence from a reference genome. Alignment can be done manually, although it is typically implemented by a computer algorithm, as it would be impossible to align reads in a reasonable time period for implementing the methods disclosed herein. The matching of a sequence read in aligning can be a 100% sequence match or less than 100% (non-perfect match).
[0181] Alignment may be performed by modifications and / or combinations of methods such as Burrows- Wheeler Aligner (BWA), ISAAC, BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, CASFLX, Cloudburst, CUDA-EC, CUSFIAW, CUSHAW2, CUSI IAW2-GPU, drFAST, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP and GSNAP, Geneious Assembler, LAST, MAQ, mrFAST and mrsFAST, MOM, MOSAIK, MPscan, Novoaligh & NovoalignCS, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PRIMEX, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RT Investigator, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SOAP, SOAP2, SOAP3 and SOAP3-dp, SOCS, SSAHA and SSAHA2, Stampy, SToRM, Subread and Subjunc, Taipan, UGENE, VelociMapper, XpressAlign, and ZOOM.
[0182] The term “mapping” used herein refers to specifically assigning a sequence read to a larger sequence, e.g., a reference sequence, by alignment.
[0183] As used herein, the term “paired-end reads” or “paired end reads” refers to paired reads generated from sequencing the forward and reverse ends of a larger nucleic acid fragment. In some examples, the forward and reverse ends of a larger nucleic acid fragment may share the same name. The paired-end reads may be generated from paired end sequencing that obtains one read from each end of a nucleic acid fragment.
[0184] Moreover, as used herein, the term “alignment score” refers to a numeric score, metric, or other quantitative measurement evaluating an accuracy of an alignment between one or more nucleotide reads or a fragment of a nucleotide read and another nucleotide sequence from a reference sequence. In particular, an alignment score includes a metric indicating a degree to which the nucleobases of one or more nucleotide reads (or a fragment thereof) match or are similar to a reference sequence or an alternate contiguous sequence from a reference sequence. In certain implementations, an alignment score takes the form of a Smith- Waterman score or a variation or version of a Smith-Waterman score for local alignment, such as various settings or configurations used by DRAGEN by Illumina, Inc. for Smith-Waterman scoring,
[0185] As used herein, a “short sequence read” refers to a sequence read of between 50-500 bp, for example, about 50 - 100 bp, and includes paired end sequence reads,
[0186] As used herein, a “long sequence read” refers to a sequence read of more than about 500 bp, for example 500 - 250,000 bp or more, A long sequence read may be obtained from a long-read sequencing technology, or may be synthetically constructed by assembling multiple short sequence reads.
[0187] As used herein, a “file” includes electronic files. In some embodiments, a file is on a computer storage medium (such as a computer hard drive, for example a spinning magnetic disk drive or a solid state drive). In some embodiments, the electronic file is stored in the format of a BAM, FASTQ, SAM, CRAM, JSON, CIGAR, or VCF file.
[0188] The terms “solid support,” “solid surface,” and other grammatical equivalents herein refer to any substrate that is appropriate for or can be modified to be appropriate for the attachment of enzymes, nucleic acids, and complexes thereof. As will be appreciated by those in the art, the number of possible substrates is very large. Possible substrates include, but are not limited to, glass and modified or functionalized glass, polymers (including acrylics, polystyrene and copolymers of styrene and other materials, polypropylene, polyethylene, polybutylene, polyurethanes, Teflon™, etc.), polysaccharides, nylon or nitrocellulose, ceramics, resins, silica or silica-based materials including silicon and modified silicon, carbon, metals, inorganic glasses, plastics, optical fiber bundles, quartz, metal oxides, inorganic oxides, other suitable transparent materials, other suitable non-transparent materials,other suitable translucent materials, and combinations thereof. The composition and geometry' of the solid support can vary with its use.
[0189] In some embodiments, the solid support or solid surface is a planar structure, such as a flowcell, slide, chip, microchip, array, microarray, wafer, panel, charge pad, and / or web. The planar structure can be a single surface structure having a single surface of sample / reaction sites. The planar structure can be a dual surface structure. One example of a dual surface structure includes a top substrate having a top surface of sample / reactions sites, a bottom substrate having a bottom surface of sample / reactions sites, and a spacer layer separating the top substrate and the bottom substrate. The solid support or solid surface can be open to direct application of a fluid. One example of an open solid support or open solid surface is an open flow cell having a single surface structure without an inlet port. In some embodiments, the solid support is not necessarily planar, such as, for example, the surface of a well, tube, or other vessel. Nonlimiting examples include the surface of a microcentrifuge tube, a well of a multiwell plate, and the like,
[0190] In some embodiments, the solid support comprises one or more surfaces of a flowcell or flow cell. The term “flowcell” or “flow cell” as used herein refers to a solid surface across which one or more fluid reagents can be flowed. Examples of flowcells and related fluidic systems and detection platforms that can be readily used in the methods of the present disclosure are described, for example, in Bentley et al., Nature 456:53-59 (2008), WO 04 / 018497; U. S. 7,057,026 B2; WO 91 / 06678; WO 07 / 123744; U. S. 7,329,492 B2; U. S.7,211,414 B2; U. S. 7,315,019 B2; U. S. 7,405,281 B2, and U. S. Pat. Pub. 2008 / 0108082 Al, each of which is incorporated herein by reference in its entirety. In some embodiments, the flowcells can be one or more flow lanes. For flow cells having a plurality of flow lanes, each of the flow lanes can be independently accessed or two or more flow lanes can be accessed as a group.
[0191] In some embodiments, the solid support or solid surface is a non-planar structure, such beads, microspheres, and / or inner and / or outer surface of a tube or vessel. The terms “beads”, “microspheres,” or “particles” or grammatical equivalents herein is refer to small discrete particles. Suitable bead compositions include, but are not limited to, plastics, ceramics, glass, polystyrene, methylstyrene, acrylic polymers, acrylamide, paramagnetic materials, thoria sol, carbon graphite, titanium dioxide, latex, polysaccharide (e.g. Dextran™,Sepharose™, cellulose, agarose), nylon, cross-linked micelles, Teflon™, as well as any other materials outlined herein for solid supports may all be used. “Microsphere Detection Guide” from Bangs Laboratories, Fishers Ind. is a helpful guide. In certain embodiments, the microspheres are magnetic microspheres or beads. The beads need not be spherical; irregular particles may be used. Alternatively or additionally, the beads may be porous. The bead sizes range from nanometers, e.g., 100 nm, to millimeters, e.g., 1 mm, with beads from about 0.2 micron to about 200 microns being preferred, and from about 0.5 to about 5 micron being particularly preferred, although in some embodiments smaller or larger beads may be used.
[0192] In some embodiments, the solid support comprises a patterned surface suitable for immobilization of molecules, such as enzymes, nucleic acids, and complexes thereof, in an ordered pattern. A “patterned surface” refers to an arrangement of different regions in or on an exposed layer of a solid support. The features can be separated by interstitial regions that contribute to the pattern. In some embodiments, the interstitial regions can be a different height, creating wells or raised platform patterns. In other embodiments, the interstitial regions can have a different surface charges or surface energies. In yet other embodiments, the interstitial regions can have a different attachment moieties. In some embodiments, the pattern can be any suitable pattern, such as a grid patterns, radial patterns, and combinations thereof. In some embodiments, a patterned surface can contain pre¬ determined locations of features but the features are not arrayed in a repetitive pattern. Examples of grid patterns include rectangular patterns, hexagonal patterns, triangular, and other suitable grid patterns. The regions for immobilization of molecules may be depressed regions, elevated regions, or planar regions relative to the interstitial regions. The regions may be fabricated as is generally known in the art using a variety of techniques, including, but not limited to, photolithography, stamping techniques, molding techniques, microetching techniques, and combinations thereof. As will be appreciated by those in the art, the technique used will depend on the composition and shape of the regions. For example, the regions for immobilization of molecules of a patterned surface may be wells, pits, channels, posts, pillars, ridges, stripes, swirls, lines, and other suitable topographies. For example, the wells may have any opening in any shape, such as circular, oval, polygonal (e.g., hexagonal, octagonal, square, rectangular, elliptical, etc.). Exemplary patterned surfaces that can be used in the methodsand compositions set forth herein are described in U. S. Pat. No. 8,778,849 B2, which is incorporated herein by reference in its entirety.
[0193] In some embodiments, the solid support comprises a surface suitable for immobilization of molecules, such as enzymes, nucleic acids, and complexes thereof, in a random distribution over the solid support. Exemplary random distribution over a solid support is described in U. S. Pat. No. 8,241,573 B2, which is incorporated herein by reference in its entirety.
[0194] As used herein, the term "flow' cell" is intended to mean a chamber having a surface across which one or more fluid reagents can be flowed. Generally, a flow cell will have an ingress opening and an egress opening to facilitate flow' of fluid. A flow cell can have multiple surfaces. Examples of flow cells and related fluidic systems and detection platforms that can be readily used in the methods of the present disclosure are described, for example, in Bentley et al, Nature 456:53-59 (2008), WO 04 / 018497; US 7,057,026; WO 91 / 06678; WO 07 / 123744; US 7,129,492; US 7,211,414; US 7,115,019; US 7,405,281, and US 2008 / 0108082, each of which is incorporated herein by reference.
[0195] In many embodiments, a solid support to which nucleic acids are attached in a method set forth herein will have a continuous or monolithic surface. Thus, fragments can attach at spatially random locations wherein the distance between nearest neighbor fragments (or nearest neighbor clusters derived from the fragments) will be variable. The resulting arrays will have a variable or random spatial pattern of features. Alternatively, a solid support used in a method set forth herein can include an array of features that are present in a repeating pattern. In such embodiments, the features provide the locations to which modified nucleic acid polymers, or fragments thereof, can attach. Particularly useful repeating patterns are hexagonal patterns, rectilinear patterns, grid patterns, patterns having reflective symmetry, patterns having rotational symmetry, or the like. The features to which a modified nucleic acid polymer, or fragment thereof, attach can each have an area that is smaller than about 1 mm2, 500 μm2, 100 μm2, 25 μm2, 10 μm2, 5 μm2, 1 μm2, 500 nm2, or 100 nm2. Alternatively, or additionally, each feature can have an area that is larger than about 100 nm2, 250 nm2, 500 nm2, 1 μm2, 2.5 μm2, 5 μm2, 10 μm2, 100 μm2, or 500 μm2. A cluster or colony of nucleic acids that result from amplification of fragments on an array (whether patterned or spatially random)can similarly have an area that is in a range above or between an upper and lower limit selected from those exemplified above.
[0196] As used herein, the term "surface," when used in reference to a material, is intended to mean an external part or external layer of the material. The surface can be in contact with another material such as a gas, liquid, gel, polymer, organic polymer, second surface of a similar or different material, metal, or coat. The surface, or regions thereof, can be substantially flat. The surface can have surface features such as wells, pits, channels, ridges, raised regions, pegs, posts or the like. The material can be, for example, a solid support, gel, or the like.
[0197] As used herein, the term "target," when used in reference to a nucleic acid polymer, is intended to linguistically distinguish the nucleic acid, for example, from other nucleic acids, modified forms of the nucleic acid, fragments of the nucleic acid, and the like. Any of a variety of nucleic acids set forth herein can be identified as target nucleic acids, examples of which include genomic DNA (gDNA), messenger RNA (mRNA), copy or complimentary DNA (cDNA), and derivatives or analogs of these nucleic acids.
[0198] As used herein, the term "transposase" is intended to mean an enzyme that is capable of forming a functional complex with a transposon element-containing composition (e.g., transposons, transposon ends, transposon end compositions) and catalyzing insertion or transposition of the transposon element-containing composition into a target DNA with which it is incubated, for example, in an in vitro transposition reaction. The term can also include integrases from retrotransposons and retroviruses. Transposases, transposomes and transposome complexes are generally known to those of skill in the art, as exemplified by the disclosure of U. S. Pat. App. Pub. 2010 / 0120098, which is incorporated herein by reference in its entirety. Although many embodiments described herein refer to Tn5 transposase and / or hyperactive Tn5 transposase, it will be appreciated that any transposition system that is capable of inserting a transposon element with sufficient efficiency to tag a target nucleic acid can be used. In particular embodiments, a preferred transposition system is capable of inserting the transposon element in a random or in an almost random manner to tag the target nucleic acid. As used herein, the term "transposome" is intended to mean a transposase enzyme bound to a nucleic acid. Typically the nucleic acid is double stranded. For example, the complex can be the product of incubating a transposase enzyme with double-stranded transposon DNA under conditions that support non-covalent complex formation. Transposon DNA can include,without limitation, Tn5 DNA, a portion of Tn5 DNA, a fusion of Tn5 or a portion of Tn5 with one or more auxiliary' proteins, a transposon element composition, a mixture of transposon element compositions or other nucleic acids capable of interacting with a transposase such as the hyperactive Tn5 transposase.
[0199] As used herein, the term "transposon element" is intended to mean a nucleic acid molecule, or portion thereof, that includes the nucleotide sequences that form a transposome with a transposase or integrase enzyme. Typically, the nucleic acid molecule is a double stranded DNA molecule. In some embodiments, a transposon element is capable of forming a functional complex with the transposase in a transposition reaction. As non-limiting examples, transposon elements can include the 19-bp outer end (" OE") transposon end, inner end (" IE") transposon end, or "mosaic end" (" ME") transposon end recognized by a wild-type or mutant Tn5 transposase, or the Rl and R2 transposon end as set forth in the disclosure of US Pat. App. Pub, No. 2010 / 0120098, which is incorporated herein by reference. Transposon elements can comprise any nucleic acid or nucleic acid analogue suitable for forming a functional complex with the transposase or integrase enzyme in an in vitro transposition reaction. For example, the transposon end can comprise DNA, RNA, modified bases, non¬ natural bases, modified backbone, and can comprise nicks in one or both strands.
[0200] A standard NGS sequencing run yields millions of short sequences that are eventually mapped on a reference sequence. A percentage of good-quality reads (1-5%) are discarded because of ambiguous genomic location. Increasing read length (2x500 or long-read sequencing), designing a specialized algorithm to map reads on specific regions of the genome (targeted callers), using specialized library preparation (e.g., Illumina’s ICLR), or a combination thereof may be implemented to address the need for disambiguating such reads that would normally be discarded. However, such approaches can be costly, laborious, and time intensive. Spatial information (e.g., X and Y coordinates) obtained from a solid support surface) can be leveraged to identify fragments that are generated from a single long input fragment and subsequentially be used to improve mapping reads in ambiguous positions.
[0201] In one or more embodiments, the system identifies and / or stores sequencing metrics within one or more sequencing data files. As used herein, the term “sequencing data file” refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Suchsequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, and so forth.
[0202] Moreover, in one or more embodiments, one or more sequencing data files in which the system identifies or stores sequencing metrics include an alignment data file containing information from a read processing and mapping procedure. As used herein, the term “alignment data file” refers to a digital file that indicates mapping and alignment information for nucleotide reads of a sample nucleotide sequence. For example, an alignment data file can include a binary alignment map (BAM) file, a compressed reference-oriented alignment map (CRAM) file, or another file indicating nucleotide reads of a sample nucleotide sequence.
[0203] Moreover, as used herein, the term “cluster of oligonucleotides” (or “cluster” or “oligonucleotide cluster” or “colony”) refers to a localized group or collection of DNA or RNA on a nucleotide-sample support, such as a flow cell, particle, polymer scaffold, or other solid surface. In particular, a cluster includes tens, hundreds, thousands, or more copies of a cloned or the same DNA or RNA segment. For example, in one or more embodiments, a cluster includes a grouping of oligonucleotides immobilized in a section of a flow cell or other nucleotide-sample slide. In some embodiments, the cluster can comprise one or more concatemers, such as, for example, a polony or a nanoball. In some embodiments, clusters are evenly spaced or organized in a systematic structure within a patterned flow cell. By contrast, in some cases, clusters are randomly organized within a non-patterned flow cell. In typical embodiments, a cluster is the product of an amplification reaction. A cluster of oligonucleotides can be imaged utilizing one or more light signals, changes in pH, changes in conductance, and other signals. For instance, an oligonucleotide-cluster image may be captured by a camera during a sequencing cycle of light emitted by irradiated fluorescent labeled nucleotides incorporated into oligonucleotides, fluorescent labeled nucleotides bound but not incorporated into oligonucleotides, and other fluorescent labeled complexes associated with incorporated or bound nucleotides from one or more clusters on a flow cell. Examples of other sequencing procedures are set forth herein. In some embodiments, a cluster can be monoclonal or polyclonal.
[0204] The term “immobilized”, “affixed” and “attached” are used interchangeably herein and both terms are intended to encompass direct or indirect, covalent or non-covalent attachment unless indicated otherwise, either explicitly or by context.
[0205] Exemplary covalent attachment includes, for example, those that result from the use of click chemistry techniques. Exemplary non-covalent attachment includes, but are not limited to, non-specific interactions (e.g. hydrogen bonding, ionic bonding, van der Waals interactions etc.) or specific interactions (e.g. affinity interactions, receptor-ligand interactions, antibody-epitope interactions, avidin-biotin interactions, streptavidin -biotin interactions, lectin-carbohydrate interactions, etc.). Exemplary attachments are set forth in U. S, Pat. Nos. 6,737,236 Bl; 7,259,258 B2; 7,375,234 B2and 7,427,678 B2; and U. S. Pat. Pub.2011 / 0059865 Al, each of which is incorporated herein by reference in its entirety.
[0206] In certain embodiments, the molecules (e.g. nucleic acids, enzymes) remain immobilized or attached to the solid support under the conditions in which it is intended to use the solid support, for example in applications requiring nucleic acid amplification and / or sequencing. In other embodiments, the molecules are reversibly immobilized and can be removed from the solid support through the use of cleavable sites, linkers, and the like.
[0207] Some embodiments further comprise amplifying and / or replicating one or more nucleic acid templates, including fragments thereof. The amplifying and / or replicating comprises use of one or more of a bridge amplification reaction, an isothermal bridge amplification reaction, a rolling circle amplification (RCA) reaction, a modified rolling circle multiple displacement amplification, a helicase-dependent amplification reaction, a recombinase-dependent amplification reaction, a single-stranded DNA binding (SSB) protein mediated Isothermal amplification, a PCR reaction, a strand-displacement reaction, a ligase chain reaction, a transcription-mediated reaction, a loop-mediated amplification reaction, other suitable reactions, and combinations thereof. Amplification can occur on the sequencing instrument or separately from the sequencing instrument.
[0208] Some embodiments further comprise rolling circle amplification / replication used to form polonies. The term “polony” or “polonies” used herein refers to a nucleic acid library molecule clonally amplified in-solution or on-support to generate an amplicon that can serve as a template molecule for sequencing. In some aspects, a linear library molecule can be circularized to generate a circularized library molecule, and the circularized library moleculecan be clonally amplified in-solution or on-support to generate a concatemer. In some aspects, the concatemer can serve as a nucleic acid template molecule which can be sequenced. The concatemer is sometimes referred to as a polony. In some aspects, a polony includes nucleotide strands.
[0209] Some embodiments further comprise rolling circle amplification / replication used to form nucleic acid nanoballs. The term “nucleic acid nanoball” may be a concatemer comprising multiple copies of a target nucleic acid molecule. These nucleic acid copies may be arranged one after another in a continuous linear strand of nucleotides. These nucleic acid copies may result in a nanoball folding configuration. The multiple copies of a target nucleic acid molecule in a nucleic acid nanoball may each contain an adaptor sequence of known sequence to facilitate amplification or sequencing. The adaptor sequence of each target nucleic acid molecule may be the same or different. The nucleic acid nanoball can be loaded on the surface of solid support. The nanoball can be attached to the surface of solid support by any suitable method. Non-limiting examples of such methods include nucleic acid hybridization, biotin streptavidin binding, thiol binding, photoactive binding, covalent binding, antibody¬ antigen, physical constraints via hydrogels or other porous polymers, etc., or combinations thereof. In some cases, the nanoball can be digested with an enzyme (nuclease, etc.) to produce a smaller nanoball or a fragment from the nanoball.
[0210] Embodiments of the present disclosure relate to methods and systems which use “links” or “linkage information” between sequence reads. The “link,” “linking information,” or “linkage information” as discussed herein refers, in some embodiments, to the probability that two pairs of reads on a sequencing flow cell are derived from the same original nucleic acid molecule. In some next generation sequencing (NGS) systems, fragments of long nucleic acids, such as genomic DNA, from a sample are sheared to create shorter fragments which can be sequenced in a single read. The shearing process can create these shorter fragments which land on the flow cell and the flow cell proximity of each fragment may be related to the original nucleic acid molecule from which the fragment was derived. For example, fragments which come from the same nucleic acid molecule have been found to bind closer together on the flow’ cell as compared to fragments which come from different original nucleic acid molecules. Accordingly, if two clusters of reads on a flow cell are in close proximity and also close together on the genome, the clusters are more likely to have comefrom the same nucleic acid molecule. However, unrelated fragments may also bind to the flow cell near one another, which leads to an uncertainty in the probability that adjacent clusters originate from the same molecule. A number of factors could affect the probability that unrelated clusters would land in a similar area, and these factors may change based on a variety of experimental conditions. Embodiments of the disclosure provide a statistical method for calculating the probability that two reads are linked, such that on a flow cell the two reads were derived from the same nucleic acid molecule.
[0211] Embodiments of the disclosure relate to systems and methods for sequencing target nucleic acids by fragmenting the target nucleic acid and distributing the fragments onto a flow cell. As the fragments are distributed along the flow cell, they bind capture primers and are then used to create clusters by well-known technologies, such as those provided by Illumina Inc, (San Diego, CA). As described above, according to the methods of this disclosure, fragments which were derived from the same template genomic sequence are more likely to bind to the flow cell m proximity to one another as compared to fragments that are from different template genomic sequences, particularly when the fragmentation is performed directly on the flow cell using immobilized transposome complexes on the surface of the flow cell. In some library preps with fragmentation happening prior to loading, fragments can land anywhere in the flow cell independently of whether they came from the same molecule. However, in some embodiments when fragmentation is performed directly on the flow cell, flow cell proximity information is retained. This flow cell proximity information can be used to help guide assembly and variant calling of the original template genomic sequence, as will be described in more detail below.
[0212] For example, transposome complexes may be provided as part of the sequencing process. In some embodiments, the transposome complexes include a transposase and a first polynucleotide having end sequences which can be used to fragment the target polynucleotides and insert into each fragment an end sequence or tag which can be used to bind to capture probes located on the substrate. The method can include contacting the transposome complexes with the target polynucleotides under conditions to fragment the target polynucleotides and add capture sequences to the ends of each fragment. In some embodiments, the capture sequences include P5 or P7 sequences as provided by Illumina, Inc. In some embodiments, the complexed strand and transposome is in solution, and is thenbrought towards a substrate and immobilized thereon. In some embodiments, prior to immobilization of the transposome complexes on the substrate, one or more of the transposome complexes bind the target polynucleotides in solution. In this embodiment, the transposome complexes in solution become immobilized to the substrate.
[0213] Once the fragments have been bound to substrate, the bound fragments can be amplified to form a plurality of nucleic acid clusters on the substrate. The location of each cluster on the flow cell can then be determined before, during or after performing sequencing by synthesis reactions (SBS) to obtain the nucleotide sequence of each fragment located in each cluster. Once the nucleotide sequence of each cluster has been determined, the method can start to map those reads to determine the original target polynucleotide from which the read originated. In some embodiments, the mapping process takes into account the flow cell proximity of each cluster, such that clusters which are closer to each other on the flow cell are more likely to have originated from the same target polynucleotide. In some embodiments, the library preparation steps are performed on the flow cell, which may reduce the complexity and the amount of equipment associated with the systems. Furthermore, by mapping the sequenced fragments to target polynucleotides using the flow cell proximity information accompanying each cluster, the method performs more accurate mapping operations as compared to methods that do not take the flow cell proximity of each cluster into account during the mapping process. Therefore, flow cell proximity that includes relative distances between various clusters on a flow cell may be leveraged to adjust mapping information, thereby increasing the mapping quality of previously identified multi-mapped reads. In the past, identified multi-mapped reads may have been discarded. Increasing the mapping quality of these previously discarded reads, by improving the confidence of read pair’s alignment based on linking information with a high link quality score, may improve the alignment information and quality of information used in certain genomic analysis applications including, but not limited to, variant calling. Processing DNA samples suitable for high-throughput sequencing that retain information on the original configuration of the DNA samples provides useful information on co-located fragments.
[0214] In some embodiments, linking information is determined by analyzing, for example, statistically analyzing with a model, the genomic distance between two reads and flow cell proximity between the two clusters on the flow cell. In some embodiments, the methods and systems determine whether the genomic distance and / or flow cell proximity isbelow a threshold. In some embodiments, the methods and systems determine the presence or absence of a link between the two sequence reads. In some embodiments, the methods and systems determine a linking quality score between the two sequence reads. In some embodiments, the methods and systems analyze genomic distance and flow cell proximity for a plurality of pairs of two sequence reads (for example, each possible pair) of two sequence reads in a dataset. Further details regarding sequencing conditions that result in links or downstream analyses utilizing linking information can be found in International Patent Application Nos. PCT / US2024 / 035447 and PCT / US2024 / 045996, International Patent Application Publication Nos. WO2015 / 189636, WO2015 / 095226 and WO2023 / 122755, and U. S. Provisional Patent Application Nos. 63 / 600460, 63 / 614066, 63 / 800,049, 63 / 715,462, and 63 / 800,262, the disclosure of each of which is incorporated herein by reference in its entirety.
[0215] In some embodiments, sequence reads comprise a barcode. As used herein “barcode” refers to a short, unique nucleic acid sequence used to tag or label different nucleic acid samples or different nucleic acid molecules. In some embodiments, long DNA template molecules are fragmented, and a barcode sequence is added to each fragment, with a unique barcode sequence for each long DNA template. The fragments are sequenced to produce short sequence reads. In some embodiments, the sequence reads are analyzed, and the barcodes are used to identify which sequence reads are part of the same long DN / X template molecule. Thus, barcodes can help retain connectivity information for short sequence reads. Such embodiments can be considered an alternative form of preserving genomic connectivity information and can be used in the methods described herein.
[0216] In any of the embodiments summarized herein, the analytes are obtained from a population of cells, a single cell, a population of cell nuclei, or a cell nucleus. In any of the embodiments summarized herein, the analytes are analyzed using various analyses, depending on what the analyte is. For example, analysis may include DNA analysis, RNA analysis, protein analysis, tagmentation, nucleic acid amplification, nucleic acid sequencing, nucleic acid library preparation, assay for transposase accessible chromatic using sequencing (ATAC-seq), contiguity-preserving transposition (CPT-seq), single cell combinatorial indexed sequencing (SCI-seq), or single cell genome amplification, or any combination thereof.
[0217] In some embodiments, a sample includes a single cell, and the single cell is fixed. In some embodiments, the cells can be fixed with a fixative. As used herein, a fixativegenerally refers to an agent that can fix cells. For example, fixed cells can stabilize protein complexes, nucleic acid complexes, or protein-nucleic acid complexes in the cell. Suitable fixatives and cross-linkers can include, alcohol or aldehyde based fixatives, formaldehyde, glutaraldehyde, ethanol-based fixatives, methanol-based fixatives, acetone, acetic acid, osmium tetraoxide, potassium dichromate, chromic acid, potassium permanganate, mercurials, picrates, formalin, paraformaldehyde, amine-reactive NHS-ester crosslinkers such as bis[sulfosuccinimidyl] suberate (BS3), 3,3'-ditliiobis[sulfosuccinimidylpropionate] (DTSSP), ethylene glycol bis[sulfosuccmimidylsuccinate] (sulfo-EGS), disuccinimidyl glutarate (DSG), dithiobis[succinimidyl propionate] (DSP), disuccinimidyl suberate (DSS), ethylene glycol bisfsuccinimidylsuccinate] (EGS), NHS-ester / diazirine crosslinkers such as NHS-diazirine, NHS-LC-diazirine, NHS-SS-diazirine, sulfo-NHS-diazirine, sulfo-NHS-LC-diazirine, and sulfo-NHS-SS-diazirine. In some embodiments, fixing a cell preserves the internal state of the cell thereby preventing modification of the cell during subsequent analysis or during performance of an assay.
[0218] In some embodiments, the sample includes a nucleic acid source, such as a single cell, a single nucleus, or a population of cells or population of nuclei, and the single cell, single nucleus, population of cells, or population of nuclei is encapsulated within a droplet. In some embodiments, the cell is fixed prior to encapsulation. As used herein, a droplet may include a hydrogel bead, which is a bead for encapsulating a single cell, and composed of a hydrogel composition. In some embodiments, the droplet is a homogeneous droplet of hydrogel material or is a hollow droplet having a polymer hydrogel shell. Whether homogenous or hollow, a droplet may be capable of encapsulating a single cell. As used herein, the term “hydrogel” refers to a substance formed when an organic polymer (natural or synthetic) is cross-linked via covalent, ionic, or hydrogen bonds to create a three-dimensional open-lattice structure that entraps water molecules to form a gel. In some embodiments, the hydrogel may be a biocompatible hydrogel. As used herein, the term “biocompatible hydrogel” refers to a polymer that forms a gel that is not toxic to living cells and allows sufficient diffusion of oxygen and nutrients to entrapped cells to maintain viability'. In some embodiments, the hydrogel material includes alginate, acrylamide, or poly-ethylene glycol (PEG), PEG-acrylate, PEG-amine, PEG-carboxylate, PEG-dithiol, PEG-epoxide, PEG-isocyanate, PEG-maleimide, polyacrylic acid (PAA), poly(methyl methacrylate) (PMMA),polystyrene (PS), polystyrene sulfonate (PSS), polyvinylpyrrolidone (PVPON), N, N’-bis(acryloyl)cystamine, polypropylene oxide (PPG), poly(hydroxyethyl methacrylate) (PHEMA), poly(N-isopropylacrylamide) (PNIPAAm), poly(lactic acid) (PLA), poly(lactic-co-glycolic acid) (PLGA), poly caprolactone (PCL), poly(vinylsulfonic acid) (PVSA), poly(L-aspartic acid), poly(L-glutamic acid), polylysine, agar, agarose, heparin, alginate sulfate, dextran sulfate, hyaluronan, pectin, carrageenan, gelatin, chitosan, cellulose, collagen, bisacrylamide, diacrylate, diallylamine, triallylamine, divinyl sulfone, diethyleneglycol diallyl ether, ethyleneglycol diacrylate, polymethyleneglycol diacrylate, polyethyleneglycol diacrylate, trimethylopropoane trim ethacry late, ethoxylated trimethylol triacrylate, or ethoxylated pentaerythritol tetracrylate, or combinations or mixtures thereof. In some embodiments, the hydrogel is an alginate, acrylamide, or PEG based material. In some embodiments, the hydrogel is a PEG based material with acrylate-dithiol, epoxide-amine reaction chemistries. In some embodiments, the hydrogel forms a polymer shell that includes PEG-maleimide / dithiol oil, PEG-epoxide / amine oil, PEG-epoxide / PEG-amine, or PEG-dithiol / PEG-acrylate. In some embodiments, the hydrogel material is selected in order to avoid generation of free radicals that have the potential to damage intracellular biomolecules. In some embodiments, the hydrogel polymer includes 60-90% fluid, such as water, and 10-30% polymer. In certain embodiments, the water content of hydrogel is about 70-80%. As used herein, the term “about” or “approximately”, when modifying a numerical value, refers to variations that can occur in the numerical value. For example, variations can occur through differences in the manufacture of a particular substrate or component. In one embodiment, the term “about” means within 1%, 5%, or up to 10% of the recited numerical value.
[0219] Hydrogels may be prepared by cross-linking hydrophilic biopolymers or synthetic polymers. Thus, in some embodiments, the hydrogel may include a crosslinker. As used herein, the term “crosslinker” refers to a molecule that can form a three-dimensional network when reacted with the appropriate base monomers. Examples of the hydrogel polymers, which may include one or more crosslinkers, include but are not limited to, hyaluronans, chitosans, agar, heparin, sulfate, cellulose, alginates (including alginate sulfate), collagen, dextrans (including dextran sulfate), pectin, carrageenan, polylysine, gelatins (including gelatin type A), agarose, (meth)acrylate-oligolactide-PEO-oligolactide-(meth)acrylate, PEO-PPO-PEO copolymers (Pluronics), poly(phosphazene),poly(methacrylates), poly(N-vinylpyrrolidone), PL(G)A-PEO-PL(G)A copolymers, poly(ethylene imine), polyethylene glycol (PEG)-thiol, PEG-acrylate, acrylamide, N, N’-bis(acryloyl)cystamine, PEG, polypropylene oxide (PPG), polyacrylic acid, poly(hydroxy ethyl methacrylate) (PHEMA), poly(methyl methacrylate) (PMMA), poly(N-isopropylacrylamide) (PNIPAAm), poly(lactic acid) (PLA), poly(lactic-co-gly colic acid) (PLGA), polycaprolactone (PCL), poly(vinylsulfonic acid) (PVSA), poly(L-aspartic acid), poly(L-glutamic acid), bisacrylamide, diacrylate, diallylamine, triallylamine, divinyl sulfone, diethyleneglycol diallyl ether, ethyleneglycol diacrylate, polymethyleneglycol diacrylate, polyethyleneglycol diacrylate, trimethylopropoane trim ethacry late, ethoxylated trimethylol triacrylate, or ethoxylated pentaerythritol tetracrylate, or combinations thereof. Thus, for example, a combination may include a polymer and a crosslinker, for example polyethylene glycol (PEG)-thiol / PEG-acrylate, acrylamide / N, N’-bis(acryloyl)cystamine (BACy), or PEG / polypropylene oxide (PPO). In some embodiments, the polymer shell includes a four-arm polyethylene glycol (PEG). In some embodiments, the four-arm polyethylene glycol (PEG) is selected from the group consisting of PEG-acrylate, PEG-amine, PEG- carboxylate, PEG-dithiol, PEG-epoxide, PEG-isocyanate, and PEG-maleimide
[0220] In some embodiments, the crosslinker is an instantaneous crosslinker or a slow crosslinker. An instantaneous crosslinker is a crosslinker that instantly crosslinks the hydrogel polymer, and is referred to herein as click chemistry. Instantaneous crosslinkers may include ditluol oil + PEG-maleimide or PEG epoxide + amine oil. A slow crosslinker is a crosslinker that slowly crosslinks the hydrogel polymer, and may include PEG-epoxide + PEG-amine or PEG-dithiol + PEG-acrylate. A slow crosslinker may take more than several hours to crosslink, for example more than 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 hours to crosslink. In some embodiments provided herein, droplets are formulated by an instantaneous crosslinker, and thereby preserve the cell state better compared to a slow crosslinker. Without wishing to be bound by theory, cells may possible undergo physiological changes by intracellular signaling mechanisms during longer crosslinking times.
[0221] In some embodiments, a crosslinker forms a disulfide bond in the hydrogel polymer, thereby linking hydrogel polymers. In some embodiments, the hydrogel polymers form a hydrogel matrix having pores (for example, a porous hydrogel matrix). These pores are capable of retaining sufficiently large particles, such as a single cell or nucleic acids extractedtherefrom within the droplet, but allow other materials, such as reagents, to pass through the pores, thereby passing in and out of the droplets. In some embodiments, the pore size of the droplets is finely tuned by varying the ratio of the concentration of polymer to the concentration of crosslinker. In some embodiments, the ratio of polymer to crosslinker is 30:1, 25:1, 20:1, 19:1, 18:1, 17:1, 16:1, 15:1, 14:1, 13:1, 12:1, 11:1, 10:1, 9:1, 8:1, 7:1, 6:1, 5:1, 4:1, 3:1, 2:1, 1:1, 1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:15, 1:20, or 1:30, or a ratio within a range defined by any two of the aforementioned ratios. In some embodiments, additional functions such as DNA primer, or charged chemical groups can be grafted to polymer matrix to meet the requirements of different applications.
[0222] As used herein, the term “porosity” means the fractional volume (dimension-less) of a hydrogel that is composed of open space, for example, pores or other openings. Therefore, porosity measures void spaces in a material and is a fraction of volume of voids over the total volume, as a percentage between 0 and 100% (or between 0 and 1), Porosity of the hydrogel may range from 0.5 to 0.99, from about 0.75 to about 0.99, or from about 0.8 to about 0.95.
[0223] In some embodiments, the droplet can have any pore size that allows for sufficient diffusion of reagents while concomitantly retaining the single cell or nucleic acids extracted therefrom.? Xs used herein, the term “pore size” refers to a diameter or an effective diameter of a cross-section of the pores. The term “pore size” can also refer to an average diameter or an average effective diameter of a cross-section of the pores, based on the measurements of a plurality of pores. The effective diameter of a cross-section that is not circular equals the diameter of a circular cross-section that has the same cross-sectional area as that of the non-circular cross-section. In some embodiments, the hydrogel can be swollen when the hydrogel is hydrated. The sizes of the pores size can then change depending on the water content in the hydrogel. In some embodiments, the pores of the hydrogel can have a pore of sufficient size to retain the encapsulated cell within the hydrogel but allow reagents to pass through. In some embodiments, the interior of the droplet is an aqueous environment. In some embodiments, the single cell disposed within the droplet is free from interaction with the polymer shell of the droplet and / or is not in contact with the polymer shell. In some embodiments, a polymer shell is formed around a cell, and the cell is in contact with the polymer shell due to the polymer shell being brought to the cell surface due to passiveadsorption or in a targeted manner, such as by being attached to an antibody or other specific binding molecule.
[0224] In some embodiments, the droplet is of a sufficient size to encapsulate a single cell. In some embodiments, the droplet has a diameter of about 20 pm to about 200 um, such as 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 pm, or a diameter within a range defined by any two of the aforementioned values. The size of the droplet may change due to environmental factors. In some embodiments, the droplets expand when they are separated from continuous oil phase and immersed in an aqueous phase. In some embodiments, expansion of the droplet increases the efficiency of performing assays on the genetic material inside the encapsulated cells. In some embodiments, expansion of the droplet creates a larger environment for indexed inserts to be amplified during PCR, which may otherwise be restricted in current cell based assays.
[0225] In some embodiments, a droplet is prepared by dynamic means, such as by vortex assisted emulsion, microfluidic droplet generation, or valve based microfluidics. In some embodiments, the droplets are formulated in a uniform size distribution. In some embodiments, the size of the droplets is finely tuned by adjusting the size of the microfluidic device, the size of the one or more channels, or the flow rate through the microfluidic channels. In some embodiments, the resulting droplet has a diameter ranging from 20 to 200 pm, for example, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 pm, or a diameter within a range defined by any two of the aforementioned values.
[0226] In some embodiments, analyzing one or more analytes may include various analyses, depending on what the analyte is. For example, analyzing may include DNA analysis, RNA analysis, protein analysis, tagmentation, nucleic acid amplification, nucleic acid sequencing, nucleic acid library preparation, assay for transposase accessible chromatic using sequencing (ATAC-seq), contiguity-preserving transposition (CPT-seq), single cell combinatorial indexed sequencing (SCI-seq), or single cell genome amplification, or any combination thereof.
[0227] DNA analysis refers to any technique used to amplify, sequence, or otherwise analyze DNA. DNA amplification can be accomplished using PCR techniques or pyrosequencing. DNA analysis may also comprise non-targeted, non-PCR based DNAsequencing (e.g., metagenomics) techniques. As a non-limiting example, DNA analysis may include sequencing the hyper-variable region of the 16S rDNA (ribosomal DNA) and using the sequencing for species identification via DNA.
[0228] RNA analysis refers to any technique used to amplify, sequence, or otherwise analyze RNA. The same techniques used to analyze DNA can be used to amplify and sequence RNA. RNA, which is less stable than DNA is the translation of DNA in response to a stimuli. Therefore, RNA analysis may provide a more accurate picture of the metabolically active members of the community and may be used to provide information about the community function of organisms in a sample. Further, simultaneous analysis of both DNA and RNA may be beneficial to efficiently determination of both DNA and RNA related interrogations. Nucleic acid sequencing refers to use of sequencing to determine the order of nucleotides in a sequence of a nucleic acid molecule, such as DNA or RNA.
[0229] As used herein, the term “reagent” describes an agent or a mixture of two or more agents useful for reacting with, interacting with, diluting, or adding to a sample, and may include agents used in assays described herein, including agents for lysis, nucleic acid analysis, nucleic acid amplification reactions, protein analysis, tagmentation reactions, ATAC- seq, CPT-seq, or SCI-seq reactions, or other assays. Thus, reagents may include, for example, buffers, chemicals, enzymes, polymerase, primers having a size of less than 50 base pairs, template nucleic acids, nucleotides, labels, dyes, or nucleases. In some embodiments, the reagent includes lysozyme, proteinase K, random hexamers, polymerase (for example, φ29 DNA polymerase, Taq polymerase, Bsu polymerase), transposase (for example, Tn5), primers (for example, P5 and P7 adaptor sequences), ligase, catalyzing enzyme, deoxynucleotide triphosphates, buffers, or divalent cations.Other Considerations
[0230] Various embodiments of the present disclosure may be a system, a method, and / or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or mediums) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
[0231] For example, the functionality described herein may be performed as software instructions are executed by, and / or in response to software instructions beingexecuted by, one or more hardware processors and / or any other suitable computing devices. The software instructions and / or other executable code may be read from a computer readable storage medium (or mediums). Computer readable storage mediums may also be referred to herein as computer readable storage or computer readable storage devices.
[0232] The computer readable storage medium can be a tangible device that can retain and store data and / or instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device (including any volatile and / or non-volatile electronic storage devices), a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a solid state drive, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
[0233] Computer readable program instructions described herein can be downloaded to respective computing / processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and / or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing / processing device.
[0234] Computer readable program instructions (as also referred to herein as, for example, “code,” “instructions,” “module,” “application,” “software application,” and / or the like) for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the " C" programming language or similar programming languages. Computer readable program instructions may be callable from other instructions or from itself, and / or may be invoked in response to detected events or interrupts. Computer readable program instructions configured for execution on computing devices may be provided on a computer readable storage medium, and / or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution) that may then be stored on a computer readable storage medium. Such computer readable program instructions may be stored, partially or fully, on a memory device (e.g., a computer readable storage medium) of the executing computing device, for execution by the computing device. The computer readable program instructions may execute entirely on a user's computer (e.g., the executing computing device), partly on the user’s computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry’, in order to perform aspects of the present disclosure.
[0235] Aspects of the present disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that eachblock of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer readable program instructions.
[0236] These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / acts specified in the flowchart and / or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and / or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function / act specified in the flowchart(s) and / or block diagram(s) block or blocks.
[0237] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions / acts specified in the flowchart and / or block diagram block or blocks. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer may load the instructions and / or modules into its dynamic memory and send the instructions over a telephone, cable, or optical line using a modem. A modem local to a server computing system may receive the data on the telephone / cable / optical line and use a converter device including the appropriate circuitry to place the data on a bus. The bus may carry the data to a memory, from which a processor may retrieve and execute the instructions. The instructions received by the memory may optionally be stored on a storage device (e.g., a solid-state drive) either before or after execution by the computer processor.
[0238] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard,each block in the flowchart or block diagrams may represent a service, module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In addition, certain blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate.
[0239] It will also be noted that each block of the block diagrams and / or flowchart illustration, and combinations of blocks in the block diagrams and / or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. For example, any of the processes, methods, algorithms, elements, blocks, applications, or other functionality (or portions of functionality) described in the preceding sections may be embodied in, and / or fully or partially automated via, electronic hardware such application-specific processors (e.g., application-specific integrated circuits (ASICs)), programmable processors (e.g., field programmable gate arrays (FPGAs)), application-specific circuitry, and / or the like (any of which may also combine custom hard-wired logic, logic circuits, ASICs, FPGAs, etc. with custom programming / execution of software instructions to accomplish the techniques).
[0240] Any of the above-mentioned processors, and / or devices incorporating any of the above-mentioned processors, may be referred to herein as, for example, “computers,” “computer devices,” “computing devices,” “hardware computing devices,” “hardware processors,” “processing units,” and / or the like. Computing devices of the above-embodiments may generally (but not necessarily) be controlled and / or coordinated by operating system software, such as Mac OS, iOS, Android, Chrome OS, Windows OS (e.g., Windows XP, Windows Vista, Windows 7, Windows 8, Windows 10, Windows 11, Windows Server, etc.), Windows CE, Unix, Linux, SunOS, Solaris, Blackberry OS, VxWorks, or other suitable operating systems. In other embodiments, the computing devices may be controlled by a proprietary operating system. Conventional operating systems control and schedule computerprocesses for execution, perform memory management, provide file system, networking, I / O services, and provide a user interface functionality, such as a graphical user interface (“GUI”), among other things.
[0241] Reference throughout the specification to “one example”, “another example”, “an example”, and so forth, means that a particular element (e.g., feature, structure, and / or characteristic) described in connection with the example is included in at least one example described herein, and may or may not be present in other examples. In addition, it is to be understood that the described elements for any example may be combined in any suitable manner in the various examples unless the context clearly dictates otherwise.
[0242] It is to be understood that the ranges provided herein incl ude the stated range and any value or sub-range within the stated range, as if such value or sub-range were explicitly recited. For example, a range from about 2 kbp to about 20 kbp should be interpreted to include not only the explicitly recited limits of from about 2 kbp to about 20 kbp, but also to include individual values, such as about 3.5 kbp, about 8 kbp, about 18.2 kbp, etc., and sub-ranges, such as from about 5 kbp to about 10 kbp, etc. Furthermore, when “about” and / or “substantially” are / is utilized to describe a value, this is meant to encompass minor variations (e.g., up to + / - 10%) from the stated value.
[0243] While several examples have been described in detail, it is to be understood that the disclosed examples may be modified. Therefore, the foregoing description is to be considered non-limiting.
[0244] While certain examples have been described, these examples have been presented by way of example only, and are not intended to limit the scope of the disclosure. Indeed, the novel methods described herein may be embodied in a variety of other forms. Furthermore, various omissions, substitutions and changes in the methods described herein may be made without departing from the spirit of the disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosure.
[0245] Features, materials, characteristics, or groups described in conjunction with a particular aspect, or example are to be understood to be applicable to any other aspect or example described in this section or elsewhere in this specification unless incompatible therewith. All of the features disclosed in this specification (including any accompanyingclaims, abstract and drawings), and / or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and / or steps are mutually exclusive. The protection is not restricted to the details of any foregoing examples. The protection extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.
[0246] Furthermore, certain features that are described in this disclosure in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations, one or more features from a claimed combination can, in some cases, be excised from the combination, and the combination may be claimed as a sub-combination or variation of a sub-combination.
[0247] Moreover, while operations may be depicted in the drawings or described in the specification in a particular order, such operations need not be performed in the particular order shown or in sequential order, or that all operations be performed, to achieve desirable results. Other operations that are not depicted or described can be incorporated in the example methods and processes. For example, one or more additional operations can be performed before, after, simultaneously, or between any of the described operations. Further, the operations may be rearranged or reordered in other implementations. Those skilled in the art will appreciate that in some examples, the actual steps taken in the processes illustrated and / or disclosed may differ from those shown in the figures. Depending on the example, certain of the steps described above may be removed or others may be added. Furthermore, the features and attributes of the specific examples disclosed above may be combined in different ways to form additional examples, all of which fall within the scope of the present disclosure.
[0248] For purposes of this disclosure, certain aspects, advantages, and novel features are described herein. Not necessarily all such advantages may be achieved in accordance with any particular example. Thus, for example, those skilled in the art will recognize that the disclosure may be embodied or carried out in a manner that achieves oneadvantage or a group of advantages as taught herein without necessarily achieving other advantages as may be taught or suggested herein.
[0249] The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.
[0250] Conditional language used herein, such as, among others, “can,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and / or states. Thus, such conditional language is not generally intended to imply that features, elements and / or states are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and / or states are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” “involving,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
[0251] Disjunctive language such as the phrase “at least one of X, Y or Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y or Z, or any combination thereof (such as X, Y and / or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y or at least one of Z to each be present.
[0252] Unless otherwise explicitly stated, articles such as “a” or “an” should generally be interpreted to include one or more described items.
[0253] While the above detailed description has shown, described, and pointed out novel features as applied to illustrative embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms illustrated can be made without departing from the spirit of the disclosure. As will be recognized, certain embodiments described herein can be embodied within a form that does not provide all of the features and benefits set forth herein, as some features can be used orpracticed separately from others. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
[0254] It should be appreciated that all combinations of the foregoing concepts (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the inventive subject matter disclosed herein.
[0255] The scope of the present disclosure is not intended to be limited by the specific disclosures of examples in this section or elsewhere in this specification, and may be defined by claims as presented in this section or elsewhere in this specification or as presented in the future. The language of the claims is to be interpreted broadly based on the language employed in the claims and not limited to the examples described in the present specification or during the prosecution of the application, which examples are to be construed as nonexclusive.
Claims
WHAT IS CLAIMED IS:
1. A method for assembling sequence reads from a genomic sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell, the method comprising:obtaining flow cell data comprising:1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and2) the flow cell locations of the clusters of the nucleic acids: assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference sequence;determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids;scaffolding the contigs from the first assembly to produce a second assembly, wherein the second assembly has undefined contiguous sequences; andmapping additional contigs from the first assembly to the undefined contiguous sequences on the second assembly by referencing the linking information to create a final assembly.
2. The method of claim 1, wherein assembling the sequence reads to produce a first assembly includes mapping the sequence reads to contiguous sequences of the first assembly to produce contigs.
3. The method of claim 1, wherein producing the first assembly comprises producing a hybrid assembly of a plurality of de novo assemblers.
4. The method of claim 1, wherein determining the linking information comprises determining a mapping quality score for the sequence reads.
5. The method of claim 1, wherein assembling the sequence reads comprises:determining at least two or more sequence reads that exceed a mapping quality threshold; andadding the at least two or more sequence reads to a plurality of bins.
6. The method of claim 5, wherein adding the at least two or more sequence reads is based on a k-mer signature threshold.
7. The method of claim 6, wherein the k-mer signature threshold is based on a number of sequence reads, which are genomically adjacent to the sequence reads in the plurality of bins.
8. The method of claim 5, wherein the plurality of bins have a bin connection strength, wherein the bin connection strength is based on at least two sequence reads, which are sorted into different bins and are genomically adjacent to one another.
9. The method of claim 8, wherein the bin connection strength is based upon proximity information for clusters determined from flow cell locations of the clusters.
10. The method of claim 1, wherein scaffolding comprises selecting contigs from the first assembly based upon a sequence length and a bin connection strength.
11. The method of claim 10, wherein the sequence length is less than 500 bases.
12. The method of claim 10, wherein the sequence length is between 500 bases and 10 kilobases.
13. The method of claim 10, wherein the sequence length is greater than 10 kilobases.
14. The method of claim 1, wherein each of the contigs have an orientation, wherein the orientation is associated with a head terminal or a tail terminal of a contig.
15. The method of claim 1, further comprising:determining the orientation of a first contig in the plurality of contigs in relation to the orientation of a second contig in the plurality of contigs based on a threshold number of connections between the first contig and second contig.
16. The method of claim 15, further comprising:generating a first graph, comprising nodes each corresponding to a terminal of a contig and edges connecting the nodes, wherein each edge has a weight based on the bin connection strength between the nodes connected by the edge.
17. The method of claim 16, further comprising:pruning the edges from the first graph based upon a unique maximum weight amongst all edges in the graph and a reciprocity metric to generate a second graph; and traversing the second graph to map the contigs to the second assembly.
18. The method of claim 17, wherein the edges in the second graph are bidirectional.
19. The method of claim 17, wherein the first and second graphs are cyclic.
20. The method of claim 10, wherein scaffolding further comprises selecting contigs based upon a mapping connection strength.
21. The method of claim 20, wherein the mapping connection strength is based upon the bin connection strength and orientation of the contigs.
22. The method of claim 20, wherein a contig may be connected to more than one contig if the mapping connection strength is above a threshold.
23. The method of claim 21, wherein the proximity information and mapping connection strength of a first contig to a second contig can be used to determine the orientation of the first contig relative to a third contig.
24. The method of claim 1, wherein the mapping of additional contigs to the sections of the second assembly having undefined contiguous sequences is based upon a mapping connection strength or a fraction of total reads that have a mapping connection greater than a threshold.
25. The method of claim 1, wherein sections having undefined contiguous sequences have a fixed sequence length.
26. The method of claim 1, wherein the genomic sample includes bacterial, fungal, human or non-human samples.
27. A system for assembling sequence reads from a genomic sample, comprising a memory storing instructions and a processor that, when executing the instructions, is configured to perform a method comprising:obtaining flow cell data comprising:1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and2) the flow cell locations of the clusters of the nucleic acids;assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference genome;determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids;scaffolding the contigs from the first assembly to produce a second assembly, wherein the second assembly has portions having undefined contiguous sequences; andmapping additional contigs from the first assembly undefined contiguous sequences to assemble a final assembly by referencing the linking information.
28. A non-transitory computer-readable medium comprising a plurality of instructions, which when executed by at least one processor, cause the at least one processor to:obtain flow cell data comprising:1 ) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and2) the flow cell locations of the clusters of the nucleic acids;assemble the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference genome;determine linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids;scaffold the contigs from the first assembly to produce a second assembly, wherein the second assembly has portions having undefined contiguous sequences; andmap additional contigs from the first assembly to the undefined contiguous sequences to assemble a final assembly by referencing the linking information.
29. A method for assembling sequence reads from a genomic sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell, the method comprising:obtaining flow cell data comprising:1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference sequence;determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; anddetecting misassemblies in the first assembly using the linking information.
30. The method of claim 29, wherein the method further comprises scaffolding the contigs from the first assembly to produce a second assembly and detecting misassemblies in the second assembly using the linking information.
31. The method of claim 30, wherein scaffolding comprises ordering and orienting contigs from the first assembly to produce the second assembly using proximity information.
32. The method of claim 31, wherein scaffolding further comprises incorporating additional contigs into the second assembly based on a measure of the strength of the determined linking information.
33. The method of claim 30, wherein the misassemblies are detected based on identifying an abnormality based on an off-diagonal signal from colocation data based on the linking information.
34. The method of claim 33, wherein the colocation data is in the form of a colocation plot or colocation matrix.
35. The method of claim 33, wherein the misassemblies are detected by providing the colocation data to a machine learning model to identify errors in the order and / or orientation of contigs in the first assembly or second assembly.
36. The method of claim 20, wherein the misassemblies are detected based on proximity link size deviation in the linking information.
37. The method of claim 30, further comprising breaking the first assembly or the second assembly into separate contigs at the site of a misassembly in the first assembly or the second assembly.
38. The method of claim 30, further comprising correcting the detected misassemblies using proximity information to create a corrected assembly.
39. The method of claim 38, wherein correcting the detected misassemblies comprises correcting the order and / or orientation of contigs in the first assembly or the second assembly using proximity information.
40. The method of claim 32, wherein the incorporation of additional contigs to the second assembly is further based upon an assembly graph, wherein contigs are represented as outputs of non- branching paths in the assembly graph.
41. The method of claim 29, further comprising correcting the assembly using proximity information.
42. The method of claim 41, wherein correcting the assembly comprises using a pile¬ up comprising a plurality of linked sequence reads.
43. The method of claim 42, wherein correcting the assembly comprises identifying positions of discrepancy between the assembly and a consensus sequence built from the linked sequence reads in the pile-up.
44. The method of claim 43, wherein the method updates the assembly at the positions of discrepancy based on the consensus sequence from the pile-up.
45. The method of claim 43, wherein the method marks the positions of discrepancy in the assembly.
46. The method of claim 42, wherein the linked sequence reads comprise local reads having 1-1000 bp.
47. The method of claim 41, wherein correcting the assembly uses ploidy-aware proximity information,48. The method of claim 29, wherein the method determines the circularity of an assembly in a BANDAGE (Bioinformatics Application for Navigating De novo Assembly Graphs Easily) plot.
49. The method of claim 48, wherein the method further determines the circularity using proximity information.
50. A method for assembling sequence reads from a genomic sample, wherein the sequence reads are read from clusters of nucleic acids on a flow cell, the method comprising:obtaining flow cell data comprising:1) sequence reads from the clusters of nucleic acids from the genomic nucleic acid sample; and2) the flow cell locations of the clusters of the nucleic acids; assembling the sequence reads to produce a first assembly comprising a plurality of contigs without comparison to a reference sequence;determining linking information for the sequence reads based on the flow cell locations of the clusters of nucleic acids; andfurther comprising correcting the assembly using proximity information.
51. The method of claim 50, wherein correcting the assembly comprises using a pile¬ up comprising a plurality of linked sequence reads.
52. The method of claim 51, wherein correcting the assembly comprises identifying positions of discrepancy between the assembly and a consensus sequence built from the linked sequence reads in the pile-up.
53. The method of claim 52, wherein the method updates the assembly at the positions of discrepancy based on the consensus sequence from the pile-up.
54. The method of claim 52, wherein the method marks the positions of discrepancy in the assembly.
55. The method of claim 51, wherein the linked sequence reads comprise reads having 1-1000 bp.