Two strand single end (2XSE) DNA sequencing
By sequencing both strands of a DNA fragment in opposite directions, the method addresses the limitations of short read lengths in MPS, achieving longer and more accurate sequence reads with reduced errors.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- MGI TECH CO LTD
- Filing Date
- 2025-12-15
- Publication Date
- 2026-06-18
AI Technical Summary
Existing massively parallel sequencing (MPS) technologies face limitations due to short read lengths, which introduce errors in sequencing samples without a close reference sequence, particularly in areas with high sequence complexity or repetition, affecting the accuracy of variant detection.
A method for simultaneously or consecutively sequencing both strands of a DNA fragment in opposite directions by forming sense and antisense concatemers, using techniques like rolling circle amplification or PCR amplification, to generate longer sequence reads and reduce errors through complementary-UMI pairing.
This approach enables longer sequence reads, reduces sequencing errors, and improves accuracy by generating overlapping reads, allowing for more sequence data per sample molecule with fewer sequencing cycles.
Smart Images

Figure CN2025142355_18062026_PF_FP_ABST
Abstract
Description
TWO STRAND SINGLE END (2XSE) DNA SEQUENCINGPRIORITY APPLICATION
[0001] This application claims the priority benefit of U.S. provisional patent application 63 / 733,958, filed December 13, 2024. The priority application is hereby incorporated herein in its entirety for all purposes.TECHNICAL FIELD
[0002] This disclosure relates generally to the fields of oligonucleotide chemistry and DNA sequencing. It provides technological enhancements for obtaining sequence reads from clonally amplified DNA (DNA nanoballs or PCR clusters) so that fragments from a fragment library can be sequenced in opposite directions on complementary strands.BACKGROUND
[0003] Massively parallel sequencing (MPS) , also known as next-generation sequencing (NGS) , has become an important tool in biological research and medicine, providing a means for simultaneously sequencing millions of DNA fragments. MPS has several advantages over traditional sequencing, including faster processing times, lower cost, and greater accuracy, helping scientists identify rare genetic variations and mutations of clinical importance. It is used in disease research, personalized medicine, genetic testing, and disease tracking worldwide, to improve the standard of clinical diagnosis and care.
[0004] Potential limitations of previous methods of MPS can arise from the length of sequence reads -in previous technology, typically ranging from 100 to 300 base pairs. Short read lengths can introduce errors in sequencing of samples that don’t have a close reference sequence. Errors in base calls and read assembly adversely affect the accuracy of variant detection, particularly in areas with high sequence complexity or repetition.
[0005] The technology put forth in this disclosure advances the length and accuracy of sequence reads that can be obtained in MPS, providing important benefits for the sequencing of genome and expression libraries.SUMMARY OF THE INVENTION
[0006] This disclosure provides a technology for simultaneously or consecutively sequencing both strands of a DNA fragment in opposite directions. Sense and antisense concatemers are formed, for example: (1) by replicating a double stranded circular DNA in opposite directions; (2) by replicating separate sense and antisense single stranded circular DNAs; or (3) by PCR amplifying single stranded linear DNAs, then separating and circularizing the sense and antisense strands. The sense and antisense circular DNAs are then used to make DNB concatemers for sequence determination. The technology can be used to achieve longer sequence reads, to provide overlapping reads with error checking capability, and / or to read an entire DNA fragment in half the number of sequencing cycles.
[0007] By using 2xSE sequencing, more sequence data can be obtained per sample molecule by pairing of one or more reads from each strand using UMIs and cUMIs. Features may include one or more of the following in any combination: ·reads from sense and antisense strands may be generated on different DNA spots; ·almost all UMI and cUMI barcodes are unique to the population of molecules being sequenced; ·one, two, three, or four times as many reads may be obtained relative to number of different molecules for each strand; ·reads may be generated at the same time for faster paired end reads; ·reads may be generated in the same lane of a sequencing cartridge; ·the circle of complementary strands used to generate the concatemers may be produced in the same tube; ·single strand reads can be obtained that are longer than 300, 400, 500, 600, 700, 800, 900, or 1000 bases in length; ·combined reads from both ends of a DNA fragment may overlap, generating a total read length exceeding 600, 800, 1000, 1200, or 1500 bases; ·overlapping sequence reads may also be used to reduce sequencing errors in the overlapped segment; ·long DNA molecules can be read by massive parallel sequencing (MPS) using complementary-UMI pairing of reads longer than 400, 500, 600, 700, or 800 bases from multiple overlapped templates generated from a DNA molecule longer than 1, 1.5, 2, 3, 4, 5, or 6 kbs; ·concatemer templates can be generated by controlled primer extension (CPE) on each strand of starting double strand DNA molecule (dsDNA) , by circularization of one strand before CPE on that strand; or ·1500 base long 16S amplicon molecules (which help identify and characterize bacterial species within a sample) can be entirely sequenced in all its species variants by way of two 750 base or longer reads, one from each strand, wherein sequencing templates are about 700 to 1000 or 800 to 1100 bases in length. Definitions and abbreviations
[0008] When referred to in this disclosure, a “concatemer” is a DNA macromolecule in which multiple replicates of a DNA template are present in the same DNA strand adjacent or nearby each other in the same orientation-such as may be achieved by rolling circle amplification (RCA) of the template presented in a single or double stranded DNA. The template comprises a portion of a target nucleic acid or DNA that is being sequenced or otherwise characterized (an “insert” ) , plus one or more artificial sequences (or “adaptors” ) , as detailed below. A “DNA nanoball” or DNB is a DNA macromolecule such as a concatemer that adopts a globular structure in a compatible buffer. “Concatemer sequencing” is the use of concatemers to determine the sequencing of a target DNA. Sequence reads are obtained from copies of the insert that is replicated in each concatemer, and assembled with sequence reads from other concatemer inserts to obtain at least part of the sequence of the target nucleic acid. Concatemers may be referred to “sense concatemers” and “antisense concatemers” in comparison with each other if the replicated insert in one is complimentary to the replicated insert in the other.
[0009] The terms “2xSE” and “2xSE sequencing” means directional nucleotide single-end (SE) sequencing of both strands of a DNA macromolecule (or portion thereof) , conducted in the same direction (usually from 5’ to 3’ ) from opposite ends of the DNA duplex. It differs from paired end (PE) sequencing, because the two strands being sequenced in 2xSE sequencing are separate and apart from each other during the sequencing process. In the context of DNA nanoball (DNB) technology, each of the two strands is contiguously replicated in its own separate DNB and sequenced separately. Such complementary DNBs may be interlinked within the same spot on the arrays made from dsDNA circles (Approach A) . The sequencing of each of the strands may be done one after another or concurrently, in separate reaction mixtures or a combined mixture. A term such as “2xSE600” means that the length of the sequence reads inwards from both ends of the insert fragments (on separate concatemers) is about 600 bases.
[0010] When referring to particular implementations of DNA sequencing using DNBs or concatemers, paired end (PE) sequencing typically comprises a first reading of a concatemer template followed by multiple displacement amplification (MDA) to permit reading in the opposite direction. Terms such as PE600 means that the length of the sequence read in the first direction is about 600 bases.
[0011] Other terms are defined as they arise in this disclosure. Terms not explicitly defined have their ordinary meaning, adapted to the context in which they are used. Features and related aspects
[0012] Provided in this disclosure are methods for sequencing or otherwise characterizing the nucleotide sequence of a target polynucleotide. The target polynucleotide may be a single nucleic acid macromolecule or plurality thereof from a single organism or source, or from a plurality thereof: for example, genomic DNA of a human, other mammal, other eukaryotic organism, cell culture, or environmental sample; or an expression library from one or more cells, culture, tissues or environmental samples that has been reverse transcribed into cDNA.
[0013] In general terms, the methods put forth below may comprise the steps of (a) obtaining a library of DNA fragments, each comprising an insert that is a contiguous portion or fragment of the target polynucleotide, ligated with a sequencing adaptor to form a template; (b) forming sense concatemers, each comprising replicated copies of a first one of said templates containing a first insert sequence; and (c) forming antisense concatemers, each comprising replicated copies of a second one of said templates containing a second insert sequence. The term “replicated” in this context means contiguously replicated: i.e., multiple replicates of the sequence of the template appear in the same DNA strand adjacent or nearby each other in the same orientation-such as may be achieved by rolling circle amplification (RCA) of the template presented in a single or double stranded DNA.
[0014] Depending on the context and the user’s objectives, a proportion of sense concatemers prepared in this way will have a corresponding antisense strand in the same reaction mixture, or in a separate reaction mixture. The proportion of sense strands that have corresponding antisense strands may be at least 10%or 20%, typically at one third, one half, or two thirds, and sometimes 80%, 90%or 95%of the prepared sense concatemers. As used in this disclosure, a sense concatemer “corresponds” to an antisense concatemer when the replicated insert in the sense concatemer is complementary to the replicated insert in the antisense concatemer.
[0015] In general, the terms “complementary” or “complementarity” have their regular meaning: Watson-Crick base pairing between nucleotides or nucleic acids. Complementary nucleotides are A paired with T (or A paired with U) , and G paired with C, including their corresponding analogs. More particularly in the context of comparing DNA fragment inserts between sense and antisense concatemers (unless stated otherwise) , “complementary inserts” or “corresponding inserts” are two inserts that match each other both in nucleotide sequence and in length. “Complementary concatemers” or “corresponding concatemers” are concatemers with complementary or corresponding inserts. Thus, at least 95%, and sometimes at least as 98%or 99%of individual bases in one insert will be complementary to a base at the same position in the other insert. In addition, the length of one insert (in terms of number of bases) will be within at least 90%, sometimes within at least 95%, or 98%, or close to 100%of the length of the other insert. Modest deletions, gaps, and interruptions are permitted when comparing the two strands, as long as the total lengths are within the stated limit. The technology of this disclosure may also be implemented such that the inserts don’t match each other in length. Instead, they can be considered to “coincide” if the inserts are complementary and at least one of the two ends of the inserts occur at or around the same place in sequence.
[0016] Referring to opposing strands in a double stranded DNA as “sense” or “antisense” strands, or “top” and “bottom” strands, is an arbitrary assignment for labeling purposes, and does not otherwise imply a constraint on structure, function, or order. When sense and antisense strands or concatemers are prepared and kept in separate reaction mixtures, they are matched against corresponding counterparts in the other mixture. In instances in which sense and antisense strands are prepared and / or pooled together in a single reaction mixture, they are compared with other strands or concatemers in the same mixture.
[0017] After the sense and antisense concatemers are prepared, (d) first sequence reads are obtained of the first inserts in the sense concatemers; and (e) second sequence reads are obtained of the second inserts in the antisense concatemers. Referring to components and actions as “first” , “second” , and so on is also an arbitrary assignment for labeling purposes, and unless otherwise stated or required, does not imply an order of operation or importance.
[0018] Adaptors are used at various places in this technology for various purposes. The “adaptors” are artificial oligonucleotide sequences that constitute a tool kit for manipulation of an adjacent section of DNA on either or both sides. In the context of this disclosure, they may be referred to with an adjective that is non-limiting and used for purposes of labeling and discrimination from other adaptors in other locations. For example, templates that are replicated in DNA concatemers contain an insert sequence from the target polynucleotide, plus an adaptor that has a hybridization site for a sequencing primer, and hence may be referred to as a sequence or sequencing adaptor. It typically has a hybridization site for a sequencing primer used to determine the sequence of adjacent DNA. It may serve other purposes, such as containing a UMI or cUMI (explained below) , one or more other primer binding sites, one or more other barcodes, recognition sites for restriction endonucleases, and hybridization sites for structural purposes, such as bridging, condensing, or scaffold oligonucleotides. Circular DNAs typically have ends that are joined by what is referred to herein as an assembly adaptor, which may have other purposes: for example, containing a hybridization site for a primer used for rolling circle amplification or for sequencing by primer extension.
[0019] In some circumstances, the first template in each sense concatemer contains a unique molecule identifier (UMI) and the second template in each antisense concatemer contains a complementary or second unique molecule identifier (cUMI) . The methods of preparing sense and antisense DNBs may continue by (f) comparing UMIs for first sequence reads with cUMIs for second sequence reads; and (g) matching first sequence reads with second sequence reads whenever the sequence of a cUMI complementary to the sequence of a UMI. The UMI is an artificial oligonucleotide sequence or barcode that is ligated to or otherwise associated with an insert sequence that constitutes a fragment or portion of the target DNA. The UMI is typically a contiguous sequence, but it may be divided into two separate portions (aduplex UMI) or more. Other features of UMIs, cUMIs, their preparation and use are explained further in sections that follow.
[0020] In the context of this disclosure, the UMI identifies a particular fragment in the DNA library and only that fragment. It is separate and apart from barcodes that may be used to identify adjacent or neighboring fragments in a genome or target DNA, such as would be incorporated by aliquoting and pooling LFR technology (U.S. Patent 8,771,958) or single tube LFR (stLFR) technology (U.S. Patent No. 10,557, 166; EP 3918088 B1; EP 3790967 B1) . UMIs are also separate and apart from barcodes that may be used to identify DNA from individual cells in a target polynucleotide made from plurality of different cells, or individual sources in a target polynucleotide made from a plurality of different sources.
[0021] Each UMI and cUMI may be embedded within a sequence read of a complete template that includes both the UMI a primer binding site, and an insert portion from the target polynucleotide. In this case, it has been sequenced consecutively on each strand with the insert, referred to herein as two primer sequencing, Example 2. Another option is for each UMI and cUMI to be sequenced separately from the insert it is labeling, using separate primer. This may be referred to as four primer sequencing, Example 1.
[0022] Following determination of UMI and cUMI sequences, the processing of sense and antisense concatemers may continue by (h) assembling first sequence reads with matching second sequence reads when the first and second sequence reads overlap.
[0023] Alternatively or in addition to UMIs, sequence reads from sense and antisense strands may be matched and assembled by (f) comparing first sequence reads with second sequence reads from sense and antisense strands; and (g) matching first sequence reads with second sequence reads when they are complementary to each other in an overlapping region for a sufficient length. The length of the region being compared is chosen so as to be unique in the target polynucleotide, to avoid mismatching. For a human genome, assuming random distribution of the four bases, the number of bases required for unique sequences is at least about 16 consecutive nucleotides. In practice, the user may decide to use at least about 10, 12, 15, 16.20, or 25 consecutive nucleotides for the match, or two or more segments with sufficient specificity. “Complementary” in this context means that the oligonucleotide sequences being matched are the opposite of each other in each base pair according to Watson-Crick base pairing over the stated distance.
[0024] Sense concatemers and corresponding antisense concatemers can be prepared by amplification of double stranded circular DNAs, each containing one of the templates, using primers that hybridize to the sequencing adaptor on opposite strands (Approach A, FIGS. 2A and 2B) . For example, the templates being replicated may start by being double stranded, and the double stranded circular DNAs are prepared by ligating assembly adaptors onto each of the templates; then processing the assembly adaptors so that they have staggered ends that hybridize to each other. The assembly adaptors may be bubble adaptors as described in U.S. Patent 10,954,559, whereupon the processing comprises incorporating a uracil into the assembly adaptors and cleaving one strand of the assembly adaptors using a USER (uracil-specific excision reagent) enzyme system to form said staggered ends.
[0025] Alternatively, sense and corresponding antisense concatemers can be prepared from separate complementary single stranded circular DNAs (Approach B, FIGS. 2C and 2D) . If the templates being replicated are double stranded, the single stranded circular DNAs may be prepared by ligating assembly adaptors onto each of the templates; followed by separating and circularizing each strand thereof. For example, adaptors on each end of the separate strands may be brought together using a splint oligonucleotide that is complementary to oligonucleotides at the 5’ end and the 3’ end of each strand, thereby bridging the 5’ end to the 3’ end so that they can be ligated together.
[0026] Alternatively, sense and corresponding antisense concatemers can be prepared from PCR amplified single stranded DNAs (Approach C, FIGS. 2E and 2F) . Each of the sense concatemers and the antisense concatemer corresponding thereto are prepared from sense and antisense strands respectively of replicated linear DNA. The replicated DNA may be separated into two reaction streams, one of which is circularized using a splint oligonucleotide that is complementary to adaptors at each end of the sense strand of the replicated DNA, the other of which is circularized using a splint oligonucleotide that is complementary to adaptors at each end of the antisense strand of the replicated DNA. The replicated linear DNA is ligated to a a first adaptor with a binding site for a first oligonucleotide primer at one end and a second adaptor having a binding site for a different second adaptor at its other end, either or both of which may be bubble adaptors. The first primer is used to selectively prime linear amplification of the sense strand, whereas the second primer is used to selectively prime linear amplification of the antisense strand.
[0027] Using any of these approaches, sense concatemers and the antisense concatemers may be pooled and sequenced together, or kept separate (for example, in separate compartments or lanes of a flow cell) and sequenced separately. Sequence reads may be obtained by primer extension, by sequencing by synthesis, by sequencing by ligation (cPAL) , or by any other suitable method.
[0028] The first inserts and the second inserts replicated in the sense and antisense concatemers may have any median length that is operable in this system, such as at least 600, 800, 1000, 1200, 1500, 2000, or 3000 kb or more. The length of the sequence reads may be of any length of which the system is capable, for example, having a median length of at least 400, 500, 600, 800, 1000, 1200, or 1500 kb or more.
[0029] This disclosure also provides a technology for forming successive concatemers from front and back segments of DNA fragments above a preselected length. This technology can also be used in combination with formation of corresponding sense and antisense concatemers, as put forth above. Successive concatemers can be made by obtaining a library of DNA fragments, each comprising an insert that is a portion of the target polynucleotide, ligated with a sequencing adaptor to form a template; forming a first single stranded circular DNA (ssDNA) , comprising a measured segment of DNA from one end of the insert; forming a second ssDNA, comprising a measured segment of DNA from the other end of the insert; forming concatemers from the first ssDNA and the second ssDNA, thereby obtaining successive concatemers from front and back segments of the template; and then obtaining sequence reads from said concatemers from front and back segments of the template. The measured segments may be 400, 600, 900, or 1200 bases long. The measured segments may be obtained by primer extension (CPE) , which comprises forming a complementary strand to a single stranded DNA by extending a primer using a DNA polymerase under reaction conditions and for a time that let the user regulate the length of the extension that is formed.
[0030] The technology of this disclosure may be implemented for any compatible purpose: such as: (1) determining at least part of the sequence of the target polynucleotide by a process that comprises assembling first sequence reads with each other and with second sequence reads; (2) identifying a base at a position in the first read of one of said sense concatemers that is not complementary to a base at the same position in the second read of the corresponding antisense strand, thereby constituting a non-call in both sequence reads at the positions of said bases; and / or (3) identifying a base at a position in the first read of one of said sense concatemers that is complementary to a base at the same position in the second read of the corresponding antisense strand but different from the base at the same position in a reference sequence, thereby identifying possible mutations in the target polynucleotide.
[0031] This disclosure also provides an array of at least 106, 107, 108, or 109 of different optically resolvable DNA nanoballs, configured for sequencing of all or part of a target polynucleotide. Each nanoball is a concatemer that comprises replicates of a template that contains an insert fragment from the target polynucleotide ligated to an adaptor. At least about 5%, 10%, 20%, 25%, one third, 40%, or nearly 50%of the nanoballs on the array are each a concatemer of a template that is complementary to the template that is replicated in at least one other nanoball on the array.
[0032] Unless specified otherwise, “complementary” in this context means that the inserts being compared match each other in both nucleotide sequence and length, as defined above. The “array” is a display of DNB nanoballs on a surface in an optically resolvable and biochemically accessible manner in any suitable arrangement, exemplified by but not limited to a grid pattern. The surface may be (for example) a flat surface on a microscope slide or in a flow cell, or a curved surface on a collection of beads. The array may be partitioned or divided: for example, in two or more lanes of a flow cell. DNA nanoballs are “optically resolvable” if they can be visualized and / or measured separately using a data capture device capable of high resolution two-dimensional pixilated measurement, such as a CCD or CMOS camera.
[0033] This disclosure further provides a method of any one or more of the following in the context of sequencing a target polynucleotide using fragments thereof replicated in DNA concatemers: obtaining longer sequence reads of said fragments; reducing the number of sequencing cycles needed to obtain a complete read of a majority of said fragments; improving accuracy of the sequencing; and / or distinguishing inaccuracies or no-calls in the sequencing from mutations in the target polynucleotide-the method comprising obtaining sequence reads of said fragments in the DNA concatemers in both directions, or from sense and antisense inserts or concatemers, as put forth above.
[0034] Also provided in this disclosure are compositions of matter made using any of the preparation methods contained therein, and their use for nucleotide sequencing. Included is a nucleic acid composition comprising a mixture of sense concatemers and antisense concatemers, each containing replicates of a portion of the target DNA to be sequenced or the complement thereof, wherein at least 30%, 50%, 70%, 80% or 90% of the sense concatemers each contain a sequence from the target polynucleotide of at least 250, 500, or 1000 bases in length that is complementary to at least part of a sequence in at least one of the antisense concatemers in the mixture. Also included is a nucleic acid composition comprising a mixture of sense concatemers and antisense concatemers, wherein inserts of the target DNA that are replicated in the sense concatemers in the mixture are each tagged with a unique molecule identifier (UMI) , wherein each of at least 30%, 50%, 70%, 80% or 90% of said UMIs in the sense concatemers matches or is complementary to a UMI that is tagging an insert of the target DNA in at least one of the antisense concatemers in the mixture.
[0035] The technology of this disclosure is put forth mostly in the context of sequencing sense and antisense strands as presented in DNA nanoballs or concatemers. Aspects of the technology can be implemented for the purposes of or in the context of pair-end (PE) sequencing, mutatis mutandis. Details of PE sequencing are provided in U.S. Patent 7,767,400 and U.S. Patent 11,319,588, which are hereby incorporated herein in their entireties. Another form of PE sequencing is multiple displacement amplification (MDA) , referred to below, and described further in U.S. Patent 10,227,647.
[0036] These and other aspects of the technology are set forth in more detail in the sections that follow.BRIEF DESCRIPTION OF THE DRAWINGS
[0037] FIGS. 1A to 1D provide data that demonstrate the quality and utility of DNA sequence data that can be obtained using the 2xSE sequencing technology of this disclosure. FIGS. 1A and 1B are graphs of data that show Q30, a metric of DNA sequencing quality obtained from sense (S1) and antisense (S2) concatemers of human genomic DNA. The nucleotide sequence reads retain their accuracy for over 500 sequencing cycles. FIG. 1C is a graph of data show Rho staining intensity during sequencing of barcodes in the sense and antisense strands, showing that both contribute to the number of sequence reads obtained. FIG 1D is a stylized Venn diagram that depicts barcode overlap. The number of reads that had matching barcodes and complementary barcodes was 4,110,055-over 80% of all the reads obtained.
[0038] FIGS. 2A and 2B illustrate a method for making sense and anti-sense concatemers from a common double strand circular DNA (Approach A) . Fragments of the target polynucleotide are end-repaired, A-tailed, and ligated on both ends with a bubble adaptor. Uracil residues are added in place of thymine residues, and then removed to create a single strand break. This generates sticky ends that hybridize to each other, forming the double strand circular DNA. Adaptors include different hybridization sites on opposing strands for primer initiated rolling circle amplification (RCA) to make sense and antisense concatemers. A unique molecule identifier (UMI) attached to each DNA fragment will be replicated in one strand with the UMI sequence and in the other stand with the complementary (cUMI) sequence. The UMI and cUMI can be used to match sequence reads from sense and antisense strands during sequence assembly.
[0039] FIGS. 2C and 2D illustrate another method for making sense and anti-sense concatemers from separate single strand circular DNAs (Approach B) . As in FIG. 2A, fragments of the target polynucleotide are end-repaired, A-tailed, and ligated with bubble adaptors. The two strands are denatured, separated, and circularized to form single strand circular DNA. Primers hybridized to sense circular DNA and antisense circular DNA can be used concurrently, yielding a mixture of sense and antisense concatemers, or the primers can be used in separate reactions, yielding separate preparations of sense and antisense concatemers. As before, a UMI in the template will be replicated as the UMI sequence in one concatemer preparation and as cUMI in the other concatemer preparation.
[0040] FIGS. 2E and 2F illustrate a third method for making sense and anti-sense concatemers from separate single strand circular DNAs (Approach C) . Again, fragments of the target polynucleotide are end-repaired, A-tailed, and ligated with bubble adaptors. In this approach, the fragments are amplified as ssDNA, for example, by polymerase chain reaction (PCR) . The amplified ssDNA becomes the sense strand. It is hybridized using a top strand specific splint oligo, and hybridized with a primer for rolling circle replication (RCR) to make the top strand or sense concatemer. To make the bottom strand, a primer is hybridized to the 3’ end of the ssDNA and extended. The product is hybridized using a bottom strand specific splint oligo, and used to make the bottom strand or antisense concatemer.
[0041] FIGS. 3A and 3B conceptually depict two procedures by which successive concatemers are made for front and back segments of long fragments. A first concatemer is made from a first measured segment (~900 bp) from the front end of the fragment (upper pathway in FIGS. 3A and 3B) . A second concatemer is made from a second measured segment (~900 bp) from the opposite end of the same fragment (lower pathway in FIGS. 3A and 3B) . Each of the two measured segments can be used to make its own pair of concatemers of opposite strands, as exemplified in FIGS. 2A to 2F.
[0042] FIG. 4 are four panels of data representing signal intensity recovery of paired end (PE500) . The technology used in this demonstration includes additional improvements to PE sequencing described in this disclosure. The data are taken from an experimental demonstration using multiple displacement amplification (MDA) to create a strand that is componentry to part of the concatemer followed by sequencing the complementary strand. Shown are data from the final 20 min of sequencing on the antisense strand after 520 cycles of sequencing on the first strand.
[0043] FIG. 5 illustrates complementary strand making and pair-end sequencing on the DNBSEQTM platform from MGI Tech. (A) a DNB is hybridized with a primer for first-end sequencing; (B) controlled primer extension generates a plurality of complementary strands; (C) the 5’ ends are displaced by the DNA polymerase generating single stranded DNA (ssDNA) overhangs, creating a branched DNB; and (D) second-end sequencing primer on the newly created branches to generate a second-end read.
[0044] FIG. 6 is a graph that shows the percentage of reads with Q scores greater than 30 from the second strand sequencing. The percentage of reads with Q scores above 30 was around 60% after 400 cycles of the second strand sequencing.
[0045] FIGS. 7 to 8B review general background technology for sequencing a target polynucleotide using DNA concatemers or nanoballs (DNBs) .
[0046] FIG. 7 depicts how an array of concatemers is prepared. Fragments of a target polynucleotide are linked to an adaptor and circularized. Each circular DNA is amplified by rolling circle amplification to produce the DNBs, which are thereafter distributed on a surface. Sequencing can be done by hybridizing a primer to the adaptor, and extending the primer by synthesis or ligation.
[0047] FIG. 8A depicts an implementation of concatemer based sequencing that uses sequencing by synthesis and base detection using antibodies bearing fluorescent labels. FIG. 8B is a detail showing antibodies that are specific for each of the four 3’ blocked nucleotides, bearing fluorescent labels.
[0048] FIG. 9 schematically depicts an arrangement for sequencing opposing strands using four sequencing primers.
[0049] FIG. 10 are nucleotide sequences (SEQ ID NOS. 1 and 2) of sense and antisense circles taken from an example of sequencing DNA using two primers. The poly-N subsequence in each circle represents the insert DNA fragment that is being sequenced.
[0050] FIG. 11 shows an illustrative library of DNA fragments formed by single-tube LFR on a sizing gel. Depending on the target polynucleotide, such libraries typically have a broad size range of 200 to 2000 bases in length.DETAILED DESCRIPTIONOverview
[0051] The technology in this disclosure provides reagents and procedures for sequencing a library of DNA fragments by making DNA concatemers (nanoballs) of both the sense and antisense strands of each fragment. The sense and antisense concatemers can be made by rolling circle amplification of both strands of a double stranded circular DNA of each fragment (Approach A)-or by rolling circle amplification of separate single stranded circular DNAs (Approach B) . In both cases, the sense and antisense strands can be prepared together, loaded onto an array, and sequenced as part of the same mixture. Alternatively, the sense and antisense strands can be prepared, loaded, and sequenced as separate mixtures, for example, in separate lanes or compartments of a flow cell. Unique molecular identifiers (UMIs) in the sense and antisense strands can be used to match sequence reads together for purposes of sequence characterization and assembly.
[0052] FIGS. 1A and 1B illustrate 2xSE sequencing in action. The data were obtained from sense and antisense concatemers of about 1 kb of human genomic DNA. UMI barcode oligonucleotides placed at the 5' end of each template in the concatemers were sequenced first, then the genomic DNA insert in the template was sequenced through over 500 cycles. Q30 sequencing quality data are graphed for sense (S1) and antisense (S2) strands, respectively. The data demonstrate that sense and antisense concatemers prepared according to the technology of this disclose are structured and configured for accurate and reliable long-read sequencing
[0053] FIG. 1C shows data taken from an experiment in which barcodes of sense and antisense concatemers were sequenced concurrently in neighboring lanes. The data represent Rho staining intensity of nucleotide analogs for each of the four bases, used for sequencing of barcodes in the sense and antisense strands. The values are virtually identical, showing that sense and antisense strands contribute equally to the pool of sequence reads.
[0054] FIG 1D depicts barcode overlap. The number of reads that had matching barcodes and complementary barcodes was 4,110,055 out of a total of 5,064,876 reads (over 80%) .
[0055] The data in FIGS. 1A to 1D demonstrate that the sequencing performed well in both directions using sense and antisense concatemers, matching data obtained from opposite orientations with precision. The user can circularize and sequence both strands (and as such both sides of the molecule) without introducing substantial differences between the two strands. Over 80% of the barcodes had counterparts in both the sense and antisense concatemer preparations, providing a rich incidence of sequence complementarity and error correction. Approach A: Make separate concatemers for each strand of double stranded circular DNA
[0056] FIGS. 2A and 2B depict a procedural scheme for making sense and anti-sense concatemers from a common double strand circular DNA.
[0057] Double stranded fragments of the target polynucleotide are end-repaired and treated with T4 Polynucleotide Kinase (PNK) to put a phosphate group on the 5’ end of both strands from ATP. The fragments are then A-tailed by treatment with Klenow exo, a modified version of DNA polymerase I with both polymerase and 3’ to 5’ exonuclease activities, but without 5’ to 3’ exonuclease activity. A bubble adaptor is then ligated onto each end in opposite orientations. This is the precursor of the assembly adaptor used to close the circular DNA. A copy of both stands is made using PfuCX polymerase: a high-fidelity DNA polymerase (a modified version of Pfu polymerase) , using dUTP rather than dTTP to insert uracil residues into the copied adaptor.
[0058] The preparation is then treated with USER enzyme (a combination of uracil-DNA glycosylase (UDG) and endonuclease IV, available commercially) to remove the uracil bases, creating a single strand break. This releases a short oligonucleotide from the 5’ end of the adaptors, leaving sticky ends (single strand overhangs) . Under suitable conditions, the sticky ends will hybridize to each other, forming the double strand circular DNA, which can be nick sealed using a DNA ligase.
[0059] The double stranded assembly adaptors include different hybridization sites on opposing strands for primer initiated rolling circle amplification (RCA) using Phi29 DNA polymerase, which has strand-displacing activity, high processivity, and proofreading ability. Using each of the primers in separate reactions, DNB concatemers are made of the top and bottom strand separately (the sense concatemer and the antisense concatemer) . Alternatively, the primers may be used at the same time to yield a mixture of both sense and antisense concatemers. If the template being replicated contains a unique molecule identifier (UMI) : it will be replicated in one strand with the UMI sequence and in the other stand with the complementary (cUMI) sequence.
[0060] Other features and illustrations of Approach A are as follows.
[0061] A double stranded DNA (dsDNA) circle comprising a DNA insert (200 base+, 400 base+, 600 base+, 800 base+, or 1000 base+ in length) and an adapter with two primer binding sites and preferably with a UMI is formed. Upon denaturing, chained ssDNA circles are formed. With optimized concentration of two DNB-making primers (one for each strand) and SSB protein, Mg++, multiple compact oligo-linkers, and other reagents, a double-DNB is formed: preferably 2x ~30 kb+ in size or 2x ~50 kb+ in size or 2x ~100 kb+ in size. SSB prevents complementary DNA in double DNB to hybridize and form dsDNA. Sequencing of at least SE200, 300, 400, 500, 700, or more bases is determined in one DNB strand of original dsDNA next to adapter using corresponding sequencing primer. At the end of sequencing, a further extension of those strands maybe blocked by incorporation of dideoxy or other 3’blocked nucleotide. Then similar read length is determined in a DNB of the other strand from the other end of the insert using second sequencing primer. UMI and other sample barcodes / indexes may be read in one of strands. Both sequencing primers may be hybridized at the same time if one has a reversible 3’ block.
[0062] An advantage of the double-DNB sequencing approach is avoiding stochastics of finding separately prepared and arrayed complementary DNBs especially for PCR-free libraries. A single preparation of dsDNA cycles is used, compared with two ssDNA circles. Two linked DNBs with enough shorter template copies can be used for simultaneous 2c4i (two color four image) sequencing with somewhat reduced accuracy of both shorter (<200 base) reads differentiated by signal intensity, wherein only ~50% or less of extendable primers are hybridized to one DNB (in each image there are about 50% of DNBs with no signal and ~50% with ~0.5x, ~1x, or ~1.5x intensities) . Requirements for two consecutive reads are preserving DNBs for twice more sequencing cycles and twice longer sequencing time similarly as in pair end sequencing except that there is no second strand making after first strand sequencing that is usually affected by the first strand sequencing and it is difficult to have 2x400 base+ pair end sequencing, especially on longer inserts.
[0063] Recent improvements in making second strand on long inserts (for example, 800 to 1200 bases) using compact oligo-linker (two copies of 15-25 base long nucleotide sequence oligonucleotide linked by a 3-5 or 2 to 8 base spacer) complementary to a sequence in the second strand adapter during second strand making (keeping MDA-generated branches together) , a random / universal primer, maximally synchronized MDA and a longer time (for example, 60 or more min) extension time shows good second strand making (1.5x more copies than in the first strand) after 150 bases of the first strand sequencing. This indicates that enough second strand DNA can be generated (for example, 0.5x+relative to the first strand) after reading 400 or more bases in the first strand. Preferred features are: (1) bigger DNBs (for example, largest than 100 kb, thereby constituting a concatemer of 100 or more copies of 1 kb circles with ~800 base inserts) ; and (2) 2c4i (two color four image) sequencing are preferred for paired 2x400 base+ or 2x600 base+ sequencing. Approach B: Make corresponding concatemers from two separate single stranded circular DNAs
[0064] FIGS. 2C and 2D depict a procedural scheme for making sense and anti-sense concatemers from a common double strand circular DNA.
[0065] In a similar fashion to Approach A, fragments of the target polynucleotide are end-repaired, treated with T4 / PNK, A-tailed, and ligated with bubble adaptors. In this case, the PCR product does not have sticky (cross-hybridizing) ends. Instead, the two strands are denatured and separated. Splint oligonucleotides having sequences that are complementary to both the 5’ end and 3’ end of each fragment are used to form single strand circular DNA, which is then closed by ligation. The adaptors have hybridization sites for primer initiated rolling circle amplification (RCA) in either direction, depending on which strand is being replicated. A UMI may be positioned in the adapter with sequencing primer binding sites of >20 bases, >30 bases or >40 bases from both sides. RCA is conducted using a strand displacing polymerase such as Phi29 DNA Polymerase. The two primers can be used concurrently, yielding a mixture of sense and antisense concatemers, or the primers can be used in separate reactions, yielding separate preparations of sense and antisense concatemers. As before, a UMI in the template will be replicated as the UMI sequence in one concatemer preparation and as cUMI in the other concatemer preparation.
[0066] A variant of this procedure is Approach B2: Two separate DNBs are prepared from circular dsDNA with a suitable adapter (preferably including a UMI) in the same manner as Approach A. Complementary DNBs are prepared in separate reactions using respective DNB-making primers after denaturing the circular dsDNA. An advantage of Approach B2 is that substantially every DNA circle has a complementary circle. This increases the efficiency of sequencing from both strands of a dsDNA, which is especially useful for sequencing a non-amplified DNA fragment library (for example, a PCR-free WGS library) . Another option is to use a plurality of circular ssDNAs to make complementary circular ssDNAs. DNBs are then made in two separate reactions, one from the original circular ssDNA, and the other from complementary circular ssDNA. Approach C: Make concatemers from single stranded linear DNA
[0067] FIGS. 2C and 2D depict a procedural scheme for making sense and anti-sense concatemers from PCR amplified linear DNA.
[0068] In 2xSE sequencing, multiple copies of a DNA molecule is sequenced. Approach C uses PCR type amplification to generate the multiple copies of the DNA fragments being sequenced. PCR amplification is done using linear DNA, which helps avoid clonal errors. The user starts by determining approximate number of ~1 kb fragments (e.g., 0.1 to 1.1 kb) of the genome or target DNA being sequenced for sufficient coverage (typically at least 10x, preferably at least 20x) . Linear amplification is performed to obtain sufficient replicates (10 to 30 to 100 times the original preparation) using one PCR primer, thereby generating ssDNA.
[0069] To generate sense and antisense concatemers, the PCR product is divided into two reaction streams. One is used directly as a ssDNA template to make a concatemer for DNA sequencing. The other reaction converts ssDNA into dsDNA by making complementary strand with a second primer, which can be amplified further before making a concatemer of the complementary sequence. Each molecule is preferably sequenced 2-4 times in each reaction (each strand) . ~1 kb sequences are assembled, providing an opportunity for correcting for amplification and sequencing errors. The assembled ~1 kb sequences can be used to detect genetic variants or new sequences in the sample DNA. For 2xSE 800 DNA molecules, 1.2-1.5 kb fragments may be used.
[0070] To match sense and antisense concatemers, a single UMI bar code can be used in forward and reverse orientations. The dsDNA adapter has both complements of the UMI (one for each strand) . Alternatively, the UMI can be installed by 3’ extension after ligating a ssDNA adapter containing the UMI to 5’ end of dsDNA.
[0071] Approach C may be implemented by any suitable experimental protocol, illustrated by but not limited to the following.
[0072] Double stranded fragments of the target polynucleotide are end-repaired and treated with T4 Polynucleotide Kinase (PNK) to put a phosphate group on the 5’ end of both strands from ATP. The fragments are then A-tailed by treatment with Klenow exo, a modified version of DNA polymerase I with both polymerase and 3’ to 5’ exonuclease activities, but without 5’ to 3’ exonuclease activity. A bubble adaptor is then ligated onto each end in opposite orientations.
[0073] Adapter ligation reaction is purified using SPRI bead purification and then adapter-ligated DNA molecules are directly subjected to the one-primer linear amplification. To produce single-stranded product, top strand, a primer 1 is annealed to the primer-binding sequence of bottom strand in the adaptered genomic fragments, and the primer is extended by DNA polymerase possessing 3’ -exonuclease activity. The number of the cycle of linear amplification can vary from 10 to 100 cycles, depending on the amount of input DNA template. The efficacy of DNA polymerase synthesis can be controlled by using a DNA polymerase (s) with a suitable polymerization rate and processivity, and by additives: for example, DMSO, formamide, betaine, or single-stranded binding proteins (SSBs) .
[0074] The product of the one-primer linear amplification is purified using solid phase reversible immobilization (SPRI) magnetic beads. Amplicons are divided into two reactions. The first half of amplicons ( “Reaction A” ) are denatured, and “sense” splint oligonucleotides having sequences that are complementary to both the 5’ end and 3’ end of the top strand of DNA fragment are used to form single strand circular DNA, which is then closed by ligation. ssDNA circles are purified with SPRI beads and “sense” RCR primers are hybridized. Rolling circle amplification (RCA) to make sense-concatemers is then conducted using a strand displacing polymerase such as Phi29 DNA Polymerase.
[0075] To produce bottom-strand DNA circles ( “Reaction B” ) , a second half of amplicons are subjected to the one-primer linear amplification with a primer 2 which is complementary to the primer-binding sequence of top strand in the adaptered genomic fragments. The number of the cycle of linear amplification can vary from 1 to 100 cycles, depending on the amount of input DNA template. Similar to Reaction A, the efficacy of DNA polymerase synthesis can be controlled by using a DNA polymerase (s) with a suitable polymerization rate and processivity, and by additives, e.g. DMSO, formamide, betaine, or single-stranded binding proteins (SSBs) .
[0076] Next, the product of one-primer linear amplification is purified with SPRI beads. Amplicons are denatured, and anti-sense specific splint oligonucleotides having sequences that are complementary to both the 5’ end and 3’ end of the bottom strand of DNA fragment are used to form single strand circular DNA, which is then closed by ligation. ssDNA circles are purified with SPRI beads and anti-sense RCR primer are hybridized. RCA is conducted to make anti-sense concatemers using a strand displacing polymerase such as Phi29 DNA Polymerase.
[0077] Sense and antisense concatemers are then sequenced and matched together for assembly of the sequence reads. Matching sequence reads from corresponding sense and antisense concatemers using complementary UMIs
[0078] Each DNB is a concatemer generated from a circular DNA comprising an adapter. The adapter may contain a UMI (aunique molecule identifier) oligonucleotide sequence or barcode. Neighboring subfragments of a target DNA to be sequenced can be labeled with the same UMI (or complement thereof, cUMI) to assist in assembly of sequence reads. A UMI differs from other types of oligonucleotide bar codes that can also be present: for example, to label DNA that identifies a single cell or to a particular sample source when the target polynucleotide is prepared from a plurality of cells or sources.
[0079] Within an adaptor, a UMI can be surrounded with binding sites for sequencing primer complementary strands. To sequence in the opposite direction, complementary UMIs can be circularized and sequenced with corresponding primers. Reads that have complementary UMIs come from the same original dsDNA or ds-cDNA molecule from each end by reading complementary strands. For each strand, reads are grouped by the same barcode and between strands by the complementary barcode.
[0080] This is possible for a large population of molecules because MPS can provide millions and billions of reads per array (lane) : for example, the DNBSEQ-G800TM Sequencer from MGI Tech. provides over 400 million high quality SE600+ reads in each of four lanes, generating 3.2 billion reads per run on two flow-cells. A population of ~100 to 200 million ~1 kb DNA molecules with UMI (or a length / fraction in the 500 base to 1500 base range) can be amplified by PCR and each strand loaded in a lane of a DNBSEQ-G800 or other sequencer, generating 2x to 4x read coverage per lane per strand. That allows that ~80% of molecules is read in both strands generating linked reads through complementary UMI (cUMI-paired reads) .
[0081] For example, a large population of full length 16S ribosomal RNA molecules (~1500 bases) may be sequenced using the 2xSE sequencing technology of this disclosure in a G800 sequencer. 2x to 5x or 3x to 10x or more sequencing of each molecule in most cases can be used to correct for sequencing errors in individual reads. The redundant reading enables UMI and cUMI linking most of reads and high accuracy of 0.5 kb+, 1 kb+ or 1.5 kb+ DNA molecules.
[0082] For the number of UMIs and cUMIs to be mostly unique, 10 x+, or 30 x+ or 100 x+ UMI barcodes are used per DNA molecules or strands per reaction (or per barcoded sample) . For 50 million barcoded dsDNA molecules, there are 100 million strands, and thus 1 about a billion UMI barcodes are needed. This is achieved by N15+ barcodes (an oligonucleotide of 15 bases) ; N17 provide 160 x more UMIs than used in 100 million strands (50 million UMIs for 50 million dsDNA molecules x 2 complements) . Positional (cLFR) barcoding for sequencing 2 to 10 kb sized DNA by 2xSE600
[0083] Positional cLFR (pcLFR) using CPE (controlled primer extension) may be done as follows. Each sample is split into different aliquots and each aliquot is extended to different lengths using CPE. After CPE, each aliquot is ligated to an adapter that contains a barcode unique to that aliquot (positional barcode) . The positional barcode is the same for all DNA molecules in that aliquot (it is not a UMI) . Information from the positional barcode can then be used to orient reads in order during de novo assembly. This can be especially useful for regions with repetitive sequence.
[0084] For Sanger-size 2xSE reads, fewer aliquots are needed with longer extensions (for example, ~700 base+) and larger insert size of ~1 kbs+. For 2xSE sequencing, both the UMI and the positional barcode are included in both SE reads. By having steps (other than the first CPE) shorter than insert size an overlap is created between consecutive sequence, for example, 300 bases on average in this example (1000 bases to 700 bases) . Another example is 1 kb read length and 1.2 kb inserts. For longer DNA molecules (for example, >2 kb, or >3 kb in length) , two, three, or more than three CPE reactions are done from each strand of starting dsDNA (usually prepared by PCR) . To link to the UMIs, CPE reactions on the complementary strand are performed after DNA circularization.
[0085] Using both strands results in more precise shorter CPE reactions on each strand (for example, ~2.5 kb instead of ~5 kb for 4*5 kb long DNA molecules) . Reactions with longer CPE may have more starting DNA to compensate for lower circularization efficiency, thus providing more even read coverage for 3 kb+ DNA. In addition, DNA can be size-selected using CPE to minimize inserts shorter than a desired length: for example, for CPE of ~2 kb use bead size selection to deplete <1.5 kb or <1.7 kb fragments. Another benefit of pcLFR (especially with 2xSE600+ or PE500+) is that de novo assembly is simpler with 2-10 kb sequence reads (using fewer longer shifted sequences) . Each positional sequence segment of each molecule is assembled first (grouping reads with the same UMI and positional barcode per strand) , and then complementary strands are compiled using UMIs, or by finding sequence overlaps. Forward and reverse SE reads with complementary UMI + positional barcodes do not need to be from the complements of the same insert. They can come from any of the inserts made from molecule copies with a unique UMI, but sharing a positional barcode. Positional sequence segments are then combined at overlapping regions to assemble full sequences of each molecule.
[0086] These and all other DNA libraries described in this document are amenable for automation in typical multi-well plates (such as 96-well plates) using instruments that can pipet reagents, move plates, perform purifications or hybrid-capture on magnetic beads, perform enzymatic reactions including PCR at temperature-controlled stations. Alternative and additional barcoding and sequencing technologies
[0087] In some circumstances, just sequencing more bases of longer DNA molecules from opposite ends without UMI is useful for matching sense and antisense reads: for example, in fragments prepared using single tube LFR labeling (stLFR) , or for whole genome sequencing (WGS) with smaller amounts of DNA. DNA sizes generated by random fragmentation usually have a wide range of sizes: for example, from 200 to 2000 bases. Instead of isolating single narrower fraction, multiple size fraction may be generated. For example, 200 to 600 base can be sequenced by a single SE600 read; and 600 to 2000 base by 2xSE600 reads from either end. Shorter fragments will generate overlapped reads and longer once will have a gap. Even 2000 base fragments will have 60% of bases read by 2xSE600.
[0088] Sequencing can also be done by using one primer UMI+ insert. This may be done on shorter DNA sequencing where part of the reads can be used for UMI and adapter. Also, one X-primer may be enough because higher DNA loss may be tolerated in shorter reads.
[0089] Sequencing UMI and a DNA insert with two different primers may be preferable for longer reads, for example 2xSE700 base+. Two X-primers (or a Z-primer or combination thereof) are used per strand. UMI (and preferably sample index) sequencing primer is used first and the read is blocked by dideoxy or removed. Then an insert sequencing primer is used. Both primers may be hybridized at the same time if the insert sequencing primer is reversibly blocked (for example by 3’ phosphate) .
[0090] DNA libraries can also be prepared with UMIs using standard PCR. Enough copies to have after circularization, DNB loading and x-fold (for example 3x each strand) sequencing most of DNA molecules sequenced from both ends in complementary strands.
[0091] PCR-free library can be prepared with UMI as follows: In one method, genomic DNA fragments length of 800 base+ are ligated with an adapter with UMI (adA) and one regular (adB) adaptor. About 50% of DNA has proper adA-DNA-adB structure. Adapters may be ligated only to 5’ ends of DNA (helper strand of the adapter is not phosphorylated) . 3’ ends of the DNA may be extended by polymerase to copy ligated adapters including UMI and form blunt-end dsDNA. Circularize both strands in the same tube by adding one bridge oligonucleotide first, then the other in higher concertation (or BP before second bridge) , or use hybrid capture strand separation and circularization in separate tubes. High efficiency circularization is preferred to get larger fraction of DNA fragments sequenced from both ends.
[0092] Sequencing both strands on the same array can be done by hybridizing primer (s) for one strand, wash unused primers away, then hybridize primers for the second strands to avoid complementary primers to hybridize. Longer adapters with less or no primer complementarity may be used allowing all primers to be hybridized together.
[0093] Human PCR-free WGS (whole genome sequencing) with 800 to 1400 base long sequences (2xSE500 to 2xSE800+) may provide better sequencing in repeated or low complexity or highly diverse regions including more efficient detection of SVs. To have enough 2SE reads from the same molecule (reading complementary UMIs) , a controlled amount of input DNA may be used with a demonstrated process that loads most of DNBs (>80% or >90%) . For example, 60 x coverage in circularizable dsDNA providing 80 x in loaded DNBs (each strand counted separately) and ~40 sequence coverage (sequencing 60% of each strand on 80% of good reads) , with ~15x+ in UMI-paired reads.
[0094] A DNA library prepared this way may be amplified (for example, by PCR) . The process will create two sets of amplicons each originating from one of two complementary strands but having the same UMI. Like in duplex UMI coding (using, for example, a duplex UMI Universal Library Prep Set) , this allows to eliminate by sequence consensus of reads from two sets i) early PCR errors unlikely to be the same in each strand or ii) DNA damage per strand (for example, C-deamination creating C to T / G to A mutations) . A duplex UMI coding approach is also usable for 2xSE sequencing providing that both half-UMIs are sequenced two time in addition to each SE read (as part of longer SE read or using two or three separate primers) . In duplex UMI approach using special Y-adapters with UMI, about 2x more molecules are amplifiable then in above described method using two distinct adapter, one with UMI and one without UMI.
[0095] Using a Y or bubble adapter with a UMI code on one adapter strand, each strand of original dsDNA is treated as independent molecule with distinct UMI. PCR created amplicons also treat each original DNA strand as a separate molecule. If the goal is a population representation like 1500 base 16S sequencing in microbiome samples, UMI for 2xSE800 sequencing can be added after standard PCR. Making segmental concatemers from the front and back segments of long fragments
[0096] Another aspect of this disclosure is the idea of using separate concatemers to sequence independently the front and back segments of long fragments. This can be beneficial when the insert fragment being sequenced is beyond the reliable length of sequence reads. Concatemers for either the front or back portion can be made, for example, by using controlled primer extension (CPE) of the template to generate a defined portion of suitable length. A primer can then be introduced, for example, by branch ligation that is positioned to initiate sequencing in the opposite direction, starting part way through the template. The size of the segments can be chosen so that there is a region of overlap, which will facilitate assembly of the sequence reads.
[0097] FIGS. 3A and 3B depict two procedures for preparing segmental concatemers. The procedures ca be applied, for example, to a sample of double stranded DNA fragments made from a biological sample that contains 16S ribosomal RNA. The 16S rRNA gene sequences contain conserved regions and regions that are hypervariable between species, which can provide species-specific signature sequences useful for identification of bacteria and archaea.
[0098] FIG. 3A depicts one procedure for making segmental concatemers. The sample is divided into two aliquots, which are used for processing each of the strands separately. The upper pathway shows the procedure for obtaining a concatemer of the first 900 bp segment of the sample. The two strands are denatured. The bottom strand is hybridized at its 3’ end with a first primer. The bottom strand is copied 5’ to 3’ by primer extension under controlled conditions (CPE) until about the first 900 bp are copied. At that point, a branch is introduced by hybridizing a second primer to the 3’ end of the DNA strand just made and extending the second new strand in the opposite direction. This yields a double stranded DNA that can be amplified by PCR and made into a single stranded circular DNA using a splint or bridge oligonucleotide. The ssDNA is used to make a first concatemer that comprises the first ~900 bp segment of the original fragment.
[0099] Meanwhile, starting with the second aliquot (lower pathway) , the top strand is recovered and circularized to get to the other end of the fragment. The aliquot is subject to CPE for ~900 bp, and then subject to branch ligation. The product is optionally amplified by PCR and circularized. The ssDNA is used to make a second concatemer that comprises the terminal ~900 bp segment of the original fragment.
[0100] FIG. 3B depicts an alternative procedure for making segmental concatemers. Processing of the bottom strand (upper pathway) is the same as in FIG 3A, yielding a first concatemer that comprises the first ~900 bp segment of the original fragment. However, the order of operations for the top strand (the lower pathway) is different. Following full extension of the first primer, branch ligation is performed with a second primer. This introduces a branch at or near the opposite end of the original fragment. CPE is used to make copies of the first ~900 bp from the back end, which is then branched, amplified, and circularized. This generates a second concatemer that comprises ~900 bp from the back end of the original fragment.
[0101] Using either procedure, sequence reads are obtained from opposite ends of the fragment in opposite directions. Depending on the total length of the insert, there may be an overlapping region in the middle.
[0102] The making and sequencing of segmental concatemers can be used on its own or in combination with 2xSE sequencing as described above. When used together, at least four concatemers are formed: Sense and antisense concatemers from the front segment of the insert, and sense and antisense concatemers from the back segment of the insert.
[0103] Advantages of making segmental concatemers over traditional PE sequencing include the following: ·2xSE sequence reads may be longer (perhaps 700 bases compared with PE reads (usually <300 bases) and may be more efficiently performed on longer inserts (such as >1 kb, >1.5 kb, or>2 kb inserts) ; ·about half the number of consecutive sequencing cycles are used, so that the total time needed for sequencing isabout 2x shorter; ·UMI (both the bar code sequence and its complement) can be incorporated into segmental concatemers to facilitate assembly; and ·SE sequencing allows long reads from the same molecule to be linked, providing easer full-length assembly, compared with cLFR using typical PE150 sequence reads. For example, DNA molecules that are 1 to 1.5 kb in length can be fully sequenced by assembling just two overlapped SE reads. Multiple displacement amplification (MDA) for paired end (PE) sequence reads
[0104] Once concatemers are formed and distributed on a surface, sequence reads may be obtained directly from the concatemer. Alternatively or in addition, complementary branches may be formed to sequence in the other direction. Exemplary technology for this approach is controlled a multiple displacement amplification (MDA) , as described in U.S. Patent No. 10,227,647. After the first read is generated on DNBs, extended products (optionally using an additional primer) are further extended using natural unblocked nucleotides in a controlled and sufficiently synchronized way by a strand displacement polymerase such as Phi29. The process generates single-stranded (ss) DNA branches complementary to original DNBs and still bound to DNBs through regions that are not displaced.
[0105] FIG. 5 is a conceptual diagram that illustrates complementary strand making and pair-end (PE) sequencing by MDA (multiple display amplification) . (A) DNA nanoball (DNB) , as a concatemer, containing copies of adaptor sequence and inserted genomic DNA, is hybridized with a primer for the first-end sequencing. (B) After generating the first-end read, controlled, continued extension is performed by a strand displacing DNA polymerase to generate a plurality of complementary strands. (C) When the 3’ ends of the newly synthesized strands reach the 5’ ends of the downstream strands, the 5’ ends are displaced by the DNA polymerase generating ssDNA overhangs creating a branched DNB. (D) A second-end sequencing primer is hybridized to the adaptor copies in the newly created branches to generate a second-end read. (C) and (D) are exemplary drawings. In other embodiments, not all branches are generated and / or the branches can be different lengths.
[0106] MDA and subsequent sequencing of the antisense branches is a form of paired-end (PE) sequencing. A sequence read is taken along the concatemer strand to a point near the branch, creating the complementary strand in the process. Sequencing from about the branch point along the complementary strand constitutes a read in the opposite sense that is paired with the first read.
[0107] MDA is done by making a complimentary copy of part of an insert in a concatemer as follows: 1.Unblocking of the extended sequencing strand 2.Hybridization of additional MDA primers to the first strand 3.Binding of polymerase to 3’ extendable DNA ends 4.Polymerization to form the second strand from 3’ primer ends
[0108] FIG. 4 is a measure of signal intensity recovery, taken from an experimental demonstration of this technology. Rho intensity values represent aggregate DNB intensity values for DNBs of the same base call. Four lanes are represented in each panel. The figures show the final 20 min of sequencing on the antisense strand after 520 cycles of sequencing on the first strand. MDA was performed on first strand sequenced templates, with inclusion of consensus primers and linker oligonucleotide in the extension reaction step. Phi29 was bound to the first strand sequencing product and primer by incubating at 20℃ for approximately 2.5 min, followed by a wash step for 2 min, and then extension for 60 min.
[0109] TABLE 1 shows intensity values at selected cycles and the estimated recovery ratio of second strand intensities to first strand intensities. Average Rho intensity is shown between cycles 6 to 10 and cycles 26 to 30 for each of the four bases of one lane. The aggregate values of all fields are shown. Differences between lanes may represent different fluid flows and reaction conditions.
[0110] FIG. 6 shows the percentage of reads with Q scores greater than 30 from the second strand sequencing. Before the 370 bases of second strand sequencing shown, 520 cycles of first strand sequencing were performed, followed by second strand making on the flow cell and an initial 30 cycles of second strand sequencing. The aggregated result from all fields of a lane are shown. The percentage of reads with Q scores above 30 was around 60% after 400 cycles of the second strand sequencing.
[0111] PE sequencing differs from 2xSE sequencing in several respects. For SE sequencing, typically (1) both strands are prepared before sequencing begins; (2) each of the two strands is presented as a separate concatemer; (3) in typical embodiments, the insert in both the sense and the antisense concatemer are approximately the same length. Underlying and related technology: concatemer based nucleic acid sequencing
[0112] Underlying the advances provided in this disclosure is the background science of DNA nanoball (DNB) sequencing. This section provides a brief review.
[0113] Each fragment insert of the target DNA to be sequenced is ligated to an oligonucleotide adaptor to form a template, which is then replicated linearly to form a concatemer. The concatemer is then contacted with a sequencing primer and a DNA polymerase. The primer hybridizes to the adaptor at a site adjacent to the insert. The DNA polymerase extends the primer one base at a time depending on the sequence of the insert. Using labeled and blocked primers, each nucleotide added can be determined iteratively using labeled antibodies, following which the antibody is removed and the nucleotide is unblocked so that the next nucleotide can be added and determined.
[0114] FIG. 7 depicts how an array of concatemers is prepared. The user obtains fragments of a target DNA molecule for sequencing-genomic DNA from cells, reverse-transcribed RNA, or DNA from any other source. A circular DNA is produced with the ends of each fragment joined together via an adaptor sequence. The DNA being sequenced is referred to as an insert, which in combination with the adjacent adaptor constitutes a template.
[0115] Each circular DNA is amplified by rolling circle amplification to produce the concatemer.The concatemer assumes a substantially spherical shape, known as a DNA nanoball (DNB) . Concatemers of a plurality of the templates are distributed on a surface, such as by random distribution on a patterned surface of DNA binding regions-thereby forming a patterned array. U.S. Patent 9,944,984. Sequencing can be done by hybridizing a primer to the adaptor, and extending the primer by synthesis or by ligation to form an extension product that is commentary to the insert being sequenced.
[0116] The adaptor is effectively a tool kit for manipulation and analysis of the insert during set-up and sequencing. There is typically a hybridization site for a sequencing primer at or near one or both ends of the adaptor, which anchors the primer extension product produced in the course of sequencing by synthesis or by ligation. There may be a hybridization site for an oligonucleotide that anchors the concatemer to a surface. There may be a binding or recognition site for a sequence specific endonuclease. There may be a hybridization site for a barcode oligonucleotide that is sequenced concurrently with the fragment insert. There may also be hybridization sites for structural oligonucleotides that bridge between adaptors in the same concatemer.
[0117] FIG. 8A depicts an implementation of concatemer based sequencing that uses sequencing by synthesis and base detection using antibodies bearing fluorescent labels. U.S. Patent 10,851,410; US 2022 / 0162693 A1. Primer extension products are extended base-by-base with unlabeled reversibly terminated nucleotides. The identity of each base added is determined in each cycle of sequencing using antibodies that are specific for each of the four 3’ blocked nucleotides (FIG. 8B) . Removal of the bound antibodies and 3’ blocking moieties on the sugar groups of the nucleotide removes the label and regenerates natural nucleotides with no scar on the base.
[0118] The feature of reversion to a natural nucleotide allows further extension of the strand in a new cycle of sequencing without any interference from the prior cycle. Unlabeled RTs are easier and less costly to make, and they can be incorporated more efficiently. The antibodies can carry multiple labels, amplifying the sequencing signal compared with single dye molecule per base on standard labeled RTs. S. Drmanac et al., “CoolMPS” , bioRxiv preprint, 2020.
[0119] Recent advancements in concatemer-based polynucleotide sequencing include the following:
[0120] U.S. Patent 10,351,909 describes single molecule arrays for genetic analysis. U.S. Patent Nos. 10,125,392 and 11,389,779 describe long read fragment (LFR) nucleic acid analysis by barcoded random mixtures of non-overlapping fragments. U.S. Patent No. 10,557,166 describes multiple tagging of individual long DNA fragments. Granted European patent EP 3790967 B1 describes single tube bead-based DNA co-barcoding for accurate and cost-effective sequencing, haplotyping, and assembly.
[0121] Publication US 2021 / 0189483 A1 describes controlled strand-displacement for paired end or single end sequencing. U.S. Patent No. 7,767,400 describes paired-end reads in sequencing by synthesis. European patent EP 4121554 B1 describes restoring phase in massively parallel sequencing. U.S. Patent No. 10,954,559 describes bubble-shaped adaptor elements for constructing a sequencing library. US 2024 / 0240174 A1 describes nick-ligate single tube long fragment read (stLFR) sequencing. US 2024 / 00423924 A1 describes the determination of long DNA sequences using short MPS reads. US 2024 / 0279644 A1 describes template mutagenesis for improved assembly of sequence reads. Application WO 2024 / 022207 describes methods of in-solution positional co-barcoding for sequencing long DNA molecules
[0122] Selective DNA amplification from complex genomes using universal double-sided adapters is described by Callow, M, et al., Nucleic Acids Research, 2004, Vol. 32, No. 2, e21. Drmanac, R. et al. describes human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327 (5961) : 78-81 (2010) . Additionally, Peters, B., et al. describe accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature, 487: 190-195 (2012) .
[0123] Levy, S. et al. provide a general review of Advancements in Next-Generation Sequencing. Annu. Rev. Genom. Hum. Genet. 2016.17: 95-115. Drmanac, S. et al. describe CoolMPS, a method of advanced massively parallel sequencing using antibodies specific to each natural nucleobase. bioRxiv February 2020, DOI: 10.1101 / 2020.02.19.953307. Wang, Q. Drmanac, R. et al. describe efficient and unique co-barcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly. Genome Research 2019, 29(5) : 798-808. Hahn, O., et al. describe CoolMPS for robust sequencing of single-nuclear RNAs captured by droplet-based method. Nucleic Acids Research, 2021, Vol. 49, No. 2, e11.
[0124] Wang, L. et al. describe 3’ branch ligation as a novel method to ligate non-complementary DNA to recessed or internal 3’ OH ends in DNA or RNA. DNA Research, 2019, 26 (1) , 45-53. Mao, Q. et al.describe whole genome sequences and experimentally phased haplotypes of over 100 personal genomes. GigaScience (2016) 5: 42. Siotlos, S., Drmanac, R. et al. describe whole genome sequence analysis of BT-474 using complete Genomics’ standard and long fragment read technologies. Ciotlos et al. GigaScience (2016) 5: 8.
[0125] Peters, B. et al. describe co-barcoded sequence reads from long DNA fragments: a cost-effective solution for “perfect genome” sequencing. Front. Genet. January 2015, Vol. 5, Article 486; doi: 10.3389 / fgene. 2014.00466. Z. Dong, et al. describe the development of coupling controlled polymerizations by adapter-ligation inmate-pair sequencing for detection of various genomic variants in one single assay. DNA Research, 2019, 26 (4) , 313-325.
[0126] Fang, CB. et al. describe high-resolution single-molecule long-fragment rRNA gene amplicon sequencing of bacterial and eukaryotic microbial communities. 2023, Cell Reports Methods 3, 100437. Murigneux, V. et al. compare long-read methods for sequencing and assembly of a plant genome. GigaScience (2020) , 9, 1-11. Cai, Y. et al. describe assembly and analysis of the genome of Notholithocarpus densiflorus. G3 Genes|Genomes|Genetics, 2024, 14 (5) , Article jkae043. Other technologies that can be combined with 2xSE sequencing to increase sequence read length and accuracy Large nanoball technology
[0127] The total length of the sequence reads obtained from DNBs is also limited by the size of the fragment insert in the amplified template, and the accuracy of determining the next nucleotide in each cycle of sequencing. In early embodiments in the use of concatemers for sequencing, this was about 200 base pairs (bp) . Increasing the size of the insert potentially increases the maximum read length, but it makes the DNBs larger. This dilutes the size of the adaptors relative to the insert, which decreases the signal obtained from each added nucleotide and also makes plating more difficult.
[0128] The scientists at MGI Tech. and Complete Genomics Inc. have developed a sequencing technology that compensates for large DNB size. The DNBs are made more compact and arrayed closer together. Labeling intensity is increased by putting more labels on the reagent used to detect the terminator nucleotide, using a two label imaging protocol, and other features. Primer extension products are prevented from going out of phase by improving incorporation of bases into the primer extension products, and other actions. When these features are combined, accurate reads of 600 to 1000 base pairs or more can be obtained.
[0129] Patent disclosures for large nanoball technology have been filed separately: for example, international publication WO 2025 / 176158. Said disclosures are hereby incorporated herein by reference in their entireties as enhancements of and embodied combinations with the technology described herein. Fragment library sizing by controlled primer extension (CPE)
[0130] Making DNBs with large inserts requires a library of fragments of the target polynucleotide of minimum size. Conventional modes of library preparation (enzymatic cleavage, shearing, and other means) typically generates libraries with a wide range of fragment sizes-many of which are too small for long fragment reads.
[0131] The scientists at Complete Genomics Inc. have developed a technology of library preparation wherein fragment size is between a preselected minimum and a preselected maximum. A starting preparation of random fragments is generated from a target DNA to be sequenced. A first primer is hybridized to an adaptor at one end of the fragments, and extended by CPE in such a way that only fragments above the minimum are retained, and other fragments are discarded. A second primer is hybridized to an adaptor at the other end of the fragments, and extended by CPE in such a way that only fragments below the maximum are retained, and other fragments are discarded. The chosen size range can be matched to the desired insert size of the DNA nanoballs.
[0132] Patent disclosures for CPE library preparations are being filed separately in the name of Complete Genomics Inc: U.S. Patent application 63 / 798, 748. Said disclosures are hereby incorporated herein by reference in their entireties as enhancements of and embodied combinations with the technology described herein. Benefits of 2xSE sequencing
[0133] The labeling and sequencing strategies put forth in this disclosure can be used to obtain accurate longer reads or more accurate combined reads with partial or complete overlap, or more non-overlapped sequences from a longer molecule, instead of techniques that use pair-end reads.
[0134] Benefits of having complementary strands of each DNA fragment prepared before beginning sequence reads include the following: ·Longer combined sequence reads: reads begun on opposite strands can be assembled to obtain a combined read that is twice as long; ·Fewer sequencing cycles: Even if insert fragments are short enough for a single read, the same sequence information can be obtained in half the number of sequencing cycles; ·Error correction: Sequence reads in opposite directions that overlap can be used to verify sequencing and identifying missed calls. Sequencing errors can be distinguished from genuine mutations in comparison with a reference sequence; ·No need to make a second strand in situ during sequencing, which may interrupt or interfere with reading of the first strand.
[0135] Because sequencing errors generally increase toward the end of SE500+ reads, the reading of 1 kb DNA by 2xSE600 provides accurate ~400 base sequence at each end and 200 base in the middle have two less accurate sequences. Using consistent base calls (including consistent mismatches) in 2x200 base reads is one approach to reduce errors. Inconsistent calls may be converted to no-calls. The other approach is to use base call with higher quality score. In most PCR amplification approaches each DNA molecule with a UMI is represented with multiple copies and by deeper sequencing a consensus sequence of forward and reverse reads linked by UMI / cUMI would be generate. For more accurate longer reads (for example, SE600+) there are other options, for example, performing two antibody bindings (generating four images of two colors, 2c4i) for the last 100 to 400 bases. Second antibody binding may be done using differently labeled antibodies. Another option is to no-call a few percent of lowest quality base calls. These approaches provide sequencing of lower frequency molecules on standard libraries.
[0136] An advantage of 2xSE over PE (paired end sequencing) is that longer pair-end type sequencing is achievable (reads are on different DNBs, resulting in less cycles per DNB array with higher quality and shorter turn over time (for example, reading 1200 base in 630 MPS cycles where 30 cycles is used for 15 base UMI) , no interference of first read on the second-strand making, no second strand making is needed avoiding need to handle unstable Phi29 polymerase for sequencing. 2xSE reads can be generated on PCR clusters with UMI at one end surrounded by sequencing primers binding sites, for example, PP-UMI-PP-IIII-PP (wherein each PP is a primer binding site and IIII is the insert) . In one strand, a UMI sequencing primer is different from the primer used for sequencing the insert. Each SE strand is sequenced on a separate array. PCR clusters are dsDNA with 5’ ends bound to the support. For 2xSE sequencing, two arrays are prepared from the same amplified library. A different strand is removed on each array, and the remaining strand is sequenced using suitable primers as described above. To remove a specific strand, a cleavable group may be added in the PCR primer used to make that strand. Each array has a different cleavable primer. For 2xSE sequencing on bead arrays, two preparations of emulsion PCR on beads are prepared, each with different PCR primers attached to beads. The beads with clonally amplified DNA are arrayed on one or two arrays for sequencing.
[0137] Sequencing both complementary strands, when reads are overlapping, can allow for error correction of sequencing induced mis-calls. For example, with an insert size of 500 bases and the insert sequenced from each end for 500 cycles, each base-pair position of the insert is read twice. Base calls at positions that do not match the reference in all sequenced copies, and when present in both strands, may be expected to be present in the original double stranded target. If errors occur during amplification of the strands, then a proportion of reads covering that position may be discordant with the other reads. Similarly, errors introduced during sequencing may also not be fully represented in both strands.
[0138] Sequencing both complementary strands of longer inserts using reads that are not fully overlapping provide longer read lengths for large insert libraries. For example, with an insert size of 1000 bases, 500 to 700 bases could be read from each end with high accuracy. The total 1400 base read is in two pieces rather than contiguous, but they are known to be nearby in the target sequence with helps in assembly. In another example, a single read of 1000 bases from one end may produce higher sequencing errors in the latter 500 cycles compared with the complementary second strand read of the same region. More repeated 1000 base reads may be needed to get the same accuracy than 500 to 700 base reads from complementary strands. Some read length not achievable by standard MPS (for example, >1000 bases) may be done with two reads. A long DNA of 1500 bases cannot be sequenced in full with a single read if read length is limited to 1000 bases, but it can be fully sequenced by two 800 base reads from either end.EXAMPLES Example 1: Sequencing sense and antisense concatemers using four sequencing primers on two arrays
[0139] Referring to FIG. 9, Strand 1 and stand 2 sequencing progresses with the hybridization of primers 1 and 4 within alternate lanes of the sequencing flow cell, for the number of cycles necessary to cover the complementary UMI regions 5 and 6 (for example, 5’ CCAG...GTAC and 5’ GTAC....CTGG) . At completion of UMI sequencing, the read is terminated with a blocking group such as a dideoxy nucleotide, phosphorylated nucleotide or alternate 3’ blocking group. A second primer is then hybridized to each strand (primers 2 and 3) to continue sequencing of the insert DNA. Alternatively, both primers 1 and 2 for Strand 1 and primers 3 and 4 for Strand 2 could be hybridized concurrently but primer 2 and 3 are reversibly blocked at the 3’ end.
[0140] After initially sequencing from primers 1 and 4 to read the UMI, primer 1 and 4 are permanently blocked and primers 2 and 4 are unblocked so insert sequencing can start from primers 2 and 3. Reversible blocking could be achieved by a 3’ phosphate group or 3’ nitrobenzyl group or alternative 3’ reversible blocking groups. 3’ phosphate is unblocked with poly nucleotide kinase or phosphatase enzyme, 3’ nitrobenzyl could be unblocked by UV light irradiation. Example 2: Sequencing sense and antisense concatemers using two sequencing primers
[0141] In this example, two complementary circle DNA strands were prepared in separate reactions from a PCR product ~900 base in length prepared by 5’ phosphorylated primers using strand matching bridge oligonucleotides. In each reaction, any residual non-circularized strands were degraded or otherwise is not used for sequencing. Circular DNA is individually replicated to form DNBs in 150 min RCA reactions.
[0142] FIG. 10 shows the sequence of the two circles, with the poly-N segment representing the insert being sequenced.
[0143] The complementary strand DNBs were then loaded into separate flow cell lanes such that one lane contained Strand 1 and one lane contained Strand 2 DNBs. Following hybridization of sequencing primers to the respective strand 5’ upstream of UMI (for example, primers 1 and 4 in the above example with 4-primer sequencing, the DNBs were sequenced for 480 cycles. The initial cycles report the sequence of the UMI followed by 42-44 bases of the adapter (including binding sites for sequencing primer in the opposite strand before entering the genomic sequence) . Example 3: Making DNBs that are 100 kbs in length using two or three compact oligonucleotide linkers
[0144] An adapter with 2 or 3 compact oligonucleotide binding sites that are 15 to 25 bases in length are used to make a long insert library (for example, 1 kb, or an average size between 500 to 1500 bases) . A low but sufficient concentration of each compact oligonucleotide used in big DNB making reagent with 20 to 50 mM Mg++ and other additives, for example, SSB and DMSO. Rolling circle amplification time is usually about at least 100, 200, or 400 minutes. After making, DNBs are preferably heated 20 to 60 secs at 40 to 50℃ to further condense. Preferable arraying conditions are in neutral pH.
[0145] Compact oligonucleotides act as staplers keeping the 100 kb+ DNA concatemer tightly together to prevent DNB splitting in the pre-loading, loading or post-loading process. Two or three compact oligonucleotides with 2 copies provide multiple binding of each copy to other copies compared with one compact oligonucleotide with 3-4 copies. Unused two-copy oligonucleotides are shorter, thus reducing impact on DNB loading compared with 3-4 copy oligonucleotides.
[0146] Another way to make 100 kb+ DNBs is to array 20 to 30 kb DNBs on a surface in conditions that preserve Phi29 polymerase bound to the extending strand; then continue the rolling circle amplification in situ on the array. Example 4: Sequencing longer cDNAs using a 900 to 1300 base cLFR library and 2xSE600
[0147] When preparing cDNA libraries from mRNA, most cDNA are shorter than 1500 bases, with some up to about 8 kbs. Two or more fractions of cDNA may be prepared for direct 1xSE (for example, <600 bases) or 2xSE (for example, 600 to 1200 bases) or cLFR (above 1200 bases) with 2xSE sequencing using Sanger-size MPS reads. A cLFR library with insert size >800 base sequenced by 2xSE500+ MPS reads provides enrichment for longer cDNA sequencing. Thus, 2 or 3 cLFR libraries may be prepared from the same cDNA sample: for example, 300 base inserts, 600 base inserts and 900 base or 1100 base or 1300 base inserts.
[0148] 2xSE cLFR libraries can be prepared with a UMI in the branch-ligated adapter. Both UMIs (for the starting molecule and the segment of that molecule) are read in both SE reads. In this case, preferably fewer copies of the original cDNA molecule are used to make cLFR library, thereby having more paired 2xSE reads with both barcodes matching. Example 5: Increasing fragment coverage of stLFR libraries
[0149] Libraries prepared by single tube LFR have a broad size range of 200 to 2000 bases in length (FIG. 11) . With standard sequencing (PE150) , much of the longer inserts are not covered or do not make DNBs with high enough signal for sequencing, especially in current PE-reads protocol. Instead, a fraction of ~300 base inserts, ~600 base inserts, and about >700 base inserts may be prepared for sequencing with PE150 or SE300, 1xSE600 or PE300, and 2xSE600+ or PE500-600, respectively.
[0150] Another option is to replace ~300 base and ~600 base libraries with a single library of about 200 to 700 bases, and sequence with SE600 / PE300 reads, potentially decreasing sequencing efficiency. These type of libraries may be sequenced over 20%, over 25%, or even over 30% of the bases in long DNA fragments from stLFR without amplifying or applying MDA on long DNA fragments. Example 6: Demonstration of sequence concordance between sense and antisense concatemers
[0151] To demonstrate the effectiveness of the technology put forth in this disclosure, 2xSE sequencing run was performed on sense and antisense DNBs with an identical insert sizes of 1 kb from HG002 genomic DNA. Sequencing was performed on a Complete Genomics brand G800 sequencer with one lane loaded with Strand 1 generated DNBs and a second lane loaded with DNBs generated from Strand 2. DNBs were prepared by using strand selective rolling circle replication primers for DNB amplification. X-linked primers as described in WO 2025 / 176158 (H. Shanshan et al., MGI Tech) were hybridized to the barcode primer site of the adapter and the insert primer hybridization site in a single hybridization step-however the insert primer was 3’ phosphorylated in this example. Barcode and insert sequencing primers were matched to the adapter strand present in each lane. After 15 base sequencing of the barcode region, the read was blocked by dideoxy (ddNTP) nucleotide incorporation. The 3’ phosphate group of the insert primer was then removed by polynucleotide kinase treatment and sequencing then proceeded from the unblocked insert primer for a further 555 cycles.
[0152] FIGS. 1A and 1B show the Q30 data of Strand 1 and Strand 2, respectively. Q30 is a quality score in sequencing that indicates virtually error-free reads, representing a base call accuracy of 99.9%. Totals and quality metrics are shown in TABLE 2.
[0153] In a second demonstration of 2xSE technology, Strand 1 or Strand 2 DNBs were prepared and loaded into separate lanes of a G800 sequencer. Primers for Strand 1 and Strand 2 barcode sequencing were then mixed and hybridized to the loaded DNBs. Only the primer sequence recognizing its reverse complement DNB strand should hybridize to the DNB in the specific lane. The barcode primers were designed such that they were offset and hybridize to different regions of the adapter to avoid any complementarity during mixing. After 15-base barcode sequencing the read was terminated by dideoxy nucleotide incorporation and the insert primer was then hybridized and sequencing of the insert continued for an addition 300 cycles. Again, the insert primers were designed such that they were offset and hybridize to different regions of the adapter to avoid any complementarity during mixing.
[0154] FIG. 1C shows the results. Rho intensity values were averaged and plotted for a field from the Strand 1 (S1) sequencing and Strand 2 (S2) sequencing lanes. Rho intensity is a CG metric that represents the average called base intensity for DNBs in a field of the array. The averaged four-base intensities from a field of either a strand one (S1) DNB lane or a strand two (S2) DNB lane were similar for both strands.
[0155] FIG 1D is a Venn diagram depicting barcode overlap, with each segment showing a number of sequence reads. To remove errors in barcode sequencing, both the first and second strand data were filtered to remove barcodes found in fewer than five reads. The remaining reads from sense and antisense concatemers were then matched with each other. The number of unmatched reads in sense and antisense strands is depicted on the left and right. The number of reads that paired or complementary barcodes was 4,110,055 out of a total of 5,064,876 reads (over 80%) .
[0156] These data show that the sequencing performed well in both directions using sense and antisense concatemers, matching data obtained from opposite orientations with precision. The user can circularize and sequence both strands (and as such both sides of the molecule) without introducing substantial differences between the two strands. Over 80% of the barcodes had counterparts in both the sense and antisense concatemer preparations, providing a rich incidence of sequence complementarity and error correction. Trademarks
[0157] The wordmark CoolMPS, the wordmark DNBSEQ, the wordmark MGIEasy and the MGI logo are all registered trademarks of MGI Tech Co., Ltd. Incorporation by reference
[0158] For all purposes in the United States of America, each and every publication and patent document referred to in this disclosure is hereby incorporated herein by reference in its entirety for all purposes to the same extent as if each such publication or document was specifically and individually indicated to be incorporated herein by reference. Practice of the invention
[0159] The technology provided in this disclosure and its use are described with non-limiting illustrations within a hypothetical understanding of general principles of oligonucleotide chemistry and DNA sequencing technology. These discussions are provided for the edification and interest of the reader, and are not intended to limit the practice of the claimed invention. All of the products and methods claimed in this application may be used for any suitable purpose without restriction, unless explicitly indicated or otherwise required.
[0160] While this disclosure has been described with reference to the specific embodiments, changes can be made and equivalents can be substituted to adapt the invention to a particular context or intended use as a matter of routine experimentation, thereby achieving benefits of this disclosure without departing from the scope of what is claimed.
Claims
1.A method of characterizing the nucleotide sequence of a target polynucleotide, the method comprising:(a) obtaining a library of DNA fragments, each comprising an insert that is a portion of the target polynucleotide, ligated with a sequencing adaptor to form a template;(b) forming sense concatemers, each comprising contiguously replicated copies of a first one of said templates containing a first insert;(c) forming antisense concatemers, each comprising contiguously replicated copies of a second one of said templates containing a second insert;wherein a majority of the antisense concatemers each comprises replicated copies of an insert that is complementary to an insert that is replicated in at least one of the sense concatemers, and is thereby characterized as corresponding to said sense concatemer;(d) obtaining a first sequence read of the first insert in each of the sense concatemers; and(e) obtaining a second sequence read of the second insert in each of the antisense concatemers.2.A method of characterizing the nucleotide sequences, the method comprising:(a) obtaining a library of DNA inserts / amplicons ligated with a sequencing adaptor to form a template;(b) forming multiple copy senses, each comprising a replicated copy of a first one of said templates containing a first insert;(c) forming multiple copy antisenses, each comprising a replicated copy of a second one of said templates containing a second insert;(d) obtaining a first sequence read of the first insert in each of the senses; and(e) obtaining a second sequence read of the second insert in each of the antisenses;wherein a majority of the antisenses each comprises replicated copies of an insert that is complementary to an insert that is replicated in at least one of the senses, and is thereby characterized as corresponding to said sense.3.The method of claim 1 or claim 2, wherein the first template in each sense concatemer contains a first unique molecule identifier (UMI) ,wherein the second template in each antisense concatemer contains a second UMI; andwherein the method further comprises:(f) comparing first UMIs of first sequence reads with second UMIs of second sequence reads; and(g) matching first sequence reads with corresponding second sequence reads whenever the sequence of a first UMI is complementary to the sequence of a second UMI.4.The method of claim 3, wherein the sequences of each first UMI is embedded within the corresponding first sequence read, and the sequences of each second UMI is embedded within the corresponding second sequence read.5.The method of claim 1 or 2, wherein the sense concatemer and the antisense concatemer are contained in the same mixture.6.The method of claim 1 or claim 2, further comprising:(f) comparing first sequence reads with second sequence reads; and(g) matching first sequence reads with corresponding second sequence reads when they are complementary to each other within an overlapping region of at least 16 consecutive nucleotides; and(h) assembling first sequence reads with corresponding second sequence reads to form a contiguous sequence when the first sequence read and corresponding second sequence read overlap.[Approach A: ]7.The method of any of claims 1 to 6, wherein each of the sense concatemers and the antisense concatemers corresponding thereto are prepared by contiguous replication of both strands of double stranded circular DNAs, each containing one of the templates, using primers that hybridize to the sequencing adaptor on opposite strands.8.The method of claim 7, wherein the templates are double stranded, and wherein each of the double stranded circular DNAs are each prepared by:ligating an assembly adaptor onto each end of one of the double stranded templates, then processing the assembly adaptors on both ends of said double stranded template so that they have staggered ends that hybridize to each other to form a double stranded circular DNA.9.The method of claim 8, wherein the assembly adaptors on both end of each template are bubble adaptors, and the processing comprises:incorporating a uracil into the assembly adaptor on both ends of each template; andcleaving one strand of the assembly adaptor on both ends using a USER (uracil-specific excision reagent) enzyme system to form said staggered ends.[Approach B: ]10.The method of any of claims 1 to 6, wherein each of the sense concatemers and the antisense concatemers corresponding thereto are prepared by contiguous replication of an antisense single stranded circular DNA and by contiguous replication of a corresponding sense single stranded circular DNA, respectively.11.The method of claim 10, wherein the templates are double stranded, and wherein each antisense single stranded circular DNA and corresponding sense single stranded circular DNA are prepared by ligating assembly adaptors onto one of the double stranded templates; then separating and circularizing each strand thereof.12.The method of claim 11, wherein the separated strands are each circularized using a splint oligonucleotide that is complementary to adaptors at the 5’ end and the 3’ end of each strand, thereby bridging the 5’ end to the 3’ end so that they can be ligated together.[Approach C: ]13.The method of any of claims 1 to 6, wherein each of the sense concatemers and the antisense concatemers corresponding thereto are prepared respectively from sense and antisense strands of replicated linear DNA.14.The method of claim 13, wherein the replicated DNA is separated into two reaction,wherein in one reaction, sense strands are circularized using a splint oligonucleotide that is complementary to adaptors ligated to both ends of each sense strand of the replicated DNA,wherein in the other reaction, antisense strands is circularized using a splint oligonucleotide that is complementary to adaptors ligated to both ends of each antisense strand of the replicated DNA.15.The method of claim 13 or 14, wherein the replicated linear DNA is ligated at one end to a first adaptor having a binding site for a first primer, and at its other end to a second adaptor having a binding site for a second primer that is different from the first;wherein the first primer is used to selectively prime linear amplification of the sense strand, and the second primer is used to selectively prime linear amplification of the antisense strand.16.The method of any of claims 1 to 15, wherein the sense concatemers and the antisense concatemers are prepared and sequenced separately.17.The method of any of claims 1 to 15, wherein the sense concatemers and the antisense concatemers are sequenced together in a single reaction mixture.18.The method of any preceding claim, wherein the first sequence reads and the second sequence reads are obtained by primer extension or by sequencing by synthesis.19.The method of any preceding claim, wherein the first inserts and the second inserts have a median length of at least 600 bases.20.The method of any preceding claim, wherein the first sequence reads and the second sequence reads have a median length of least 400 bases.[Segmental concatemers from the front and back of long fragments: ]21.The method of any preceding claim, wherein step (b) comprises forming successive concatemers from front and back segments of templates that are above a preselected length.22.A method of characterizing the nucleotide sequence of a target polynucleotide, the method comprising:(a) obtaining a library of DNA fragments, each comprising an insert that is a portion of the target polynucleotide, ligated with a sequencing adaptor to form a template;(b) for a plurality of templates in the library, forming a first single stranded circular DNA (ssDNA) , comprising a segment of predetermined length from one end of the insert, and forming a second ssDNA, comprising a segment of predetermined length from the other end of the insert;(c) forming a first concatemers from each first circular ssDNA and a second concatemer from each second circular ssDNA, thereby obtaining successive concatemers from front and back segments of each insert; and(d) obtaining first and second sequence reads from said first and the second concatemers.23.The method of claim 22, wherein the predetermined length of each segment is 900 bases.24.The method of claim 22 or claim 23, wherein the segments of predetermined length are obtained by controlled primer extension (CPE) .25.The method of any preceding claim, further comprising determining at least part of the sequence of the target polynucleotide by a process that comprises assembling first sequence reads with each other and with second sequence reads.26.The method of any preceding claim, further comprising identifying a base at a position in the first sequence read of one of said sense concatemers that is not complementary to a base at the same position in the second sequence read of the corresponding antisense concatemer, thereby constituting a non-call in both sequence reads at the positions of said bases.27.The method of any preceding claim, further comprising identifying a base at a position in the first sequence read of one of said sense concatemers that is complementary to a base at the same position in the second sequence read of the corresponding antisense concatemer but different from the base at the same position in a reference sequence, thereby identifying true mutations in the target polynucleotide in relation to the reference sequence.28.An array of at least 108 different optically resolvable DNA nanoballs, configured for sequencing all or part of a target polynucleotide;wherein each of the DNA nanoballs is a concatemer that comprises replicates of a template that contains an insert fragment from the target polynucleotide ligated to an adaptor; andwherein at least one third of the DNA nanoballs on the array are each a replicate of a template that corresponds to the template that is replicated in another nanoball on the array.29.The array of claim 28, wherein at least 95%of the DNA nanoballs on the array are each a replicate of a template that is complementary to the template that is replicated in another nanoball on the array.30.A method of any one or more of the following in the context of sequencing a target polynucleotide using fragments thereof replicated in DNA nanoballs:obtaining longer sequence reads of said fragments;reducing the number of sequencing cycles needed to obtain a complete read of a majority of said fragments;improving accuracy of the sequencing; and / ordistinguishing inaccuracies or no-calls in the sequencing from mutations in the target polynucleotide;the method comprising obtaining sequence reads of said fragments in the DNA nanoballs in both directions, as described above.[Compositions of matter]31.A nucleic acid composition configured for sequencing a target DNA, comprising a mixture of sense concatemers and antisense concatemers, each containing replicates of a portion of the target DNA to be sequenced or the complement thereof,wherein at least 80%of the sense concatemers each contain a sequence from the target polynucleotide of at least 500 bases in length that is complementary to at least part of a sequence in at least one of the antisense concatemers in the mixture.32.A nucleic acid composition configured for sequencing a target DNA, comprising a mixture of sense concatemers and antisense concatemers, each containing replicates of a portion of the target DNA to be sequenced or the complement thereof,wherein inserts of the target DNA that are replicated in the sense concatemers in the mixture are each tagged with a unique molecule identifier (UMI) ,wherein each of at least 80%of said UMIs in the sense concatemers matches or is complementary to a UMI that is tagging an insert of the target DNA in at least one of the antisense concatemers in the mixture.