Method for filtering erroneous sequence reads caused by barcode hopping
The MRU filter addresses barcode hopping errors by filtering sequence reads based on read thresholds, enhancing the accuracy and precision of spatial transcriptomics data analysis.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ILLUMINA INC
- Filing Date
- 2025-11-11
- Publication Date
- 2026-06-25
AI Technical Summary
Barcode hopping during nucleic acid sequencing leads to erroneous sequence reads, causing incorrect spatial assignment of transcripts and inflating sensitivity, which undermines the accuracy of spatial transcriptomics analyses.
Implement a Multiple Read Supported Unique Molecule (MRU) filter to identify and remove subsets of sequence reads that do not meet a threshold number of reads, specifically targeting barcode-hopped molecules by analyzing subsets with the same spatial barcode, target nucleic acid, and molecular identifier.
Enhances the accuracy of nucleic acid sequencing data by reducing the impact of barcode hopping, improving the signal-to-noise ratio and ensuring precise localization and quantification of RNA molecules in spatial transcriptomics.
Smart Images

Figure US2025054891_25062026_PF_FP_ABST
Abstract
Description
33080 / IP-2914METHOD FOR FILTERING ERRONEOUS SEQUENCE READS CAUSED BY BARCODE HOPPINGCROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to (1 ) U.S. Provisional Patent Application No. 63 / 737,088, filed December 20, 2024, entitled “Method for Filtering Erroneous Sequence Reads Caused by Barcode Hopping,” and (2) U.S. Provisional Patent Application No. 63 / 809,739, filed May 21 , 2025, entitled “Method for Filtering Erroneous Sequence Reads Caused by Barcode Hopping,” the entire disclosures of each of which is hereby expressly incorporated by reference herein.FIELD OF THE INVENTION
[0002] The present disclosure generally relates to methods and systems for enhancing the accuracy of nucleic acid sequencing data analysis, and more specifically, to techniques for filtering erroneous sequence reads caused by barcode hopping.BACKGROUND
[0003] The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
[0004] Spatial transcriptomics enables highly multiplexed, spatially located gene expression analysis from biological samples. RNA transcripts are obtained from tissue samples placed on a substrate after tissue permeabilization and subsequent capture using target capture sites on the surface. Spatial barcodes are added to preserve the location of each transcript such that a spatial transcriptomics map may be generated after sequencing.
[0005] Most sequencing technologies require amplification (e.g., by polymerase chain reaction (PGR)) of barcoded cDNA molecules in order to have enough yield for sequencer loading. After sequencing, the sequence reads are demultiplexed and transcripts are assigned a spatial location based on the spatial barcode.
[0006] Unfortunately, during the amplification process and sequencing, barcode hopping sometimes occurs. That is, during the amplification process and sequencing, a cDNA molecule may become associated with a spatial barcode from a different cDNA molecule, dissociating the33080 / IP-2914 spatial information from the transcript information. This causes transcripts to be assigned to the incorrect spatial location on the flow cell thereby adding random noise to the data. Moreover, because unique molecules are determined by a unique combination of spatial barcode (SBC), gene, and molecule identifier (e.g., a unique molecular identifier (UMI)), barcode hopping creates new combinations of SBC, gene, and molecular identifier, resulting in inflation of reported sensitivity which would interfere with assay readouts interpretation. These errors can detract from the overall efficacy of sequencing technologies, particularly when applied to complex analyses such as spatial transcriptomics, where precise localization and quantification of RNA molecules within a tissue are critical.SUMMARY
[0007] In order to filter erroneous sequence reads caused by barcode hopping, nucleic acid sequencing data is analyzed to identify subsets of sequence reads that correspond to the same spatial barcode, the same target nucleic acid (e.g., cDNA), and the same molecular identifier (e.g., the same UMI). Then, one or more of the subsets that do not comprise more than a threshold number of sequence reads are filtered from the sequencing data.
[0008] When the sequencing amplification occurs under deep sequencing depth, most true molecules are supported by >1 sequence reads. Because barcode hopped molecules have a much higher chance of appearing in later cycles of amplification, such barcode hopped molecules are only supported by low reads.
[0009] Thus, in one embodiment, subsets of sequence reads having the same spatial barcode, the same target nucleic acid, and the same molecular identifier are filtered based on a threshold number of one. That is, any unique molecule identified in the sequencing that is only supported by 1 sequence read is filtered form the plurality of sequence reads.
[0010] However, if a sample is not deeply sequenced, it is possible that a molecule that is only supported by a single sequence read is an actual molecule correctly associated with a spatial barcode.
[0011] Thus, in some embodiments, the total number of reads of each specific spatial barcode is determined, and the subsets are only filtered if they comprise a spatial barcode that occurred in a number of sequence reads above a second threshold number (e.g., 2, 3, 4, 5, 10, etc.). The second threshold number can be determined based on the sequencing depth of the amplification reaction.33080 / IP-2914
[0012] The filtered sequence data can then be used to present a spatial transcriptomics map.
[0013] In one aspect, a method for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data includes: (1 ) obtaining nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier, (2) analyzing the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier, and (3) filtering, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
[0014] In another aspect, a computing device for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data includes one or more processors and a non-transitory computer-readable memory storing instructions thereon. When executed by the one or more processors, the instructions cause the computing device to (1 ) obtain nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier, (2) analyze the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier, and (3) filter, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
[0015] In yet another aspect, a non-transitory computer-readable memory stores instructions thereon, that when executed by one or more processors, cause the one or more processors to (1 ) obtain nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier, (2) analyze the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier, and (3) filter, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
[0016] Advantages will become more apparent to those of ordinary skill in the art from the following description of the preferred embodiments which have been shown and described by way of illustration. As will be realized, the present embodiments may be capable of other and different embodiments, and their details are capable of modification in various respects.33080 / IP-2914Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The figures described below depict various aspects of the system and methods disclosed herein. It should be understood that each figure depicts an embodiment of a particular aspect of the disclosed system and methods, and that each of the figures is intended to accord with a possible embodiment thereof.
[0018] There are shown in the drawings arrangements which are presently discussed, it being understood, however, that the present embodiments are not limited to the precise arrangements and instrumentalities shown, wherein:
[0019] Fig. 1 depicts example nucleotide molecules, each including a spatial barcode, a target nucleic acid, and a molecular identifier, according to some aspects, and illustrates an example of barcode hopping.
[0020] Fig. 2 depicts a block diagram of an example computing environment for filtering erroneous sequence reads caused by barcode hopping, according to some aspects.
[0021] Fig. 3 depicts example spatial transcriptomics maps depicting the number of reads of an NPHS2 gene with and without applying a Multiple Read Supported Unique Molecule (MRU) filter, according to some aspects.
[0022] Fig. 4 is a flow diagram of an example method for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data, which may be implemented by a server device, according to some aspects.
[0023] Fig. 5 is a flow diagram of another example method for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data, which may be implemented by a server device, according to some aspects.
[0024] Fig. 6 shows the experimental design, analysis, and results of a species mixing experiment to characterize the properties of barcode hopping.
[0025] Fig. 7 illustrates a potential barcode hopping mechanism.
[0026] Fig. 8 illustrates an example application of a Smart MRU filter.
[0027] Fig. 9 illustrates another potential barcode hopping mechanism.33080 / IP-2914
[0028] Fig. 10 illustrates the experimental design to characterize the properties of barcode hopping in an embodiment using a dual nucleotide approach.
[0029] Fig. 11 illustrates the reverse relationship between hopping and overall captured molecules in a dual nucleotide approach.
[0030] Fig. 12 illustrates the reads per transcript analysis of Example 3.
[0031] Fig. 13 illustrates the dual oligo approach of Example 3, including the mRNA capture, cDNA adapterization, bridging, and extension.
[0032] The Figures depict preferred implementations for purposes of illustration only. Alternative implementations of the systems and methods illustrated herein may be employed without departing from the principles of the invention described herein.DETAILED DESCRIPTION
[0033] Although the following text discloses a detailed description of implementations of methods, apparatuses and / or articles of manufacture, it should be understood that the legal scope of the property right is defined by the words of the claims set forth at the end of this document. Accordingly, the following detailed description is to be construed as examples only and does not describe every possible implementation, as describing every possible implementation would be impractical, if not impossible. Numerous alternative implementations could be implemented, using either current technology or technology developed after the filing date of this patent. It is envisioned that such alternative implementations would still fall within the scope of the claims.
[0034] Spatial transcriptomics allows the mapping of single cell transcriptomes to their spatial locations in a tissue slice. In general, spatially mapping single-cell transcriptomes (called the ex situ approach) involves the use of a surface coated with barcoded oligonucleotides, where the spatial location of each barcode is known. The barcoded oligonucleotides are localized into individual features, where every oligonucleotide in the same feature carries the same spatial barcode. Different implementations of the surface include, for example and without limitation, a bead array, a spotted array, a clustered flow cell, and clustered particles arranged on a surface. The surface also contains capture nucleotide sequences which may be part of the barcoded oligonucleotides or may be separate from the barcoded oligonucleotides.
[0035] A tissue section is then placed on the surface and target nucleic acids (e.g., polyA mRNA molecules) within the tissue diffuse to the features and are captured on the surface. In33080 / IP-2914 some embodiments, the captured target nucleic acid is RNA that is reverse transcribed into cDNA, wherein the spatial barcode is linked with the cDNA sequence.
[0036] This is followed by library preparation, which typically involves amplification (e.g., PCR). The sequencing library is then subjected to sequencing (e.g., Next Generation Sequencing). During analysis of the sequencing data, the spatial barcode is used to map the physical location of the molecule from which the sequence read is derived.
[0037] There are various strategies and methods for capturing target nucleic acids, associating spatial barcodes to such captured nucleic acids, and preparing nucleic acid sequencing libraries. Examples of such methods are described in WO 2024 / 138154; U.S. Patent No. 10,913,975; WO 2021 / 247593; WO 2020 / 072380; and U.S. Provisional Patent Application No. 63 / 615,558; each of which is incorporated by reference herein in its entirety.
[0038] As used herein, the term "barcode" is intended to mean a series of nucleotides in an oligonucleotide that can be used to identify the oligonucleotide, a spatial address on a surface ( / .e., a “spatial barcode”), a characteristic of the oligonucleotide, or a manipulation that has been carried out on the oligonucleotide. The barcode can be a naturally occurring nucleotide sequence or a nucleotide sequence that does not occur naturally in the organism from which the barcoded nucleic acid was obtained. A barcode sequence can be unique to a single nucleic acid species in a population, or a barcode sequence can be shared by several different nucleic acid species in a population.
[0039] In some embodiments, one or more of the plurality of spatially barcoded oligonucleotides comprises a molecule identifier (Ml). In further embodiments, the Ml is a unique molecular identifier (UMI).
[0040] As used herein, the term “molecular identifier” or “Ml” refers to a molecular tag, either random, non-random, or semi-random, that may be attached to a nucleic acid. In various embodiments, an Ml is a unique molecular identifier (UMI). When incorporated into a nucleic acid, an Ml can be used to correct for subsequent amplification bias by directly counting molecular identifiers (Mis) that are sequenced after amplification. A Ml (e.g., a UMI) can be attached to similar nucleic acids, e.g., adapters, making each nucleic acid unique. Mis (e.g., UMIs) may also be used to uniquely tag individual molecules (e.g., individual mRNA molecules) in a sample (e.g., individual mRNA molecules in a tissue sample, cell sample, or sample library). In some embodiments, a UMI is a random nucleotide sequence (e.g., N9).33080 / IP-2914
[0041] UMIs may be applied to or identified in individual DNA molecules. In some implementations, the UMIs may be applied to the DNA molecules by methods that physically link or bond the UMIs to the DNA molecules, e.g., by ligation or transposition through polymerase, endonuclease, transposases, etc. These “applied” UMIs are therefore also referred to as physical UMIs. In some contexts, they may also be referred to as exogenous UMIs. The UMIs identified within source DNA molecules are referred to as virtual UMIs. In some context, virtual UMIs may also be referred to as endogenous UMI.
[0042] UMIs are uniquely associated with a single DNA fragment in a sample including a source polynucleotide and its complementary strand. A physical UMI is a sequence of an oligonucleotide linked to the source polynucleotide, its complementary strand, or a polynucleotide derived from the source polynucleotide. A virtual UMI is a sequence of an oligonucleotide within the source polynucleotide, its complementary strand, or a polynucleotide derived from the source polynucleotide. Within this scheme, one may also refer to the physical UMI as an extrinsic or exogenous UMI, and the virtual UMI as an intrinsic or endogenous UMI.
[0043] Physical UMIs may be defined in many ways. For example, they may be random, pseudo-random or partially random, or nonrandom nucleotide sequences that are inserted in adapters or otherwise incorporated in source DNA molecules to be sequenced. In some implementations, the physical UMIs may be so unique that each of them is expected to uniquely identify any given source DNA molecule present in a sample. The collection of adapters is generated, each having a physical UMI, and those adapters are attached to fragments or other source DNA molecules to be sequenced, and the individual sequenced molecules each has a UMI that helps distinguish it from all other fragments. In such implementations, a very large number of different physical UMIs (e.g., many thousands to millions) may be used to uniquely identify DNA fragments in a sample.
[0044] A “virtual unique molecular index” or “virtual UMI” is a unique sub-sequence in a source DNA molecule. In some implementations, virtual UMIs are located at or near the ends of the source DNA molecule. One or more such unique end positions may alone or in conjunction with other information uniquely identify a source DNA molecule. Depending on the number of distinct source DNA molecules and the number of nucleotides in the virtual UMI, one or more virtual UMIs can uniquely identify source DNA molecules in a sample. In some cases, a combination of two virtual unique molecular identifiers is required to identify a source DNA molecule. Such combinations may be extremely rare, possibly found only once in a sample. In33080 / IP-2914 some cases, one or more virtual UMIs in combination with one or more physical UMIs may together uniquely identify a source DNA molecule.Multiple Read Supported Unique Molecule (MRU) filter
[0045] During library preparation and sequencing in next generation sequencing processes, barcode hopping sometimes occurs.
[0046] “Barcode hopping” (also called “barcode switching”, “barcode swapping”, “index hopping”, “index swapping”, or “index switching”) occurs when a spatial barcode designating a first location becomes associated with a target nucleic acid from a different location, thus resulting in the erroneous assignment of spatial location to the target nucleic acid.
[0047] Fig. 1 shows an example of barcode hopping. As shown in the upper portion of Fig. 1 , spatial barcode SBC1 designates the original location associated with target nucleic acid cDNA1 , and spatial barcode SBC2 designates the original location associated with target nucleic acid cDNA2. The lower nucleic acid shows the result of barcode hopping, where spatial barcode SPC1 has become associated with target nucleic acid cDNA2, thus associating the incorrect spatial location to cDNA2. Moreover, as further shown in Fig. 1 , the barcode hopped molecule includes UMI2, which results in the creation of a false new unique molecule.
[0048] Fig. 9 shows another example of barcode hopping as explained in further detail in Example 3 below.
[0049] Without intending to be limited by any particular theory (and as further explained below), barcode hopping appears to have a much higher chance of occurring at the later stage amplification (e.g., PCR) cycles of library preparation and, due to various reasons, short fragments of DNA are generated with only the spatial barcode. The short fragments with the spatial barcodes are then able to hybridize to another molecule with a different DNA, resulting in a molecule with swapped barcode and transcript DNA. Additionally, although barcode hopping appears to mainly occur during such amplification in the library preparation, some barcode hopping may also occur during clonal amplification and sequencing.
[0050] At deep sequencing depths (e.g., above a threshold sequencing depth), most true molecules originating from the original sample (e.g., RNA or cDNA) are supported by greater than one sequence read. However, most barcode hopped molecules are only supported by one sequence read.33080 / IP-2914
[0051] Thus, as further explained below, a Multiple Read Supported Unique Molecule (MRU) filter may be used to filter sequencing data comprising a plurality of sequence reads resulting from nucleic acid sequencing. That is, by identifying subsets of the plurality of sequence reads that correspond to the same unique molecule (e.g., the same spatial barcode, target nucleic acid, and molecular identifier), an MRU filter may be used to remove the subsets that do not include more than a threshold number of sequence reads (e.g., 1 , 2, etc.), thereby removing erroneous sequence reads caused by barcode hopping and reducing the impact of such barcode hopping on the sequencing data.
[0052] In other embodiments, the MRU filter may only be applied to certain spatial locations (e.g., clusters) of the original substrate (e.g., flow cell) based on the type of potential barcode hopping that may be occurring. For example, as explained in Example 3 below, and without being limited by theory, in some embodiments using a dual nucleotide approach for capturing and preparing the sequencing library (e.g., 63 / 615,558), there appears to be a reverse correlation between barcode hopping and total number of transcripts captured at a specific spatial location, cluster etc.
[0053] Thus, in some embodiments, the MRU filter may be applied to different spatial areas of the original flow cell (i.e. , based on the total number of transcripts captured at each location) and not applied to other areas. That is, the MRU filter may be applied to certain spatial areas of a flow cell (e.g., high hopped areas) based on the number of transcripts captured at the area, and thus only applied to subsets having designated barcodes from the desired spatial areas.
[0054] The MRU filter may be applied to any spatial or single-cell transcriptomics.Smart MRU filter
[0055] In some embodiments of barcode hopping, and without intending to be limited by any particular theory, it is believed that there is a positively correlated linear relationship between the probability of barcode hopping of a specific barcode and the number of sequence reads including that specific barcode. For example, if a first SBC (SBC1 ) has 10-fold more reads than a second SBC (SBC2), SBC1 would have a 10-fold higher chance to be associated with a hopped molecule than SBC2.
[0056] If a sample is not deeply sequenced, it is possible that a molecule that is only supported by a single sequence read is an actual, true molecule correctly associated with a spatial barcode (rather than a false barcode hopped molecule not associated with the sample).33080 / IP-2914
[0057] Thus, in some embodiments (e.g., when a sample is not deeply sequenced, or is sequenced at a low sequencing depth), a Smart MRU filter may be used to filter sequencing data comprising a plurality of sequence reads resulting from nucleic acid sequencing. As with the MRU filter described above, subsets of the plurality of sequence reads that correspond to the same unique molecule (e.g., the same spatial barcode, target nucleic acid, and molecular identifier) are identified. The total number of sequence reads having a specific spatial barcode is determined (e.g., there may be multiple subsets of sequence reads (each subset representing a unique molecule) having the same specific barcode, with one or more of those subsets potentially representing a barcode hopped molecule).
[0058] The MRU filter is then only applied to a subset if the subset comprises a specific spatial barcode that occurred in a total number of sequence reads above a second threshold number (e.g., 2, 3, 4, 5, 10, etc.). All other subsets are maintained as true molecules in the filtered sequencing data.
[0059] In this way, molecules that are only supported by a low sequence read count (e.g., 1 count, 2 counts, etc.) that are actual, true molecules correctly associated with a barcode remain in the filtered sequencing data, while barcode hopped molecules associated with other specific barcodes may be removed.
[0060] An example of an application of such a Smart MRU filter is shown in Fig. 8.
[0061] Thus, in some embodiments, the total number of reads of each specific barcode is determined, and the subsets are only filtered if they comprise a barcode that occurred in a number of sequence reads above a second threshold number (e.g., 2, 3, 4, 5, 10, etc.). The second threshold number may be determined based on the sequencing depth of the amplification reaction.
[0062] In some embodiments, the second threshold number is a predetermined number based on a sequencing depth of the nucleic acid sequencing reaction.
[0063] In some embodiments, the second threshold number is determined based on a sequencing depth of the nucleic acid sequencing reaction.
[0064] In other embodiments, the Smart MRU filter may be adjusted for different spatial locations (e.g., clusters) of the original substrate (e.g., flow cell) based on the type of potential barcode hopping that may be occurring. For example, as explained in Example 3 below, and without being limited by theory, in some embodiments using a dual nucleotide approach for capturing and preparing the sequencing library (e.g., 63 / 615,558), there appears to be a reverse33080 / IP-2914 correlation between barcode hopping and total number of transcripts captured at a specific spatial location, cluster etc.
[0065] Thus, in some embodiments, the Smart MRU filter may be applied to different spatial areas of the original flow cell based on the total number of transcripts captured at each location. That is, the Smart MRU filter may be applied to certain spatial areas of a flow cell (e.g., high hopped areas) based on the number of transcripts captured at the area, where the Smart MRU filter may be adjusted depending on the number of transcripts captured at each area (i.e., the second threshold number could be adjusted at different spatial areas and / or only applied to subsets having designated barcodes from the desired spatial areas).
[0066] The Smart MRU filter may be applied to any spatial and single-cell transcriptomics.Computing Environment
[0067] Fig. 2 is a block diagram of an example computing environment 200 for filtering erroneous sequence reads caused by barcode hopping. The example computing environment 200 may include a sequencing device 210, a server device 220, a client device 230, a tissue sample 202, and a network 250.
[0068] The tissue sample 202 may be sectioned and placed in proximity to a slide (e.g., a flow cell) with thousands of barcoded locations, each containing millions of capture oligonucleotides with spatial barcodes unique to that location. During permeabilization, the tissue sample 202 may release mRNA which binds to capture oligonucleotides from a proximal location on the tissue 202. A reverse transcription reaction may occur while the tissue is still in place, generating a cDNA library that incorporates the spatial barcodes and preserves spatial information. Barcoded cDNA libraries are mapped back to a specific location within the flow cell.
[0069] Then the sequencing device 210 may image the flow cell having barcoded nucleotide fragments from the tissue sample 202. The sequencing device 210 may include a computing device, image sensors, and a sequencing application 212 for sequencing a genomic sample or other nucleic-acid polymer. In some versions, by executing the sequencing application 212 using a processor, the sequencing device 210 may analyze nucleotide fragments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 210.33080 / IP-2914
[0070] More particularly, the sequencing device 210 may receive the flow cell with barcoded nucleotide fragments extracted from the tissue sample 202b, and the sequencing device may determine the nucleobase sequence of such extracted nucleotide fragments. The sequencing device 210 may be the sequencing system described in U.S. Patent Application No. 18 / 340,795, titled “Split-Read Alignment by Intelligently Identifying and Scoring Candidate Split Groups filed on July 23, 2023, which is hereby incorporated by reference in its entirety.
[0071] In some versions, the sequencing device 210 may utilize sequencing-by-synthesis (SBS) to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads. By executing the sequencing application 212, the sequencing device 210 may further store the nucleobase calls as part of base-call data (also referred to herein as “nucleic acid sequencing data”) that is formatted as a binary base call (BCL) file and send the BCL file to the server device 220. The sequencing device 210 may communicate the BCL file and / or other data to the server device 220 via one or more network(s) 250 or directly (e.g., bypassing the one or more network(s) 250).
[0072] As used herein, the term “cluster of oligonucleotides” (or simply “cluster(s)” or “DNA cluster(s)”) refers to a localized group or collection of DNA or RNA molecules on a nucleotide- sample slide, such as a flow cell, or other solid surface. In particular, a cluster includes tens, hundreds, thousands, or more copies of a cloned or the same DNA or RNA segment. For example, in one or more embodiments, a cluster includes a grouping of oligonucleotides immobilized in a section of a flow cell or other nucleotide-sample slide. In some embodiments, clusters are evenly spaced or organized in a systematic structure within a patterned flow cell. By contrast, in some cases, clusters are randomly organized within a non-patterned flow cell. A cluster of oligonucleotides can be imaged utilizing one or more light signals. For instance, an oligonucleotide-cluster image may be captured by a camera during a sequencing cycle of light emitted by irradiated fluorescent tags incorporated into oligonucleotides from one or more clusters on a flow cell.
[0073] The server device 220 receives the nucleic acid sequencing data from the sequencing device 210 including several sequence reads of nucleic acid molecules. Each molecule includes at least a spatial barcode (SBC), a target nucleic acid (e.g., DNA, cDNA, or RNA), and a molecular identifier. The molecular identifier may be a unique molecular identifier (UMI) and may be a physical molecular identifier (e.g., an exogenous or extrinsic molecular identifier) or a virtual molecular identifier (e.g., an endogenous or intrinsic molecular identifier).33080 / IP-2914
[0074] The server device 120 may include a memory and one or more processors (CPUs). The memory can be a non-transitory memory and can include one or several suitable memory modules, such as random access memory (RAM), read-only memory (ROM), flash memory, other types of persistent memory, etc.
[0075] The memory may store an operating system (OS), which can be any type of suitable mobile or general-purpose operating system. The memory also stores a read filtering engine 222 that receives and filters the nucleic acid sequencing data using the MRU filter and / or the smart MRU filter described above.
[0076] More specifically, the read filtering engine 222 may identify subsets of the nucleic acid sequencing data corresponding to the same unique molecule (e.g., the same spatial barcode, target nucleic acid, and molecular identifier). By identifying these subsets, the read filtering engine 222 allows for the filtering of sequence reads that do not meet specific criteria, thereby reducing the impact of barcode hopping on sequencing data. This filtering process is based on the principle that most true molecules, under conditions of deep sequencing depth, may be supported by more than one sequence read. By contrast, barcode-hopped molecules, which typically appear in later cycles of amplification, may only be supported by a low number of reads. This distinction forms the basis for a filtering strategy that can significantly improve the accuracy of sequencing data.
[0077] For each subset, the read filtering engine 122 may apply the MRU filter by filtering the subset from the sequence reads if the subset does not include more than a threshold number of reads (e.g., 1). For example, if there is only one read of a unique molecule (e.g., having the same spatial barcode, target nucleic acid, and molecular identifier), the read filtering engine 222 filters the subset from the sequence reads. If the subset for a unique molecule includes multiple reads, the read filtering engine 222 does not filter the subset from the sequence reads.
[0078] To apply the smart MRU filter, the read filtering engine 222 may determine the sequencing depth of the nucleic acid sequencing reaction and / or the number of reads of each specific spatial barcode. The read filtering engine 222 may apply the MRU filter to molecules including a specific spatial barcode if the number of reads for the specific spatial barcode exceeds a threshold number of reads. Otherwise, the read filtering engine 222 does not apply the MRU filter to molecules including the specific spatial barcode.
[0079] In some implementations, the read filtering engine 222 determines the threshold number based on the sequencing depth of the nucleic acid sequencing reaction. For example,33080 / IP-2914 for each sequencing depth, the read filtering engine 222 may have a predetermined threshold number of reads to compare to the number of reads for a specific spatial barcode. In another example, the read filtering engine 222 may determine the threshold number of reads to compare to the number of reads for a specific spatial barcode as a function of the sequencing depth.
[0080] In other implementations, the read filtering engine 222 does not apply a binary filter to each subset of unique molecules. Instead, the read filtering engine 222 may apply a relative weighting factor to a subset to reflect the relative likelihood of the molecule being a hop. For example, if a subset for a unique molecule includes 4 reads, the read filtering engine 222 may apply a relative weighting factor of one half to the subset to determine there are 2 reads in the subset. The filtered number of reads may be used when presenting the density of transcripts in a spatial transcriptomics map. A specific spatial barcode with a larger number of reads may have a higher probability of being a hop than one with a smaller number.
[0081] Then the server device 220 may generate indications of the filtered sequence reads for display to a user. For example, the server device 220 may generate a spatial transcriptomics map using the filtered sequence reads and their spatial barcode locations.
[0082] The server device 220 may then transmit the spatial transcriptomics map to the client device 230. The client device 230 may include a memory and one or more processors (CPUs). The memory can be a non-transitory memory and can include one or several suitable memory modules, such as random access memory (RAM), read-only memory (ROM), flash memory, other types of persistent memory, etc.
[0083] The memory may store an operating system (OS), which can be any type of suitable mobile or general-purpose operating system. The memory also stores a spatial transcriptomics application 232 which may present the spatial transcriptomics map via a user interface.
[0084] The spatial transcriptomics map may include a heat map of the filtered transcripts - the clusters of RNA molecules produced from DNA within the cells - at various locations within the tissue sample. The spatial transcriptomics map is generated by displaying the density of the filtered transcripts across the sample. The spatial transcriptomics heat map may use colors to indicate the concentration of transcripts, with warmer colors typically representing higher concentrations.
[0085] In this manner, users can identify which transcript corresponds to which cell, determine cell types based on corresponding transcripts, assess transcript density within cells, identify elevated gene expression, and analyze cell-to-cell interactions. The client device 23033080 / IP-2914 may also perform this analysis and provide an indication of the transcript density within cells, elevated gene expression in particular cells, cell-to-cell interactions, etc. for display to the user.
[0086] By filtering out barcode hopped molecules, the present embodiments boost the signal- to-noise ratio in spatial transcriptomics and improve the accuracy of the spatial transcriptomics data. As a result, transcripts which are incorrectly assigned the wrong spatial location on the flow cell are removed from the data set and are not presented to the user, for example in a spatial transcriptomics map.Example Displays
[0087] Fig. 3 depicts example spatial transcriptomics maps 310-330 which may be presented on the client device 230 via the user interface. The spatial transcriptomics map 310 includes a heat map of the transcripts corresponding to the NPHS2 gene at various locations within the tissue sample, which may be a kidney tissue. The spatial transcriptomics map 310 indicates the density of reads before filtering. The spatial transcriptomics map 320 indicates the density of unique molecules having only 1 read that would be filtered out using the MRU filter. As shown in Fig. 3, 72% of the unique molecules have only 1 read and would be filtered out using the MRU filter. The spatial transcriptomics map 330 indicates the density of unique molecules having more than 1 read (e.g., after applying the MRU filter). As shown in Fig. 3, 28% of the unique molecules remain after applying the MRU filter. The spatial transcriptomics map 330 depicts the density of the transcripts corresponding to the NPHS2 gene after the noise has been filtered out. This is a more accurate depiction of the transcripts in the kidney tissue which allows for a more accurate analysis of cell types in the kidney tissue, the transcript density within cells, elevated gene expression within the tissue, etc.Example Methods
[0088] Fig. 4 is a flow diagram of an example method 400 for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data. The method 400 utilizes the MRU filter. The method 400 may be implemented by a server device 220, and more specifically, a read filtering engine 222.
[0089] At block 402, the server device 220 obtains nucleic acid sequencing data including sequence reads of nucleic acid molecules. Each molecule includes a spatial barcode, a target nucleic acid, and a molecular identifier (e.g., a UMI).
[0090] At block 404, the server device 220 analyzes the sequence reads to categorize the sequence reads into subsets, where each subset corresponds to a unique molecule having the33080 / IP-2914 same spatial barcode, target nucleic acid, and molecular identifier. Then at block 406, the server device 220 determines whether a particular subset has more than a threshold number of reads (e.g., 1). If the particular subset does not have more than a threshold number of reads, the server device 220 filters the particular subset from the set of sequence reads (block 408). Otherwise, the server device 220 does not filter the particular subset from the set of sequence reads and determines whether each subset has been checked (block 410).
[0091] If each subset has not been checked, the server device 220 repeats this process for the next subset until each subset has been analyzed. Once each subset has been analyzed, the server device 220 provides the filtered set of sequence reads. For example, the server device 220 may provide an indication of the filtered set of sequence reads to present on the client device 230. The client device 230 may present a spatial transcriptomics map using the filtered set of sequence reads and their spatial barcode locations.
[0092] Fig. 5 is a flow diagram of another example method 500 for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data. The method 500 utilizes the smart MRU filter. The method 500 may be implemented by a server device 220, and more specifically, a read filtering engine 222.
[0093] At block 502, the server device 220 obtains nucleic acid sequencing data including sequence reads of nucleic acid molecules. Each molecule includes a spatial barcode, a target nucleic acid, and a molecular identifier (e.g., a UMI).
[0094] Then for a specific spatial barcode, the server device 220 determines the number of sequence reads having the specific spatial barcode (block 504). At block 506, the server device 220 determines whether the number of sequence reads for the specific spatial barcode exceeds a threshold number.
[0095] In some implementations, the server device 220 determines the threshold number based on the sequencing depth of the nucleic acid sequencing reaction. For example, for each sequencing depth, the server device 220 may have a predetermined threshold number to compare to the number of reads for a specific spatial barcode. In another example, the read server device 220 may determine the threshold number to compare to the number of reads for a specific spatial barcode as a function of the sequencing depth.
[0096] If the number of sequence reads for the specific spatial barcode exceeds the threshold number, the server device 220 applies the MRU filter as described in method steps 406-410 of33080 / IP-2914 the method 400 (block 510). Otherwise, the server device 220 does not apply the MRU filter (block 508).
[0097] If each specific spatial barcode has not been checked, the server device 220 repeats this process for the next specific spatial barcode until each specific spatial barcode has been analyzed. Once each specific spatial barcode has been analyzed, the server device 220 provides the filtered set of sequence reads.EXAMPLESExample 1
[0098] In order to characterize the properties of barcode hopping, a species mixing experiment was conducted.
[0099] The experimental design is shown in the left insert of Fig. 6. First, a spatial assay was conducted in two wells with different mRNA input (either mouse RNA or RNA encoding GFP). Then, the samples from two wells were mixed together either before PCR or after PGR. As shown in Fig. 6, for the Mix After PCR experiment, two experiments were sequenced (one with a free adaptor blocker (FAB) and one without a FAB). For the Mix Before PCR, the PCR was either optimal PCR or over-cycle PCR, but FAB was used in each.
[0100] Finally, after sequencing the mixed samples, the results were mapped back to the barcoded cell (either Mouse RNA or GFP). As shown in the center of Fig. 6, GFP reads were mapped to the Mouse well and Mouse reads were mapped to the GFP well, thus indicating barcode hopping.
[0101] The results of the experiments are shown in the right portion of Fig. 6. As shown, using a free adaptor blocker (FAB) decreased the barcode hopping by 50% in the Mixing After PCR sample ( / '.e., some barcode hopping happened on the sequencer), suggesting that one possible source of barcode hopping is free truncated fragments with SBC.
[0102] As also shown, the Mix Before PCR showed much higher barcode hopping than Mix After PCR, suggesting that barcode hopping mainly happens during the PCR step in the library preparation. Moreover, the fact that barcode hopping occurred during the PCR reaction indicates that barcode hopping likely also happens intra-well (between the molecules originating from the same well). Finally, the optimal PCR showed a lower SBC hopping rate than the overcycle PCR, suggesting that barcode hopping mainly happened in later PCR cycles.33080 / IP-2914
[0103] Thus, without being bound by theory, barcode hopping appears to occur more frequenctly at the later stage amplification (e.g., PCR) cycles of library preparation and, due to various reasons (e.g. reduced polymerase processivity in homopolymers, Low Tm of A / T homopolymer, depletion of PCR components), polymerase potentially stalled at a homopolymer region between the barcode and cDNA (especially polyA / T region), generating short fragments with only an SBC. As shown in Fig. 7, these short fragments may hybridize to other molecules at a homopolymer region, and then be extended such that they have a new cDNA linked to the wrong barcode, resulting in a molecule with a swapped barcode and cDNA.Example 2
[0104] Based on various species mixing experiments, it was discovered that at deep sequencing depth, most barcode hopping molecules are only supported by 1 sequence read, while most true molecules are supported by >1 sequence read. This is potentially due to real molecules being present in the original sample (thus having a high chance of being amplified by PCR) versus barcode hopped molecules mainly appearing in later cycles of PCR.
[0105] A Multiple Reads Supported Unique Molecule (MRU) filter was applied to kidney spatial transcriptomics sequence data, where the filter was set to remove any sequence reads representing unique molecules with only 1 sequence read. Any sequence reads representing unique molecules having more that 1 sequence read were passed into the final filtered sequence data. The data was then applied to generate a spatial transcriptomics map.
[0106] The results are shown in Fig. 3. As shown, applying such a filter significantly boosted the signal-to-noise ratio. As shown, the single-read unique molecules with 1 read only contributed a homogenous background, while the MRU filtered data led to a desired spatial pattern.Example 3
[0107] In order to characterize the properties of barcode hopping in an embodiment using a dual nucleotide approach for capturing and preparing the sequencing library (e.g., 63 / 615,558), an experiment was conducted. As shown in Fig. 10, the transcripts from two flow cells containing different tissue samples (rat liver and mouse brain, respectively) were mixed together before PCR.
[0108] Whole flow cell per tile hopping rates were investigated. As shown in Fig. 11 , the analysis revealed a reverse correlation between barcode hopping and total number of transcripts contained in a specific spatial tile. The spatial area containing the mouse brain33080 / IP-2914 contained a much higher total transcripts per tile. However, as shown in the heatmap, the hopping rate was inversely correlated with the total transcripts level per tile.
[0109] As shown in Fig. 12, the number of reads per transcript analysis revealed that most of the barcode hopped molecules were only supported by a single read. This occurred at both the low hopping rate tile and the high hopping rate tile.
[0110] Fig. 13 illustrates the dual oligo approach, including the mRNA capture and cDNA adapterization and then the bridging and extension. It is possible that some of the short fragments containing barcodes may pass through the SPRI step and then enter the PCR step.
[0111] Based on the results, a barcode hopping mechanism was proposed for this two oligo capture / library preparation approach. As illustrated in Fig. 9, the tiles having lower sensitivity / transcripts would have a higher fraction of unextended short fragments with spatial barcodes. Because the double-stranded hairpin (DHP) bridge has a very low Tm, during the early PCR cycles when the real PCR primer and dNTPs are plentiful, the short unextended fragments would be displaced. However, in the later PCR cycles when the real primer is almost exhausted, the short unextended fragments would have a higher chance to extend to formed barcode hopped molecules (illustrated as SBC2 with Payload 1 in Fig. 9).Aspects
[0112] Aspects of the techniques described in the present disclosure may include any of the following aspects, either alone or in combination:
[0113] 1 . A method for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data, the method comprising: obtaining, by one or more processors, nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier; analyzing, by the one or more processors, the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier; and filtering, from the plurality of sequence reads, by the one or more processors, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
[0114] 2. The method of aspect 1 , wherein the threshold number is one.33080 / IP-2914
[0115] 3. The method of any of aspects 1 -2, further comprising determining, by the one or more processors, the total number of sequence reads of each specific spatial barcode, wherein the filtering step is only applied to subsets comprising a specific spatial barcode when the total number of sequence reads for the specific spatial barcode in the plurality of sequence reads exceeds a second threshold number.
[0116] 4. The method of any of aspects 1 -3, wherein the second threshold number is a predetermined number based on a sequencing depth of the nucleic acid sequencing reaction.
[0117] 5. The method of any of aspects 1 -3, wherein the second threshold number is determined, by the one or more processors, based on a sequencing depth of the nucleic acid sequencing reaction.
[0118] 6. The method of any of aspects 1 -5, further comprising: identifying, by the one or more processors, a sequencing depth of the nucleic acid sequencing reaction; and determining, by the one or more processors, whether to apply filtering to the plurality of sequence reads based on whether the sequencing depth exceeds a threshold sequencing depth.
[0119] 7. The method of any of aspects 1 -6, wherein: in a first instance: filtering one or more of the subsets includes filtering the one or more subsets from the plurality of sequence reads in response to determining that the one or more subsets that do not comprise more than the threshold number of sequence reads and the sequencing depth exceeds the threshold sequencing depth; and in a second instance: not filtering the one or more subsets from the plurality of sequence reads in response to determining the sequencing depth does not exceed the threshold sequencing depth.
[0120] 8. The method of any of aspects 1 -7, further comprising: presenting, by the one or more processors, indications of the filtered sequence reads for display to a user.
[0121] 9. The method of any of aspects 1 -8, wherein presenting the indications of the filtered sequence reads includes: presenting, by the one or more processors, a spatial transcriptomics map using the filtered sequence reads.
[0122] 10. The method of any of aspects 1 -9, wherein the spatial transcriptomics map is a heat map indicating a number of reads of target nucleic acid at a plurality of locations within the sample.
[0123] 1 1. The method of any of aspects 1 -10, wherein filtering one or more of the subsets from the plurality of sequence reads includes: applying, by the one or more processors, a33080 / IP-2914 relative weighting factor to one of the subsets to filter a portion of the subset from the plurality of sequence reads according to the relative weighting factor.
[0124] 12. The method of any of aspects 1 -11 , wherein the nucleic acid sequencing data is deoxyribonucleic acid (DNA) sequencing data.
[0125] 13. The method of any of aspects 1 -12, wherein the nucleic acid sequencing data is ribonucleic acid (RNA) sequencing data.
[0126] 14. The method of any of aspects 1 -13, wherein the target nucleic acid is complementary DNA (cDNA).
[0127] 15. The method of any of aspects 1 -14, wherein the sample is a tissue sample.
[0128] 16. The method of any of aspects 1 -15, wherein the molecular identifier is a unique molecular identifier (UMI).
[0129] 17. The method of any of aspects 1 -16, wherein the molecular identifier is a physical molecular identifier or a virtual molecular identifier.
[0130] 18. A computing device for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data, the computing device comprising: one or more processors; and a non-transitory computer-readable memory storing instructions thereon that, when executed by the one or more processors, cause the computing device to: obtain nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier; analyze the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier; and filter, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
[0131] 19. The computing device of aspect 18, wherein the instructions further cause the computing device to: determine the total number of sequence reads of each specific spatial barcode, wherein the filtering step is only applied to subsets comprising a specific spatial barcode when the total number of sequence reads for the specific spatial barcode in the plurality of sequence reads exceeds a second threshold number.
[0132] 20. The computing device of any of aspects 18-19, wherein the second threshold number is determined based on a sequencing depth of the nucleic acid sequencing reaction.33080 / IP-2914
[0133] 21 . The computing device of any of aspects 18-20, wherein the instructions further cause the computing device to: present, via a user interface, a spatial transcriptomics map using the filtered sequence reads.
[0134] 22. A non-transitory computer-readable memory storing instructions thereon that, when executed by one or more processors, cause the one or more processors to: obtain nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier; analyze the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier; and filter, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
[0135] 23. The computer-readable memory of aspect 22, wherein the instructions further cause the one or more processors to: determine the total number of sequence reads of each specific spatial barcode, wherein the filtering step is only applied to subsets comprising a specific spatial barcode when the total number of sequence reads for the specific spatial barcode in the plurality of sequence reads exceeds a second threshold number.
[0136] 24. The computer-readable memory of any of aspects 22-23, wherein the second threshold number is determined based on a sequencing depth of the nucleic acid sequencing reaction.
[0137] 25. The computer-readable memory of any of aspects 22-24, wherein the instructions further cause the one or more processors to: present, via a user interface, a spatial transcriptomics map using the filtered sequence reads.
[0138] 26. The method of any of aspects 1 -17, wherein the filtering is only applied to subsets having designated barcodes.ADDITIONAL CONSIDERATIONS
[0139] Although the disclosure herein sets forth a detailed description of numerous different implementations, it should be understood that the legal scope of the description is defined by the words of the claims set forth at the end of this patent and equivalents. The detailed description is to be construed as exemplary only and does not describe every possible implementation since describing every possible implementation would be impractical.Numerous alternative implementations may be implemented, using either current technology or33080 / IP-2914 technology developed after the filing date of this patent, which would still fall within the scope of the claims.
[0140] The following additional considerations apply to the foregoing discussion. Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.
[0141] Additionally, certain implementations are described herein as including logic or a number of routines, subroutines, applications, or instructions. These may constitute either software e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware. In hardware, the routines, etc., are tangible units capable of performing certain operations and may be configured or arranged in a certain manner. In example implementations, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware modules of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.
[0142] The various operations of example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented modules that operate to perform one or more operations or functions. The modules referred to herein may, in some example implementations, comprise processor-implemented modules.
[0143] Similarly, the methods or routines described herein may be at least partially processor implemented. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented hardware modules. The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example33080 / IP-2914 implementations, the processor or processors may be located in a single location, while in other implementations the processors may be distributed across a number of locations.
[0144] The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example implementations, the one or more processors or processor- implemented modules may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other implementations, the one or more processors or processor-implemented modules may be distributed across a number of geographic locations.
[0145] This detailed description is to be construed as exemplary only and does not describe every possible implementation, as describing every possible implementation would be impractical, if not impossible. A person of ordinary skill in the art may implement numerous alternate implementations, using either current technology or technology developed after the filing date of this application.
[0146] Those of ordinary skill in the art will recognize that a wide variety of modifications, alterations, and combinations may be made with respect to the above described implementations without departing from the scope of the invention, and that such modifications, alterations, and combinations are to be viewed as being within the ambit of the inventive concept.
[0147] The patent claims at the end of this patent application are not intended to be construed under 35 ll.S.C. § 1 12(f) unless traditional means-plus-function language is expressly recited, such as “means for” or “step for” language being explicitly recited in the claim(s). The systems and methods described herein are directed to an improvement to computer functionality and improve the functioning of conventional computers.
Claims
33080 / IP-2914CLAIMSWhat is claimed is:1 . A method for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data, the method comprising: obtaining, by one or more processors, nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier; analyzing, by the one or more processors, the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier; and filtering, from the plurality of sequence reads, by the one or more processors, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
2. The method of claim 1 , wherein the threshold number is one.3 The method of claim 1 , further comprising determining, by the one or more processors, the total number of sequence reads of each specific spatial barcode, wherein the filtering step is only applied to subsets comprising a specific spatial barcode when the total number of sequence reads for the specific spatial barcode in the plurality of sequence reads exceeds a second threshold number.
4. The method of claim 3, wherein the second threshold number is a predetermined number based on a sequencing depth of the nucleic acid sequencing reaction.
5. The method of claim 3, wherein the second threshold number is determined, by the one or more processors, based on a sequencing depth of the nucleic acid sequencing reaction.
6. The method of claim 1 , further comprising: identifying, by the one or more processors, a sequencing depth of the nucleic acid sequencing reaction; and33080 / IP-2914 determining, by the one or more processors, whether to apply filtering to the plurality of sequence reads based on whether the sequencing depth exceeds a threshold sequencing depth.
7. The method of claim 6, wherein: in a first instance: filtering one or more of the subsets includes filtering the one or more subsets from the plurality of sequence reads in response to determining that the one or more subsets that do not comprise more than the threshold number of sequence reads and the sequencing depth exceeds the threshold sequencing depth; and in a second instance: not filtering the one or more subsets from the plurality of sequence reads in response to determining the sequencing depth does not exceed the threshold sequencing depth.
8. The method of claim 1 , further comprising: presenting, by the one or more processors, indications of the filtered sequence reads for display to a user.
9. The method of claim 8, wherein presenting the indications of the filtered sequence reads includes: presenting, by the one or more processors, a spatial transcriptomics map using the filtered sequence reads.
10. The method of claim 9, wherein the spatial transcriptomics map is a heat map indicating a number of reads of target nucleic acid at a plurality of locations within the sample.11 . The method of claim 1 , wherein filtering one or more of the subsets from the plurality of sequence reads includes: applying, by the one or more processors, a relative weighting factor to one of the subsets to filter a portion of the subset from the plurality of sequence reads according to the relative weighting factor.
12. The method of claim 1 , wherein the nucleic acid sequencing data is deoxyribonucleic acid (DNA) sequencing data.33080 / IP-291413. The method of claim 1 , wherein the nucleic acid sequencing data is ribonucleic acid (RNA) sequencing data.
14. The method of claim 1 , wherein the target nucleic acid is complementary DNA (cDNA).
15. The method of claim 1 , wherein the sample is a tissue sample.
16. The method of claim 1 , wherein the molecular identifier is a unique molecular identifier (UMI).
17. The method of claim 1 , wherein the molecular identifier is a physical molecular identifier or a virtual molecular identifier.
18. A computing device for filtering erroneous sequence reads caused by barcode hopping from nucleic acid sequencing data, the computing device comprising: one or more processors; and a non-transitory computer-readable memory storing instructions thereon that, when executed by the one or more processors, cause the computing device to: obtain nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier; analyze the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier; and filter, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
19. The computing device of claim 18, wherein the instructions further cause the computing device to: determine the total number of sequence reads of each specific spatial barcode,33080 / IP-2914 wherein the filtering step is only applied to subsets comprising a specific spatial barcode when the total number of sequence reads for the specific spatial barcode in the plurality of sequence reads exceeds a second threshold number.
20. The computing device of claim 19, wherein the second threshold number is determined based on a sequencing depth of the nucleic acid sequencing reaction.21 . The computing device of claim 18, wherein the instructions further cause the computing device to: present, via a user interface, a spatial transcriptomics map using the filtered sequence reads.
22. A non-transitory computer-readable memory storing instructions thereon that, when executed by one or more processors, cause the one or more processors to: obtain nucleic acid sequencing data comprising a plurality of sequence reads from a nucleic acid sequencing reaction for a sample comprising nucleic acid molecules, each molecule comprising a spatial barcode, a target nucleic acid, and a molecular identifier; analyze the plurality of sequence reads to identify subsets corresponding to the same spatial barcode, the same target nucleic acid, and the same molecular identifier; and filter, from the plurality of sequence reads, one or more of the subsets that do not comprise more than a threshold number of sequence reads.
23. The computer-readable memory of claim 22, wherein the instructions further cause the one or more processors to: determine the total number of sequence reads of each specific spatial barcode, wherein the filtering step is only applied to subsets comprising a specific spatial barcode when the total number of sequence reads for the specific spatial barcode in the plurality of sequence reads exceeds a second threshold number.
24. The computer-readable memory of claim 23, wherein the second threshold number is determined based on a sequencing depth of the nucleic acid sequencing reaction.
25. The computer-readable memory of claim 22, wherein the instructions further cause the one or more processors to:33080 / IP-2914 present, via a user interface, a spatial transcriptomics map using the filtered sequence reads.
26. The method of any of claims 1 -6, wherein the filtering is only applied to subsets having designated barcodes.