Improvement of Split Lead Alignment by Intelligently Identifying and Scoring Candidate Split Groups

JP2025523520A5Pending Publication Date: 2026-07-02ILLUMINA INC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
ILLUMINA INC
Filing Date
2023-06-23
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing sequencing systems inaccurately identify and align split reads, leading to incorrect variant calls due to the inability to consider the overall fragment alignment geometry and relative positions of nucleotide reads, resulting in increased mismatched alignments and computational inefficiencies.

Method used

A split read alignment system that uses dynamic programming to generate and score candidate split groups, considering fragment alignment scores, break penalties, overlap penalties, and pair scores to select the most likely split group for accurate base calling.

Benefits of technology

Improves alignment accuracy and computational efficiency by accurately identifying split reads and reducing the need for multiple sequencing assays, enhancing the precision of nucleobase and variant calls.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 00000000_0000_ABST
    Figure 00000000_0000_ABST
Patent Text Reader

Abstract

The present disclosure relates to a system, a non-transitory computer-readable medium, and a method for efficiently identifying and selecting split groups corresponding to one or more nucleotide reads. Generally, a split group includes a chain of fragments that form a split alignment of one read. The disclosed system utilizes dynamic programming to generate and evaluate candidate split groups. The disclosed system can generate a split group score for each of the candidate split groups. To generate the split group score, the disclosed system considers fragment alignment scores and the geometry of the fragment alignments within the candidate split groups. The disclosed system selects a predicted split group from the candidate split groups based on the split group scores.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] (Cross - Reference to Related Applications) This application claims the benefit and priority of U.S. Provisional Patent Application No. 63 / 367,002, entitled "IMPROVING SPLIT - READ ALIGNMENT BY INTELLIGENTLY IDENTIFYING AND SCORING CANDIDATE SPLIT GROUPS", filed on June 24, 2022. The above application is hereby incorporated by reference in its entirety into this specification.

Background Art

[0002] In recent years, biotechnology companies and research institutions have improved the hardware and software for sequencing nucleotides and determining nucleotide calls for genomic samples. For example, some existing sequencing machines and sequencing data analysis software (collectively referred to as "existing sequencing systems") predict individual nucleotides within a sequence by using conventional Sanger sequencing or sequencing - by - synthesis (SBS) methods. When using SBS, existing sequencing systems can monitor thousands of oligonucleotides being synthesized in parallel from a template to predict nucleotide calls for increasing nucleotide reads. Cameras in many existing sequencing systems capture images of the irradiated fluorescent tags incorporated into the oligonucleotides. After capturing such images, some existing sequencing systems determine nucleotide calls for the nucleotide reads corresponding to the oligonucleotides and transmit the base - call data to a computing device equipped with sequencing data analysis software that aligns the nucleotide reads with a reference genome. Based on the differences between the aligned nucleotide reads and the reference genome, existing systems (e.g., variant callers) determine nucleotide calls for genomic regions and identify mutations in the genomic sample.

[0003] Despite these recent advances, existing sequencing systems often inaccurately identify split reads, align them to a reference genome, and as a result, are unable to determine a mutation or other nucleotide call, or are able to determine an inaccurate nucleotide call. Generally, a split read represents a nucleotide read that has one read fragment that maps (or aligns) to one region of the reference genome and one or more other read fragments that map (or align) to different regions of the reference genome. For example, nucleotide reads covering structural variations, different sides of deletions, different sides of gene fusions, or simply random mapping of read fragments can result in split reads. In fact, in a split read, one read fragment from a nucleotide read may best align to a genomic region on one chromosome, and another read fragment from the same nucleotide read may best align to a genomic region on a different chromosome. Such split read alignments on two different chromosomes (or different genomic regions on the same chromosome) can accurately reflect mutations in a genomic sample or can falsely suggest split reads that should align to a single genomic region, so existing sequencing systems have developed computational models for recognizing and distinguishing between accurate and inaccurate split read alignments.

[0004] Existing computational models can accurately recognize some split read alignments, but such computational models contain design flaws that routinely result in misidentification of split read alignments. For example, some existing sequencing systems determine a primary alignment for a split read based on the highest-scoring alignment of a single read fragment from candidate alignments of candidate read fragments. However, such existing sequencing systems cannot consider the possibility of split alignments and cannot explain how the alignments of multiple fragments score together against other candidate alignments. Further, many existing sequencing systems clip read fragments (or different ends of a read), thereby determining a primary alignment that leaves gaps between fragment alignments. To fill such gaps, some existing sequencing systems iteratively select additional fragment alignments that overlap the gaps. By simply filling the gaps without considering the fragment alignments together, such existing systems cannot consider the relative fragment positions or orientations of nucleotide reads with respect to the reference genome or other split alignment geometries.

[0005] Due in part to inaccuracies in read fragment alignment, existing sequencing systems often determine inaccurate variant calls or other base calls based on inaccurate split read alignments. For example, by prioritizing primary alignments without considering the overall fragment alignment from nucleotide reads, some existing sequencing systems incorrectly ignore fragment alignments that correctly reflect structural variations and may fill gaps that indicate deletions along with other fragment alignments. Conversely, the primary alignment of a read fragment may, by itself, best map to an inaccurate genomic region of the reference genome. By prioritizing the primary alignment, some existing sequencing systems ignore the correct genomic regions better reflected by the alignment of multiple fragments from a nucleotide read, thereby resulting in false negative variant calls or otherwise incorrect variant calls. Thus, existing sequencing systems often misalign, inaccurately match, or miss call variants for a large number of samples, increasing the likelihood of mismatched alignments with reads from genomic samples.

[0006] To compensate for the inability of some existing sequencing systems to accurately detect split read alignments indicating structural variations, some existing systems perform both whole genome sequencing (WGS) using SBS (or other technologies) and microarrays using genotyping probes targeting specific structural variations. In fact, microarrays are specifically designed to target structural variations that are difficult to detect using existing sequencing devices. WGS sequencing systems perform both WGS and multiple microarrays, sometimes using different specialized sequencing and microarray devices, increasing the computational processing and time required to accurately determine variant calls for single nucleotide polymorphisms (SNPs) and smaller insertions and deletions (indels) on the one hand, and structural variations on the other. SUMMARY OF THE INVENTION

[0007] The present disclosure describes method, non-transitory computer-readable media, and system embodiments that can solve one or more of the foregoing (or other problems) in the art. For example, the disclosed system can determine a score of an alignment of one or more fragments from nucleotide reads within a candidate split group, select a predicted split group from among the candidates based on such a score, and use it for base calling. In particular, the disclosed system can identify a fragment alignment that includes a candidate local alignment of a fragment of a read from a genomic sample having a reference genome. The disclosed system then groups such fragment alignments into candidate split groups and determines a split group score for each of these candidate split groups. Based on the split group score, the disclosed system identifies a predicted split group from among the candidate split groups for use in base calling.

[0008] Additional features and advantages of one or more embodiments of the present disclosure are described in the following description, some of which will be apparent from the description, or can be learned by practice of such exemplary embodiments.

Brief Description of the Drawings

[0009] The embodiments for carrying out the invention provide additional specificity and details to one or more embodiments through the use of the accompanying drawings, as briefly described below.

Figure 1

Figure 2

Figure 3A

Figure 3B

Figure 4

Figure 5

Figure 6A

Figure 6B

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11A

Figure 11B

Figure 11C

Figure 11D

Figure 12A

Figure 12B

Figure 12C

Figure 12D

Figure 13

Figure 14A

Figure 14B

Figure 15

Figure 16

DETAILED DESCRIPTION OF THE INVENTION

[0010] The present disclosure describes one or more embodiments of a split read alignment system that can select a split group from among candidate split groups for read fragment alignment based on the generation and scoring of such candidate split groups. Generally, a split read alignment system identifies single-end reads or paired-end reads corresponding to a genomic region of a genomic sample and analyzes candidate split groups that together contain the alignment of one or more read fragments, rather than finding an isolated single fragment with the highest alignment score. More specifically, a split read alignment system can identify candidate local alignments of read fragments and create a chain of fragment alignments to a candidate split group. The split read alignment system scores the candidate split groups and selects a predicted split group for base calling based on the candidate split group scores.

[0011] As described above, the split read alignment system can determine candidate split groups. Generally, candidate split groups can include (i) one or more fragment alignments of single-end nucleotide reads, or (ii) one or more fragment alignments from paired-end nucleotide reads from pairs of paired-end nucleotide reads. In some embodiments, the split read alignment system efficiently determines candidate split groups by using dynamic programming. Generally, in dynamic programming, instead of considering all possible combinations of fragment alignments, the split read alignment system iterates from the outermost fragment alignment to the innermost fragment alignment to determine split groups and split group scores. By using dynamic programming, the split read alignment system effectively considers all possible or likely combinations of fragment alignments from nucleotide reads.

[0012] The split read alignment system can further generate split group scores for the fragment alignments of candidate split groups. Generally, the split group score indicates the likelihood of the fragment alignment in a candidate split group that represents a correct alignment with the reference genome. The split group score accounts for the likelihood of split alignments and split alignment geometries. Thus, by determining split group scores rather than just alignment scores for isolated fragment alignments, the split read alignment system improves the likelihood of selecting the correct fragment alignment or combination of fragment alignments to complete the template.

[0013] In some embodiments, the split read alignment system generates a split group score for a candidate split group based on one or more of (i) a fragment alignment score, (ii) a break penalty, (iii) an overlap penalty, or other penalties for fragment alignments within the candidate split group. As part of the split group score, for example, the split read alignment system determines a fragment alignment score for individual fragments of the candidate split group. As an additional part of the split group score, in some embodiments, the split read alignment system determines a break penalty for the relative geometry of fragment alignments within the candidate split group (e.g., to penalize breaks between fragment alignments). As yet another part of the split group score, in certain embodiments, the split read alignment system determines an overlap penalty for overlaps between fragment alignments within the candidate split group. As described below, the split read alignment system can determine the split group score by combining (i), (ii), and (iii).

[0014] For paired-end nucleotide reads, a split read alignment system can also identify and score candidate pairs of split groups. Generally, in certain embodiments, the split read alignment system further considers and determines a pair score for paired-end mates to identify a likely split group from among candidate split groups of paired-end mates. For example, the split read alignment system can sum split group scores for each candidate pair of split groups from paired-end mates and estimate the insert size between the innermost fragment alignments of candidate pairs of split groups. The split read alignment system can then generate a pair score for candidate pairs of split groups based on the summed split group scores and the estimated insert size. By way of example, the split read alignment system can include a pair score penalty for unlikely estimated insert sizes.

[0015] In addition to scoring and selecting split groups, in some embodiments, the split read alignment system can further identify fragment alignments that align with alternative contiguous sequences in the reference genome by using the split groups to report corresponding split alignments. When the split read alignment system determines that a nucleotide read best aligns to an alternative contiguous sequence based on split group scoring, in some embodiments, the split read alignment system reports the split alignment in the primary assembly corresponding to the alternative contiguous sequence by virtue of a lift-over relationship. For example, in some cases, the split read alignment system determines an alternative contig fragment alignment score for a fragment alignment corresponding to a nucleotide read having an alternative contiguous sequence representing a structural variant. The split read alignment system can also determine a split group score for the corresponding split alignment of the fragment alignment with the primary assembly of the reference genome. The split read alignment system can utilize the higher scoring alternative contig fragment alignment score as a replacement split alignment score to guide the selection of the corresponding split group over other candidate split groups. For example, if the alternative contig fragment alignment score exceeds the split group scores of other candidate split groups, the split read alignment system selects and reports the split alignment having the primary assembly corresponding to the alternative contiguous sequence rather than the split alignment represented by other candidate split groups that may have scored well in the absence of the alternative contig fragment alignment score.

[0016] As described above, based on one or both of the split group score and the pair score, the split read alignment system selects a predicted split group from the candidate split groups for use in nucleobase calling. For example, in some embodiments, the split read alignment system selects a predicted split group that has the highest split group score for each mate of a nucleotide read pair. In another example, the split read alignment system selects a predicted split group for each mate of a nucleotide read pair according to the highest pair score among all pair scores generated from pairs of scored split groups. As a result of selecting the predicted split group, the split read alignment system improves the accuracy of nucleobase calls and predicted variant calls in an output file (e.g., a variant call file).

[0017] As suggested above, the split read alignment system provides several technical advantages and benefits over existing sequencing systems and methods. For example, the split read alignment system improves the alignment accuracy of split reads over existing sequencing systems by considering the possibility of split alignments within various candidate split groups corresponding to nucleotide reads. By determining a split group score for a candidate split group that includes fragment alignments from fragments of nucleotide reads and selecting a predicted split group from among the candidates based on such split group scores, the split read alignment system identifies fragment alignments for split reads with higher accuracy than existing sequencing systems. As shown in FIGS. 11A-11D, for example, the split read alignment system determines better mapping and alignment for transcriptome reads and more accurate true negative variant calls for candidate gene fusion events than existing sequencing systems. As shown in FIGS. 12A-12D, the split read alignment system also determines better mapping and alignment for nucleotide reads on the genomic region of chromosome M for mitochondrial DNA, resulting in improved coverage compared to existing sequencing systems. Instead of simply finding a primary alignment for a single fragment with the highest alignment score, the split read alignment system considers and scores candidate fragment alignments from nucleotide reads together as part of a split group.

[0018] In addition to considering fragment alignments together rather than in isolation, in certain embodiments, a split read alignment system also improves the accuracy of split read alignments using improvements to other computational models. For a given split group, for example, the split read alignment system determines a break penalty for the relative geometry of the fragment alignments in the candidate split group. In some cases, the split read alignment system efficiently identifies and scores such split groups and rapidly identifies likely split read alignments by using dynamic processing to comprehensively consider candidate split groups. For each candidate split group, in some embodiments, the split read alignment system generates a split group score based on the fragment alignment score, the break penalty, and the overlap penalty, thereby globally evaluating the likelihood that a given candidate split group contains fragment alignments.

[0019] Due in part to the improved split read alignment, the split read alignment system also improves the accuracy of the corresponding nucleobase calls. Based on the more accurate split read alignment, the split read alignment system can accurately identify and report split alignments when reads align with alternative contiguous sequences. The split read alignment system can report split alignments in the primary assembly corresponding to alternative contiguous sequences to further guide the selection of predicted split groups. For alignment improvement, the split read alignment system can also determine more accurate variant calls or other nucleobase calls with a higher confidence rate than existing sequencing systems. As shown in FIGS. 11A-11D, for example, the split read alignment system determines true negative variant calls that are more accurate than existing sequencing systems for candidate gene fusion events. In addition, as shown in FIGS. 13 and 14A-14B, the split read alignment system determines SNP calls, indel calls, and variant calls that are more accurate than existing sequencing systems.

[0020] Beyond improved alignment and improved base calling accuracy, in some embodiments, a split read alignment system improves computational efficiency by reducing the number of sequencing assays and computing devices used to determine structural variant calls. As noted above, some existing sequencing systems consume significant computer processing and time by performing both (i) whole genome sequencing (WGS) on a specialized sequencing device to generate nucleotide reads for a genomic sample, and (ii) multiple genotyping microarrays on a microarray device. By comparing nucleotide reads to a reference genome for WGS and analyzing optical signals from DNA probes in the microarray, existing sequencing systems can determine accurate variant calls for both SNPs and smaller indels based on the reference genome on the one hand, and targeted structural variants from DNA probes on the other hand. In contrast to such existing sequencing systems, in some embodiments, a split read alignment system uses a specialized sequencing device to determine nucleotide reads using candidate split groups, with or without fewer genotyping microarrays for targeted structural variants, to determine variant calls corresponding to structural variants or primary assembly regions of the reference genome, facilitating a more computationally efficient approach. Thus, a split read alignment system can obviate some or all of the genotyping microarrays for structural variants by determining split group scores for candidate split groups that include fragment alignments from fragments of nucleotide reads and selecting a predicted split group from among the candidates based on such split group scores.

[0021] As shown by the foregoing discussion, the present disclosure utilizes various terms to describe the features and advantages of the split read alignment system. Here, further details regarding the meaning of such terms are provided. For example, as used herein, the term "nucleotide read" (or simply "read") refers to the inferred sequence of one or more nucleobases (or nucleobase pairs) from all or part of a sample nucleotide sequence (e.g., a sample genomic sequence, cDNA). In particular, a nucleotide read includes the sequence of nucleotide calls determined or predicted for a nucleotide sequence (or a group of monoclonal nucleotide sequences) from a sample library fragment corresponding to a genomic sample. For example, in some cases, a sequencing device determines a nucleotide read by generating nucleotide calls for nucleobases that have passed through nanopores of a nucleotide-sample slide, determined via fluorescent tagging, or determined from clusters within a flow cell.

[0022] A nucleotide read can include both genomic nucleotide reads based on a DNA sequence and transcriptome nucleotide reads based on ribonucleic acid (RNA). As used herein, the term "genomic read" refers to a nucleotide read representing the inferred sequence of nucleobases (or nucleobase pairs) derived from genomic DNA (gDNA) extracted from a sample. For example, a genomic read includes (i) gDNA extracted from or derived from gDNA extracted from a sample, and (ii) a read including a portion of a sample library fragment corresponding to the sample. In some cases, a genomic read includes a read that includes an adapter sequence for an assay of transposase-accessible chromatin (ATAC) reads, also called ATAC reads. In some embodiments, a genomic read can include, but is not limited to, DNase 1 hypersensitive site (DNase) sequencing reads, formaldehyde-assisted isolation of regulatory elements (FAIRE) sequencing reads, or Tet-assisted bisulfite (TAB) sequencing reads.

[0023] Conversely, as used herein, the term "transcriptome read" refers to a nucleotide read that represents a putative sequence of nucleobases (or nucleobase pairs) that complement or represent the RNA extracted from a sample. For example, a transcriptome read can include (i) cDNA synthesized from single-stranded messenger RNA (mRNA) or microRNA (miRNA), or derived from RNA extracted from a sample, and (ii) a read that includes a portion of a sample library fragment corresponding to the sample. As a further example, a transcriptome read can include (i) RNA extracted from or derived from RNA extracted from a sample, and (ii) a read that includes RNA (e.g., mRNA, miRNA, transfer RNA (tRNA)) that is a portion of a sample library fragment corresponding to the sample.

[0024] Furthermore, as used herein, the term "genomic coordinate" refers to a specific location or position of a nucleotide base within a genome (e.g., the genome of an organism or a reference genome). In some cases, a genomic coordinate can include an identifier for a specific chromosome of the genome and an identifier for the position of a nucleotide base within the specific chromosome. For example, a genomic coordinate (singular or plural) can include a number, name, or other identifier of a chromosome (e.g., chr1 or chrX), and a specific position (singular or plural) such as a numbered position following the identifier of the chromosome (e.g., chr1:1234570 or chr1:1234570~1234870). Additionally, in certain implementations, a genomic coordinate can refer to the source of the reference genome (e.g., mt for a mitochondrial DNA reference genome, or SARS-CoV-2 for the reference genome of the SARS-CoV-2 virus), and the position of the nucleotide base within the source for the reference genome (e.g., mt:16568 or SARS-CoV-2:29001). In contrast, in certain cases, a genomic coordinate can refer to the position of a nucleotide base within the reference genome without reference to a chromosome or source (e.g., 29727).

[0025] As used herein, the term "genomic region" refers to a range of genomic coordinates. Similar to genomic coordinates, in certain embodiments, a genomic region can be identified by an identifier for a chromosome and a specific position(s), e.g., a numbered position following an identifier for a chromosome (e.g., chr1:1234570~1234870). In various embodiments, genomic coordinates include positions within a reference genome. In some cases, genomic coordinates are specific to a particular reference genome.

[0026] Also, as used herein, the term "genomic sample" refers to a target genome or a portion of a genome that is subjected to sequencing. For example, a sample genome can include the sequence of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence). In particular, a sample genome is (in whole or in part) isolated or extracted from a sample organism and includes the entire genome composed of nitrogenous heterocyclic bases. For example, a nucleic acid polymer can include deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or segments of chimeric or hybrid forms of nucleic acids described hereinbelow. In some cases, a sample genome is prepared or isolated by a kit and is found in a sample received by a sequencing device.

[0027] As used herein, the term "split group" refers to a group of one or more fragment alignments corresponding to a nucleotide read. In particular, a split group includes a strand of one or more fragment alignments that form a split alignment of one nucleotide read with respect to a reference genome. For example, a split group can include fragment alignments of one or more fragments of a nucleotide read. Such fragment alignments can represent read fragments from a single-end nucleotide read or alignments of paired-end nucleotide reads (e.g., mates) from a pair of paired-end nucleotide reads. In this context, the term "candidate split group" refers to a potential fragment alignment of one nucleotide read.

[0028] Furthermore, the term "predicted split group" refers to a split group selected to represent an alignment of nucleotide reads. In particular, the predicted split group includes the split group having the highest split group score among the candidate split groups corresponding to the nucleotide reads. Thus, in some embodiments, the predicted split group represents a prediction that the corresponding split alignment is most likely to represent the true alignment of the nucleotide reads with the reference genome. For example, in certain situations described below, the predicted split group may represent a split read alignment corresponding to a true structural variation in the sequenced genomic sample.

[0029] As used herein, the term "split group score" refers to a numerical score, metric, or other quantitative measure indicating the accuracy of the fragment alignment in a split group. For example, the split group score indicates the likelihood that a given split alignment of one or more fragment alignments of a candidate split group is correct with respect to the reference genome. For example, as described below, the split group score may reflect a combination of fragment alignment scores, break penalties, overlap penalties, and possibly gap penalties for the fragment alignments within the split group.

[0030] As used herein, the term "fragment alignment" refers to a candidate local alignment of a given fragment of a nucleotide read with respect to the reference genome. For example, the fragment alignment indicates the genomic region or genomic coordinates of the reference genome to which the fragment of the read aligns.

[0031] As further used herein, the term "alignment score" refers to a numerical score, metric, or other quantitative measure that assesses the accuracy of an alignment between a nucleotide read or a fragment of a nucleotide read and another nucleotide sequence from a reference genome. In particular, an alignment score includes a metric that indicates the degree to which the nucleotides of a nucleotide read (or a fragment of a nucleotide read) match or are similar to a reference sequence or an alternative contiguous sequence from the reference genome. In certain implementations, the alignment score takes the form of a Smith-Waterman score for a local alignment, or a variation or version of the Smith-Waterman score (e.g., various settings or configurations used by DRAGEN by Illumina, Inc. for Smith-Waterman scoring). Thus, the term "fragment alignment score" refers to an alignment score for a fragment alignment of a nucleotide read. Thus, in a split group that includes multiple fragment alignments, the fragment alignment score can be determined for each fragment alignment within the split group.

[0032] In this context, the term "alternative contiguous sequence" (or simply "alternative contig") refers to a contiguous sequence that represents a population haplotype that is added to (e.g., lifted over to) a linear reference genome (or other reference genome) at a particular genomic coordinate or genomic coordinates. In some implementations, a graph reference genome can include alternative contiguous sequences that are mapped to genomic coordinates of a primary assembly for the linear reference genome. For example, an alternative contiguous sequence can represent a population haplotype that includes a structural variant having a lift over to two or more genomic coordinates in a linear reference genome that correspond to two or more sides of a structural variant break end. In some cases, a hash table of the graph reference genome includes identifiers that associate alternative contiguous sequences that represent structural variant haplotypes with genomic coordinates that represent reference haplotypes from a primary assembly of the linear reference genome.

[0033] In this regard, the term "alternative contig fragment alignment score" refers to an alignment score for an alignment between one or more read fragments and an alternative contiguous sequence. In particular, the alternative contig fragment alignment score can include an alignment score for an alignment between one or more inner read fragments and one or more outer read fragments of a nucleotide read and an alternative contiguous sequence. As will be described below, the alternative contig fragment alignment score can, in certain circumstances, replace or function as a split group score.

[0034] As further used herein, the term "break penalty" refers to a numerical score, metric, or other quantitative measure that imposes a penalty on a fragment alignment within a split group that indicates a break between fragment alignments. In particular, the break penalty can include a metric that imposes a penalty on a fragment alignment of a split group to the extent (or in proportion) that the fragment alignment indicates a break in the nucleotides between fragment alignments at a breakpoint. Thus, in some embodiments, the split read alignment system determines a relatively high break penalty for breaks between fragment alignments of relatively large size or distance or within a fragment alignment.

[0035] In this regard, the term "breakpoint" refers to a break or space between a nucleotide read and / or fragments of a nucleotide read where the nucleotide read aligns to different positions within a reference genome. For example, split alignments include breakpoints because fragments of a nucleotide read can indicate a highest score alignment (e.g., highest pair score) with a reference genome when aligning to different positions that have a break or breakpoint between the fragments of the nucleotide read.

[0036] As further used herein, the term "overlap penalty" refers to a numerical score, metric, or other quantitative measure that imposes a penalty on fragment alignments within an overlapping split group within a nucleotide read. In particular, the overlap penalty can include a metric that imposes a penalty on the fragment alignment of a split group to the extent (or in proportion) that the fragment alignment represents overlapping nucleotide bases within the nucleotide read. For example, a 150-base pair nucleotide read can have at least two fragment alignments. The first fragment alignment can align with the leftmost 100 base pairs to one chromosome (e.g., Chr1) within a reference genome, and the second fragment alignment can align with the rightmost 100 base pairs to a different chromosome (e.g., Chr2). Despite the exemplary fragment alignments not overlapping within the reference genome, the first and second fragment alignments can overlap by only 50 base pairs within the nucleotide read. Thus, the overlap penalty can represent a metric that imposes a penalty on such a 50-base pair overlap (or other exemplary overlap of nucleotide bases) within the nucleotide read as read from the foregoing example.

[0037] As further used herein, the term "gap penalty" refers to a numerical score, metric, or other quantitative measure that imposes a penalty on a pair of fragment alignments based on gaps between pairs of fragment alignments within a nucleotide read. In particular, the gap penalty can include a metric that imposes a penalty on fragment alignments of a split group to the extent (or in proportion) to the size of the gap that exists between fragment alignments within a nucleotide read. For example, a 150 base pair nucleotide read can have at least two fragment alignments. A first fragment alignment can align the leftmost 50 base pairs to a first set of genomic coordinates of a reference genome, and a second fragment alignment can align the rightmost 50 base pairs to a second set of genomic coordinates of the reference genome. In contrast to the above overlap example, the nucleotide read can include a 50 base pair gap within the nucleotide read between a first fragment corresponding to the first fragment alignment and a second fragment corresponding to the second fragment alignment. Thus, the gap penalty can represent a metric that imposes a penalty on such a 50 base pair gap between the first and second fragment alignments within the nucleotide read.

[0038] As used herein, the term "split alignment" refers to the alignment of different fragments of a read to different regions in a reference genome. For example, a split alignment can refer to a split read or a chimeric alignment.

[0039] As further used herein, the term "alignment score" refers to a numerical score, metric, or other quantitative measure that assesses the accuracy of an alignment between a candidate pair of a split group and another nucleotide sequence from a reference genome. In particular, a pair score includes a metric that indicates the extent to which a candidate pair of a split group is accurately aligned with a nucleotide sequence from a reference genome. More specifically, in some embodiments, the pair score indicates the likelihood that a candidate pair of a split group contains the true mate of a paired-end nucleotide read. Indeed, in some embodiments, the pair score represents the total of the split group scores for each candidate pair of the split group minus a pairing penalty.

[0040] As used herein, the term "pairing penalty" refers to a numerical score, metric, or other quantitative measure that penalizes pairs of fragment alignments that are unlikely to be mates of paired-end reads. In particular, the term "pairing penalty" refers to a metric that indicates the likelihood or unlikelihood that fragment alignments are correctly paired based on the geometry of two or more fragment alignments with respect to a reference genome. For example, the pairing penalty can represent a log likelihood, or a logP value of the insert size between two innermost fragment alignments based on an empirical insert distribution.

[0041] As used herein, the term "reference genome" refers to a digital nucleic acid sequence assembled as a representative example (or examples) of the genes and other genetic sequences of an organism. Regardless of sequence length, in some cases, the reference genome represents a set of nucleic acid sequences in an exemplary gene set or digital nucleic acid sequence determined to be representative of the organism. For example, the linear human reference genome can be GRCh38 (or other version of the reference genome) from the Genome Reference Consortium. GRCh38 can include alternative contigs representing alternative haplotypes such as SNPs and small indels (e.g., base pairs 10 or less, base pairs 50 or less), but GRCh38 includes alternative haplotypes with a limited representation of population structure variants. Indeed, the structural variants represented by GRCh38 include only those represented by the 11 individuals from which the GRCh38 library was constructed. In this context, the term "reference region" refers to a part or fraction of the reference genome. For example, a reference region can be a selected number of nucleotides (e.g., 150 bases) from the reference genome.

[0042] As used herein, the term "variant" refers to a different, or unlike, nucleotide or nucleotides that do not align with the corresponding nucleotide(s) in a reference sequence or reference genome. For example, a variant includes SNPs, indels, or structural variants that indicate a nucleotide in the nucleotide sequence of a sample that is different from the reference nucleotide at the corresponding genomic coordinate of the reference sequence. Along these lines, a "variant nucleotide call" refers to a nucleotide call that includes a variant at a particular genomic coordinate. Conversely, a "non-variant nucleotide call" refers to a nucleotide call that includes a non-variant (or matches the reference base) at a genomic coordinate.

[0043] Furthermore, as used herein, the term "nucleotide call" (or simply "base call") refers to the determination or prediction of a particular nucleotide (or nucleotide pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a sample genome. In particular, a nucleotide call may indicate (i) the determination or prediction of the type of nucleotide incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleotide call), or (ii) the determination or prediction of the type of nucleotide present at a genomic coordinate or region within the genome, including a variant call or non-variant call in a digital output file. In some cases, for a nucleotide read, a nucleotide call may include the determination or prediction of a nucleotide based on intensity values obtained from fluorescently tagged nucleotides added to the oligonucleotides of the nucleotide-sample slide (e.g., within a cluster of a flow cell). Alternatively, a nucleotide call may include the determination or prediction of a nucleotide from a chromatogram peak or current change resulting from a nucleotide passing through a nanopore of a nucleotide-sample slide. In contrast, a nucleotide call may further include the final prediction of a nucleotide at a genomic coordinate of a sample genome for a variant call file (VCF) or other base call output file based on a nucleotide read corresponding to the genomic coordinate. Thus, a nucleotide call may include a base call corresponding to a genomic coordinate and a reference genome, e.g., an indication of a variant or non-variant at a particular position corresponding to the reference genome. Indeed, a nucleotide call may refer to a variant call, including but not limited to single nucleotide variants (SNVs), insertions or deletions (indels), or a base call that is part of a structural variant. As suggested above, a single nucleotide call may be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or a uracil (U) call.

[0044] As further used herein, the term "alignment file" refers to a digital file that shows the relative alignment or mapping of nucleotide reads to the nucleotide sequence of a reference genome or other reference nucleotide sequence. In particular, an alignment file can contain data indicating the relative mapping positions of nucleotide reads and nucleotide sequences of a reference genome. In some embodiments, the alignment file includes or consists of a Sequence Alignment / Map (SAM) file, a Binary Alignment Map (BAM) file, a FAST-All (FASTA) file, or a FASTQ file.

[0045] As used herein, the term "variant call file" refers to a digital file that indicates or represents one or more nucleotide calls (e.g., variant calls) compared to a reference genome, along with other information regarding the nucleotide calls (e.g., variant calls). For example, a Variant Call Format (VCF) file refers to a text file format that has information regarding variants at specific genomic coordinates, including meta-information lines, header lines, and data lines where each data line has information regarding a single nucleotide call (e.g., a single variant).

[0046] In some embodiments, a split read alignment system or corresponding sequencing system utilizes a call generation model to determine nucleotide base calls (e.g., variant calls or genotype calls). As used herein, the term "call generation model" refers to a probabilistic model that, along with relevant metrics, generates sequencing data from nucleotide reads of a sample nucleotide sequence, including nucleotide base calls, variant calls, and / or genotype calls. Thus, in some cases, the call generation model can be a variant call generation model. For example, in some cases, the call generation model refers to a Bayesian probability model that generates variant calls based on nucleotide reads of a sample nucleotide sequence. Such a model can process or analyze sequencing metrics corresponding to read pileups (e.g., multiple nucleotide reads corresponding to a single genomic coordinate), which include various hypotheses such as mapping quality, base quality, and foreign reads, missing reads, joint detection, etc. The call generation model can similarly include multiple components, including but not limited to, different software applications or components for mapping and alignment, sorting, duplicate marking, calculation of read pileup depth, and variant calling. In some cases, the call generation model refers to the ILLUMINA DRAGEN model for variant calling functionality as well as mapping and alignment functionality (e.g., DRAGEN variant caller or "DRAGEN VC").

[0047] As used herein, for example, the term "configurable processor" refers to a circuit or chip that can be configured or customized to execute a particular application. For example, a configurable processor includes an integrated circuit chip designed to be configured or customized on-site by an end-user computing device to execute a particular application. Configurable processors include, but are not limited to, ASICs, ASSPs, coarse-grained reconfigurable arrays (CGRAs), or FPGAs. In contrast, configurable processors do not include CPUs or GPUs. In some embodiments, the split read alignment system uses a configurable processor (e.g., an FPGA) or a processor (e.g., a CPU) to execute the various embodiments described herein.

[0048] The following paragraphs describe the split read alignment system with respect to exemplary embodiments and exemplary figures depicting the embodiments. For example, FIG. 1 shows a schematic diagram of a computing system 100 in which a split read alignment system 106 operates, according to one or more embodiments. As shown, environment 100 includes one or more server devices (s) 102 connected to user client device 108, local device 118, and array determination device 114 via network 112. Network 112 can include any suitable network over which computing devices can communicate. Exemplary networks are considered in more detail below with respect to FIG. 16.

[0049] As shown in FIG. 1, computing system 100 includes server device(s) 102. In various embodiments, server device(s) 102 can generate, receive, analyze, store, and transmit electronic data such as data for determining nucleobase calls or for sequencing nucleic acid polymers. In some embodiments, server device(s) 102 receive various data from sequencing device 114, such as data from a sample genome and / or nucleotide reads. Server device 102 can also communicate with user client device 108. In particular, server device(s) 102 can transmit data about nucleotide reads, direct nucleobase calls, genomic samples, nucleobase calls, and / or sequencing metrics to user client device 108.

[0050] As shown, server device(s) 102 includes sequencing system 104. Generally, sequencing system 104 analyzes data received from sequencing device 114 (e.g., call data) to determine the nucleobase sequence of a nucleic acid polymer. For example, sequencing system 104 can receive raw data from sequencing device 114 and determine the nucleobase sequence for a sample genome or nucleic acid segment. In some embodiments, sequencing system 104 determines the sequence of nucleobases in a DNA and / or RNA segment or oligonucleotide.

[0051] Also as shown, the sequencing system 104 includes a split read alignment system 106. As described below, the split read alignment system 106 can determine split read alignments of nucleotide reads with a reference genome 116. For example, in some embodiments, the split read alignment system 106 identifies one or more nucleotide reads corresponding to a genomic region of a genomic sample. The split read alignment system 106 further (i) determines a candidate split group including fragment alignments corresponding to the one or more nucleotide reads, and (ii) generates a split group score for the split alignment of the candidate split group with the reference genome 116. Based on the split group score, the split read alignment system 106 selects a predicted split group from among the candidate split groups for use in nucleotide calling.

[0052] As further illustrated in FIG. 1, the computing system 100 includes a user client device 108. In various embodiments, the user client device 108 can generate, store, receive, and transmit digital data. In particular, the user client device 108 can receive sequencing data from a sequencing device 114. As further illustrated, the user client device 108 includes a sequencing application 110. The sequencing application 110 can be a web application or a native application (e.g., a mobile application, a desktop application) stored and executed on the user client device 108. The sequencing application 110 can receive data from the sequencing system 104 and / or the split read alignment system 106. For example, the user client device 108 can receive a variant call file and / or an alignment file from the sequencing system 104.

[0053] The array determination application 110 can include instructions that cause the user client device 108 to receive data from the split read alignment system 106 and present data from the array determination device 114 and / or the server device 102 when executed. Further, the array determination application 110 can instruct the user client device 108 to display data of nucleobase calls regarding the reference genome 116, such as the display of nucleobase calls or split alignments from a variant call file or an alignment file. In fact, the user client device 108 can display nucleobase call results and / or instructions for predicted split groups for a genomic sample.

[0054] As further shown in FIG. 1, the computing system 100 includes an array determination device 114. In various embodiments, the array determination device 114 sequences a genomic sample or other nucleic acid polymer. For example, the array determination device 114 generates data by analyzing nucleic acid segments or oligonucleotides extracted from a genomic sample, either directly or indirectly on the array determination device 114. More specifically, the array determination device 114 receives and analyzes nucleic acid sequences extracted from a genomic sample within a nucleotide-sample slide (e.g., a flow cell). In one or more embodiments, the array determination device 114 uses SBS to sequence a genomic sample or other nucleic acid polymer. In some embodiments, in addition to or instead of communicating via the network 112, the array determination device 114 bypasses the network 112 and communicates directly with the user client device 108.

[0055] As further shown in FIG. 1, in some embodiments, the server device(s) 102 comprises a distributed set of servers, and the server device(s) 102 is distributed across the network 112 and includes a number of server devices located in the same or different physical locations. For example, the server device(s) 102 can be implemented in whole or in part on the local device 118. By way of illustration, the local device 118 can implement the sequencing system 104 and / or the split read alignment system 106. Further, the server device(s) 102 and / or the local device 118 can include a content server, an application server, a communication server, a web hosting server, or another type of server.

[0056] The user client device 108 shown in FIG. 1 can include various types of client devices. For example, in some embodiments, the user client device 108 includes non-mobile devices such as a desktop computer or server, or other types of client devices. In various embodiments, the user client device 108 includes mobile devices such as a laptop, tablet, mobile phone, or smartphone. Additional details of the user client device 108 are described below with respect to FIG. 16.

[0057] Furthermore, although the split read alignment system 106 is shown on the server device(s) 102 as part of the sequencing system 104, in some embodiments, the split read alignment system 106 is implemented (e.g., fully or partially disposed) by the user client device 108, the sequencing device 114, and / or the local device 118. As described above, in some embodiments, the split read alignment system 106 is implemented by one or more other components of the computing system 100 such as the sequencing device 114. In particular, the split read alignment system 106 can be implemented in a variety of different ways across the server device(s) 102, the network 112, the user client device 108, the local device 118, and the sequencing device 114.

[0058] FIG. 1 illustrates the components of the computing system 100 that communicate via the network 112, but in certain embodiments, the components of the computing system 100 can also communicate directly with each other bypassing the network 112. For example, in some embodiments, the user client device 108 can communicate directly with the sequencing device 114. Additionally, in some embodiments, the user client device 108 can communicate directly with the split read alignment system 106 and / or the server device 102. In some embodiments, the user client device 108 can communicate directly with the local device 118. Furthermore, the split read alignment system 106 can access one or more databases housed in or accessed by the server device(s) 102 or elsewhere within the computing system 100.

[0059] FIG. 2 provides an overview of a split read alignment system 106 that determines scores for alignments of one or more fragments from nucleotide reads within a candidate split group according to one or more embodiments, and selects a predicted split group from among the candidates based on such scores for use in base calling. Generally, as shown in FIG. 2, split read alignment system 106 performs a series of operations 200 that include an operation 202 of identifying one or more nucleotide reads. Split read alignment system 106 further performs an operation 204 of determining a candidate split group that includes fragment alignments for fragments of the nucleotide reads. Split read alignment system 106 also performs an operation 206 of generating a split group score for the determined candidate split group and an operation 208 of selecting a predicted split group.

[0060] As shown in FIG. 2, split read alignment system 106 performs an operation 202 of identifying one or more nucleotide reads. In particular, split read alignment system 106 identifies one or more nucleotide reads corresponding to a genomic region of a genomic sample. For example, split read alignment system 106 may identify nucleotide reads corresponding to the template strand or sequence of a genomic sample. More specifically, the template includes the original continuous DNA or RNA fragment that was sequenced by either the single-end method or the paired-end method. In the single-end method, a single read is sequenced from one end of the template. Since a single read is sequenced from one end of the template, the single read represents the complementary sequence of the template. In the paired-end method, a first read (e.g., R1) is sequenced from one end of the template towards the center, and a second read (e.g., R2) is sequenced from the other end. FIG. 2 shows two paired-end reads R1 and R2 that are oriented towards each other. As illustrated, there is a gap between R1 and R2, although an overlap between R1 and R2 is also possible. R1 and R2 may be described as paired-end mates.

[0061] As further shown in FIG. 2, the split read alignment system 106 performs an operation 204 of determining candidate split groups. In particular, the split read alignment system 106 determines candidate split groups that include fragment alignments corresponding to one or more nucleotide reads. Generally, a fragment alignment refers to a candidate local alignment of a fragment of a read. FIG. 2 shows the split read alignment system 106 that determines a candidate split group for R1. R1 can be either a single-end read or one of two paired-end reads. R1 can include one or more different fragments.

[0062] To illustrate fragments and fragment alignments, FIG. 2 shows the split read alignment system 106 that identifies various fragments of a nucleotide read. As shown, the split read alignment system 106 identifies fragments 218, 220, 222, and 224 corresponding to R1. The fragments shown in FIG. 2 are separated by breaks that represent structural variant (or "SV") breakpoints. FIG. 2 shows R1 cut by a single SV breakpoint, but a nucleotide read may have no SV breakpoint or several SV breakpoints. For example, fragment 220 may be further cut into two or more fragments.

[0063] As further shown in FIG. 2, the split read alignment system 106 determines candidate split groups 214a-214c for R1. As part of performing operation 204, the split read alignment system 106 identifies fragment alignments for the identified fragments of the read. Generally, fragments of a read can be aligned to different sequences in the reference genome. For example, fragments 218 and 220 can be aligned to genomic regions near the reference genome on the same chromosome. Conversely, fragment 218 can be aligned to the reference genome on one chromosome and fragment 220 can be aligned to the reference genome on another chromosome.

[0064] FIG. 2 shows candidate fragment alignments corresponding to R1 as part of a split group. More specifically, candidate split groups 214a-214c show candidate local alignments of different combinations of fragments 218-222 of R1 on the reference genome. For example, candidate split group 214a includes candidate fragment alignments of fragment 218 and fragment 220 to the reference genome. FIGS. 3A-4 illustrate a split read alignment system 106 for determining candidate split groups for single-end and paired-end nucleotide reads according to one or more embodiments, and the corresponding paragraphs describe in further detail.

[0065] As further shown in FIG. 2, the split read alignment system 106 performs an operation 206 to generate a split group score. Generally, the split read alignment system 106 generates a split group score for the split alignment of a candidate split group and a reference genome. The split read alignment system 106 can generate a split group score for a split group based on a fragment alignment score, a break penalty, and an overlap penalty. As shown, the split read alignment system 106 generates a split group score of 0.98 for the candidate split group 214a and a split group score of 0.73 for the candidate split group 214b. FIG. 5 shows additional details regarding determining a split group score according to one or more embodiments and provides a corresponding description.

[0066] After determining the split group score, as further shown in FIG. 2, the split read alignment system 106 performs an operation 208 to select a predicted split group. The split read alignment system 106 selects a predicted split group from the candidate split groups based on the split group score. By way of example, in some embodiments, the split read alignment system 106 selects the candidate split group 214a as the predicted split group based on the candidate split group 214a having the highest split group score.

[0067] As mentioned, the split read alignment system 106 can generate predicted split groups for single-end reads and paired-end reads. In some embodiments, the split read alignment system 106 predicts a split group based in part on a pair score for a pair of candidate split groups. FIGS. 6A-6B show the split read alignment system 106 generating a pair score according to one or more embodiments.

[0068] As described above, the split read alignment system 106 determines candidate split groups for single-end nucleotide reads and paired-end nucleotide reads. FIG. 3A shows the split read alignment system 106 for determining candidate split groups for single-end nucleotide reads according to one or more embodiments, and FIG. 3B shows the split read alignment system 106 for determining candidate split groups for paired-end nucleotide reads.

[0069] FIG. 3A shows the split read alignment system 106 that identifies candidate split groups in single-end nucleotide reads. As described above, single-end read sequencing involves sequencing DNA or RNA from one direction. Generally, the split read alignment system 106 identifies fragments of nucleotide reads. By way of example, the split read alignment system 106 identifies fragment 320, fragment 322, fragment 324, and fragment 326. The illustrated fragments are split by potential breakpoints when aligned with the reference genome 334a.

[0070] The split read alignment system 106 identifies candidate split groups 332a-332c of the identified fragments. Generally, the candidate split groups 332a-332c include all realistic fragment alignments. In other words, the candidate split groups 332a-332c include potential fragment alignments of the read fragments with the reference genome 334b. For example, candidate split group 332a includes the fragment alignments of fragment 320 and fragment 322 with respect to the reference genome 334b. Candidate split group 332b includes the overlapping fragment alignments of fragment 320 and fragment 322. Candidate split group 332c includes the fragment alignments of fragment 320 and fragment 326 with respect to the reference genome 334b.

[0071] Figure 3A shows candidate split groups 332a - 332c, although additional candidate split groups are possible. For example, a candidate split group can include a single - fragment alignment of a single fragment of a nucleotide read. For example, in some embodiments, the fragment may be the entire nucleotide read. Further, a candidate split group can include more than two fragment alignments. For example, a candidate split group can include a fragment alignment for three or more fragments of a nucleotide read.

[0072] As described above, Figure 3B shows a split - read alignment system 106 for determining candidate split groups for nucleotide reads in paired - end sequencing according to one or more embodiments. Generally, paired - end sequencing sequences involve generating paired nucleotide reads that begin at different (and opposite) positions of a library template. Specifically, paired - end sequencing generates two mate reads. For example, R1 and R2 shown in Figure 3B constitute a pair of mates. As mentioned, there may be a gap between R1 and R2, or the paired - end reads may overlap.

[0073] In some examples, one paired - end mate intersects a breakpoint (e.g., an SV breakpoint), while the other paired - end mate does not. For illustration, R2 may cross a breakpoint, while R1 does not. Thus, R2 can be segmented into fragment 302 and fragment 304, while R1 remains as the entire fragment 316. In this example, the 3’ end of R2 (e.g., the inner end of fragment 302) is in a position properly paired with respect to the mate alignment of the entire fragment 316, while fragment 304 can potentially align to a different genomic region of the reference genome.

[0074] In another example, R1 and R2 may overlap and both may cross a single breakpoint. For illustration, breaks 336a and 336b can represent the same breakpoint. In this example, fragment 318 of R1 overlaps with fragment 302 of R2, and fragment 320 of R1 represents fragment 304 of R2.

[0075] In another example, R1 and R2 cross different breakpoints. For example, break 336a can represent a different breakpoint from break 336b. Thus, R1 is split into fragment 318 and fragment 320, and R2 is split into fragment 310 and fragment 312.

[0076] The split read alignment system 106 contemplates the above scenarios by generating candidate split groups for both R1 and R2. As illustrated in FIG. 3B, the split read alignment system 106 generates candidate split groups 324a - 324c corresponding to R1 with respect to the reference genome 327. The split read alignment system 106 also generates candidate split groups 340a, 340b, and 340c corresponding to R2 with respect to the reference genome 314. In some embodiments, the reference genome 327 and the reference genome 314 represent the same reference genome. The candidate split groups 324a - 324c and the candidate split groups 340a - 340c include fragment alignments corresponding to the relevant nucleotide reads, i.e., either R1 or R2.

[0077] As previously described with respect to FIG. 3A, in some embodiments, a candidate split group includes a chain of fragment alignments for one nucleotide read. As described above, the nucleotide read and the fragment alignment can be of various nucleotide lengths. As shown, the split read alignment system 106 can determine that a candidate split group includes an alignment of the entire nucleotide read. For example, candidate split group 324a includes an alignment of all fragments 316 that include all of R1 with respect to reference genome 327. In contrast, a candidate split group can also include overlapping fragment alignments. For example, candidate split group 324c for R1 and candidate split group 340c for R2 include overlapping fragment alignments. The split read alignment system 106 can further determine non-overlapping candidate split groups. For example, candidate split group 324b for R1 and candidate split group 340a for R2 include non-overlapping fragment alignments. Further, the split read alignment system 106 can generate candidate fragments that include a chain of more than two fragment alignments. Further, a candidate fragment can also include fragment alignments having different geometric orientations with respect to the reference genome.

[0078] FIGS. 3A-3B show a split read alignment system 106 that generates candidate split groups for single-end and paired-end nucleotide reads. In some embodiments, the split read alignment system 106 utilizes dynamic programming to efficiently generate and evaluate all possible fragment alignment arrays. FIG. 4 shows a split read alignment system 106 that utilizes dynamic programming to generate and evaluate candidate split groups according to one or more embodiments, and the corresponding description explains it.

[0079] By using dynamic programming, in some embodiments, the split read alignment system 106 considers a subset of the candidate split groups that can be programmed. More specifically, the split read alignment system 106 identifies a subset of the possible candidate split groups by evaluating the fragment alignments in a specific order. By way of illustration, in some embodiments, the split read alignment system 106 determines the candidate split groups by repeatedly grouping the individual fragment alignments in order from the outermost fragment alignment to the innermost fragment alignment of the nucleotide read. The split read alignment system 106 further repeatedly scores the grouping of the individual fragment alignments according to the order in which the individual fragment alignments are grouped.

[0080] Generally, each read has two ends, a 3' end or a 5' end, where 3' is designated as "inner" and the 5' end is designated as "outer". For paired-end reads, the terms inner and outer refer to the expected relative positions in the template. In the case of single-end or paired-end reads with forward-reverse (FR) orientation, the 3' end represents the inner end and the 5' end represents the outer end. If reverse-forward (RF) or forward-forward (FF) / reverse-reverse (RR) pair orientations are expected, the split read alignment system 106 dynamically determines the inner and outer read ends. In particular, the split read alignment system 106 designates the innermost and outermost fragment alignments according to the observed geometry of the appropriate pair of fragment alignments with the highest total alignment score.

[0081] Figure 4 shows the process of dynamic programming executed by the split read alignment system 106. As shown for illustrative purposes, the fragment alignments 402-410 are organized in a Smith-Waterman matrix. In addition to showing the positions of the fragment alignments with respect to the nucleotide reads and the reference genome, the Smith-Waterman matrix shows the orientation of the fragment alignments 402-410. For example, as shown in the figure, the fragment alignment 406 represents a forward alignment, and the fragment alignment 408 represents a reverse complement alignment. Figure 4 shows the fragment alignments 402-410 as complete gapless diagonal alignments, although the individual fragment alignments of the fragment alignments 402-410 may contain indels (insertions and / or deletions). In some embodiments, such indels are relatively small mutations (e.g., less than 50 base pairs) as opposed to the size of structural variations (e.g., more than 50 base pairs). Small indels are typically aligned within the fragment alignment, while structural variations are typically described or depicted by multi-fragment split read alignments.

[0082] As shown in Figure 4, the fragment alignment 402 represents the innermost fragment alignment, and the fragment alignment 410 represents the outermost fragment alignment. As shown, the split read alignment system 106 begins by grouping the outermost fragment alignment with the next outermost fragment alignment. For example, the split read alignment system 106 groups the fragment alignment 410 with the fragment alignment 408. The grouping of the fragment alignment 410 and the fragment alignment 408 constitutes a candidate split group 412a.

[0083] After grouping the outermost fragment alignment and the outermost fragment alignment at the next point (and determining the split group score therefor), the split read alignment system 106 groups the outermost fragment alignment and the outermost fragment alignment at the next point (and determines the split group score therefor). Thus, the split read alignment system 106 groups the fragment alignment 410 with the fragment alignment 406. The grouping of the fragment alignment 410 and the fragment alignment 406 constitutes the candidate split group 412b.

[0084] In some embodiments, as just shown, the split read alignment system 106 generates a split group score by repeatedly scoring the grouping of the individual fragment alignments according to the order in which the individual fragment alignments are grouped. As shown in FIG. 4, the split read alignment system 106 scores the candidate split group 412a and the candidate split group 412b in the order in which they are formed. For example, the split read alignment system 106 determines a split group score 414a for the candidate split group 412a and a split group score 414b for the candidate split group 412b. In some cases, the split group score 414b is greater than the split group score 414a. As shown below, a better split group score can affect the order in which the next candidate split group is determined (and scored).

[0085] In some embodiments, candidate split groups 412a and candidate split groups 412b represent partial split groups. Generally, a partial split group includes one or more fragment alignments that represent fragment alignments for a part rather than the entire nucleotide read. The split read alignment system 106 can link additional fragment alignments to the partial split groups. For example, in some embodiments, the split read alignment system 106 links additional fragment alignments to the partial split group with the highest split group score as part of dynamic programming. By linking additional fragment alignments to the highest-scoring partial split group, the split read alignment system 106 reduces the processing power required to comprehensively generate candidate split groups.

[0086] Although not shown in FIG. 4, after grouping fragment alignment 410 and fragment alignment 406 as candidate split group 412b (and determining its split group score), the split read alignment system 106 groups (and determines its split group score) additional candidate split groups that include fragment alignment 410, fragment alignment 408, and fragment alignment 406. If the split group score 414b for candidate split group 412b exceeds the additional split group scores for additional candidate split groups, the split read alignment system 106 continues to group (and determine its split group score) candidate split groups that include fragment alignment 410 and other combinations of fragment alignments. For example, the split read alignment system 106 (i) groups fragment alignment 410, fragment alignment 406, and fragment alignment 404 (and determines its split group score), and (ii) groups fragment alignment 410 and fragment alignment 404 (and determines its split group score).

[0087] In addition, as part of considering candidate split groups, the split read alignment system 106 can also consider single fragment alignments. As described above, in some embodiments, the split read alignment system 106 also considers single fragment alignments in order from the outermost fragment alignment to the innermost fragment alignment. Before or after considering the candidate split group 412a, for example, the split read alignment system 106 can identify a candidate partial split group that includes the fragment alignment 410. The split read alignment system 106 generates a partial split group score for the fragment alignment 410. The split read alignment system 106 then compares the partial split group score with other split group scores, such as the split group score 414a for the candidate split group 412a. Thus, in addition to candidate split groups that include new or additional fragment alignments, in some embodiments, the split read alignment system 106 also identifies (and determines its split group score) a candidate partial split group that includes new or additional fragment alignments.

[0088] As further shown in FIG. 4, the split read alignment system 106 generates a candidate split group 412n that includes fragment alignment 402, fragment alignment 406, and fragment alignment 410. In this example, since the candidate split group 412b has the highest split group score, i.e., split group score 414b, the split read alignment system 106 adds fragment alignment 402 to the candidate split group 412b. The split read alignment system 106 scores the candidate split group 412n and assigns a split group score 414n. In this way, the split read alignment system 106 iterates from the outermost fragment alignment to the innermost fragment alignment. For each fragment considered, the split read alignment system 106 finds the best next fragment alignment (i.e., the next, outer, highest-scoring fragment alignment).

[0089] If adding the next outer fragment alignment results in an improvement in the split group score, the split read alignment system 106 retains the next outer fragment alignment as part of the candidate split group. If adding the next outer fragment alignment does not result in an improvement in the split group score, the split read alignment system 106 discards the next outer fragment alignment from the candidate split group and proceeds to the next outer fragment alignment. Thus, by performing dynamic programming, the split read alignment system 106 groups the candidate split groups (and determines the split group scores therefor) in order from the outermost fragment alignment to the innermost fragment alignment of the nucleotide read until each candidate split group is considered unable to improve the highest split group score or is excluded.

[0090] As described above, the split read alignment system 106 determines a split group score for a candidate split group. FIG. 5 shows a split read alignment system 106 that determines a split group score for a candidate split group according to one or more embodiments, and the corresponding description will be further detailed. In some embodiments, the split read alignment system 106 determines a split group score for a candidate split group based on a fragment alignment score 502, a break penalty 506, and an overlap penalty 508. For example, the split read alignment system 106 can generate a split group score by combining the fragment alignment scores 502 for the fragment alignments within a candidate split group and subtracting the break penalty 506 and the overlap penalty 508 from the combined fragment alignment scores.

[0091] As described above, the split read alignment system 106 can assign a split group score to each candidate split group. In some embodiments, a candidate split group includes any chain of fragment alignments that follows a specific rule. For example, a candidate split group includes a chain of one or more fragment alignments for the same read from a head fragment to a tail fragment. Under one embodiment of the rule, the head fragment is closest to the inner end of the nucleotide read, and the tail fragment is closest to the outer end of the nucleotide read. The inner gap of a fragment is the distance from the inner end of the nucleotide read, and the outer gap of a fragment is the distance to the outer end of the nucleotide read. For consecutive fragment alignments A and B, for example, the rule can be expressed as follows: i) A.inner_gap ≦ B.inner_gap and ii) A.outer_gap > B.outer_gap. The same fragment alignment may participate in multiple split groups.

[0092] As shown in FIG. 5, the split read alignment system 106 generates fragment alignment scores 502 for fragment alignments A and B. As described above, the fragment alignment score can include a numerical score, metric, or other quantitative measure of the alignment accuracy of the fragment alignment from the nucleotide read. For example, the fragment alignment score can indicate the likelihood that a given alignment of the fragment is correct with respect to the reference genome. As described above, such a fragment alignment score can indicate the probabilistic degree of nucleotides when the nucleotide read fragment matches or is similar to a reference sequence (or alternative contiguous sequence) from the reference genome. For example, the split read alignment system 106 can assign a fragment alignment score to individual fragment alignments within the split group by determining a Smith-Waterman score or a version of the Smith-Waterman score. In other embodiments, the split read alignment system 106 utilizes variations of fragment alignment scoring. As shown, the split read alignment system 106 combines (e.g., sums) the fragment alignment scores of two fragment alignments A and B within the split group.

[0093] As further shown in FIG. 5, the split read alignment system 106 determines a break penalty 506. FIG. 5 shows three factors that the split read alignment system 106 analyzes to generate the break penalty 506 - fragment alignment orientation, the same reference sequence, and the effective indel length. As suggested above, in some embodiments, the break penalty 506 represents a metric that penalizes the fragment alignment of the split group to the extent that the relative geometry of the fragment alignment indicates a break in the nucleobase. More specifically, the break penalty 506 indicates the relative geometry of fragment alignments A and B with respect to the reference genome. In some embodiments, as shown in FIG. 5, the split read alignment system 106 determines the break penalty 506 based on the fragment alignment orientation. For example, the fragment alignment orientation refers to whether the fragment alignment has a forward orientation or a reverse orientation. By way of illustration, in some cases, the expected orientation of a paired - end template is two fragment alignments that point towards each other. For example, the split read alignment system 106 determines the break penalty 506 based on whether fragment alignments A and B have opposite orientations or are inverted.

[0094] In some embodiments, the split read alignment system 106 determines an inversion penalty (e.g., represented as split - inv - pen) when fragment alignments A and B have opposite orientations. When fragment alignments A and B do not have opposite orientations, the split read alignment system 106 does not assign such an inversion penalty.

[0095] Furthermore, as shown in FIG. 5, the split read alignment system 106 determines a break penalty 506 based on whether the fragment alignments are located in the same reference sequence of the reference genome. For purposes of illustration, the split read alignment system 106 may associate a maximum break penalty (e.g., represented as "split-max-pen") if fragment alignments A and B are aligned against different reference sequences of the reference genome. The maximum break penalty may include a predetermined value for DNA and RNA. For example, based on determining that fragment alignments A and B are aligned against different reference sequences, the split read alignment system 106 assigns a 36-point penalty to the DNA fragment alignment and a 20-point penalty to the RNA fragment alignment when determining the split group score. If fragment alignments A and B are aligned against the same reference sequence, in some embodiments, the split read alignment system 106 calculates the effective indel length (indelLen) as the absolute difference between the alignment diagonals of the fragment alignments at their opposing ends.

[0096] As further illustrated in FIG. 5, the split read alignment system 106 determines a break penalty 506 based on the effective indel length. In some embodiments, the split read alignment system 106 reduces the break penalty 506 based on the indel length. For example, the split read alignment system 106 can reduce the overlap penalty by MIN(overlap,FLOOR(Log4(indelLen)),split-olap-ignore). In some embodiments, indelLen is equal to the indel length measured in nucleotide base pairs. The split read alignment system 106 reduces the overlap penalty because (a) the overlap means similar sequences in the fragment alignments A and B, which is common to the SV break, and (b) much of the penalty for long-distance breaks is exponentially derived from a large number of potential break end positions. However, the number of potential break end positions is reduced to a smaller set when only considering break end positions with sufficient sequence similarity to cause fragment overlap.

[0097] In some embodiments, the split read alignment system 106 can limit or disable overlap reduction by setting the split-olap-ignore value lower or to 0. When enabling overlap reduction, the split read alignment system 106 can set a split-log2-coeff of at least 0.5 so that overlapping breaks do not receive a penalty that increases with distance but rather decreases.

[0098] Instead of determining the effective indel length, in some embodiments, the split read alignment system 106 determines the break distance in the chromosome. In one example, the split read alignment system 106 determines the distance between fragment alignment start points in the reference genome and compares the distance between the fragment alignment start points with the predicted break distance. In another example, the split read alignment system 106 determines the distance between the closest endpoints of two fragment alignments and compares that distance with the predicted break distance.

[0099] Furthermore, in the case of split alignment, the split read alignment system 106 determines an initial break penalty (e.g., represented as split-open-pen) before considering the effective indel length. In at least one example, the break penalty is equal to the greater of (i) the maximum break penalty or (ii) the break penalty determined based on the inversion penalty (invPen) and the indel length (indelLen). By way of illustration, the break penalty is equal to MIN(split-max-pen, split-open-pen + invPen + FLOOR(split-log2-coeff * Log2(indelLen))).

[0100] FIG. 5 further shows a split read alignment system 106 that determines an overlap penalty 508. As suggested above, in some embodiments, the overlap penalty 508 represents a metric that penalizes fragment alignments of a split group to the extent that the fragment alignments overlap within a nucleotide read. For example, in some embodiments, the overlap penalty 508 is equal to the Smith-Waterman match score multiplied by the amount of overlap within the read between fragment alignments A and B. As noted above, fragment alignment overlap can occur when a fragment contains nucleotide read bases that overlap from a nucleotide read (and align with the reference genome). By determining an overlap penalty, the split read alignment system 106 avoids double counting read nucleobases that match the reference genome within both fragments of a fragment alignment.

[0101] In some embodiments, the split read alignment system 106 further determines other penalties as part of determining a split group score. By way of illustration, the split read alignment system 106 may determine a gap penalty. The gap penalty is complementary to the overlap penalty 508. More specifically, in some embodiments, the gap penalty represents a numerical score, metric, or other quantitative measure that penalizes fragment alignments of a split group to the extent that a gap exists between fragment alignments. In some embodiments, the gap penalty represents a negative overlap and the overlap penalty represents a negative gap.

[0102] As described above, in some embodiments, the split read alignment system 106 generates and scores split groups by using dynamic programming. Thus, in some embodiments, the split read alignment system 106 generates split group scores for candidate split groups as shown in FIG. 5, according to the order of the outermost fragment alignments towards the innermost fragment alignment as shown in FIG. 4.

[0103] In some embodiments, as described above, the split read alignment system 106 evaluates candidate split groups based on pair scores. More specifically, the split read alignment system 106 evaluates the pair alignments of candidate pairs of the split groups and selects a predicted split group based on the pair scores. FIG. 6A shows the split read alignment system 106 that generates pair scores according to one or more embodiments. FIG. 6B shows the split read alignment system 106 that determines a predicted split group based on pair scores according to one or more embodiments.

[0104] FIG. 6A shows a split read alignment system 106 that generates a pair score based on a split group score 602 and a pairing penalty 608. In some embodiments, the split read alignment system 106 identifies candidate pairs of split groups from a candidate split group that include different fragment alignments for mates of paired-end nucleotide reads. For example, the split read alignment system 106 identifies a candidate pair of split groups that includes split group 604 and split group 606. More specifically, split group 604 includes fragment alignments A and B, and split group 606 includes fragment alignments C and D. As shown, split group 604 and split group 606 are aligned with a reference genome. More specifically, split group 604 and split group 606 include candidate paired-end mates aligned along the reference genome. For example, split group 604 may represent R1 of the paired-end read, and split group 606 may represent R2.

[0105] As further shown in FIG. 6A, the split read alignment system 106 generates a split group score 602. As suggested above, in some embodiments, the pair score evaluates the accuracy of the pair alignment of the candidate pair of split groups with the reference genome. In some embodiments, the split group score 602 includes the sum of the split group scores of the candidate pair of split groups. By way of example, the split read alignment system 106 sums the split group score for split group 604 and the split group score for split group 606 as part or all of the pair score.

[0106] As further shown in FIG. 6A, the split read alignment system 106 generates a pairing penalty 608 for candidate pairs of the split group. The split read alignment system 106 may determine the pairing penalty 608 based on the estimated insert size between the innermost fragment alignments of candidate pairs of the split group. In some cases, the fragment alignments corresponding to the paired ends are located relatively close to each other in the reference genome. The split read alignment system 106 can determine a known empirical insert size distribution. In some embodiments, the split read alignment system 106 determines the known empirical insert size distribution by analyzing the insert sizes in the array library. The known empirical insert size distribution generally indicates the most likely insert size of the array library. Thus, the split read alignment system 106 may assign a pairing penalty of 0 or a small value when the two innermost fragment alignments are located close to each other or at a distance expected from each other based on the empirical insert size distribution.

[0107] For example, the split read alignment system 106 determines an estimated insert size 610 between the innermost fragment alignments B and C. As shown in FIG. 6A, the estimated insert size 610 includes the length of the library template for which mate nucleotide reads were sequenced at each end. The split read alignment system 106 compares the estimated insert size 610 to an expected insert size based on an empirical insert size distribution. The split read alignment system 106 assigns a larger pairing penalty to candidate pairs of split groups, whether the estimated insert size 610 is larger or smaller than the expected insert size. In some embodiments, the split read alignment system 106 determines a fixed pairing penalty for candidate pairs of split groups outside the expected insert size range. In other embodiments, the split read alignment system 106 utilizes a sliding scale and the split read alignment system 106 adjusts the pairing penalty based on the difference between the estimated insert size 610 and the expected insert size.

[0108] In some examples, the estimated insert size is calculated to reflect the estimated full length of the library template strand sequenced at each end to obtain two paired-end nucleotide reads. For example, two paired-end nucleotide reads include fragment alignments A, B, C, and D. In at least one embodiment, the insert size is estimated from the reference positions of the endpoints of the innermost fragment alignments B and C and extrapolated to account for the outer portions of the two paired-end nucleotide reads not covered by fragment alignments B and C. By way of illustration, the split-read alignment system 106 can extrapolate to account for the outer portions that are covered by fragment alignments A and D. However, in the example shown in FIG. 6A, the split-read alignment system 106 does not consider the reference positions of the outer fragment alignments A and D due to the SV breaks between fragment alignment A and B and between fragment alignment C and D. Thus, the positions of fragment alignments A and D provide little information regarding the true insert size.

[0109] In some embodiments, the split-read alignment system 106 further adjusts the pairing penalty 608 based on the split group position and split group orientation. For example, the split-read alignment system 106 can assign a larger pairing penalty to candidate pairs of split groups that are aligned to different chromosomes of the reference genome. As described above, the split-read alignment system 106 can assign a larger pairing penalty based on the orientation of the split group. For example, if the fragment alignments are oriented in the same orientation (e.g., both oriented 3' to 5' of the reference genome) rather than in a complementary orientation (e.g., pointing towards each other), the split-read alignment system 106 assigns a larger pairing penalty to the candidate pair of split groups.

[0110] In one or more embodiments, the split read alignment system 106 determines a pair score based on the split group score 602 and the pairing penalty 608. By way of example, in some embodiments, the split read alignment system 106 generates a pair score by subtracting the pairing penalty 608 from the sum of the split group scores 602.

[0111] As mentioned, in some cases, two paired-end mate reads overlap the same breakpoint (e.g., an SV breakpoint). If the overlapping mates cross the breakpoint in their overlapping zone, each mate can be similarly split aligned as two fragment alignments. In some embodiments, the split read alignment system 106 detects these "quads" as a special case and assigns a pair score that includes only one copy of the break penalty (but both overlapping penalties). If such a "quad" of split-overlap alignments indicates the highest pair score, the split read alignment system 106 selects the R1 and R2 fragment alignments on the same side of the cut as the primary alignment to support the appropriate pairing, i.e., one 5' fragment alignment and one 3' fragment alignment. Generally, the split read alignment system 106 selects the higher-scoring 5' fragment alignment as the primary alignment, along with the 3' fragment alignment of the mate.

[0112] In some embodiments, the detection of quadruplets is somewhat restrictive. The corresponding fragments in both mates need to be excised at the SV break at the same position, which typically occurs unless there are sequencing errors intervening. Gaps or overlaps between fragments in each nucleotide read are allowed, but they must be the same in both mates of the paired-end reads. If the split-read alignment system 106 cannot detect a complete quadruplet, the split-read alignment system 106 outputs only three fragment alignments and omits the lowest-scoring 3’ fragment alignment.

[0113] As described above, in some embodiments, the split-read alignment system 106 selects a predicted split group based on the pair score. FIG. 6B shows the split-read alignment system 106 that selects a predicted split group based on the pair score, and the corresponding paragraph describes it. Briefly, FIG. 6B shows the pair scores 622 of the candidate pairs of split groups 626a - 626c. The candidate pair 626a of split groups includes split groups 611 and 612. The empty boxes within the fragmented arrows within split group 612 represent breaks (e.g., SV breaks) between the fragment alignments that make up split group 612. In contrast to the candidate pair 626a of split groups, the candidate pair 626b of split groups includes split groups 614 and 616. Finally, the candidate pair 626c of split groups includes split groups 618 and 620. As will be described below, the split-read alignment system selects (i) the pair of candidate split groups having the highest pair score and (ii) for each mate of the nucleotide read pair, selects a predicted split group from the pair of candidate split groups having the highest pair score.

[0114] In some cases, the candidate split group with the highest split group score may not necessarily indicate an accurate split alignment. For example, a relatively high split group score indicates a way in which nucleotide reads are likely to show a split alignment. However, this relatively high split group score may be associated with a less likely pairing configuration of the two mates from a pair of paired-end nucleotide reads. By generating a pair score in addition to the split group score, the split read alignment system 106 further considers the pairing configuration of the fragment alignments from the mates of the paired-end nucleotide reads when selecting a predicted split group.

[0115] For illustration, for example, split group 614 may have the highest split group score among split groups 611 - 620. The split read alignment system 106 generates a pair score 622 for candidate pairs of split groups 626a - 626c. Based on the determination that the pair score for the candidate pair of split group 626a exceeds the pair score for the candidate pair of split group 626b, in some cases, the split read alignment system 106 selects split group 614 from the candidate pair of split group 626a as the predicted split group for a particular mate instead of split group 611 from the candidate pair of split group 626b.

[0116] In some embodiments, the split read alignment system 106 generates a fragment alignment mapping score (e.g., MAPQ) corresponding to the fragment alignment corresponding to the highest pair score. The fragment alignment mapping score represents the confidence that a given fragment alignment is part of the true alignment from the perspective of a mapping quality metric (e.g., MAPQ). The fragment alignment mapping score for one fragment alignment is independent of other fragment alignments. Rather, the fragment alignment mapping score is proportional to the difference between the highest pair score and the next highest pair score that did not include the target fragment alignment.

[0117] In some embodiments, the split read alignment system 106 may determine a fragment alignment that aligns with an alternative contiguous (or "alternative contig") sequence within the reference genome. FIG. 7 shows a split read alignment system 106 that scores an alternative contig fragment alignment corresponding to a nucleotide read having an alternative contiguous sequence according to one or more embodiments. Briefly, FIG. 7 shows a series of operations 700 including an operation 702 to determine an alternative contig fragment alignment score, an operation 704 to determine a split group score, and an operation 708 to select an alternative contig fragment alignment score. If the alternative contig fragment alignment score exceeds the split group score for the fragment alignment, the split read alignment system 106 reports a split alignment of the fragment alignment with the primary assembly corresponding to the alternative contiguous sequence.

[0118] Generally, the split read alignment system 106 identifies alternative contiguous arrays that represent structural variations. The split read alignment system 106 determines that a fragment of a nucleotide read exhibits a maximum fragment alignment score with an alternative contiguous array, and thus reports a split alignment in the corresponding primary assembly region. For example, if the split read alignment system 106 determines that a split alignment for a nucleotide read exhibits an alternative contig fragment alignment score with an alternative contiguous array, and this alternative contig fragment alignment score exceeds the split group score for other candidate split groups for the nucleotide read, the split read alignment system 106 uses the alternative contig fragment alignment score for the lift-over corresponding split alignment (without break penalty) instead of the other candidate split group scores. Thus, the alternative contig fragment alignment score can guide the split read alignment system 106 to select and report a given split alignment against other candidate split alignments represented by other split groups that might be better scored in the absence of the alternative contig fragment alignment score.

[0119] When the alternative contiguous array represents an SV breakpoint, for example, the split read alignment system 106 can recognize two primary fragment alignments for the same lift-over group as one alternative fragment alignment. In some cases, multiple primary fragments for one lift-over group are treated as duplicates of each other, and only the fragment alignment with the best score is retained. However, in the case of a nucleotide read that matches an alternative contiguous array spanning an SV break, the split read alignment system 106 can retain both primary fragment alignments and combine them into a split group that uses the alignment score of the alternative contiguous array.

[0120] As shown in FIG. 7, a series of operations 700 shows a split read alignment system 106 that uses a scoring system to identify a split alignment that represents a structural variation when detecting an alternative contig fragment alignment. Generally, the split read alignment system 106 determines when there are two primary fragment alignments (a 5' fragment alignment and a 3' fragment alignment) where the lift-over groups extend past each other in the nucleotide read. The lift-over groups include fragment alignments that align with either an alternative contiguous sequence for the same genomic region of the primary assembly region or the reference genome. In some embodiments, the split read alignment system 106 determines that the 5' fragment alignment and the 3' fragment alignment exhibit alternative contig characteristics.

[0121] To identify such alternative contig fragment alignments, in some embodiments, the split read alignment system 106 determines a split-alt-min-ext where two primary fragment alignments must extend beyond each other in the nucleotide read. The split read alignment system 106 uses the split-alt-min-ext to identify fragment alignments that are eligible as alternative contig fragment alignments. In some embodiments, the split-alt-min-ext includes a predetermined value (e.g., 20 bases). In other embodiments, the split read alignment system 106 determines the split-alt-min-ext based on user input. Generally, a higher split-alt-min-ext is more restrictive and reduces the likelihood that the split read alignment system 106 will identify an alternative contig fragment alignment. In some embodiments, the split read alignment system 106 sets the split-alt-min-ext to 0 to disable lift-over-induced split alignments. For example, the 5’ fragment alignment must start within the first split-alt-min-ext bases of the nucleotide read. The 5’ fragment must extend at least split-alt-min-ext bases toward the 5’ end relative to the 3’ fragment. The 3’ fragment must extend at least split-alt-min-ext bases toward the 3’ end relative to the 5’ fragment. The best score alignment in the lift-over group must be an alternative contig alignment.

[0122] To determine whether a fragment alignment having an alternative contiguous array scores better for a nucleotide read than other candidate split groups, the split read alignment system 106 can use the scoring approach shown in FIG. 7. As shown in FIG. 7, the split read alignment system 106 performs an operation 702 of determining an alternative contig fragment alignment score. The split read alignment system 106 determines an alternative contig fragment alignment score for an inner fragment alignment 712 (3' fragment) and an outer fragment alignment 710 (5' fragment) corresponding to the nucleotide read. As shown in FIG. 7, both the inner fragment alignment 712 and the outer fragment alignment align with an alternative contiguous array 714 within the reference genome 718. The alternative contiguous array 714 includes an alternative sequence of the primary assembly region 716 of the reference genome. In contrast to some existing sequencing systems, the split read alignment system 106 does not consider the inner fragment alignment 712 and the outer fragment alignment 710 to be overlapping, but both are separately involved for scoring purposes on the condition that they meet the requirement of the minimum length by which the two fragment alignments must extend beyond each other in the read (e.g., the split-alt-min-ext requirement).

[0123] In fact, in some embodiments, the split read alignment system 106 determines an alternative contig fragment alignment score for the inner fragment alignment 712 and an alternative contig fragment alignment score for the outer fragment alignment 710 in the same way as the split read alignment system 106 determines the fragment alignment score. For example, the split read alignment system 106 determines the alternative contig fragment alignment score by determining a Smith-Waterman score or a variation of the Smith-Waterman score.

[0124] In addition to determining an alternative contig fragment alignment score for each of the fragment alignments, split read alignment system 106 performs an operation 704 of determining a split group score. In particular, split read alignment system 106 determines split group scores for inner fragment alignment 712 and outer fragment alignment 710 with respect to the primary assembly region 716 of the reference genome 718.

[0125] As further shown in FIG. 7, split read alignment system 106 further performs an operation 708 of selecting an alternative contig fragment alignment score. Generally, split read alignment system 106 utilizes the best alignment score of the lift-over group among the alternative contig fragment alignment score(s) and the split group score. Thus, split read alignment system 106 may replace the split group score with the best alternative contig fragment score. Thus, the alternative contig fragment score becomes the replacement split group score.

[0126] Based on determining that the alternative contig fragment alignment score exceeds the split group score, split read alignment system 106 utilizes the alternative contig fragment alignment score for the fragment alignment process. In some embodiments, split read alignment system 106 further determines by comparing that the alternative contig fragment alignment score exceeds other split group scores of inner fragment alignment and outer fragment alignment with respect to other primary assembly regions.

[0127] If the alternative contig fragment alignment score exceeds the split group score of the fragment alignment, the split read alignment system 106 reports the associated split alignment that includes the outer fragment alignment 710 and the inner fragment alignment 712. By reporting the associated split alignment, the split read alignment system 106 effectively reports or indicates the alignment of the nucleotide read with the alternative contig 714 itself. By using the alternative contig fragment alignment score as the replacement split group score, the split read alignment system 106 facilitates the selection of the split group corresponding to the alternative contig 714 over other candidate split groups. In other words, the split read alignment system 106 assigns a higher score inherited from the alternative contig sequence corresponding to the primary assembly to the split group of the primary assembly. By using the alternative contig fragment alignment score as the split group score, the split read alignment system 106 further increases the fragment alignment mapping score (e.g., MAPQ) corresponding to the fragment alignment within the split group.

[0128] In some embodiments, the split read alignment system 106 filters out unreliable fragment alignments by utilizing a threshold fragment alignment score and a minimum alignment score. According to one or more embodiments, FIGS. 8-9 each show a split read alignment system 106 that utilizes a threshold fragment alignment score and a minimum alignment score to remove candidate split groups and identify candidate split groups that do not report an alignment.

[0129] FIG. 8 shows a split read alignment system 106 that removes candidate split groups with abnormal forms using fragment alignment scores according to one or more embodiments. Briefly, FIG. 8 shows a series of operations 800 that include an operation 802 of determining that a fragment alignment score does not meet a threshold fragment alignment score and an operation 804 of removing the fragment alignment.

[0130] As shown in FIG. 8, the series of operations 800 includes an operation 802 of determining that a fragment alignment score does not meet a threshold fragment alignment score. In particular, the split read alignment system 106 determines that the fragment alignment score for the fragment alignment corresponding to the candidate split group does not meet the threshold fragment alignment score. The split read alignment system 106 may determine the threshold fragment alignment score based on user input. Additionally or alternatively, the split read alignment system 106 generates a predetermined fragment alignment score. The threshold fragment alignment score may include the minimum fragment alignment score for a fragment alignment to participate in split read alignment. For example, the fragment alignment score for fragment alignment A may be below the threshold fragment alignment score.

[0131] As further shown in FIG. 8, the split read alignment system 106 performs an operation 804 to remove fragment alignments. More specifically, the split read alignment system 106 removes sub-threshold fragment alignments from consideration when forming candidate split groups. For example, based on determining that the fragment alignment score for fragment alignment A is below a threshold fragment alignment score, the split read alignment system 106 excludes fragment alignment A from consideration. Thus, the split read alignment system 106 never forms a split group that includes fragment alignment A and fragment alignment B. By removing sub-threshold fragment alignments from consideration, the split read alignment system 106 effectively filters out untrustworthy fragment alignments at input and completely ignores them. The threshold fragment alignment score is mainly useful for low-score inner (3’) fragments that may be included because their properly paired positions obtain a large score benefit through a low pairing penalty. Further, in some embodiments, the split read alignment system 106 also prevents sub-threshold fragment alignments from participating in any generated multi-fragment alignment split group.

[0132] The split read alignment system 106 further reduces noise by utilizing a minimum alignment score. FIG. 9 shows a split read alignment system 106 that utilizes a minimum alignment score to identify candidate split groups that do not report an alignment, according to one or more embodiments. Briefly, FIG. 9 shows a series of operations 900 that include an operation 902 of determining that the alignment score for a candidate split group does not meet the minimum alignment score, and an operation 904 of refraining from reporting the split alignment.

[0133] As shown in FIG. 9, the split read alignment system 106 performs an operation 902 that determines that the alignment score for a candidate split group does not meet the minimum alignment score. The alignment score for a candidate split group refers to the alignment score for the entire split group. In some embodiments, the alignment score for a candidate split group includes the split group score. By way of example, the split read alignment system 106 determines that the split group score for candidate split group 906 is below the minimum alignment score. The split read alignment system 106 may determine the minimum alignment score based on user input or may have a predetermined minimum alignment score.

[0134] In contrast to existing alignment systems, the split read alignment system 106 can report a split alignment even when component fragment alignments have low fragment alignment scores. By way of example, fragment alignment A and / or fragment alignment B can have individual alignment scores that are less than a minimum alignment score. However, the A+B split group score can be higher than and exceed the minimum alignment score. In this case, the split read alignment system 106 can report the A+B split alignment. In contrast, existing alignment systems would have excluded one or both of fragment alignment A and / or fragment alignment B for failing to meet the minimum alignment score. In essence, the split read alignment system 106 leverages the generation of split group scores by splitting a threshold score into two separate parameters (a threshold fragment alignment score and a minimum alignment score). The threshold fragment alignment score pre-filters fragment alignments by disqualifying sub-threshold fragment alignments from participation in split alignments. The threshold fragment alignment score utilized by the split read alignment system 106 can be higher and more permissive than the alignment scores utilized by existing alignment systems. In some embodiments, the split read alignment system 106 configures the minimum alignment score such that it filters candidate split groups only after low score fragment alignments have had an opportunity to participate in candidate split groups that can potentially achieve higher split group scores. Thus, the split read alignment system 106 maintains a final minimum score that achieves a target level of noise filtering similar to existing alignment systems, but does so in a way that provides sensitivity to lower score constituent fragment alignments that are part of a full read alignment.

[0135] The split read alignment system 106 additionally performs an operation 904 that refrains from reporting split alignments. In particular, the split read alignment system 106 refrains from reporting split alignments of candidate split groups in an alignment file or a variant call file based on the alignment score not meeting the minimum alignment score. For purposes of illustration, the split read alignment system 106 does not report candidate split group 906 as a predicted split group.

[0136] In some embodiments, even if the split read alignment system 106 does not report candidate split group 906, the split read alignment system 106 still considers candidate split group 906 as a conflict for other alignments. If the top pair score is associated with a split group score that is below the minimum alignment score, the split read alignment system 106 returns the unmapped reads. However, even if another alignment or split group shows the top pair score, the split read alignment system 106 may reduce the fragment alignment mapping score (e.g., MAPQ) for the fragment alignment if the pair score of the failed split group was the second best. As described above, the fragment alignment mapping score represents the confidence that a given fragment alignment is part of (or maps to) the true alignment from the perspective of a mapping quality metric (e.g., MAPQ).

[0137] In some embodiments, the split read alignment system 106 generates and stores configuration registers as part of determining split read alignment. In the previous description, register entries including split-log2-coeff, primary-5p, etc. were described. The following table provides an overview of additional configuration register entries defined by the split read alignment system 106 according to one or more embodiments.

[0138]

Table 1

[0139] In some embodiments, the split read alignment system 106 assigns alignment tags to fragment alignments indicating strand orientation. More specifically, the XS tag is defined as the raw competing fragment score. In some embodiments, for a given fragment alignment, the XS is the highest score of any other fragment alignment that mostly overlaps a given fragment alignment from a nucleotide read (and thus is not eligible for a split alignment with the given fragment alignment). In other embodiments, the split read alignment system 106 determines that the XS for all non-secondary fragment alignments (both primary and supplementary) is the highest fragment score that does not participate in a successful or highest-scoring split group. The XS for all secondary alignments (both non-supplementary and supplementary) is the highest fragment score that participates in a successful split group.

[0140] In some embodiments, the split read alignment system 106 determines nucleotide calls for genomic regions based on the alignment of predicted split groups to a reference genome. FIG. 10 shows a split read alignment system 106 that generates nucleotide call and variant call files, according to one or more embodiments. Briefly, FIG. 10 shows a series of operations 1000 that include an operation 1002 to identify nucleotide reads, an operation 1004 to align the nucleotide reads to a reference genome, an operation 1006 to generate nucleotide calls, and a resulting variant call file 1008.

[0141] As shown in FIG. 10, the split read alignment system 106 executes an operation 1002 to identify nucleotide reads. In one or more embodiments, operation 1002 includes identifying nucleotide reads from a genomic sample. In some embodiments, the sequencing device 114 determines nucleotide reads from the sample genome (e.g., by using SBS) and transmits data representing the nucleotide reads to the sequencing system 104 (e.g., in a base call file). In an alternative embodiment, a third-party system determines nucleotide reads from the sample genome and enables the sequencing system 104 to access the nucleotide reads.

[0142] The series of operations 1000 shown in FIG. 10 further includes an operation 1004 to align the nucleotide reads with a reference genome. As shown, the split read alignment system 106 aligns the nucleotide read 1010 with the reference genome. For example, in various embodiments, the sequencing system 104 causes the nucleotide read 1010 to be aligned with the reference genome. As part of executing operation 1004, the split read alignment system 106 determines fragment alignments and determines predicted split groups.

[0143] As further shown in FIG. 10, the split read alignment system 106 executes an operation 1006 to generate nucleotide base calls. Generally, the nucleotide base calls also include the final prediction of the nucleotide bases at the genomic coordinates of the sample genome for a variant call file (VCF) 1008 or other base call output file based on aligning the nucleotide reads to the reference genome. For the accuracy of the predicted split groups, the sequencing system 104 can generate nucleotide base calls with higher accuracy and confidence for genomic coordinates than existing sequencing systems.

[0144] In some examples, the split read alignment system 106 reports split alignments using the BAM / SAM file format. The BAM / SAM file specification provides three different alignment types: primary, supplementary, and secondary. In some examples, the FLAG bits indicate supplementary and / or secondary designations. According to the BAM / SAM specification, exactly one primary alignment is recognized (having neither a supplementary FLAG set nor a secondary FLAG set). Thus, a split alignment having N ≧ 2 fragments is represented as one primary fragment alignment BAM / SAM record and N-1 supplementary fragment alignment BAM / SAM records.

[0145] Thus, typically, the split read alignment system 106 may not output the entire split group as a primary alignment without using special means or encoding. The split read alignment system 106 identifies which of the N fragment alignments should be selected for primary alignment status, and the remaining N-1 fragment alignments receive supplementary alignment status. In some embodiments, the split read alignment system 106 determines the primary alignment output based on the parameter primary_5p. If primary-5p = 0, the primary fragment alignment is selected to support proper pairing and is typically the most 3' fragment alignment. Additionally or alternatively, the split read alignment system 106 sets primary-5p to 1 to set the most 5' fragment alignment as the primary alignment.

[0146] When the split read alignment system 106 determines to output a secondary alignment, the split read alignment system 106 selects secondary fragment alignments in descending order of pair score. Generally, secondary alignments are not related to the primary alignment but include additional alignment records representing alternative alignment candidates. Some of the secondary fragment alignments can themselves be non-trivial split groups. The split read alignment system 106 can determine to output complete split groups for the secondary alignments. Each of the complete split groups mimics the primary / supplemental structure of the successful split groups but has a secondary flag. However, if the fragment alignments of the secondary split groups have already been output (in either the highest score split group or a higher score secondary split group), the split read alignment system 106 blocks the output of the supplemental secondary fragment alignments. More specifically, the supplemental alignment includes additional alignment records that supplement the primary alignment or present additional portions of the split alignment.

[0147] As described above, the split read alignment system 106 improves the alignment of split reads and improves the accuracy of the corresponding nucleotide calls that include structural variant calls. According to one or more embodiments, FIGS. 11A-11D show more accurate mapping and alignment and result in more accurate variant calling than existing sequencing systems based on transcriptome reads, a read pileup of candidate gene fusion events generated by the split read alignment system 106. As shown by FIGS. 11A-11D, the split read alignment system 106 determines a split group score for candidate split groups that include fragment alignments from fragments of nucleotide reads (e.g., transcriptome reads), and selects a predicted split group from among the candidates based on such split group scores, thereby (i) identifying fragment alignments for candidate split reads with higher accuracy than existing sequencing systems, and (ii) determining true negative variant calls (here, no gene fusion) at genomic coordinates and breakpoints where existing sequencing systems determine false positive variant calls for gene fusion events.

[0148] FIGS. 11A and 11B complement each other by showing the breakpoint along the chromosome (FIG. 11A), and the different read fragment alignments and mappings determined by the split read alignment system 106 and an existing sequencing system (FIG. 11B) for the same breakpoint. As shown in FIG. 11A, for example, chromosomal segment 1102a for chromosome 11 includes breakpoint 1104a. In particular, breakpoint 1104a shown in FIG. 11A identifies one or more genomic coordinates where nucleotide reads were aligned by an existing sequencing system, and the break between nucleotide read fragments is then shown in FIG. 11B. As further described below, the split alignment of transcriptome reads to breakpoint 1104a may indicate a gene fusion event between the ARL2-SNX15 RNA gene and another gene.

[0149] As shown in FIG. 11B, the user client device 108 presents a graphical user interface 1100a including different read fragment alignments and mappings determined by the split read alignment system 106 and the existing sequencing system with respect to the breakpoint 1104a. For example, the graphical user interface 1100a can represent the graphical user interface of Integrative Genomics Viewer (IGV) including read alignments to the reference genome. For comparison, the graphical user interface 1100a includes an updated alignment window 1106a showing candidate transcriptome read alignments of the split read alignment system 106, a previous alignment window 1108a showing candidate transcriptome read alignments of the existing sequencing system, and a reference genome window 1110a showing reference nucleotide bases of the reference genome. In FIG. 11B, the updated alignment window 1106a also includes a read coverage marker 1120a showing read coverage (e.g., read depth) at genomic coordinates overlapping the breakpoint 1104a.

[0150] As shown in the previous alignment window 1108a, existing sequencing systems map transcriptome read fragments 1114a and align them to the reference genome at genomic coordinates corresponding to (or relatively close to) breakpoint 1104a. As indicated by the light gray shading of transcriptome read fragments 1114a within the previous alignment window 1108a, the called nucleotide bases of transcriptome read fragments 1114a match the reference nucleotide bases of the reference genome within reference genome window 1110a. In contrast to transcriptome read fragments 1114a, existing sequencing systems (i) map and align mismatch transcriptome read fragments 1112a to a genomic region corresponding to the ARL2 contiguous sequence located upstream from breakpoint 1104a, and (ii) map and align mismatch transcriptome read fragments 1112b to a genomic region corresponding to the SNX15 contiguous sequence located downstream from breakpoint 1104a. As indicated by the different gray shading or colors of mismatch transcriptome read fragments 1112a and 1112b within the previous alignment window 1110a, the called nucleotide bases of mismatch transcriptome read fragments 1112a and 1112b do not match the reference nucleotide bases of the reference genome within reference genome window 1108a.

[0151] Since the called nucleotide bases of the threshold number do not match the reference nucleotide bases, existing sequencing systems clip the nucleotide bases within the mismatch transcriptome read fragments 1112a and 1112b (e.g., soft clip or hard clip), thereby ignoring the nucleotide bases of the mismatch transcriptome read fragments 1112a and 1112b for alignment purposes. However, the mismatched transcriptome read fragments 1112a and 1112b show a split alignment of the corresponding transcriptome reads with respect to the reference genome. Candidate alignments of both the mismatch transcriptome read fragments 1112a and 1112b by existing sequencing systems represent supplementary alignments with a positive mapping quality metric (e.g., positive MAPQ) and correspond to a primary alignment with another gene (e.g., the AKT3 gene). Based on the scoring of the primary and supplementary alignments of such corresponding transcriptome reads shown in the previous alignment window 1108a, existing sequencing systems determine false positive variant calls for gene fusion events for the genomic sample. For example, in some cases, existing sequencing systems realign the mismatch transcriptome read fragments 1112a and 1112b with a genomic region of another gene (e.g., the AKT3 gene on chromosome 1), thereby indicating a gene fusion event.

[0152] As shown in the updated alignment window 1106a, the split read alignment system 106 maps the transcriptome read fragment 1116a and aligns it with the reference genome at the genomic coordinates corresponding to (or relatively close to) the breakpoint 1104a. As shown by the light gray shading of the transcriptome read fragment 1116a, the called nucleotide bases of the transcriptome read fragment 1116a match the reference nucleotide bases of the reference genome within the reference genome window 1110a. In contrast to the transcriptome read fragment 1116a, the split read alignment system 106 maps and aligns the mismatched transcriptome read fragment 1118a with the genomic region corresponding to the SNX15 contiguous sequence located downstream from the breakpoint 1104a, but does not map or align the mismatched transcriptome read fragments upstream from the breakpoint 1104a. As shown by the different gray shading or color of the mismatched transcriptome read fragment 1118a within the updated alignment window 1106a, the called nucleotide bases of the mismatched transcriptome read fragment 1118a do not match the reference nucleotide bases of the reference genome within the reference genome window 1110a.

[0153] As further shown by FIG. 11B, the candidate alignment of the mismatched transcriptome read fragment 1118a by the split read alignment system 106 shows a mapping quality metric of 0 (e.g., MAPQ0), thereby causing the split read alignment system 106 to exclude the candidate alignment of the mismatched transcriptome read fragment 1118a. Since the split read alignment system 106 excludes the mismatched transcriptome read fragment 1118a aligned with the genomic region on one side of the breakpoint 1104a, the split read alignment system 106 avoids determining false positive variant calls for gene fusion events for the same genomic sample (as performed by existing sequencing systems). By generating an improved split group score for candidate split alignments, the split read alignment system 106 avoids "noisy" split reads indicated by candidate alignments of existing sequencing systems in the previous alignment window 1108a. To avoid such noisy split read alignments, the split read alignment system 106 also avoids calling inaccurate gene fusion mutations and accurately identifies true negative mutations for gene fusions.

[0154] Figures 11C and 11D complement each other by showing the breakpoint along the chromosome (Figure 11C), as well as the different read fragment alignments and mappings determined by the split-read alignment system 106 and the existing sequencing system (Figure 11D) for the same breakpoint. As shown in Figure 11C, for example, chromosome segment 1102b of chromosome 4 includes breakpoint 1104b. In particular, breakpoint 1104b shown in Figure 11C identifies one or more genomic coordinates where transcriptome reads were aligned by the existing sequencing system, and the break between read fragments is shown following Figure 11D. As further explained below, the split-read alignment of transcriptome reads for breakpoint 1104b can indicate a gene fusion event between the DCTD gene and another gene.

[0155] As shown in Figure 11D, user client device 108 presents a graphical user interface 1100b that includes different read fragment alignments and mappings determined by the split-read alignment system 106 and the existing sequencing system for breakpoint 1104b. As described above, for example, graphical user interface 1100b represents the graphical user interface of IGV that includes the transcriptome read alignment to the reference genome. For comparison, graphical user interface 1100b includes an updated alignment window 1106b that shows the transcriptome read alignment of the split-read alignment system 106, a previous alignment window 1108b that shows the transcriptome read alignment of the existing sequencing system, and a reference genome window 1110b that shows the reference nucleotide bases of the reference genome. In Figure 11D, the updated alignment window 1106b further includes a read coverage marker 1120b that shows the read coverage (e.g., read depth) at the genomic coordinates that overlap breakpoint 1104b.

[0156] As shown in the previous alignment window 1108b, existing sequencing systems map transcriptome read fragments 1114b and align them with the reference genome at genomic coordinates corresponding to (or relatively close to) breakpoint 1104b. Similar to the graphical user interface 1100a of FIG. 11B, the graphical user interface 1100b of FIG. 11D includes a light gray shading indicating that the called nucleotide bases of a transcriptome read fragment (e.g., transcriptome read fragment 1114b) match the reference nucleotide bases of the reference genome, and a different gray shading or color indicating that the called nucleotide bases of a mismatched transcriptome read fragment (e.g., mismatched transcriptome read fragments 1112c, 1112d, and 1118b) contain little of the reference nucleotide bases. In contrast to transcriptome read fragment 1114b, existing sequencing systems (i) map and align mismatched transcriptome read fragment 1112c with a genomic region corresponding to a contiguous sequence located upstream from breakpoint 1104b, and (ii) map and align mismatched transcriptome read fragment 1112d with a genomic region corresponding to a contiguous sequence located downstream from breakpoint 1104b.

[0157] Since the called nucleotide bases of the threshold number do not match the reference nucleotide bases, existing sequencing systems clip the nucleotide bases within mismatch transcriptome read fragments 1112c and 1112d, thereby ignoring the nucleotide bases of mismatch transcriptome read fragments 1112a and 1112b for alignment purposes. As shown in FIG. 11D, mismatch transcriptome read fragments 1112c and 1112d show a split alignment of the corresponding transcriptome reads with respect to the reference genome. The candidate alignments of both mismatch transcriptome read fragments 1112c and 1112d by the existing sequencing system represent supplementary alignments with a positive mapping quality metric (e.g., positive MAPQ) and correspond to a primary alignment with another gene (not shown). Based on the scoring of the primary and supplementary alignments of such corresponding transcriptome reads shown in the previous alignment window 1108b, the existing sequencing system determines a false positive variant call for a gene fusion event for the genomic sample. For example, in some cases, the existing sequencing system realigns mismatch transcriptome read fragments 1112c and 1112d with genomic regions of another gene on the same chromosome (e.g., chromosome 4) or another gene on a different chromosome, thereby indicating a gene fusion event.

[0158] In contrast, as shown in the updated alignment window 1106b, the split read alignment system 106 maps and aligns the mismatched transcriptome read fragment 1118a to the genomic region corresponding to the contiguous sequence located upstream from the breakpoint 1104b, but does not map or align any mismatched transcriptome read fragments downstream from the breakpoint 1104b. As further shown by FIG. 11D, the candidate alignment of the mismatched transcriptome read fragment 1118a by the split read alignment system 106 exhibits a relatively low mapping quality metric (e.g., MAPQ0), thereby excluding the candidate alignment of the mismatched transcriptome read fragment 1118a from the split read alignment system 106. Since the split read alignment system 106 excludes the mismatched transcriptome read fragment 1118a aligned to the genomic region on one side of the breakpoint 1104b, the split read alignment system 106 does not determine false positive variant calls for gene fusion events for the same genomic sample. By generating an improved split group score for candidate split alignments, the split read alignment system 106 avoids the "noisy" split reads indicated by the candidate alignments of existing sequencing systems in the previous alignment window 1108b. As described above, by avoiding such noisy split read alignments, the split read alignment system 106 also avoids the calling of inaccurate gene fusion mutations and accurately identifies true negative mutations with respect to gene fusions.

[0159] In some embodiments, in addition to improving accuracy, in some embodiments, the split read alignment system 106 also improves nucleotide read coverage and variant calling accuracy for chromosome M with respect to human mitochondrial DNA by selecting more accurate mapping and alignment based on an improved split group score. According to one or more embodiments, FIGS. 12A-12D show coverage graphs 1200a-1200d showing higher coverage by nucleotide reads mapped and aligned to genomic regions of chromosome M using the split read alignment system 106 as compared to such coverage from nucleotide reads mapped and aligned using an existing sequencing system. As shown in FIGS. 12A-12D, the improved nucleotide read coverage extends from the beginning of chromosome M to the end of chromosome M, to genomic regions that are more difficult to cover and call. According to one or more embodiments, FIG. 13 shows a variant call table 1300 showing better accuracy for SNP calls and indel calls by the split read alignment system 106 in the genomic region of chromosome M as compared to such SNP calls and indel calls by an existing sequencing system.

[0160] It is well known that the terminal genomic region of chromosome M is difficult to call and cover by existing sequencing systems, due in part to the circular nature of mitochondrial DNA. Existing models for mapping and alignment represent the circular DNA of chromosome M in a linear format, so existing sequencing systems often clip nucleotide reads that align with the terminal genomic region of chromosome M and inaccurately soft-clip them, thus potentially ignoring valuable nucleotide read data related to the terminal genomic region of chromosome M inaccurately. In contrast to existing sequencing systems, as shown by FIGS. 12A-12D, the split-read alignment system 106 generates an improved split-group score that penalizes split alignments across different chromosomes, thereby improving the selection of split groups and the mapping and alignment for chromosome M.

[0161] To test the nucleotide read coverage for fragment alignment from the split-read alignment system 106, the researchers performed the split-read alignment system 106 and existing sequencing systems on mitochondrial samples from the Fazzini dataset as described in Federica Fazzini et al., “Analyzing Low-Level mtDNA Heteroplasmy - Pitfalls and Challenges from Bench to Benchmarking”, Int’l J. Mol. Sci. January 19, 2021; 22(2):935, which is incorporated herein by reference in its entirety. For example, the researchers sequenced and aligned nucleotide reads from two mtDNA mixtures with different target allele frequencies, where sample mixture M1 contained a 1:2 mixture and a 50% target allele frequency, sample mixture M2 contained a 1:10 mixture and a 10% target allele frequency, and sample mixture M3 contained a 1:50 mixture and a 2% target allele frequency. In some cases, the researchers used different versions of Taq polymerase, including LA Advantage (by Clontech Laboratories), Herculase II Fusion (HERK), and LongAmp Taq polymerase (NEB), for polymerase chain reaction (PCR). The researchers also sequenced nucleotide reads from sample mixtures M1, M2, and M3 using two different protocols: pre-mixing PCR amplification and post-mixing PCR. The researchers further plotted the nucleotide read coverage at the genomic coordinates of the start and end of chromosome M in FIGS. 12A - 12D. Additionally, as shown in FIG. 13, the researchers determined false positive and false negative variant calls for SNPs and indels in sample mixtures M1, M2, and M3 using different versions of PCR Taq polymerase and protocols.

[0162] As shown in FIGS. 12A and 12B, coverage graphs 1200a and 1200b show the coverage of nucleotide reads sequenced from sample mixture M1 using HERK mapped and aligned by split-read alignment system 106 and an existing sequencing system. In FIGS. 12A and 12B, graph keys 1202a and 1202b display coverage plot lines for split-read alignment system 106 designated as MapperV2 (i.e., MapperV2_All, MapperV2_60, MapperV2_20, and MapperV2_gvcf), and coverage plot lines for an existing sequencing system designated as curMapper (i.e., curMapper_All, curMapper_60, curMapper_20, and curMapper_gvcf). As shown by coverage graph 1200a and graph key 1202a in FIG. 12A, split-read alignment system 106 maps and aligns nucleotide reads that have consistently higher coverage across the starting genomic region of chromosome M (chrM:0-100) relative to the existing sequencing system, including programmed mapped nucleotide reads as shown by the comparison of the plot lines for MapperV2_All and curMapper_All. As shown by coverage graph 1200b and graph key 1202b in FIG. 12B, even further from the starting genomic region of chromosome M, split-read alignment system 106 maps and aligns nucleotide reads that have consistently higher coverage across the ending genomic region of chromosome M (chrM:16469-16569) compared to the existing sequencing system, again including programmed mapped nucleotide reads as shown by the comparison of the plot lines for MapperV2_All and curMapper_All.

[0163] As shown in FIGS. 12C and 12D, coverage graphs 1200c and 1200d also show the coverage of nucleotide reads sequenced from sample mixture M1 using Clontech Taq polymerase, mapped and aligned by split-read alignment system 106 and an existing sequencing system. In FIGS. 12C and 12D, graph keys 1202c and 1202d display coverage plot lines for split-read alignment system 106 designated as MapperV2 (i.e., MapperV2_All, MapperV2_60, MapperV2_20, and MapperV2_gvcf), and coverage plot lines for an existing sequencing system designated as curMapper (i.e., curMapper_All, curMapper_60, curMapper_20, and curMapper_gvcf). As shown by coverage graph 1200c and graph key 1202c in FIG. 12C, split-read alignment system 106 maps and aligns nucleotide reads that have consistently higher coverage across the starting genomic region of chromosome M (chrM: 0-100) relative to the existing sequencing system, including all mapped nucleotide reads as shown by the comparison of the plot lines for MapperV2_All and curMapper_All. As shown by coverage graph 1200d and graph key 1202d in FIG. 12D, further beyond the starting genomic region of chromosome M, split-read alignment system 106 also maps and aligns nucleotide reads that have consistently higher coverage across the ending genomic region of chromosome M (chrM: 16469-16569) compared to the existing sequencing system, including the mapped nucleotide reads of the programming as shown by the comparison of the plot lines for MapperV2_All and curMapper_All.

[0164] As described above, FIG. 13 shows a variant call table 1300 showing false positive and false negative variant calls for SNPs and indels by the split read alignment system 106 and an existing sequencing system in the genomic region of chromosome M for sample mixtures M1, M2, and M3 using different versions of PCR Taq polymerase and different PCR protocols. On the left side, the variant call table 1300 shows false positive and false negative variant calls for SNPs and indels by the existing sequencing system as shown in the column of "Dataset_jama_REV7169". On the right side, the variant call table 1300 shows false positive and false negative variant calls for SNPs and indels by the split read alignment system 106 as shown in the column of "CGM_mapperV2". As shown in the "Total" and "Difference" columns of the variant call table 1300, the split read alignment system 106 consistently determines fewer total false positive and false negative SNP and indel calls than the existing sequencing system. The "Difference" column of the variant call table 1300 shows that the split read alignment system 106 shows 1 to 8 fewer false positive and false negative SNP and indel calls, but such a reduction in false positive and false negative SNP and indel calls is significant for such a short chromosome as chromosome M with a length of only 16,569 base pairs.

[0165] Beyond improved nucleotide read coverage and improved variant calls for chromosome M, in some embodiments, the split read alignment system 106 also improves the accuracy of structural variant calls. According to one or more embodiments, FIG. 14A shows a table 1400a showing the split read alignment system 106 recovering an insertion call missed by the existing sequencing system for genes affecting acute myeloid leukemia (AML). According to one or more embodiments, FIG. 14B shows a table 1400b showing that the split read alignment system 106 determines more accurate duplication and translocation calls compared to the existing sequencing system.

[0166] As shown in FIG. 14A, for example, Table 1400a compares split-read alignment system 106 and insertion calls by an existing sequencing system for known genomic samples from both normal and tumor tissues within the fms-like tyrosine kinase 3 (FLT3) gene. Mutations in the FLT3 gene are responsible for a significant proportion of AML cases, and internal tandem duplications (ITDs) represent the most common type of FLT3 mutation. As shown in Table 1400a, based on the improved split-group scores and better selection of split groups for variant calling, split-read alignment system 106 (shown in the column of the "New M / A+ Call Generation Model") accurately determines insertion calls at least 50 base pairs in length at genomic coordinates chr13:28034103 and chr13:28034120 for a pair of known genomic samples with FLT3-ITD mutations, while the existing sequencing system (shown only in the column of the "Call Generation Model") miscalls such insertions. As further shown in Table 1400a, split-read alignment system 106 (shown in the column of the "New M / A+ Call Generation Model") also accurately determines the presence or absence of such insertions at other genomic coordinates that the existing sequencing system (shown only in the column of the "Call Generation Model") also accurately determined. Such recovered insertion calls (and retention of previous accurate insertion calls) by split-read alignment system 106 demonstrate a significant improvement and retention of accuracy for structural variant calls within genes important for cancer diagnosis.

[0167] As shown in FIG. 14B, Table 1400b compares the accuracy of somatic structural variant calls by the split read alignment system 106 with the accuracy of somatic structural variant calls by an existing sequencing system from sequencing data HCC1954. HCC1954 is a cell line showing epithelial breast cancer. As shown in Table 1400b, based on the improved split group score and the selection of a better improved split group for variant calling, the split read alignment system 106 (shown in the row of "New M / A+ call generation model") shows better recall, precision, and F-score for duplicate calls in HCC1954 than an existing sequencing system (shown only in the row of "Call generation model"). Also, as shown in Table 1400b, the split read alignment system 106 (shown in the row of "New M / A+ call generation model") shows better precision and F-score for translocation calls in HCC1954 than an existing sequencing system (shown in the row of "Call generation model"). The recall, precision, and F-score reported in Table 1400b were determined without using a ground truth call, so they are the recall, precision, and F-score for the split read alignment system 106 when determined using a ground truth call.

[0168] FIGS. 1-14B, the corresponding text, and the examples provide several different methods, systems, apparatuses, and non-transitory computer-readable media of the split read alignment system 106. In addition to the above, one or more embodiments can also be described from the perspective of a flowchart including operations for achieving the specific results shown in FIG. 15. FIG. 15 may be executed with more or fewer operations. Further, the operations may be executed in a different order. Further, the operations described herein may be repeated or executed in parallel with each other, or in parallel with different examples of the same or similar operations.

[0169] As described above, FIG. 15 shows a flowchart of a series of operations 1500 for selecting a predicted split group from candidate split groups. FIG. 15 shows operations according to one embodiment, although alternative embodiments may omit, add, reorder, and / or modify any of the operations shown in FIG. 15. The operations of FIG. 15 can be implemented as part of a method. Alternatively, a non-transitory computer-readable medium can include instructions that, when executed by at least one processor, cause a computing device to perform the operations of FIG. 15. In yet further embodiments, a system can include at least one processor and a non-transitory computer-readable medium that includes instructions that, when executed by one or more processors, cause the system to perform the operations of FIG. 15. In some cases, at least one processor includes a configurable processor, and executing at least one processor includes configuring the configurable processor.

[0170] As shown in FIG. 15, a series of operations 1500 includes an operation 1502 of identifying one or more nucleotide reads. Specifically, operation 1502 includes identifying one or more nucleotide reads corresponding to a genomic region of a genomic sample.

[0171] The series of operations 1500 illustrated in FIG. 15 further includes an operation 1504 of receiving a candidate split group. Specifically, operation 1504 includes determining a candidate split group that includes fragment alignments corresponding to one or more nucleotide reads. In some embodiments, determining a candidate split group among candidate split groups further includes grouping one or more fragment alignments of single-end nucleotide reads into a candidate split group, or grouping one or more fragment alignments of paired-end nucleotide reads from a pair of paired-end nucleotide reads into a candidate split group.

[0172] As further illustrated in FIG. 15, a series of operations 1500 includes an operation 1506 of generating a split group score. In particular, operation 1506 includes generating a split group score for a split alignment of a candidate split group with a reference genome. In some embodiments, operation 1506 further includes an additional operation of generating a fragment alignment score for individual fragment alignments of the candidate split group with the reference genome, and an additional operation of generating a split group score for the candidate split group based on the fragment alignment scores. Additionally, in some embodiments, operation 1506 further includes generating a break penalty for the relative geometry of a first fragment alignment and a second fragment alignment with respect to the reference genome for a candidate split group of the candidate split groups, and generating a split group score for the candidate split group based on the break penalty. Further, in some embodiments, a series of operations 1506 includes generating an overlap penalty for an overlap within a nucleotide read between a first fragment alignment and a second fragment alignment for a candidate split group of the candidate split groups, and generating a split group score for the candidate split group based on the overlap penalty.

[0173] In some embodiments, operation 1506 further includes generating a split group score for a candidate split group among the candidate split groups by combining a fragment alignment score for the fragment alignment of the candidate split group, a break penalty, and an overlap penalty, and subtracting the break penalty and the overlap penalty from the combined fragment alignment score. In some embodiments, operation 1006 further includes determining candidate split groups by iteratively grouping individual fragment alignments in order from the outermost fragment alignment to the innermost fragment alignment of the nucleotide reads, and generating split group scores by iteratively scoring the grouping of the individual fragment alignments in the order in which the individual fragment alignments are grouped.

[0174] The series of operations 1500 illustrated in FIG. 15 includes an operation 1508 of selecting a predicted split group. In particular, operation 1508 includes selecting a predicted split group from candidate split groups based on split group scores for nucleobase coiling of genomic regions. In some embodiments, operation 1508 includes identifying candidate pairs of split groups that include different fragment alignments for mates of paired-end nucleotide reads from the candidate split groups, generating a pair score for evaluating the pair alignment of the candidate pairs of split groups with the reference genome, and selecting a predicted split group for each mate of the paired-end nucleotide reads based further on the pair score. Further, in some embodiments, operation 1508 further includes determining the sum of the split group scores for each candidate pair of split groups, generating a pairing penalty based on the estimated insert size between the innermost fragment alignments of the candidate pairs of split groups, and generating a pair score for the candidate pairs of split groups based on the sum of the split group scores and the pairing penalty.

[0175] In some embodiments, the series of operations 1500 includes additional operations of determining an alternative contig fragment alignment score for inner and outer fragment alignments corresponding to nucleotide reads with alternative contiguous sequences within the reference genome, determining a split group score for inner and outer fragment alignments with the primary assembly of the reference genome, and selecting the alternative contig fragment alignment score as a replacement split group score based on a determination that the alternative contig fragment alignment score exceeds the split group score.

[0176] Furthermore, in one or more embodiments, the series of operations 1500 includes additional operations for determining nucleobase calls for genomic regions based on the alignment of the predicted split groups with the reference genome.

[0177] The series of operations 1500 may also include additional operations for determining that the fragment alignment score of a fragment alignment does not meet a threshold fragment alignment score, and additional operations for removing the fragment alignment from consideration in forming candidate split groups.

[0178] The series of operations 1500 may include additional operations for determining that the alignment score for a candidate split group does not meet a minimum alignment score, and additional operations for refraining from reporting the split alignment of the candidate split group in the alignment file or variant call file based on the alignment score not meeting the minimum alignment score.

[0179] The methods described herein can be used in conjunction with various nucleic acid sequencing technologies. Particularly applicable technologies are those in which nucleic acids are attached to fixed positions within an array such that their relative positions do not change and the array is repeatedly imaged. For example, embodiments in which images are obtained in different color channels that correspond to different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process of determining the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing by synthesis (SBS) technologies.

[0180] SBS technology generally involves the enzymatic extension of a nascent nucleic acid strand by iterative addition of nucleotides to a template strand. In conventional methods of SBS, a single nucleotide monomer can be provided to the target nucleotide in the presence of polymerase in each delivery. However, in the methods described herein, two or more types of nucleotide monomers can be provided to the target nucleic acid in the presence of polymerase during delivery.

[0181] SBS can utilize nucleotide monomers having a terminator moiety or nucleotide monomers lacking any terminator moiety. Examples of methods that utilize nucleotide monomers lacking a terminator include, for example, pyrosequencing and sequencing using γ-phosphate labeled nucleotides as described in more detail below. In methods that use nucleotide monomers lacking a terminator, the number of nucleotides added in each cycle is generally variable and depends on the template sequence and the mode of nucleotide delivery. In SBS technology that utilizes nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under sequencing conditions as used in conventional Sanger sequencing that utilizes dideoxynucleotides, or the terminator can be reversible as in the case of the sequencing method developed by Solexa (now Illumina, Inc.).

[0182] The SBS technique can use nucleotide monomers with a label moiety or nucleotide monomers lacking a label moiety. Thus, incorporation events can be detected based on properties of the label, such as fluorescence of the label, properties of the nucleotide monomer, such as molecular weight or charge, and by-products of nucleotide incorporation, such as release of pyrophosphate. In embodiments where two or more different nucleotides are present in the sequencing reagent, the different nucleotides can be distinguishable from one another or, alternatively, two or more different labels can be distinguishable under the detection technique used. For example, different nucleotides present in the sequencing reagent can have different labels and they can be distinguished using an appropriate optical system exemplified by the sequencing method developed by Solexa (now Illumina, Inc.).

[0183] Preferred embodiments include pyrosequencing technology. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) when a specific nucleotide is incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) “Real-time DNA sequencing using detection of pyrophosphate release.” Analytical Biochemistry 242(1), 84-9, Ronaghi, M. (2001) “Pyrosequencing sheds light on DNA sequencing.” Genome Res. 11(1), 3-11, Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-time pyrophosphate.” Science 281(5375), 363, U.S. Patent No. 6,210,891, U.S. Patent No. 6,258,568 and U.S. Patent No. 6,274,320, the entire disclosures of which are incorporated herein by reference). In pyrosequencing, the released PPi can be detected by its immediate conversion to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of the generated ATP is detected via photons generated by luciferase. The nucleic acid to be sequenced can be attached to features in an array, and the array can be imaged to capture the chemiluminescent signal generated by incorporating nucleotides into the features of the array. An image can be obtained after treating the array with a specific nucleotide type (e.g., T, C, or G). The images obtained after the addition of each nucleotide type differ with respect to which features in the array are detected. These differences in the images reflect the different sequence contents of the features on the array. However, the relative positions of each feature remain unchanged within the image. The images can be stored, processed, and analyzed using the methods described herein.For example, the images obtained after processing the array with each different nucleotide type can be processed in the same manner as those exemplified herein for images obtained from different detection channels for reversible terminator-based sequencing methods.

[0184] In another exemplary type of SBS, cycle sequencing is achieved, for example, by stepwise addition of cleavable or photo-bleachable dye-labeled reversible terminator nucleotides as described in International Publication No. WO 04 / 018497 and U.S. Patent No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach has been commercialized by Solexa (now Illumina Inc.) and is also described in International Publication No. WO 91 / 06678 and International Publication No. WO 07 / 123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators with both termini reversible and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also co-operate to efficiently incorporate and extend from these modified nucleotides.

[0185] Preferably, in a reversible terminator-based sequencing embodiment, the label does not substantially inhibit extension under SBS reaction conditions. However, the detection label can be removable, for example, by cleavage or degradation. Images can be captured after incorporation of the label into the arrayed nucleic acid features. In certain embodiments, each cycle involves the simultaneous delivery of four different nucleotide types to the array, and each nucleotide type has spectrally distinct labels. Next, four images can be obtained, each using a detection channel selective for one of the four different labels. Alternatively, the different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image shows nucleic acid features incorporating a particular type of nucleotide. Because the sequence content of each feature is different, different features will be present in, or absent from, different images. However, the relative positions of the features remain unchanged within the image. Images obtained from such reversible terminator-SBS methods can be stored, processed, and analyzed as described herein. Following the imaging step, the label can be removed, and the reversible terminator moiety can be removed for subsequent cycles of nucleotide addition and detection. Removing the label after detection in a particular cycle and prior to subsequent cycles has the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are described below.

[0186] In certain embodiments, some or all of the nucleotide monomers can include reversible terminators. In such embodiments, the reversible terminator / cleavable fluor can include a fluor attached to the ribose moiety via a 3'-ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches separate the chemistry of the terminator from the cleavage of the fluorescent label (Ruparel et al., Proc Natl Acad Sci USA 102:5932-7 (2005), which is incorporated herein in its entirety by reference). Ruparel et al. describe the development of reversible terminators that use a small amount of 3'-allyl groups to block elongation but can be easily de-blocked by treatment with a palladium catalyst for a short time. The fluor is attached to the group via a photocleavable linker that can be easily cleaved by exposure to long-wavelength UV light for 30 seconds. Thus, either disulfide reduction or photocleavage can be used as the cleavable linker. Another approach to reversible termini is the use of natural termini following placement of a bulky dye on the dNTP. The presence of the charged bulky dye on the dNTP can act as an effective terminator via steric and / or electrostatic hindrance. The presence of one incorporation event prevents further binding unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U.S. Patent No. 7,427,673 and U.S. Patent No. 7,057,026, the disclosures of which are incorporated herein in their entirety by reference.

[0187] Additional exemplary SBS systems and methods that can be used with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007 / 0166705, U.S. Patent Application Publication No. 2006 / 0188901, U.S. Patent No. 7,057,026, U.S. Patent Application Publication No. 2006 / 0240439, U.S. Patent Application Publication No. 2006 / 0281109, International Publication No. 05 / 065814, U.S. Patent Application Publication No. 2005 / 0100900, International Publication No. 06 / 064199, International Publication No. 07 / 010,251, U.S. Patent Application Publication No. 2012 / 0270305, and U.S. Patent Application Publication No. 2013 / 0260372, the disclosures of which are incorporated herein by reference in their entireties.

[0188] Some embodiments can utilize the detection of four different nucleotides using less than four different labels. For example, SBS can be implemented using the methods and systems described in incorporated reference US Patent Application Publication No. 2013 / 0079232. As a first example, nucleotide type pairs can be detected at the same wavelength, but based on the difference in intensity for one member of the pair, or based on a change (e.g., via chemical modification, photochemical modification, or physical modification) to one member of the pair that results in the appearance or disappearance of a distinct signal as compared to the signal detected for the other member of the pair. As a second example, three of the four different nucleotide types can be detected under specific conditions, while the fourth nucleotide type has no detectable label or is minimally detected under those conditions (e.g., minimal detection by background fluorescence). Incorporating the first three nucleotide types into a nucleic acid can be determined based on the presence of their respective signals, and incorporating the fourth nucleotide type into a nucleic acid can be determined based on the absence or minimal detection of any signal. As a third example, one nucleotide type can include a label that is detected in two different channels, while other nucleotide types are detected in one or fewer of the channels. The three exemplary configurations described above are not considered mutually exclusive and can be used in various combinations.An exemplary embodiment combining all three examples is a first nucleotide type detected in the first channel (e.g., dATP having a label detected in the first channel when excited by the first excitation wavelength), a second nucleotide type detected in the second channel (e.g., dCTP having a label detected in the second channel when excited by the second excitation wavelength), a third nucleotide type detected in both the first and second channels (e.g., dTTP having at least one label detected in both channels when excited by the first and / or second excitation wavelength), and a fourth nucleotide type that is not detected in any channel or lacks a label that is minimally detected (e.g., unlabeled dGTP), which is a fluorescence-based SBS method.

[0189] Furthermore, as described in U.S. Patent Application Publication No. 2013 / 0079232, which is incorporated herein by reference, sequencing data can be obtained using a single channel. In such a so-called one-dye sequencing method, the first nucleotide type is labeled, but the label is removed after the first image is generated, and the second nucleotide type is labeled only after the first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.

[0190] Some embodiments can utilize ligation-based sequencing techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. Oligonucleotides typically have different labels that correlate with the identity of specific nucleotides in the sequence to which the oligonucleotide hybridizes. As with other SBS methods, an image can be obtained after processing an array of nucleic acid features with labeled sequencing reagents. Each image shows nucleic acid features incorporating a particular type of label. Since the sequence content of each feature is different, different features may or may not be present in different images, but the relative positions of the features remain unchanged within the image. Images obtained from ligation-based sequencing methods can be stored, processed, and analyzed as described herein. Exemplary SBS systems and methods that can be utilized with the methods and systems described herein are described in U.S. Patent No. 6,969,488, U.S. Patent No. 6,172,218, and U.S. Patent No. 6,306,597, the disclosures of which are hereby incorporated by reference in their entirety.

[0191] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M., "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147 - 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817 - 825 (2002); Li, J., M. Gershow, D. Stein, E. Brantin, and J. A. Golovchenko, "DNA molecules and configurations in a solid - state nanopore microscope" Nat. Mater. 2:611 - 615 (2003), the disclosures of which are incorporated herein by reference in their entirety). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore such as α - hemolysin or a biomembrane protein. When the target nucleic acid passes through the nanopore, each base pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Patent No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid - state nanopores." Clin. Chem. 53, 1996 - 2001 (2007); Healy, K., "Nanopore - based single - molecule DNA analysis." Nanomed. 2, 459 - 481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R., "A single - molecule nanopore device detects DNA polymerase activity with single - nucleotide resolution." J. Am Chem. Soc. 130, 818 - 820 (2008), the disclosures of which are incorporated herein by reference in their entirety).Data obtained from nanopore array determination can be stored, processed, and analyzed as described herein. Specifically, the data can be processed as an image according to the exemplary processing of the optical and other images described herein.

[0192] Some embodiments can utilize methods involving real-time monitoring of DNA polymerase activity. Incorporation of nucleotides can be detected, for example, via fluorescence resonance energy transfer (FRET) interactions between a fluorophore-containing polymerase and a γ-phosphate-labeled nucleotide, as described in, for example, U.S. Patent No. 7,329,492 and U.S. Patent No. 7,211,414, each of which is incorporated herein by reference, or incorporation of nucleotides can be detected, for example, using zero-mode waveguides as described in U.S. Patent No. 7,315,019, which is incorporated herein by reference, and fluorescence nucleotide analogs and engineered polymerases as described in, for example, U.S. Patent No. 7,405,281 and U.S. Patent Application Publication No. 2008 / 0108082, each of which is incorporated herein by reference. Illumination can be restricted to zeptoliter-scale volumes around surface-tethered polymerases such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. “Zero-mode waveguides for single-molecule analysis at high concentrations.” Science, 299, 682 - 686 (2003); Lundquist, P. M. et al. “Parallel confocal detection of single molecules in real time.” Opt. Lett. 33, 1026 - 1028 (2008); Korlach, J. et al. “Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures.” Proc. Natl. Acad. Sci. USA 105, 1176 - 1181 (2008), the disclosures of which are incorporated herein by reference in their entireties).Images obtained from such methods can be stored, processed, and analyzed as described herein.

[0193] Some SBS embodiments include the detection of protons released upon incorporation of nucleotides into the extension product. For example, sequencing based on the detection of released protons can use an electrical detector and related technology commercially available from Ion Torrent (Guilford, CT, a subsidiary of Life Technologies), or the sequencing methods and systems described in U.S. Patent Application Publication Nos. 2009 / 0026082 (A1), 2009 / 0127589 (A1), 2010 / 0137143 (A1), or 2010 / 0282617 (A1), each of which is incorporated herein by reference. The methods described herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to the substrates used for detecting protons. More specifically, the methods described herein can be used to generate a clonal population of amplicons for use in detecting protons.

[0194] The above-described SBS method can be advantageously implemented in a multiplex format such that multiple different target nucleic acids are manipulated simultaneously. In certain embodiments, the different target nucleic acids can be processed in a common reaction vessel or on the surface of a particular substrate. This enables the facile delivery of sequencing reagents, removal of unreacted reagents, and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can typically be bound to the surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to beads or other particles, or attachment to a polymerase or other molecule bound to the surface. The array can contain a single copy of the target nucleic acid at each site (also referred to as a feature), or multiple copies having the same sequence can be present at each site or feature. The multiple copies can be generated by amplification methods such as bridge amplification or emulsion PCR, which are described in more detail below.

[0195] The methods described herein can use arrays having features of any of a variety of densities, for example, at least about 10 features / cm 2 , 100 features / cm 2 , 500 features / cm 2 , 1,000 features / cm 2 , 5,000 features / cm 2 , 10,000 features / cm 2 , 50,000 features / cm 2 , 100,000 features / cm 2 , 1,000,000 features / cm 2 , 5,000,000 features / cm 2 , or more.

[0196] The advantage of the method described herein is that it provides for the rapid and efficient detection of multiple target nucleic acids in parallel. Accordingly, the present disclosure provides an integrated system that can prepare and detect nucleic acids using techniques known in the art such as those exemplified above. Accordingly, the integrated system of the present disclosure can include fluid components capable of delivering amplification reagents and / or sequencing reagents to one or more immobilized DNA fragments, and the system can include components such as pumps, valves, reservoirs, fluid lines, and the like. A flow cell can be configured and / or used in the integrated system for detecting target nucleic acids. Exemplary flow cells are described, for example, in U.S. Patent Application Publication No. 2010 / 0111768 (A1) and U.S. Patent Application No. 13 / 273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluid components of the integrated system can be used in amplification methods and detection methods. Taking an embodiment of nucleic acid sequencing as an example, one or more of the fluid components of the integrated system can be used for delivering sequencing reagents in the amplification methods described herein and sequencing methods such as those exemplified above. Alternatively, the integrated system can include separate fluid systems for performing amplification methods and detection methods. Examples of integrated sequencing systems that can create amplified nucleic acids and also determine the sequence of nucleic acids include, but are not limited to, the MiSeq™ platform (Illumina Inc., San Diego, CA), and the apparatus described in U.S. Patent Application No. 13 / 273,666, which is incorporated herein by reference.

[0197] The above-described array determination system determines the sequence of nucleic acid polymers present in a sample received by an array determination apparatus. As defined herein, "sample" and its derivatives are used in the broadest sense and include any sample, culture, etc. that is suspected of containing a target. In some embodiments, the sample includes nucleic acids in the form of DNA, RNA, PNA, LNA, chimeras, or hybrids. The sample can include any biological sample, clinical sample, surgical sample, agricultural sample, atmospheric sample, or water sample that contains one or more nucleic acids. The term also includes any isolated nucleic acid sample, e.g., genomic DNA, fresh frozen or formalin-fixed paraffin-embedded nucleic acid samples. The sample can be a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as tumor samples and normal tissue samples, or a sample from a single source that contains two different forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or a sample that may be derived from the presence of contaminating bacterial DNA in a sample containing plant or animal DNA. In some embodiments, the source of the nucleic acid material can include nucleic acids obtained from a neonate, such as those typically used in neonatal screening.

[0198] A nucleic acid sample can contain high molecular weight substances such as genomic DNA (gDNA). The sample can contain low molecular weight substances such as nucleic acid molecules obtained from FFPE or stored DNA samples. In another embodiment, the low molecular weight substance contains enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some embodiments, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture microdissection, surgical excisions, and other clinical or laboratory-obtained samples. In some embodiments, the sample can be an epidemiological, agricultural, forensic, or pathogenic sample. In some embodiments, the sample can contain nucleic acid molecules obtained from animals such as human or mammalian sources. In another embodiment, the sample can include nucleic acid molecules obtained from non-mammalian sources such as plants, bacteria, viruses, or fungi. In some embodiments, the source of the nucleic acid molecule can be a preserved or extinct sample or species.

[0199] Furthermore, the methods and compositions disclosed herein may be useful for amplifying nucleic acid samples having low-quality nucleic acid molecules such as degraded and / or fragmented genomic DNA from forensic samples. In one embodiment, a forensic sample can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation, or a forensic sample obtained by a law enforcement agency, one or more military forces, or such personnel. The nucleic acid sample can be, for example, a purified sample or a crude DNA lysate derived from a buccal swab, paper, cloth, or other substrate impregnated with saliva, blood, or other body fluid. In and of itself, in some embodiments, the nucleic acid sample can contain small amounts or fragmented portions of DNA such as genomic DNA. In some embodiments, the target sequence can be present in one or more body fluids including, but not limited to, blood, sputum, plasma, semen, urine, and serum. In some embodiments, the target sequence can be obtained from hair, skin, tissue samples, autopsies, or the remains of a victim. In some embodiments, the nucleic acid containing one or more target sequences can be obtained from a deceased animal or human. In some embodiments, the target sequence can include nucleic acids obtained from non-human DNA such as microbial, plant, or entomological DNA. In some embodiments, the target sequence or amplified target sequence is for human identification. In some embodiments, the present disclosure generally relates to methods for identifying the characteristics of forensic samples. In some embodiments, the present disclosure generally relates to methods of human identification using one or more of the target-specific primers disclosed herein, or one or more target-specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic sample or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein, or using the primer criteria outlined herein.

[0200] The components of the split read alignment system 106 can include software, hardware, or both. For example, the components of the split read alignment system 106 can be stored on a computer-readable storage medium and include one or more instructions executable by a processor of one or more computing devices (e.g., user client device 108). When executed by one or more processors, the computer-executable instructions of the split read alignment system 106 can cause the computing device to implement the bubble detection method described herein. Alternatively, the components of the split read alignment system 106 can include hardware such as a dedicated processing device for implementing a particular function or group of functions. Additionally, or alternatively, the components of the split read alignment system 106 can include a combination of computer-executable instructions and hardware.

[0201] Furthermore, the components of the split read alignment system 106 that implement the functionality described herein with respect to the split read alignment system 106 can be implemented, for example, as part of a stand-alone application, as a module of an application, as a plug-in of an application, as a library function that can be called by another application, and / or as a cloud computing model. Thus, the components of the split read alignment system 106 can be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the split read alignment system 106 can be implemented in any application that provides a sequencing service, including but not limited to, Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. "Illumina", "BaseSpace", "DRAGEN", and "TruSight" are registered trademarks or trademarks of Illumina, Inc. in the United States and / or other countries.

[0202] Embodiments of the present disclosure may include, or utilize, a special purpose or general purpose computer including, for example, one or more processors and system memory, as will be discussed in more detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and / or data structures. In particular, one or more of the processes described herein may be embodied in a non-transitory computer-readable medium and at least partially implemented as executable instructions by one or more computing devices (e.g., any of the media content access devices described herein). Generally, a processor (e.g., a microprocessor) receives instructions from a non-transitory computer-readable medium (e.g., memory, etc.), executes those instructions, and thereby performs one or more processes including one or more of the processes described herein.

[0203] A computer-readable medium can be any available medium that can be accessed by a general-purpose computer system or a dedicated computer system. A computer-readable medium storing computer-executable instructions is a non-transitory computer-readable storage medium (device). A computer-readable medium carrying computer-executable instructions is a transmission medium. Thus, by way of example and not limitation, embodiments of the present disclosure can include at least two distinctly different types of computer-readable media, namely non-transitory computer-readable storage media (devices) and transmission media.

[0204] Non-transitory computer-readable storage media (devices) include RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs, e.g., based on RAM), flash memory, phase-change memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions or data structures and that can be accessed by a general-purpose or dedicated computer.

[0205] "Network" is defined as one or more data links that enable the transfer of electronic data between computer systems and / or modules and / or other electronic devices. When information is transferred or provided to a computer via a network or another communication connection (either hardwired, wireless, or a combination of hardwired or wireless), the computer properly recognizes the connection as a transmission medium. Transmission media can be used to carry desired program code means in the form of computer-executable instructions or data structures and can include networks and / or data links that can be accessed by a general-purpose or dedicated computer. Combinations of the above should also be included within the scope of computer-readable media.

[0206] Furthermore, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be automatically transferred from the transmission medium to a non-transitory computer-readable storage medium (device) (or vice versa). For example, computer-executable instructions or data structures received via a network or data link are buffered in RAM within a network interface module (e.g., NIC) and then ultimately transferred to the computer system RAM and / or a less volatile computer storage medium (device) in the computer system. Accordingly, it should be understood that a non-transitory computer-readable storage medium (device) can be included in computer system components that utilize the transmission medium as well (or even primarily).

[0207] Computer-executable instructions, for example, when executed on a processor, include instructions and data that cause a general-purpose computer, a special-purpose computer, or a special-purpose processing device to perform a particular function or group of functions. In some embodiments, the computer-executable instructions are executed on a general-purpose computer, converting the general-purpose computer into a special-purpose computer implementing the elements of the present disclosure. The computer-executable instructions can be, for example, binary, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and / or methodological acts, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts above. Rather, the described features and acts are disclosed as exemplary forms for implementing the claims.

[0208] Those skilled in the art will understand that the present disclosure can be implemented in a network computing environment having many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, handheld devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, cellular telephones, PDAs, tablets, pagers, routers, switches, and the like. The present disclosure can also be implemented in a distributed system environment where local and remote computer systems, linked via a network (either by a hardwired data link, a wireless data link, or a combination of hardwired and wireless data links), both perform tasks. In a distributed system environment, program modules can be located in both local and remote memory storage devices.

[0209] Embodiments of the present disclosure can also be implemented in a cloud computing environment. As used herein, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be adopted in the marketplace to provide ubiquitous and convenient on-demand access to a shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization, exposed with low management effort or service provider interaction, and then scaled accordingly.

[0210] The cloud computing model can be composed of various characteristics such as, for example, on-demand self-service, wide area network access, resource pooling, rapid elasticity, measured service, etc. The cloud computing model can also expose various service models such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). The cloud computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, etc. In this specification and the claims, a "cloud computing environment" is an environment in which cloud computing is adopted.

[0211] FIG. 16 shows a block diagram of a computing device 1600 that can be configured to perform one or more of the above processes. It will be understood that one or more computing devices, such as computing device 1600, can implement split read alignment system 106 and array determination system 104. As shown by FIG. 16, computing device 1600 can include a processor 1602, a memory 1604, a storage device 1606, an I / O interface 1608, and a communication interface 1610, which can be communicatively coupled by a communication infrastructure 1612. In certain embodiments, computing device 1600 can include fewer or more components than those shown in FIG. 16. The following paragraphs describe the components of computing device 1600 shown in FIG. 16 in more detail.

[0212] In one or more embodiments, processor 1602 includes hardware for executing instructions, such as instructions that make up a computer program. By way of example and not limitation, to execute instructions for dynamically modifying a workflow, processor 1602 can retrieve (or fetch) instructions from internal registers, internal caches, memory 1604, or storage device 1606, decode them, and execute them. Memory 1604 may be volatile or non-volatile memory used to store data, metadata, and programs for execution by the processor. Storage device 1606 includes storage such as a hard disk, flash disk drive, or other digital storage device for storing data or instructions for implementing the methods described herein.

[0213] I / O interface 1608 enables a user to provide input to, receive output from, transfer data to, or receive data from computing device 1600. I / O interface 1608 can include a mouse, keypad or keyboard, touch screen, camera, optical scanner, network interface, modem, other known I / O devices, or a combination of such I / O interfaces. I / O interface 1608 can include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In certain embodiments, I / O interface 1608 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and / or any other graphical content that may be useful in a particular implementation.

[0214] The communication interface 1610 can include hardware, software, or both. In any case, the communication interface 1610 can provide one or more interfaces for communication (e.g., packet-based communication, etc.) between the computing device 1600 and one or more other computing devices or networks. By way of example and not limitation, the communication interface 1610 can include a network interface controller (NIC) or network adapter for communicating with Ethernet or other wired-based networks, or a wireless NIC (WNIC) or wireless adapter for communicating with wireless networks such as WI-FI.

[0215] Additionally, the communication interface 1610 can facilitate communication with various types of wired or wireless networks. The communication interface 1610 can also facilitate communication using various communication protocols. The communication infrastructure 1612 can also include hardware, software, or both that couple the components of the computing device 1600 to each other. For example, the communication interface 1610 can enable multiple computing devices connected by a particular infrastructure to communicate with each other to implement one or more aspects of the processes described herein using one or more networks and / or protocols. By way of illustration, the sequencing process can enable multiple devices (e.g., client devices, sequencing devices, and server devices) to exchange information such as sequencing data and error notifications.

[0216] In the foregoing specification, the present disclosure has been described with reference to its specific exemplary embodiments. Various embodiments and aspects of the present disclosure are described with reference to the details considered herein, and the accompanying drawings illustrate the various embodiments. The above description and drawings are examples of the present disclosure and should not be construed as limiting the present disclosure. A number of specific details are described to provide a complete understanding of the various embodiments of the present disclosure.

[0217] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be implemented with fewer or more steps / operations, or the steps / operations may be performed in a different order. Additionally, the steps / operations described herein may be repeated or performed in parallel with each other, or in parallel with different occurrences of the same or similar steps / operations. Accordingly, the scope of the present application is indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A computer implementation method, Identifying one or more paired-end nucleotide reads corresponding to a genomic region in a genome sample, To determine a candidate split group containing the fragment alignment of one or more paired-end nucleotide reads, From the candidate split groups, identify candidate pairs of split groups containing different fragment alignments of one or more paired-end nucleotide reads with respect to the mates of the paired-end nucleotide reads. The method involves generating split group scores for the split alignments of the candidate split groups, wherein the split group scores measure the accuracy of the fragment alignment in the split groups relative to the reference genome. For the candidate pairs of the split group, and based on the split group score, a pair score is generated to evaluate the pair alignment of the candidate pairs of the split group with respect to the reference genome. A computer-aided method comprising selecting a predicted split group from the candidate split groups based on the pair score for nucleobase calling of the genomic region.

2. The computer-aided method according to claim 1, further comprising determining the candidate split groups among the candidate split groups by grouping one or more fragment alignments of paired-end nucleotide reads from a pair of paired-end nucleotide reads into the candidate split groups.

3. The method involves generating fragment alignment scores for individual fragment alignments of candidate split groups with respect to the reference genome, wherein the fragment alignment score among the fragment alignment scores measures the accuracy of the fragment alignment with respect to the reference genome. The computer implementation method according to claim 1 or 2, further comprising generating a split group score for the candidate split group based on the fragment alignment score.

4. To generate a penalty for the relative geometry of the first fragment alignment of the first alignment orientation with respect to the reference genome and the second fragment alignment of the second alignment orientation with respect to the reference genome for the candidate split group among the candidate split groups, The computer implementation method according to claim 1 or 2, further comprising generating a split group score for the candidate split group based on the penalty for the relative geometry of the first and second fragment alignments.

5. With respect to the candidate split group among the candidate split groups, an overlap penalty is generated for the overlap in nucleotide reads between the first fragment alignment and the second fragment alignment. The computer implementation method according to claim 1 or 2, further comprising generating a split group score for the candidate split group based on the overlap penalty.

6. The split group score for the candidate split group among the aforementioned candidate split groups is as follows: To generate a fragment alignment score, a penalty for relative geometry, and an overlap penalty for the fragment alignment of the candidate split group, The computer implementation method according to claim 1 or 2, further comprising generating by combining the aforementioned fragment alignment scores and subtracting the penalty for relative geometry and the overlap penalty from the combined fragment alignment scores.

7. The candidate split groups are determined by iteratively grouping possible fragment alignment sequences according to the order from the outermost fragment alignment to the innermost fragment alignment of nucleotide reads, The computer implementation method according to claim 1 or 2, further comprising generating the split group score by iteratively scoring the grouping of possible fragment alignment sequences according to the order in which the possible fragment alignment sequences were grouped.

8. From the candidate split group, the predicted split group is selected. From the candidate pairs of the split groups, select the pair of candidate split groups that has the highest pair score. The computer implementation method according to claim 1 or 2, further comprising selecting the predicted split group from the candidate split group pair for each mate of a nucleotide read.

9. Determine the sum of the split group scores for each candidate pair in the split group, A pairing penalty is generated based on the estimated insert size between the innermost fragment alignments of the candidate pairs in the split group. The computer implementation method according to claim 8, further comprising generating the pair score for the candidate pair of the split group based on the sum of the split group score and the pairing penalty.

10. Determining alternative contig fragment alignment scores for inner and outer fragment alignments corresponding to nucleotide reads having alternative contig sequences within the reference genome, Determine the split group scores for the inner and outer fragment alignments with respect to the primary assembly region of the reference genome. The computer implementation method according to claim 8, further comprising selecting the alternative contig fragment alignment score as a replacement split group score based on the determination that the alternative contig fragment alignment score exceeds the split group score.

11. It is a system, At least one processor, A non-temporary computer-readable medium is provided, and when the non-temporary computer-readable medium is executed by the at least one processor, the system Identify one or more paired-end nucleotide reads corresponding to a genomic region in a genome sample. Determine a candidate split group containing the fragment alignment of one or more paired-end nucleotide reads. From the candidate split groups, identify candidate pairs of split groups containing different fragment alignments of one or more paired-end nucleotide reads with respect to the mates of the paired-end nucleotide reads. A split group score is generated for the split alignment of the candidate split group with respect to the reference genome, and the split group score among these split group scores measures the accuracy of the fragment alignment in the split group with respect to the reference genome. For the candidate pairs in the split group, and based on the split group score, a pair score is generated to evaluate the pair alignment of the candidate pairs in the split group with respect to the reference genome. A system including instructions to select a predicted split group from the candidate split groups based on the pair score for nucleobase calling of the genomic region.

12. When executed by the at least one processor, the system For the candidate split group among the candidate split groups, an overlap penalty is generated for the overlap in nucleotide reads between the first fragment alignment and the second fragment alignment. The instruction further includes generating a split group score for the candidate split group based on the overlap penalty, The system according to claim 11.

13. When executed by the at least one processor, the system provides the split group score for the candidate split group among the candidate split groups, To generate a fragment alignment score, a penalty for relative geometry, and an overlap penalty for the fragment alignment of the candidate split group, The fragment alignment scores are combined, and the penalty for relative geometry and the overlap penalty are subtracted from the combined fragment alignment score. The system according to claim 11 or 12, further comprising instructions generated by

14. Select the predicted split group from the candidate split group, From the candidate pairs of the split groups, select the pair of candidate split groups that has the highest pair score. The instruction further includes, for each mate of a nucleotide read, selecting the predicted split group from the pair of candidate split groups. The system according to claim 11 or 12.

15. When executed by the at least one processor, the system Determine the alternative contig fragment alignment scores for the inner and outer fragment alignments corresponding to nucleotide reads having alternative contig sequences within the reference genome. Determine the split group scores for the inner and outer fragment alignments with respect to the primary assembly region of the reference genome. The instruction further includes, based on the determination that the alternative contig fragment alignment score exceeds the split group score, the instruction causes the alternative contig fragment alignment score to be selected as the substitute split group score. The system according to claim 14.

16. When executed by the at least one processor, the system The candidate split groups are determined by iteratively grouping possible fragment alignment sequences according to the order from the outermost fragment alignment to the innermost fragment alignment of nucleotide reads. The instruction further includes generating the split group score by iteratively scoring the grouping of possible fragment alignment sequences according to the order in which the possible fragment alignment sequences were grouped. The system according to claim 11 or 12.

17. A non-temporary computer-readable medium, when executed by at least one processor, to a computing device, Identify one or more paired-end nucleotide reads corresponding to a genomic region in a genome sample. Determine a candidate split group containing the fragment alignment of one or more paired-end nucleotide reads. From the candidate split groups, identify candidate pairs of split groups containing different fragment alignments of one or more paired-end nucleotide reads with respect to the mates of the paired-end nucleotide reads. The system generates split group scores for the split alignment of the candidate split groups, and the split group score among these split group scores measures the accuracy of the fragment alignment in the split group relative to the reference genome. For the candidate pairs in the split group, and based on the split group score, a pair score is generated to evaluate the pair alignment of the candidate pairs in the split group with respect to the reference genome. A non-temporary computer-readable medium including instructions for selecting a predicted split group from the candidate split groups based on the pair score for nucleobase calling of the genomic region.

18. The non-temporary computer-readable medium according to claim 17, further comprising instructions, when executed by the at least one processor, causing the computing device to determine a candidate split group from the candidate split groups by grouping one or more fragment alignments of paired-end nucleotide reads from a pair of paired-end nucleotide reads into the candidate split group.

19. When executed by the at least one processor, the computing device: Fragment alignment scores are generated for each fragment alignment of the candidate split group with respect to the reference genome, and the fragment alignment score among these fragment alignment scores measures the accuracy of the fragment alignment with respect to the reference genome. The non-temporary computer-readable medium according to claim 17 or 18, further comprising instructions for generating a split group score for the candidate split group based on the fragment alignment score.

20. When executed by the at least one processor, the computing device: For the candidate split groups among the candidate split groups, a penalty is generated regarding the relative geometry of the first fragment alignment with the first alignment orientation relative to the reference genome and the second fragment alignment with the second alignment orientation relative to the reference genome. A non-temporary computer-readable medium according to claim 17 or 18, further comprising instructions for generating a split group score for the candidate split group based on the penalty for the relative geometry of the first and second fragment alignments.