Variant calling based on copy-number-aware multi-region joint detection
The MRJD method addresses the challenge of genotyping regions with high homology by aligning reads across multiple genomic regions, determining copy numbers, and using uniform and haplotype-based priors to enhance variant detection accuracy and reduce errors.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ILLUMINA INC
- Filing Date
- 2025-12-11
- Publication Date
- 2026-06-25
AI Technical Summary
Conventional variant callers struggle with accurately genotyping regions with high homology, such as segmental duplications, leading to discarded reads and reduced accuracy in variant detection.
A method for multi-region joint detection (MRJD) that aligns sequence reads to multiple genomic regions, determines copy numbers, and scores joint genotype candidates using haplotypes, employing a uniform prior score and haplotype-based priors to improve variant calling, particularly in paralogous regions.
Enhances the detection of small variants in paralogous regions by retaining ambiguously mapped reads, improving recall and precision, and reducing false positives and negatives through iterative candidate searching and informed haplotype placement.
Smart Images

Figure US2025059169_25062026_PF_FP_ABST
Abstract
Description
VARIANT CALLING BASED ON COPY-NUMBER- AW ARE MULTI-REGIONJOINT DETECTIONBACKGROUND
[0001] Genomic sequencing describes a method of identifying nucleotides or other component parts of genomic data. A nucleic acid sequencing device, also referred to as a sequencer, generates data as base calls, for instance ones corresponding to, or representing, nucleotides of a ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) fragment sequenced by the nucleic acid sequencing device. A read sequence includes data that corresponds to a series of these nucleotide base calls as well as data describing quality scores for the series of nucleotides. This data is usually output from the sequencing device as records (‘sequence’ or ‘sequencing’ data) for analysis / processing by a computer system, for instance to correlate component parts, such as nucleotides, with respective positions in a given reference genome.
[0002] The human genome encodes 200-500 disease-associated genes that can be difficult to genotype due to high homology with other genomic regions. Conventional variant callers rely on an aligner to first map unique sequence reads to their original genomic location before checking for differences that indicate the presence of a variant. This approach can be problematic when the reads match two or more genomic regions similarly well. This issue frequently arises when the two or more genomic regions are segmental duplication regions (‘segdups’, also referred to as paralogous regions), referring to regions in the reference sequence that look exactly the same as, or very similar to, each other in terms of sequence identity. In these cases, a conventional variant caller might discard the reads even though they contain useful information.SUMMARY
[0003] Shortcomings of the prior art are overcome and additional advantages are provided through the provision of a computer-implemented method.
[0004] In accordance with aspects described herein, a method is provided that obtains a collection of sequence reads of a biological sample. The method also alignsIP-2920-PCT Page 1 of 56the collection of sequence reads to two or more regions in a reference sequence, and identifies candidate variant positions and candidate variant alleles. The method additionally determines a total copy number of the two or more regions. The method also generates, using the determined total copy numberjoint genotype candidates with haplotypes placed in respective regions of the two or more regions, in which at least some of the joint genotype candidates include different numbers of haplotypes per region than others of the joint genotype candidates. The method also scores the joint genotype candidates for genotyping the sample.
[0005] In some embodiments, generating the joint genotype candidates includes generating and scoring haplotype set candidates based on the determined copy number, each haplotype set candidate including a respective set of haplotypes, where the generating and scoring is performed iteratively in iterations based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions, and where, at each iteration of the iterations, a respective set of haplotype set candidates is generated. Additionally, in some examples, and based on the iterative performance of the generating and scoring, the method obtains a final set of haplotype set candidates having haplotypes that cover the identified candidate variant positions, the final set of haplotype set candidates yielding the joint genotype candidates for performing the scoring of the joint genotype candidates.
[0006] In some embodiments, each haplotype set candidate of the generated haplotype set candidates is a joint haplotype set (JHS) candidate that includes a respective set of haplotypes determined based on the identified candidate variant alleles and with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed, the respective set of haplotypes having a cardinality based on the determined copy number. Additionally, in some examples scoring a JHS candidate of the JHS candidates uses a prior probability value based on a uniform probability distribution. Additionally or alternatively, in some examples scoring a JHS candidate of the JHS candidates uses a prior probability value based on probability distribution that is based on the determined total copy number.IP-2920-PCT Page 2 of 56
[0007] In some embodiments, the method further includes, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration to produce a resulting set of candidates to carry forward to a next iteration of the iterations. The pruning includes using an odds-ratio based pruning approach in which a score threshold is dynamically set based on a function of (i) a score of a highest-scoring haplotype set candidate generated at that iteration and (ii) a prespecified odds-ratio value, and any haplotype set candidate generated at that iteration and being scored below the score threshold is eliminated from the respective set of haplotype set candidates. The pruning using the odds-ratio based pruning method produces a resulting collection of candidates, each having been scored to meet or exceed the score threshold.
[0008] In some embodiments, the method further includes, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration using a haplotype-based priors pruning approach to produce a set of joint haplotype candidates to carry forward to a next iteration of the iterations, where the haplotype-based priors pruning approach includes determining, based on the respective set of haplotype set candidates, a collection of proposed candidates having haplotypes with proposed placements, scoring the proposed candidates based on haplotype-based prior scores of the haplotypes with the proposed placements, and pruning the collection of proposed candidates based on scoring the proposed candidates to obtain a resulting set of candidates. Additionally, in some examples the method further includes activating the haplotype-based priors pruning approach based on determining that an initial pruning approach at that iteration, and applied to the respective set of haplotype set candidates generated at that iteration, produces a resulting collection of candidates having a total number of candidates that exceeds a pruning threshold.
[0009] In some embodiments, the method further includes, based on the scoring the joint genotype candidates, genotyping the biological sample based on one or more best-scoring joint genotype candidates, and generating a variant call file that identifies candidate variants based on the genotyping.
[0010] Additional aspects of the present disclosure are directed to systems configured to perform the methods described above and herein, and computerIP-2920-PCT Page 3 of 56program products having storage media readable by a processing circuit for execution to perform methods described above and herein. Additional features and advantages are realized through the concepts described herein.BRIEF DESCRIPTION OF THE DRAWINGS
[0011] Aspects described herein are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosure are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
[0012] FIG. 1 illustrates an example of a multi -region joint detection (MRJD) variant calling approach;
[0013] FIG. 2 depicts example performance of an MRJD caller in high-sensitivity mode;
[0014] FIG. 3 depicts example performance of an MRJD caller for uniquely placed variants;
[0015] FIGS. 4A-4C illustrate an example of iterative candidate searching of an example MRJD approach;
[0016] FIG. 5 conceptually illustrates the position-wise nature of the determination of priors scores;
[0017] FIG. 6 depicts example performance of an MRJD caller for uniquely placed variants under the three priors models noted above;
[0018] FIG. 7 conceptually illustrates an example of information not currently captured by existing prior models in MRJD approaches;
[0019] FIG. 8 depicts an example of scoring proposed, placed haplotypes in accordance with aspects described herein;
[0020] FIG. 9 depicts an example of scoring a proposal haplotype, in accordance with aspects described herein;IP-2920-PCT Page 4 of 56
[0021] FIG. 10 depicts an example of scoring a joint genotype proposal;
[0022] FIG. 11 depicts example performance of variant calling under differing variant call technologies, in accordance with aspects described herein;
[0023] FIG. 12 depicts example performances of MRJD and non-MRJD variant callers under different copy number configurations;
[0024] FIGS. 13A-13B illustrates examples in which a diploid assumption used on non-diploid samples results in false positive and false negative errors;
[0025] FIG. 14 conceptually illustrates an example process of copy-number-aware multi-region joint detection in accordance with aspects described herein;
[0026] FIG. 15 depicts example regional copy number configurations when total copy number is three in a two-copy paralog;
[0027] FIG. 16 depicts example performances of MRJD variant calling by a copynumber-aware approach in accordance with aspects described herein and a forced- diploid approach under different copy number configurations;
[0028] FIGS. 17A-17C depict example processes for performing MRJD variant calling using haplotype-based prior scores, in accordance with aspects described herein;
[0029] FIGS. 18A-18C depict example processes for performing MRJD variant calling using a copy-number-aware model, in accordance with aspects described herein;
[0030] FIG. 19 depicts one example of a computer system and associated devices to incorporate and / or use aspects described herein.DETAILED DESCRIPTION
[0031] Described herein are approaches for improved variant calling in multiregion joint detection (MRJD) scenarios. MRJD is an approach that provides haplotype-based variant calling designed to detect small variants in paralogous regions of the genome. It can be helpful in addressing challenges including those noted above regarding segdups. Instead of considering each region in isolation andIP-2920-PCT Page 5 of 56genotyping it individually, the MRJD caller considers all locations from which a group of reads may have originated and assesses the underlying sequences jointly. For instance, the MRJD caller takes into consideration primary alignments in all paralogous regions, regardless of mapping quality, evaluates support for candidate haplotypes and their placement based on combined set of reads and prior knowledge, and computes joint genotypes to call small variants. This approach retains ambiguously mapped reads, making it a sensitive variant caller for paralogous regions.
[0032] FIG. 1 illustrates a basic example of the multi -region joint detection variant calling approach. This example considers two segmental duplication regions 102 (PMS2 gene) and 104 (PMS2CL pseudogene) of a reference sequence and corresponding initial read alignments 106a and 106b for these regions. The initial read alignments contain ambiguous and / or false mappings. That is, under the two regions, read alignment(s) may be ambiguous and / or misaligned with a high quality score (e.g., MAPQ). The MRJD approach collects all reads in the segmental duplication regions to create a joint pileup 108 - a collection of all of the reads regardless of their quality scores. MRJD also identifies active sites (also referred to as ‘active positions’ or ‘candidate variant positions’) of the segmental duplication regions. At sites 110a and 110c, at least some reads of the joint pileup support alternative alleles (‘A’ instead of ‘G’ observed in the segmental duplication regions for site 110a, and ‘T’ instead of ‘A’ observed in the segmental duplication regions for site 110a). At site 110b, the regions differ in their alleles (e.g., ‘C’ versus ‘T’). Both types of sites will be considered ‘active’ sites and candidate variant positions.
[0033] The MRJD approach uses the active sites to build and place haplotypes 112. In this example, four haplotypes are built - two for each of the two regions - and placed into the two regions. Haplotypes can be placed with varying degrees of confidence. Accordingly, the haplotypes can be scored, and variant calling is made based on the scores.
[0034] The MRJD approach can be performed under different calling modes of varying sensitivity. For instance, a default calling mode could call uniquely placed variants (“Type 1” variants) and region-ambiguous variants (“Type 2” variants), and a high-sensitivity mode could additionally call potential complex events (“Type 3”IP-2920-PCT Page 6 of 56variants) and alternative locations for uniquely placed variants (“Type 4” variants), as examples. The default calling mode can provide a balance between precision and recall, while the high-sensitivity mode can provide maximum recall at the expense of precision. The high-sensitivity mode might be preferred when identification of all potentially pathogenic variants is critical and orthogonal confirmation of the placement of such variants is feasible (e.g., with long-range PCR-based tests).
[0035] FIG. 2 depicts example performance of an MRJD caller in high-sensitivity mode, with performance metrics being reflected by recall and precision. Recall reaches nearly 100% in this example, but precision is lost relative to other mode(s) owing to the fact that high-sensitivity mode reports all variants in all regions. Thus, the MRJD approach can work reliably well by capturing most variants (e.g., SNPs and indels) with high-sensitivity and without having to place them uniquely. This can be applied to clinically-relevant genes, such as PMS2 by way of example.
[0036] FIG. 3 depicts example performance of an MRJD caller for uniquely placed variants (again SNP and indels), and a resulting gap is observed in unique placement performance. Under this mode, what is reported is only the confidently, uniquely placed variants rather than all possible variants in all regions. While precision is much higher compared to the high-sensitivity mode, recall is much lower.
[0037] Described herein are approaches for improved variant calling in multiregion joint detection scenarios. As used herein, a “joint diplotype” (JD) refers to the phased diplotypes across one locus or multiple loci from two or more paralogous regions within the genome. Each paralogous region (with one locus or more loci) has its own diplotype, which is a pair of two ordered haplotypes. The joint diplotype encompasses the diplotypes from all paralogous regions. A “joint haplotype set” (JHS) refers to a collection of haplotypes across multiple loci from two or more paralogous regions within the genome. Note that in accordance with aspects described herein, these haplotypes are not necessarily placed into regions and thus the JHS is a set of unplaced haplotypes. The JHS serves as a candidate for a given sample as explained herein.
[0038] There are various sources of error in MRJD-based variant placement. The fact that MRJD high-sensitivity mode has a high recall indicates that nearly all candidate variant events can be reliably discovered. However, false-positives (FPs)IP-2920-PCT Page 7 of 56and false-negatives (FNs) may result. One reason is due to incorrect haplotype candidates, which, as described herein, may be influenced by poor prior scores (prior scores or values are sometimes referred to as just “priors”). In other words, poor priors might result in construction of poor haplotypes that do not exist in the real sample, and thus will be incorrect no matter how they are placed. Another reason false positives and negatives is the influence that poor priors have on haplotype placement into regions. Poor priors might result in placement of a haplotype (even a correct one) into an incorrect region, for instance.
[0039] In an existing MRJD approach, active positions are included and considered iteratively (adding one active position at a time or sometimes several active positions at a time). After a new active position is included, all possible JD candidates JDi, JD2, ..., JDi are proposed and scored. A JD candidate JDi is scored by calculating the posterior probability of the given JD candidate given all reads P(JDi\R), which is based on read likelihood P(R\JDt) and prior score P(JDi). After posterior probabilities of all JD candidates have been calculated, genotyping at each of the included active positions is performed. Then, a pruning step is performed that retains the n number of JD candidates with highest posterior probabilities, n may be 50, for instance, though it could be any desired number.
[0040] To illustrate, refer to FIGS. 4A-4C, which provide an example of iterative candidate searching of an example MRJD approach. MRJD attempts to jointly genotype the regions (two regions in these examples) using iterative steps that involve an iterative search and consideration of candidates. Referring initially to FIG. 4A, in sequencing a new sample, there are three identified candidate variant positions, with each position having possible alternative alleles. The first position 402a has C or T alternative alleles, the second position 402b has A or T alternative alleles, and the third position 402c has A or G alternative alleles. The MRJD candidate searching first considers the first position 402a and generates all possible joint genotypes in the two regions, which in this case generates all possible distributions of the C and / or T alleles in the four haplotypes. This results in a collection of candidates 404. Three such candidates are depicted in FIG. 4A - candidate 1 being a C allele in each of the four positions, candidate 2 being two C alleles in region 1 and each of a C and T allele in region 2, and candidate 3 being two C alleles in region 2 and each of a C and T alleleIP-2920-PCT Page 8 of 56in region 1. There are 16 possible candidates because each of the four haplotypes for placement can be either of two different alleles.
[0041] 406 of FIG. 4 A indicates an example equation for scoring each candidate.The equation is set forth as Eq. 1 :where P(Gm\R) is the posterior probability of the joint genotype M (Gm) given all of the reads (R). This is equal to the fraction in which, in the numerator, P(R\Gm) is the likelihood of the read given that this joint genotype is true, P(Gm) is the ‘prior’ likelihood of the joint genotype based on empirical knowledge, and the denominator is the marginalized sum of the likelihood of the read given that a joint genotype multiplied by the prior likelihood of a joint genotype based on empirical knowledge for all possible joint genotypes.
[0042] The posterior probability determined from Eq. 1 provides a score for each candidate. A pruning step may be considered after scoring all of the candidates as that iteration. Depending on the pruning approach or parameter(s) thereof, the number of candidates may need to be pruned based on the scores. In this example, the number of candidates remaining after each iteration is to be 50 or fewer, and therefore no pruning is needed at this point because only 16 candidates exist.
[0043] The collection of candidates may be used in a next iteration as depicted in FIG. 4B, in which the second position 402b has been introduced. The second position, like the first position, has two alternative alleles (A and T) which results in 16 unique placements at the second position. Thus, for each of the 16 candidates resulting from the first iteration (the ‘first iteration’ being considered the processing with respect to the first position), there are 16 possibilities for the second position, resulting in 16 * 16 = 256 uniquely placed candidate solutions on the second iteration after the second position 402b is introduced. Thus, the number of candidate solutions grows exponentially as additional positions are included. For efficiency or other reasons, a pruning step is taken to prune the total number of candidates to a lesser number of candidates. The candidates are scored as described above, and the 50 (in this example) highest-scoring candidates are carried forward to a next iteration. On the third iteration, as depicted in FIG. 4C, a total of 50 * 16 = 800 candidate solutions areIP-2920-PCT Page 9 of 56proposed and scored, then the resulting collection is pruned, and this process continues.
[0044] After each of the candidate variant positions have been introduced, the result is a final set of candidates that have been scored. These candidates can also possibly be pruned. Joint genotyping can then proceed based on whatever candidates remain.
[0045] In an extreme case in which one of the placed candidates has a very high posterior probability score relative to the other candidates, then that placed candidate will dominate what genotype is ultimate called at the end of the processing because of it tendency to continue being carried through in the ‘winning’ solutions across the iterations.
[0046] There are issues with the example method that can directly impact variant calling performance. Pruning is based on the read likelihoods and the prior scores, as those influence the scoring on which the pruning is based. Thus, one issue is that uninformative or misinforming priors could lead to the true ID candidate being pruned out during the position-by-position JD candidate search, which could lead to incorrect genotyping results. Additionally, if two or more JD candidates share the same posterior probability as the 50th best-ranking candidates (e.g., highest-scoring, if a higher score indicates a better candidate), then in some examples the process still prune-out all candidates but one, possibly leading to the loss of true JD candidate during the position-by position JD candidate search and scoring process. In addition, uninformative or misinforming priors could lead to a haplotype being placed into the wrong region, which also could lead to incorrect genotyping results.
[0047] Poor priors, meaning uninformative or misinforming prior values, may be a result of the prior model being used. Example such models are the ‘independent and identically distributed’ (IID) model, the population-average model, and the imputation-based model, all of which are position-wise models. FIG. 5 conceptually illustrates the position-wise nature of prior scores determined under these models. The prior score is determined in a position-wise manner, with the score calculation looking at each position independently to provide a particular prior score for that position in that region. This is done for each of the positions, and the probabilities are multiplied together into the final priors score.IP-2920-PCT Page 10 of 56
[0048] Under the IID model, the prior value aims to minimize differences between the reference sequence and the haplotype in the proposed JD. This method could be misleading when a sample has a gene conversion or crossover event and the reference alleles are different among paralogous regions, or a sample has high divergence between its haplotype and the reference sequence. The population-based prior improves on this by considering the frequency of the diplotype appearing in the population at that position and that region (i.e., there is a predefined likelihood for the diplotype at that particular position based on population modeling). The imputationbased (or ‘sample-specific’) model improves further by considering the likelihood of the diplotype in the position based on nearby context such as flanking regions. This can help with situations when the sample diplotype is minor in the population (the minor allele or diplotype in the imputation-based prior model could get a high prior value, whereas in the population-based prior model it could still get a low prior value).
[0049] Thus, for population-average or imputation-based prior models, the prior is encoded as the expected allele frequency of each diplotype combination in the population. The population-based prior model takes into account the variability in the allele frequency distribution across polymorphic sites. Additionally, the reference allele diplotype can be in the minority, and diplotypes including the alterative alleles can be in the majority for some polymorphic sites.
[0050] FIG. 6 depicts example performance of an MRJD caller for uniquely placed variants under the three priors models noted above. The performance illustrates that population-based priors result in better performance than IID priors, and that sample-specific priors result in better performance than population-based priors. However, none of the three models necessarily provide desirable or satisfactory recall and precision rates.
[0051] Thus, uninformative or misinforming priors could lead to incorrect haplotypes and / or could result in misplacement of even correct haplotypes into the wrong regions. Either scenario can lead to incorrect genotyping. All position-wise prior models derive position-wise diplotype priors and multiply the priors under multiple loci into a final JD candidate prior. This process fails to directly take into account the frequency of haplotypes, or cooccurrence frequency of alleles in multipleIP-2920-PCT Page 11 of 56loci. Given that the JD candidate is a set of placed haplotypes, and in accordance with aspects described herein, a more biologically meaningful prior is provided to take advantage of haplotype frequency in the population, which is impacted by linkagedisequilibrium (non-random association between alleles at different positions on the chromosome in a given population). It is noted that the position-wise imputationbased prior model operates on a diplotype level instead of haplotype level. Accordingly, aspects described herein present a haplotype-based prior model for MRJD.
[0052] FIG. 7 conceptually illustrates an example of information that is not currently captured by existing prior models. Shown are paralogous regions 1 and 2 (702) of a reference sequence, and placed haplotypes 704 for four samples in the population. This data would inform a database from which prior values can be determined. In region 1, a T allele in the 2nd position and a C allele in the 4th position are statistically likely to appear together in the same haplotype, rather than appear individually. However, that tendency for co-occurrence is information that positionwise prior models do not capture because by nature they consider single positions at a time.
[0053] The importance of this information was evidenced through hierarchical clustering on PMS2 and PMS2CL haplotypes from a long read dataset, in which different haplotypes were stacked by aligning them in terms of base positions. Haplotypes of the PMS2CL region were clustered this way and haplotypes of the PMS2 region were separately clustered this way. A natural correspondence was observed between the two different groups corresponding to the two regions: the haplotypes at different segmental duplication regions are generally distinguishable from each other - each region has respective haplotypes that generally correspond to each other and not to those of the other region. Therefore, moving to haplotype-based prior holds can deliver more informative prior knowledge based on comprehensive population haplotype profiling under the regions of interest.
[0054] Accordingly, approaches described herein can move away from using position-wise priors when progressively building joint diplotypes position-by- position. This can avoid pruning-out correct joint diplotype candidates along the way. Additionally, moving from a position-wise prior model to a haplotype-based priorIP-2920-PCT Page 12 of 56model provides more accurate haplotype placement, and a haplotype-based pruning strategy can optionally be used to efficiently prune truly poor joint diplotype candidates and facilitate more feasible, manageable software computation undertaken. This improves efficiency and reduces resource use, improving operation of the processing / computer systems involved.
[0055] An initial aspect of MRJD is event identification. A conventional MRJD approach breaks the input region into overlapping windows of 900 base pairs (bp) each. Within each window, an alignment strategy, such as Smith-Waterman alignment, is performed between the two or multiple reference sequences to derive a universal coordinate system. Then, De Bruijn Graph assembly is performed, followed by candidate event identification. Limiting region size in this manner facilitates processing by making the data size manageable. However, longer haplotypes provide better results, and thus it may be desirable to operate at a whole-gene copy level (on the order of 30,000 bp or more for instance). Accordingly, event identification under approaches described herein may first perform a global alignment (such as Smith- Waterman alignment) between the entirety of the two regions, identify reference difference sites between the two regions and the corresponding positions in the regions, and then perform event identification under each of the limited-size windows (e.g., of 900 bp each).
[0056] More specifically, as an example, global Smith -Waterman alignment is performed between the two or multiple reference sequences of the entire paralogous regions (without breaking into small windows). Then, the region is broken into small windows, followed by candidate event detection using, e.g., De Bruijn Graph assembly.
[0057] A process obtains a collection of sequence reads of a biological sample and aligns the collection of sequence reads to two or more regions in a reference sequence, before identifying candidate variant positions and candidate variant alleles. By way of example, aligning the collection of reads to the two or more regions and identifying the candidate variant positions can proceed as follows:
[0058] -Given the input pairs of regions, perform global Smith-Waterman alignment of the two or multiple regions. This aligns the two or more regions to obtain a full alignment of the two or more regions and a universal coordinate system.IP-2920-PCT Page 13 of 56
[0059] -Break the large region (full alignment) into smaller overlapping windows (for example, 900 bp window with 300 bp overlap on each end). Since these windows overlap, a variant span in the middle of each window is introduced to avoid getting duplicate calls. Only events within variant spans can be eventually considered for genotyping. An example variant span could be 300 bp based on an overlapping size of 300 bp (i.e., 300 bp on each end of a 900 bp length).
[0060] -For each window pair, extract and map the reads to the universal coordinate system to provide a joint pileup. The process sets up the universal coordinate system by parsing the CIGAR string of the global Smith-Waterman alignment output. Optionally, the process could skip large relative INDEL regions. Not skipping may be preferred because it could avoid breaking the joint haplotype set construction at relative INDEL regions and the relative INDELs could help the process better place haplotypes. In any case, the process extracts all reads and maps them to the universal coordinate system. The result will be a joint pileup. The process can then perform De Bruijn Graph assembly and extract candidate variants from the joint pileup.
[0061] -Event identification under each window pair provides the events in the 900 bp windows, and therefore these can be expanded to the full region to derive all active positions and events for downstream processing.
[0062] To help address issues noted above relative to incorrect and misplaced haplotypes, a new joint haplotype set construction and candidate scoring approach is provided as part of an updated iterative candidate searching. In particular, to avoid pruning correct haplotype candidates during the iterative processing, a uniform prior score is used in conjunction with read likelihood to score candidates, and optionally a haplotype-based prior score is used in an example pruning approach. To avoid malplacement of haplotypes, a haplotype-based prior model is leveraged as described herein.
[0063] Thus, in some embodiments, a uniform prior score is used. This refers to a prior probability value that is based on a uniform probability distribution, meaning that each JHS candidate will be regarded as having the same probability. It could, in examples, be a fixed value, for instance 1. In any case, use of a uniform prior differs from conventional approaches in which different candidates have differentIP-2920-PCT Page 14 of 56probabilities. A uniform prior model by nature does not boost the score of any haplotype based on empirical knowledge.
[0064] Instead of proposing and scoring JD candidates during the candidate searching process, a method as described herein proposes and scores JHS candidates by generating a collection JHS candidates, where each JHS candidate includes a respective set of haplotypes, determined based on the identified candidate variant alleles, with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed. Referring back to FIG. 4A, candidates 2 and 3 each include three placed C alleles and one placed T allele. An analogous JHS candidate may be generated to describe inclusion of three C alleles (unplaced) and one T allele (unplaced), which would capture the two possibilities of resulting JDs. Thus, a difference between a JHS candidate and a JD candidate is that the JHS candidate does not specify placement of haplotypes. This avoids a potentially negative impact on candidate scoring that was seen when uninformative or misinforming priors were relied upon in placing haplotypes.
[0065] Instead of scoring JHS candidates based on a position-wise prior model (and multiplying the position-wise prior across multiple positions), the scoring uses a uniform JHS prior, and thus the scoring will be dominated by read likelihood, denoted P(R\JHS), which is a composite likelihood of observing the collection of sequence reads (e.g., as read pairs if using paired-end sequencing) given the respective set of haplotypes of the JHS candidate. Again, this avoids a potentially negative impact on candidate scoring owing to uninformative or misinforming priors.
[0066] For pruning, instead of selecting only n candidates with the highest posterior probabilities (e.g., 50, even if there are multiple candidates that have the same probability and some are pruned anyway), a process in accordance with aspects described herein can use an odds-ratio based pruning approach, in which after scoring the candidates, the highest posterior probability value, Posterior max, is determined, and a threshold value for pruning is derived by calculating Posterior max I odds ratio. All candidates with scores at or above that threshold may be retained, while the rest are pruned-out. An example odds-ratio is 100, though it could be any other desired value. Odds-ratio-based pruning lowers the chance of correct candidatesIP-2920-PCT Page 15 of 56being pruned out because the posterior probabilities of correct candidates should be toward the top of the ranking, if not the highest.
[0067] Under this odds-ratio based approach, there may be situations in which relatively many JD candidates with the same or similar high scores remain, since the number of candidates will grow exponentially every time a new active position is included and there is no read phasing information. Therefore, in these situations, a pruning approach based on placed haplotype scoring (using haplotype-based prior values) can be utilized to efficiently reduce the number of candidate and control computation burden. This approach can be activated when the number of candidates remaining after odds-ratio-based pruning is greater than a set threshold. This alternative pruning approach is also described in further detail herein.
[0068] After the candidate searching - that is, a final set of JHS candidates is obtained having haplotypes that span all active positions, these candidates can be expanded into candidates with unique haplotype placements and the candidates can be scored using haplotype-based priors.
[0069] The following presents a specific example of candidate searching steps in accordance with aspects described herein. The processing order of active position can be from left to right (or additionally from right to left), which is different from having multiple processing orders that seed from each of the active positions as in previous approaches.
[0070] -Add a new active position (or positions), based on the set of unplaced haplotype sets in the previous positions (unless at the first iteration considering the first-introduced active position(s)) and the alleles at the new position, and populate all possible joint haplotype sets {JHSi. . . JHSm}.
[0071] -Each joint haplotype set JHSmincludes a set of unplaced haplotypesLet be a candidate paired read, i = 1, ..., / . The likelihood of observing each paired read given each candidate haplotype P(rt\Hk) can be calculated using pairHMM (forward algorithm) discussed below.
[0072] -Then, the likelihood of observing all paired reads given each candidate haplotype (denoted P(R\Hk){ can be calculated by summing the log read likelihood for all reads in, R = {ri. in other words summing the P(rt\Hk) for all I = 1, . . . , I. ForIP-2920-PCT Page 16 of 56each joint haplotype set candidate JHSm, the likelihood of observing all reads given joint haplotype set (denoted P(R\JHSm)) can be calculated by summing the P(R\Hk) for the set of H in JHSm, then dividing the value by K (the number of haplotypes in JHSm).
[0073] -Next, the posterior probability (‘score’) of each JHS candidate given all reads can be calculated as P(JHSm\R) = Inaccordance withaspects described herein, a uniform distribution is used for the prior term, i.e., P(JHSm). In examples, this is set to 1 or 1 / number of JHS candidates. In any case, this is “uniform”, meaning the same for each of the JHS candidates. In this manner, scoring the JHS candidate multiplies the read likelihood of the JHS by the prior of that JHS, which is taken as a uniform prior, and therefore the prior has no negative influence on the scoring. This is unlike the conventional situation in which bad prior values could be present and used in the scoring.
[0074] -Use odds-ratio based pruning to prune out JHS candidates with posterior values lower than a threshold. The highest score, Posterior max, of all of the JHS candidates determined at that iteration is divided by a selected odds-ratio, such as 100, to determine a score threshold for pruning. All JHS candidates with a score (posterior value) lower than the threshold can be pruned-out. In this manner, the collection of JHS candidates generated at the iteration is pruned using an odds-ratio based pruning approach that dynamically sets a score threshold based on a function of (i) the score of highest-scoring JHS candidate generated at that iteration and (ii) a prespecified odds-ratio value. Any JHS candidate generated at that iteration and being scored below the score threshold is eliminated from the collection of JHS candidates to produce a resulting collection of JHS candidates (each having been scored to meet or exceed the score threshold) that may be carried into a next iteration if there are more candidate variant positions to incorporate. If the number of JHS candidates after the odds-ratio pruning approach exceeds some threshold (for example, 50,000 candidates), then the process can switch to an alternative pruning approach that is based on placed haplotype scoring, detailed below.
[0075] -At this point, the process can return to the aspect above where the next one (or more) active position(s) is / are added and updated JHS candidates are generated. In this manner, generating and scoring JHS candidates is performedIP-2920-PCT Page 17 of 56iteratively in iterations based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions. At each iteration of the iterations, a respective collection of JHS candidates is generated.
[0076] Based on obtaining the set of JHS candidates that cover the identified candidate variant positions, then the process can generate joint diplotype (JD) candidates from the set of JHS candidates. Each JHS candidate, which has a respective set of haplotypes, informs several JD candidates, and each JD candidate of the JD candidates has each haplotype of the respective set of haplotypes of the given JHS candidate placed in a respective region of the two or more regions being considered. Those JD candidates can be scored based on haplotype-based prior scores of the placed haplotypes in accordance with aspects described herein. FIG. 8 depicts an example of scoring proposed, placed haplotypes, in accordance with aspects described herein. A dataset or similar collection of population-based data 802 (“hapDB”) is obtained an indicates population haplotypes in the paralogous regions. In examples, the database takes the form of a multi-haplotype variant call file in the VCF file format, which contains, among other data, a column with genotype information for each haplotype. Haplotypes can be derived from multiple sources, examples of which include fully phased diploid assemblies or human genome assemblies, as examples.
[0077] Continuing with FIG. 8, the example data 802 is a simplified version for illustrative purposes, and its representation uses different colors to present different haplotypes. In region 1, haplotypes 804a and 804c are each observed twice, and haplotype 804b is observed once, while in region 2, haplotype 804d is observed four times and haplotype 804c is observed once.
[0078] A haplotype-based prior score of a haplotype placed into a region refers to the frequency at which the placed haplotype appears in the region in the selected population. These scores are also referred to herein as ‘haplotype-based priors’. Consider a set of haplotypes Hl and H2 that are equally supported by a sample read. Hl, which is haplotype 804c, has a relatively high score compared to H2 because Hl is observed in the population database and H2 is not. During the candidate searching process discussed above (which does not place the haplotypes), Hl will have a higherIP-2920-PCT Page 18 of 56tendency to survive in terms of its appearance in JHS candidates that move through the iterations as compared to H2. Additionally, the haplotype-based prior score for Hl in region 1 is higher than the haplotype-based prior score for Hl in region 2 since Hl occurs with a higher probability in region 1 than it does in region 2.
[0079] The following presents an example process for JD candidate scoring using haplotype-based prior scores in accordance with aspects described herein.
[0080] -When candidate searching concludes with a set of JHS candidates having haplotypes that cover the identified candidate variant positions, the process proposes all possible JDs based on this set of JHS candidates. That is, for each JHS candidate JHSm, the process proposes all possible placed JD candidates JDS, s = 1, ..., S by generating all possible ways to place the haplotypes within JHSm. For example, if there are four unique haplotypes within a given JHS candidate, then the number of possible JD candidates is six.
[0081] -Score each of the JD candidates in {JDi, JDS}. For a given JD candidate JDS, the posterior probability of the candidate given all reads can be calculated . Note that fK I JDS) = P(R I JHSk), whichhas been calculated already for all JHS candidates during candidate searching. The JD candidate scoring is thus based (in part) on read likelihood - a composite likelihood of observing the collection of sequence reads given the respective set of haplotypes of the JD candidate. The term P(JDS) refers to the haplotype-based prior score of the placed haplotypes of the given JHS candidate. Assume that one JD candidate JDmconsists of four placed haplotypes. Using Q to represent phase and n to represent region, then each haplotype placed into region and phase can be represented as Hg,n. Then P(JDS), the haplotype-based prior score, is equal to ^e,n P H0 n) [Note: The individual probabilities are summed because computations are performed in log space; they would be multiplied if performing the computations in non-log space], P(He,n) may be determined as follows. Let Dx,nbe a unique haplotype in the hapDB in region n, x =The prior score for a candidate placed haplotype, P(Hg,n), can be calculated as the weighted sum of the alignment score in log likelihood scale between the MRJD-proposed haplotype and each of the unique haplotypes in the hapDB, with the weight being the frequency of thex.nin the hapDB. For example, if n = 1, then the prior for P(Ho,n=i) can be calculated as P(Ho,n=i) =x(S(Ho,n=i,Dx,n=i) *IP-2920-PCT Page 19 of 56Freq(Dx,n=i)) . The S(He,n=i,Dx,n=i) is similar to P(R\H) and represents the similarity between the proposed haplotype in a given region Ho,nand a hapDB haplotypex.nNote that, here too, the individual probabilities are summed because computations are performed in log space; they would be multiplied if performing the computations in non-log space.
[0082] The scoring can be calculated using several options. As an example, assume that a proposed haplotype Hl in region 1 is A-C-G-G-T, and that there are four different population haplotypes in region 1. The proposed haplotype can be considered pairwise with each of the four population haplotypes, and for each proposed-population pair, the process can determine two key pieces of information. The first is a likelihood, meaning how likely the proposed haplotype and the population haplotype match. This is akin to a confidence that that the proposed haplotype appears in the population and is based on similarity, for instance identifying SNP (single nucleotide polymorphism) differences to indicate how similar they are to each other. The more SNPs that exist between the two, the lower the likelihood value. In examples, a likelihood model has a maximum value of 150 indicating a perfect likelihood, though any desired likelihood model could be used. The second key piece of information is frequency - it can be determined from hapDB how frequently the population haplotype appears at that region in the population haplotype database. The frequency can represent a weight. If one population haplotype occurs at 90% frequency, then the weight could be .9. The total frequency of all haplotypes observed in the population is expected to be 100%, or 1. For proposed haplotype Hl, P(H1) is the sum of the four (in this example) results of the pairwise considerations, and this is a weighted sum as described. Then, the probability of each of the placed haplotypes Hi may be the log sum of each P(Hi).
[0083] An example scoring of a given proposal haplotype hapA and hapDB haplotypes is presented. One option is using a forward algorithm termed pairHMM. For this, it is not required to know what is the best alignment, but instead the total probability given all possible paths may be used. A process can perform pairHMM between proposal haplotype hapA and each of the unique hapDB haplotypes, and get a weighted probability (of proposal haplotype hapA given hapDB haplotypes) in which the weight is determined by the frequency of each unique hapDB haplotype. This weighted probability of proposal haplotype hapA can then be normalized acrossIP-2920-PCT Page 20 of 56weighted probabilities of all proposal haplotypes into a normalized weighted probability of proposal haplotype hapA. The final haplotype-based prior for a given JD candidate is calculated by multiplying the normalized weighted probabilities of all proposal haplotypes in a given JD candidate.
[0084] In an optimized approach (FIG. 9, showing an example of scoring a proposal haplotype), the probability of proposal haplotype hapA given hapDB haplotypes is calculated differently. First, a list of candidate positions are defined as the union set of variant positions between the proposal haplotype hapA and the reference, and all variant positions between all hapDB haplotypes and the reference. Then, between proposal haplotype hapA and each of the hapDB haplotypes, the process considers each candidate position and calculates a distance or penalty score between proposed haplotype hapA and the given hapDB haplotype by comparing the allele in the proposed haplotype hapA and the given hapDB haplotype at the given candidate position. This distance or penalty score is essentially an edit distance, though other approaches to calculate the distance can be applied that further considers differences between single nucleotide allele mismatches and multi-nucleotides allele mismatches (e.g., an affine gap penalty can be applied for multi-nucleotides allele mismatches). After a respective distance or penalty score is calculated between the proposed haplotype hapA and each of the given hapDB haplotypes, a weighted distance can be calculated between proposed haplotype hapA and hapDB haplotypes, in which the weight is based on the frequency of each unique haplotype in the given hapDB haplotypes. In this example, the weighted distance between proposed hapA and hapDB haps is 2*0.4 + 1*0.4 + 0*0.2 = 1.2. After the weighted distance is calculated, a quality score Q (for instance in phred scale) can be calculated by multiplying the weighed distance by a factor, which represents average cost of each distance unit in phred scale (e.g., 20). Using the prior example, quality in phred scale is equal to 1.2 * factor (e.g., 20) = 24. Then, an estimated probability of proposed haplotype hapA given hapDB haplotypes can be calculated by converting the quality score Q in phred scale to probability scale using this formula 10A(-Q / l 0) (.004 in this example). Then, the prior of proposed haplotype hapA can be calculated by normalizing the probability of proposed hapl across probabilities of all proposed haplotypes. Finally, the haplotype-based prior for a given JD candidate may be calculated by multiplying the normalized probabilities of all proposal haplotypes in aIP-2920-PCT Page 21 of 56given JD candidate. This example approach saves performing pairHMM or an alignment between the proposed haplotype and every haplotype in the hapDB.
[0085] For a long proposed haplotype, this can be broken down to small windows, scores can be determined for each window, and then the scores for the windows can be summed in the log space. In an optimized approach, between proposal haplotype H and each of the hapDB haplotypes D, the process calculates a score S between the hapDB haplotype D and the sequence in the reference genome once. Then, scores between a given proposal haplotype H and hapDB haplotype D can be done by going through positions that are variant positions either between the proposal haplotype H and the reference or between the hapDB haplotype D and the reference, checking the allele concordance between the proposal haplotype and the hapDB haplotype D at each of those positions, and adjusting score S accordingly. Referring to FIG. 9, the alignment score of the hapDB haplotype D (for example hapl in FIG. 9) to the reference produces an initial score of 135. To obtain the alignment score between the proposal haplotype and one of the hapDB haplotypes (and using hapl of FIG. 9 by way of example), each variant in the proposal haplotype relative to the reference sequence and each variant in the hapDB haplotype hapl relative to the reference is considered, and the allele of the proposal haplotype is compared to the corresponding allele of hapl at the same position based on alignment between hapl and the reference sequence. The initial score of 135 is adjusted based on this comparison. For example, if the allele is present in the proposal haplotype and also the hapDB haplotype (hapl in this example) at the same position, then a score is added (+5 in this example) to the initial score. The resulting scores are 135 between the proposal haplotype and hapDB haplotype hapl, 140 between the proposal haplotype and hapDB haplotype hap2, and 145 between the proposal haplotype and hapDB haplotype hap3, as examples. These scores are exponentiated to determine a likelihood for each, and then scaled by the respective haplotype frequency in the hapDB. The sum of the results is a weighted sum. The logarithm of the weighted sum can be taken to provide an adjusted score. This example approach saves performing pairHMM or an alignment between the proposed haplotype and every haplotype in the hapDB.
[0086] In accordance with some aspects described herein, candidates may be scored based on haplotype-based prior scores. Further, a pruning approach may be implemented based on these haplotype-based prior scores. Scoring based onIP-2920-PCT Page 22 of 56haplotype priors introduces a non-uniform prior score for haplotypes and does so based on placing the haplotypes. In examples, a pruning approach based on this scoring is undertaken in the context of an iteration during candidate searching if an initial pruning approach used at that iteration, which could be the odds-ratio based approach described above or any other pruning approach, results in a number of remaining candidates that exceeds a threshold. Thus, if an initial pruning approach provided a resulting collection of JHS candidates that has a total number of JHS candidates that exceeds a pruning threshold, the process can switch, for that iteration, to an alternative pruning approach, different from the initial pruning approach, for instance to a haplotype-based priors pruning approach .
[0087] In embodiments, the haplotype-based priors pruning performs the following:
[0088] -Propose all possible JDs based on the JHS candidates after candidate searching. The JHS candidates is the collection of JHS candidates generated at the iteration after adding the candidate variant position(s) of that iteration (rather than the set of JHS candidates that resulted from the initial pruning at that iteration and that exceed the pruning threshold). In proposing the possible JDs, the process expands the collection of JHS candidates into a collection of proposed JD candidates having haplotypes with proposed placements.
[0089] -Score all possible JD candidates of the collection, which scoring uses read likelihood and haplotype-based prior scores of the haplotypes with the proposed placements as described above.
[0090] -Prune the collection of proposed JD candidates based on the scores and using a desired pruning approach (which could be odds-ratio pruning or any other approach). This prunes out all JD candidates below a threshold, for instance, and produces a resulting set of JD candidates.
[0091] -Collapse the resulting set of JD candidates back to JHS candidates to obtain an updated collection of JHS candidates. This updated collection of JHS candidates is the set of JHS candidates to carry forward to the next iteration, if present.IP-2920-PCT Page 23 of 56
[0092] Thus, under haplotype-based pruning, JD candidates are proposed from the JHS candidates, the JD candidates are scored based on read likelihood and haplotypebased priors, and the scored JD candidates are pruned to provide a resulting set. This set can then be collapsed into a set of JHS candidates for use in a next iteration. The next iteration will propose new JHS candidates based on the addition of one (or more) candidate variant positions, and the pruning approach taken at that point can be any pruning approach desired. The haplotype-based pruning approach could again be activated if needed if an initial pruning approach results in too many JHS candidates to carry forward.
[0093] After the iterative candidate searching has covered the candidate variant positions and the JD candidates have been generated and scored, this set of JD candidates could be pruned if desired. In any case, a resulting set of JD candidates and their scores can be used to perform genotyping at each candidate locus to determine, at each candidate variant position, what variant is to be reported and the quality of that variant. As part of this, the process can sum the JDs supporting each variant scenario to obtain the best supported variant scenario. The process can then generate variant call data, for instance a variant call file, such as one in the VCF file format, that identifies candidate variants based on the genotyping.
[0094] FIG. 10 depicts an example of scoring a joint genotype proposal. The example joint genotype proposal 1 of FIG. 10 proposes haplotypes Hl and H2 in region 1 and haplotypes H3 and H4 in region 2. The score of the proposal is equal to the sum (under log scale, in examples) of the proposal of Hl in region 1, H2 in region 1, H3 in region 2, and H4 in region 2.
[0095] Equation 2 below describes how an example MRJD approach described herein determines the relative probability of a candidate variant in position j in a given region.
[0096] By the above, the likelihood of a variant genotype is dependent on the ratio to the homozygous reference genotype. The equation can then be expanded into the summed posterior probability of joint diplotypes supporting the variant genotype (inIP-2920-PCT Page 24 of 56position j in a given region) versus the summed posterior probability of joint genotypes supporting the homozygous reference genotype.
[0097] The genotyping of region-ambiguous variants can also be performed. However, in contrast to conventional approaches that consider whether the variant is in one of the paralogous copies (regions), the MRJD region-ambiguous-variant detection mode as described herein considers whether the variant is in any of the paralogous copies. The quality can be computed using the Equation 3:
[0098] If the quality for a variant genotype (either heterozygous or homozygous) is greater than a threshold, for example 3, the variant can be reported with a PASS filter, in examples.
[0099] FIG. 11 depicts example performance of variant calling under differing variant call technologies, in accordance with aspects described herein. Specifically, aggregated SNP performance in PMS2 and PMS2CL high homology regions across 133 cell line samples for four callers is compared. The four callers are a conventional non-MRJD-based variant caller (with results 1102), a conventional MRJD caller in high-sensitivity calling mode (with results 1104), a conventional MRJD caller in default calling mode (with results 1106), and an MRJD caller 1108 utilizing the updated candidate search and haplotype-based prior scores to genotype in accordance with aspects described herein. The orthogonal callset is derived from a long read dataset. It is seen that the MRJD caller utilizing the updated candidate search and haplotype-based prior scores achieves substantially better concordance with orthogonal results compared to the other three callers.
[0100] In accordance with additional aspects described herein, a copy-number- aware multi -region joint detection approach and genotyping is provided, which accounts for the possibility of varying copy numbers. The concept of accounting for varying copy numbers as described herein could be used alone or in combination with one or more other aspects discussed, such as those discussed above relating to updated candidate searching based on unplaced haplotypes, use of uniform priors in scoringIP-2920-PCT Page 25 of 56JHS candidates, and use of haplotype-based priors in scoring candidates with placed haplotypes.
[0101] Under a copy -number-aware approach, the model determines a total copy number and examines different region copy number configurations to figure out the best configuration. Specifically, candidate generation and / or scoring can use the determined total copy number to place haplotypes in regions such that different candidates can include different numbers of haplotypes per region than other candidates. This enables more accurate small variant calling and also enables regionspecific copy number calling. As noted, approaches described above relating to updated candidate generation using joint haplotype sets, uniform and haplotype-based prior scores, and pruning approaches such as odds-ratio and haplotype-based pruning, can optionally be combined with the copy-number-aware approach.
[0102] Conventional MRJD approaches assume a diploid copy number in all paralogous regions. This is also assumed in examples above when proposing the joint haplotype set candidates and joint genotype candidates. However, copy number change is common in segmental duplication regions, meaning there will not always be two haplotypes per region - a region could have 0, 1, 2, 3, etc. haplotypes. In regions with copy number changes, the diploid model could, in fact, perform worse than other variant callers due to an incorrect ploidy assumption. FIG. 12 depicts example performances of MRJD and non-MRJD variant callers under different copy number configurations. As depicted, the MRJD calling performance under several copy number configurations is worse than the performance of the conventional non-MRJD caller.
[0103] FIGS. 13A-13B illustrates examples in which a diploid assumption used on non-diploid samples results in false positive and false negative errors. FIG. 13 A depicts an example of a false positive in the G-G call in region 2 in a 2-0 copy number scenario. The diploid assumption by MRJD results in a 2-2 copy number scenario with a false positive in region 2. FIG. 13B depicts an example also under a 2-0 copy number scenario of calls with a false positive in either region. The upper call also exhibits a false negative in the failure to call the precise genotype; the genotype called in region 1 is 1 / 1 (homozygous alternative allele), but the correct genotype is 0 / 1 (heterozygous alternative allele).IP-2920-PCT Page 26 of 56
[0104] A joint genotype (JG) refers to the phased genotypes across one locus or multiple loci from two or more paralogous regions within the genome. Each paralogous region (with one locus or more loci) has its own genotype, which is an ordered collection of zero, one, two, or more haplotypes. The joint genotype encompasses the genotypes from all of the paralogous regions. The main difference between joint genotype and joint diplotype is that joint genotype allows different ploidy, whereas joint diplotype allows only diploidy in each region.
[0105] FIG. 14 conceptually illustrates an example process of copy -number-aware multi -region joint detection in accordance with aspects described herein. This particular example also incorporates the updated candidate searching approach detailed above that uses joint haplotype sets (with unplaced haplotypes), though it is not required that the JHS approach be taken for candidate generation.
[0106] Initially, a total copy number 1402 of the paralogous regions is derived. The total copy number is then used as guidance for the number of haplotypes to include in candidates when generating candidates 1404. The candidates are JHS candidates in this example. JHS candidates in the diploid case described above involved two regions and four haplotypes (two haplotypes per each of the two regions). Here, the total copy number informs the number of haplotypes, which could differ from four.
[0107] The total copy number can be derived, in examples, by counting all reads in all paralogous regions and normalizing the count by the unique region in the genome. In a diploid case in which there are two haplotypes per region and four haplotypes in total, then unique region coverage would be about half of the reads in the two regions. As an example, unique region coverage for 40 reads would be expected to be 20 in the diploid case of four haplotypes across the two regions. However, if 80 reads are observed instead of 40, then this indicates there are about eight copies across the two regions. The foregoing are just examples. There are various ways to determine a likely copy number.
[0108] Total copy number influences the position-by-position candidate construction (construction of JHS candidates in this example) and pruning. For instance, the number of haplotypes in each joint haplotype set will be the total copy number of the paralogous regions. The total copy number in the example of FIG. 14 isIP-2920-PCT Page 27 of 56three, and thus JHS candidates 1404 of FIG. 14 include JHSi with haplotypes Hl, H2, and H3, JHS2 with haplotypes Hl, H3, and H4, JHSs with haplotypes Hl, H2, and H5, etc.. The number of JHS candidates generated may be the combination Q where n is the number of unique haplotypes for JHS construction and p is the total copy number. JHS candidates may be generated, scored, and pruned iteratively as described above.
[0109] In accordance with aspects of the copy-number-aware model, after all active positions have been included and the candidate searching steps are done, then an example process proposes, for each JHS candidate JHSm, all possible placed joint genotype candidates JGS, s = 1, 5 by generating all possible ways to place the haplotypes of JHSm. The diploid case described above proposes placed joint genotype candidates that have two haplotypes in each region. The updated copy -number-aware model will propose placed joint genotype candidates that include potentially other (different than two) numbers of haplotypes per region, under the constraint that the total number of haplotypes is the total copy number.
[0110] It is noted that if candidate generation did not use the JHS approach, meaning the candidates included placed haplotypes, then this proposal of joint genotype candidates after incorporating all active positions would not be needed, as the set of candidates after the active positions are included would be joint genotype candidates with proposed placements.
[0111] 1406 of FIG. 14 illustrates the JGs {JGi. . . JGi} for JHSi that includes haplotypes Hl, H2 and H3. The JGs include a JG in which all three haplotypes are placed in region 1, a JG in which all three haplotypes are placed in region 2, and six JGs covering the varying situations of two haplotypes in one region and one haplotype in the other region.
[0112] After all possible JG for all possible JHS have been proposed, the process scores each of the JG candidates in {JGi, . , ., JGi}. For a candidate JGi, the posterior probability of the candidate given all reads can be calculated as:
[0113] By equation 4, the read likelihood given a joint genotype candidate,P(R\JGi), is the same as the read likelihood given the corresponding joint haplotypeIP-2920-PCT Page 28 of 56set candidate -P(R | JGt) = P(R\JHSm). The prior for a joint genotype candidate, P(JGi), may, in some embodiments, be based on the haplotype-based prior and the region copy number configuration prior. If using Q to represent phase and n to represent region, then each haplotype placed into region and phase can be represented as He,n. And if using RCNj to represent region copy number configuration under P(JGi), then P(JGi) = Ze,nP(He,n) *P(RCNj). In examples, the calculation of P(He,n) is based on the haplotype-based prior method discussed previously. The region copy number configuration prior can either be derived from prior knowledge from a large and diverse population, a uniform distribution as discussed previously, or an empirical distribution based on the number of copies deviating from the normal diploid copy number, as examples. In this manner, the JG prior used could be, or be based on, a position-wise prior, a haplotype-based prior, a uniform prior, and / or a copy-numberbased prior, as examples.
[0114] Table 1 below presents example region copy number prior values based on population knowledge, and was generated by applying an example targeted caller on a dataset.IP-2920-PCT Page 29 of 56Table 1
[0115] After determining the scores of the JG candidates for each JHS, i.e., after all P(JG) is calculated for all JG candidates, the process can perform small variant genotyping at each candidate active position with a difference relative to the description above in that the genotype can now be non-diploid, for example. The process could also perform region copy number calling using P(JG). Based on the total copy number, the process can propose all region copy number configuration candidates {RCNi, RCN2, ..., RCNk} based on the possible ways to assign regionalIP-2920-PCT Page 30 of 56copy number. For example, if the total copy number is three in a two-copy paralog, there will be four RCN candidates with the configuration shown in FIG. 15. The process can calculate the posterior probability of a region copy number configuration candidate RCNi by summing the posterior probabilities of P(JG\R) for all JGs that support
[0116] In some embodiments, and based scoring the joint genotype candidates, a process genotypes the biological sample based on one or more best-scoring joint genotype candidates, and generates a variant call file that identifies candidate variants based on the genotyping. Additionally or alternatively, a process can detect and report regional copy numbers of the regions based on the scoring. As yet another example, a process could report, for a region, a variant genotype based on ploidy that follows a regional copy number of the region. For example, under a diploid case, the genotype can be 0 / 1, but under a triploid case (where a variant is in the region with copy number three), the process can determine and report a genotype that follows this ploidy of three, for example a polyploid genotype of 0 / 0 / 1.
[0117] As noted previously, the copy -number-aware model may or may not incorporate aspects of the JD scenario described above. These aspects include the updated candidate searching described herein in which candidates are constructed as JHS’s (with unplaced haplotypes) rather than JDs / JGs (placed haplotypes). In other words, the process at each iteration of incorporating next candidate variant position(s) could use the copy number to produce JG candidates, rather than JHS candidates, based on the total copy number, if desired. Similarly, a pruning approach(es) used could be conventional pruning approach(s), the odds ratio approach described herein, and / or the haplotype-based priors pruning approach described herein, if desired. And, scoring may or may not be based on haplotype-based priors, if desired. In this manner, various combinations of aspects described are possible.
[0118] FIG. 16 depicts example performances of MRJD variant calling by a copynumber-aware approach in accordance with aspects described herein and a forced- diploid approach under different copy number configurations. In FIG. 16, the performance differences (indicated by the y-axis scale) between the copy-number- aware MRJD, denoted as prototype in the figure, and non-copy-number-aware MRJD, denoted as prototype_force_diploid, in the SMN1 / 2 region under different copyIP-2920-PCT Page 31 of 56number configurations are apparent. The copy-number-aware mode achieves significantly better performance over the non-copy -number-aware mode, especially on samples with non-diploid copy number configurations.
[0119] Multiple technological improvements and advantages are provided by aspects described herein to region-ambiguous variant calling methods, including accurate detection of variants relative to paralogous regions of a reference genome requiring performance of the operations with reference to hundreds, thousands, millions, or tens of millions reference sequence locations. In some cases, aspects of the present disclosure enable performance of operations that could not reasonably or practicably be performed in a human mind in a reasonable amount of time given the complexity of the operations performed for set of, e.g., tens of millions of reference sequence locations. Further, improvements to variant call accuracy achieved by aspects described herein reduce downstream processing that might otherwise need to be performed, and improve the accuracy thereof. For example, a downstream process such as tertiary analysis performed using the identified variants may be more accurate and efficient, as aspects described herein provide more accurate variant sets on which such downstream processing operates. Tertiary analysis on more accurate variant sets can be executed, and the results trusted in a more routine manner than conventional tertiary analysis operations that may be executed on less accurate variant sets and could thus lead to analysis that needs to be performed again, results that cannot be trusted as much as those results generated by the present disclosure, or a combination of both, as examples.
[0120] FIGS. 17A-17C and 18A-18C depict example processes for MRJD variant calling, in accordance with aspects described herein. The processes may be executed, in one or more examples, by a processor or processing circuitry of one or more computers / computer systems, such as those described herein. In one example, code or instructions implementing process(es) of FIGS. 17A-17C and / or 18A-18C are part of a module, such as code module. In other examples, the code may be included in one or more code modules and / or in one or more code sub-modules of the one or more modules. Different modules or sub-modules, if present, could be executed by the same system or a combination of different systems that may be co-located or remote from each other. Various options are available.IP-2920-PCT Page 32 of 56
[0121] FIGS. 17A-17C depict example processes for performing MRJD variant calling using haplotype-based prior scores, in accordance with aspects described herein. Referring initially to FIG. 17A, the process obtains (1702) a collection of sequence reads of a biological sample, then aligns (1704) the collection of sequence reads to two or more regions in a reference sequence and identifies candidate variant positions and candidate variant alleles. In particular examples, the aligning the collection of reads to the two or more regions and the identifying the candidate variant positions includes (i) aligning the two or more regions to obtain a full alignment of the two or more regions and a universal coordinate system, (ii) breaking the full alignment into windows of a smaller length than the full alignment (such as windows of 900 bp length), (iii) for each window of the windows, extracting and mapping reads, of the collection of reads, to the universal coordinate system to provide a joint pileup, and (iv) extracting the candidate variant positions based on the joint pileup.
[0122] At this point, the process enters into an iterative candidate search, which generates and scores JHS candidates, and optionally prunes the collection of candidates along the way. Entering into the ‘loop’, the process introduces (1706) one or more next candidate variant positions and then generates (1708) a respective collection of JHS candidates. Generating the collection generates the JHS candidates. Each such JHS candidate includes a respective set of haplotypes, determined based on the identified candidate variant alleles, with unspecified placement as to which region of the two or more regions of the reference sequence each haplotype of the respective set of haplotypes is to be placed. The process then scores (1710) the JHS candidates of the collection. The scoring of a JHS candidate of the JHS candidates uses a composite likelihood of observing the collection of sequence reads (which reads could be read pairs if using paired-end sequencing) given the respective set of haplotypes of that JHS candidate. The scoring could also use a uniform prior score are described herein. This scoring can be done for each such JHS candidate generated.
[0123] The process proceeds with optionally pruning the collection of JHS candidates that have now been scored. Whether the pruning is trigged / performed may be based on a total number of candidates generated at that iteration, which could be dependent on the current iteration of the process. For instance, on the first iteration (which in the context of this disclosure is taken to be the first progression through 1706, 1708, 1710, 1712) the number of JHS candidates produced may be sufficientlyIP-2920-PCT Page 33 of 56low that it is below a pruning threshold such that pruning need not be undertaken on that iteration. On later iterations, the number of JHS candidates might exceed the pruning threshold, thereby triggering pruning at 1712 of those iterations. In any case, pruning (1712) the collection of JHS candidates generated at an iteration produces a set of JHS candidates to carry forward to a next iteration of the iterations of the processing. FIG. 17B, described below, depicts one example of a pruning approach, specifically an odds-ratio-based pruning approach in accordance with aspects described herein, that may be performed at 1712. Any given pruning approach can produce a resulting collection of JHS candidates. That resulting set of JHS candidates may be the set of JHS candidates to carry forward to the next iteration. Alternatively, there may be instances in which an initially-attempted pruning approach does not satisfactorily prune the collection of candidates. In that case, the process could trigger activation of a different pruning approach. Thus, based on a resulting collection of JHS candidates (from the initial pruning approach) having a total number of JHS candidates that still exceeds a pruning threshold, the process could switch to / activate an alternative pruning approach, different from the initial pruning approach.
[0124] Another pruning approach in accordance with aspects herein is haplotypebased priors pruning, a process for which is described below and depicted with reference to FIG. 17C. The haplotype-based prior pruning approach may be activated if an initial pruning approach, such as the odds-ratio based pruning approach, does not satisfactorily prune down the collection of candidates.
[0125] In a situation in which an alternative pruning approach is used, then the collection of JHS candidates that results from that alternative approach may be the candidates that are carried forward to the next iteration. In some examples, more than two pruning approaches may be needed to achieve a satisfactory number of candidates with which to move forward.
[0126] Continuing with FIG. 17A, the process determines (1714) whether there are additional candidate variant positions that have not been introduced. If so, (1714, Y), the process iterates by returning to 1706 to introduce a next one or more candidate variant positions and repeat though 1712.
[0127] In this manner, the processing of FIG. 17A generates JHS candidates, each JHS candidate including a respective set of haplotypes, determined based on theIP-2920-PCT Page 34 of 56identified candidate variant alleles, with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed. The process also scores the JHS candidates using a composite likelihood of observing the collection of sequence reads given the respective set of haplotypes of the JHS candidate. The generating and scoring can be performed iteratively in iterations based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions, where, at each iteration of the iterations, a respective collection of JHS candidates is generated, scored, and potentially pruned.
[0128] In examples, scoring a JHS candidate can further use prior probability values based on a uniform probability distribution as described herein. So called ‘uniform priors’ uses a prior model that follows a probability model that is a uniform distribution, meaning that each JHS candidate will be regarded as having the same probability. This could be simplified to a uniform value. In any case, this differs from prior approaches where different candidates could, and often do, have varying prior scores.
[0129] If instead at 1714 it is determined that there are no additional positions to incorporate (1714, N), i.e., the process has obtained a set of JHS candidates having haplotypes that cover the identified candidate variant positions, the process proceeds by generating (1716) joint diplotype (JD) candidates as described herein from the set of JHS candidates, where each JD candidate of the JD candidates has each haplotype (of the respective set of haplotypes of a given JHS candidate of the set of JHS candidates) placed in a respective region of the two or more regions. The process continues by scoring (1718) the JD candidates. Scoring a JD candidate of the JD candidates is based on haplotype-based prior scores of the placed haplotypes of the respective set of haplotypes of the given JHS candidate, as described herein. This scoring based on haplotype-based prior scores can be performed for each such JD candidate of the JD candidates. Based on the scoring of the JD candidates, the process genotypes (1720) the biological sample based on one or more JD candidates, for instance one or more that are among the highest-scoring candidates, and generates a variant call file that identifies candidate variants based on the genotyping. In examples, the genotyping identifies, at each candidate variant position, what variant is to be reported and the quality to be reported for that variant, among other information.IP-2920-PCT Page 35 of 56This is based on the JD scores. The variant call file may, which could be in any desired format, for instance the VCF format, as an output of the genotyping.
[0130] FIG. 17B depicts an example pruning approach, specifically an odds-ratio- based pruning approach in accordance with aspects described herein. Under this approach, the process determines (1730) an odds-ratio-based score threshold. This may be dynamically set based on a function of (i) the score of the highest-scoring JHS candidate generated at that iteration (Posterior max) and (ii) a prespecified odds-ratio value (odds ratio). In examples, the function is the fraction Posterior max / odds- ratio. Using this score threshold, the process eliminates (1732) any and each JHS candidate generated at that iteration that is scored below the score threshold. This produces a resulting collection of JHS candidates, each having been scored to meet or exceed the score threshold. As noted above, this may or may not be the set of candidates carried to the next iteration of the candidate searching of FIG. 17A - it can depend on whether the resulting set is sufficiently small, for instance, to enable efficient subsequent processing.
[0131] FIG. 17C depicts another example pruning approach, specifically a haplotype-based priors pruning approach in accordance with aspects described herein. This approach relies on placed haplotypes, and consequently if the candidates are JHS candidates as in the example of FIG. 17A, the process initially expands (1740) the collection of JHS candidates generated at the iteration into a collection of proposed JD candidates having haplotypes with proposed placements. If instead the candidates had placed haplotypes, then those candidates can be used as-is. The process continues by scoring (1742) the proposed JD candidates based on haplotype-based prior scores of the haplotypes with the proposed placements, as described herein. The process prunes (1744) the collection of proposed JD candidates based on the scoring of those proposed JD candidates to obtain a resulting set of JD candidates. The particular pruning approach used in conjunction with the haplotype-based priors scoring could be any desired approach, for instance a conventional approach of taking the top n highest-scoring candidates, or an odds-ratio based approach applied to this set, as examples. In any case, the process collapses (1746) the resulting set of JD candidates to obtain an updated collection of JHS candidates. This updated collection of JHS candidates may or may not be the set of JHS candidates carried to the next iteration of the candidate searching of FIG. 17A - it can depend on whether the resulting set isIP-2920-PCT Page 36 of 56sufficiently small, for instance, to enable efficient subsequent processing. It is noted that in some examples, the collapsing (1746) need not be performed if the candidate searching of FIG. 17A were using candidates with placed haplotypes instead of JHS candidates, since the resulting set of candidates after pruning (1744) could be used as- is.
[0132] FIGS. 18A-18C depict example processes for performing MRJD variant calling using a copy-number-aware model, in accordance with aspects described herein. Referring initially to FIG. 18A, the process obtains (1802) a collection of sequence reads of a biological sample, then aligns (1804) the collection of sequence reads to two or more regions in a reference sequence and identifies candidate variant positions and candidate variant alleles. The aligning could be performed using any desired approach, for instance an approach as described herein. The process also determines (1806) a total copy number of the two or more regions. Determination of the copy number could be performed using any of various methods, either known or later developed.
[0133] The process of FIG. 18A proceeds at this point with aspects to ultimately generate, using the determined total copy numberjoint genotype candidates with haplotypes placed in respective regions of the two or more regions. The particular candidate searching and pruning approaches used to accomplish this could be any approaches desired, including conventional approach(es) and / or those described herein. In any case, due to use and incorporation of the total copy number, at least some of the resulting joint genotype candidates include different numbers of haplotypes per region than others of the joint genotype candidates. These can be scored for then genotyping the sample.
[0134] Accordingly, and proceeding with FIG. 18 A, the process enters into an iterative candidate search, which iteratively generates, scores, and optionally prunes, collections of candidates along the way. Entering the ‘loop’, the process introduces (1808) one or more next candidate variant positions, then generates (1810) haplotype set candidates. Each haplotype set candidate includes a set of haplotypes that are either placed into regions or are unplaced (i.e., as JHS candidates). A respective set of haplotype set candidates is generated at each iteration, with the haplotypes in a given haplotype set candidate being determined based on the identified candidate variantIP-2920-PCT Page 37 of 56alleles that have been introduced to that point of the processing. The process then scores (1812) the haplotype set candidates of the set of haplotype set candidates generated at that iteration. The scoring of a haplotype set candidate can follow any scoring approach desired, including but not limited to scoring approaches described herein, for instance one based on a composite likelihood of observing the collection of sequence reads (which reads could be read pairs if using paired-end sequencing) given the respective haplotypes of that haplotype set candidate. The scoring could also use a uniform prior approach, if desired. The scoring can be performed for each such haplotype set candidate generated.
[0135] The process proceeds with optionally pruning (1814) the respective set / collection of haplotype set candidates generated at that generation that have now been scored at 1812. Whether the pruning is trigged / performed may be based on a total number of candidates generated at that iteration, which could be dependent on the current iteration of the process. For instance, on the first iteration (which in the context of this disclosure is taken to be the first progression through 1808, 1810, 1812, 1814) the number of candidates produced may be sufficiently low that it is below a pruning threshold such that pruning need not be undertaken on that iteration. On later iterations, the number of candidates might exceed the pruning threshold, therefore triggering pruning at 1814 of those iterations. In any case, pruning (1814) the respective set of haplotype set candidates generated at an iteration produces a resulting set of haplotype set candidates to carry forward to a next iteration of the iterations of the processing. FIG. 18B, described below, depicts one example of a pruning approach, specifically an odds-ratio-based pruning approach in accordance with aspects described herein, that may be performed at 1814. Any given pruning approach can produce a resulting collection of haplotype set candidates. That resulting set of haplotype set candidates may be the set of haplotype set candidates to carry forward to the next iteration. Alternatively, there may be instances in which an initially-attempted pruning approach does not satisfactorily prune the collection of candidates. In that case, the process could trigger activation of a different pruning approach. Thus, based on a resulting collection of haplotype set candidates (from the initial pruning approach) having a total number of haplotype set candidates that exceeds a pruning threshold, the process could switch to / activate an alternative pruning approach, different from the initial pruning approach.IP-2920-PCT Page 38 of 56
[0136] Another pruning approach in accordance with aspects herein is haplotypebased priors pruning, a process for which is described below and depicted with reference to FIG. 18C. The haplotype-based prior pruning approach may be activated if an initial pruning approach, such as the odds-ratio based pruning approach, does not satisfactorily prune down the collection of candidates.
[0137] In a situation in which an alternative pruning approach is used, then the collection of haplotype set candidates that results from that alternative may be the candidates that are carried forward to the next iteration. In some examples, more than two pruning approaches may be needed to achieve a satisfactory number of candidates with which to move forward.
[0138] Continuing with FIG. 18 A, the process determines (1816) whether there are additional candidate variant positions that have not been introduced. If so, (1816, Y), the process iterates by returning to 1808 to introduce a next one or more candidate variant positions and repeating though 1814.
[0139] In this manner, the processing of FIG. 18A generates and scores iteratively, in iterations, based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions, where, at each iteration of the iterations, a respective set of haplotype set candidates is generated, scored, and potentially pruned.
[0140] If at 1816 it is determined that there are no additional positions to incorporate (1816, N), then the iterative performance of the generating and scoring has produced a final set of haplotype set candidates having haplotypes that cover the identified candidate variant positions. This set could be pruned, if desired, using any desired pruning approach. In any case, a final set of haplotype set candidates yields the joint genotype candidates for scoring. In this regard, the haplotype set candidates could already have haplotypes placed into regions, in which case the joint genotypes can be the haplotype set candidates themselves. Alternatively, in a situation where the haplotype set candidates have unplaced haplotypes (i.e., they are JHS candidates), then the process can generate (1818) the JG candidates from the haplotype set candidates. The JGs are scored (1820) and then, based on the scoring of the JG candidates, the process genotypes (1822) the biological sample based on one or more best-scoring JG candidates and generates a variant call file that identifies candidateIP-2920-PCT Page 39 of 56variants based on the genotyping. In examples, the genotyping identifies, at each candidate variant position, what variant is to be reported and the quality to be reported for that variant, among other information. This is based on the JG scores. The variant call file may, which could be in any desired format, for instance the VCF format, as an output of the genotyping.
[0141] As noted, each haplotype set candidate of the generated haplotype set candidates could be a JHS candidate that includes a respective set of haplotypes determined based on the identified candidate variant alleles and with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed. In this copynumber-aware approach, the respective set of haplotypes has a cardinality based on (equal to) the determined copy number, as described herein.
[0142] In examples, scoring a JHS candidate can further use prior probability values based on a uniform probability distribution as described herein. So called ‘uniform priors’ uses a prior model that follows a probability model that is a uniform distribution, meaning that each JHS candidate will be regarded as having the same probability. This could be simplified to a uniform value. In any case, this differs from prior approaches where different candidates could, and often do, have varying prior scores.
[0143] FIG. 18B depicts an example pruning approach, specifically an odds-ratio- based pruning approach in accordance with aspects described herein. Under this approach, the process determines (1830) an odds-ratio-based score threshold. This sets a score threshold based on a function of (i) the score of the highest-scoring haplotype set candidate generated at that iteration (Posterior max) and (ii) a prespecified odds-ratio value (odds ratio). In examples, the function is the fraction Posterior max / odds-ratio. Using this score threshold, the process eliminates (1832) from the respective set of haplotype set candidates any and each haplotype set candidate generated at that iteration that is scored below the score threshold. This produces a resulting collection of haplotype set candidates, each having been scored to meet or exceed the score threshold. As noted above, this may or may not be the set of candidates carried to the next iteration of the candidate searching of FIG. 18A - itIP-2920-PCT Page 40 of 56can depend on whether the resulting set is sufficiently small, for instance, to enable efficient subsequent processing.
[0144] FIG. 18C depicts another example pruning approach, specifically a haplotype-based priors pruning approach in accordance with aspects described herein. This approach can be used at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration using the haplotype-based priors pruning approach to produce a set of joint haplotype candidates to carry forward to a next iteration of the iterations. The process includes determining (1840), based on the respective set of haplotype set candidates, a collection of proposed candidates having haplotypes with proposed placements. In the iterative candidate searching of FIG.18A the haplotype set candidates might or might not have placed haplotypes as described herein. Since the haplotype-based prior pruning approach relies on placed haplotypes, the determining 1840 determines the particular proposed candidates with which to proceed. If the haplotype set candidates already have placed haplotypes, then the determined collection of proposed candidates can be the existing set of candidates generated at the iteration. If instead the candidates have unplaced haplotypes, for instance they are JHS candidates as described herein, the determining 1840 can expand the set of haplotype set candidates generated at the iteration into a collection of proposed candidates having haplotypes with proposed placements.
[0145] The process continues by scoring (1842) the proposed candidates based on haplotype-based prior scores of the haplotypes with the proposed placements, as described herein. The process then prunes (1844) the collection of proposed candidates based on the scoring of those proposed candidates to obtain a resulting set of candidates. The particular pruning approach used in conjunction with the haplotype-based priors pruning approach could be any desired approach, for instance a conventional approach of taking the top n highest-scoring candidates, or an odds- ratio based approach applied to this set, as examples.
[0146] For similar reasons as those discussed with reference to 1840 above (i.e., the need to expand the haplotype set candidates if they have unplaced haplotypes), if the candidate searching works with candidates with unplaced haplotypes, then it may be necessary after pruning (1844) to collapse the set of candidates that results from the pruning 1844, in order to obtain an updated collection of candidates. That updatedIP-2920-PCT Page 41 of 56collection of JHS candidates may or may not be the set of haplotype set candidates to carry forward to the next iteration of the candidate searching of FIG. 18A - it can depend on whether the resulting set is sufficiently small, for instance, to enable efficient subsequent processing.
[0147] If instead the candidate searching uses placed haplotypes, then this collapsing need not be performed and the resulting set of haplotype set candidates after pruning (1844) could be used as-is.
[0148] Processes described herein may be performed singly or collectively by one or more computer systems, such as one or more computer systems of, or in communication with, a sequencing / sequencer device, or any other computer system(s), as examples. FIG. 19 depicts one example of such a computer system and associated devices to incorporate and / or use aspects described herein. A computer system may also be referred to herein as a data processing device / system, computing device / system / node, or simply a computer. The computer system may be based on one or more of various system architectures and / or instruction set architectures.
[0149] FIG. 19 shows a computer system 1900 in communication with external device(s) 1912. Computer system 1900 includes one or more processor(s) 1902, for instance central processing unit(s) (CPUs). A processor can include functional components used in the execution of instructions, such as functional components to fetch program instructions from locations such as cache or main memory, decode program instructions, and execute program instructions, access memory for instruction execution, and write results of the executed instructions. A processor 1902 can also include register(s) to be used by one or more of the functional components. Computer system 1900 also includes memory 1904, input / output (VO) devices 1908, and I / O interfaces 1910, which may be coupled to processor(s) 1902 and each other via one or more buses and / or other connections. Bus connections represent one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include the Industry Standard Architecture (ISA), the Micro Channel Architecture (MCA), the Enhanced ISA (EISA), the Video ElectronicsIP-2920-PCT Page 42 of 56Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI).
[0150] Memory 1904 can be or include main or system memory (e.g. Random Access Memory) used in the execution of program instructions, storage device(s) such as hard drive(s), flash media, or optical media as examples, and / or cache memory, as examples. Memory 1904 can include, for instance, a cache, such as a shared cache, which may be coupled to local caches (examples include LI cache, L2 cache, etc.) of processor(s) 1902. Additionally, memory 1904 may be or include at least one computer program product having a set (e.g., at least one) of program modules, instructions, code or the like that is / are configured to carry out functions of embodiments described herein when executed by one or more processors.
[0151] Memory 1904 can store an operating system 1905 and other computer programs 1906, such as one or more computer programs / applications that execute to perform aspects described herein. Specifically, programs / applications can include computer readable program instructions that may be configured to carry out functions of embodiments of aspects described herein.
[0152] Examples of I / O devices 1908 include but are not limited to microphones, speakers, Global Positioning System (GPS) devices, cameras, lights, accelerometers, gyroscopes, magnetometers, sensor devices configured to sense light, proximity, heart rate, body and / or ambient temperature, blood pressure, and / or skin resistance, and activity monitors. An I / O device may be incorporated into the computer system as shown, though in some embodiments an EO device may be regarded as an external device (1912) coupled to the computer system through one or more I / O interfaces 1910.
[0153] Computer system 1900 may communicate with one or more external devices 1912 via one or more EO interfaces 1910. Example external devices include a keyboard, a pointing device, a display, a sequencing instrument, and / or any other devices that enable a user to interact with computer system 1900. Other example external devices include any device that enables computer system 1900 to communicate with one or more other computing systems or peripheral devices such as a printer. A network interface / adapter is an example EO interface that enables computer system 1900 to communicate with one or more networks, such as a localIP-2920-PCT Page 43 of 56area network (LAN), a general wide area network (WAN), and / or a public network (e.g., the Internet), providing communication with other computing devices or systems, storage devices, or the like. Ethernet-based (such as Wi-Fi) interfaces and Bluetooth® adapters are just examples of the currently available types of network adapters used in computer systems (BLUETOOTH is a registered trademark of Bluetooth SIG, Inc., Kirkland, Washington, U.S.A.).
[0154] The communication between I / O interfaces 1910 and external devices 1912 can occur across wired and / or wireless communications link(s) 1911, such as Ethernet-based wired or wireless connections. Example wireless connections include cellular, Wi-Fi, Bluetooth®, proximity-based, near-field, or other types of wireless connections. More generally, communications link(s) 1911 may be any appropriate wireless and / or wired communication link(s) for communicating data.
[0155] Particular external device(s) 1912 may include one or more data storage devices, which may store one or more programs, one or more computer readable program instructions, and / or data, etc. Computer system 1900 may include and / or be coupled to and in communication with (e.g. as an external device of the computer system) removable / non-removable, volatile / non-volatile computer system storage media. For example, it may include and / or be coupled to a non-removable, nonvolatile magnetic media (typically called a "hard drive"), a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk"), and / or an optical disk drive for reading from or writing to a removable, nonvolatile optical disk, such as a CD-ROM, DVD-ROM or other optical media.
[0156] Computer system 1900 may be operational with numerous other general purpose or special purpose computing system environments or configurations. Computer system 1900 may take any of various forms, well-known examples of which include, but are not limited to, personal computer (PC) system(s), server computer system(s), such as messaging server(s), thin client(s), thick client(s), workstation(s), laptop(s), handheld device(s), mobile device(s) / computer(s) such as smartphone(s), tablet(s), and wearable device(s), multiprocessor system(s), microprocessor-based system(s), telephony device(s), network appliance(s) (such as edge appliance(s)), virtualization device(s), storage controller(s), set top box(es), programmable consumer electronic(s), network PC(s), minicomputer system(s),IP-2920-PCT Page 44 of 56mainframe computer system(s), and distributed cloud computing environment(s) that include any of the above systems or devices, and the like.
[0157] Aspects of the present invention may be a system, a method, and / or a computer program product, any of which may be configured to perform or facilitate aspects described herein.
[0158] In some embodiments, aspects may take the form of a computer program product, which may be embodied as computer readable medium(s). A computer readable medium may be a tangible storage device / medium having computer readable program code / instructions stored thereon. Example computer readable medium(s) include, but are not limited to, electronic, magnetic, optical, or semiconductor storage devices or systems, or any combination of the foregoing. Example embodiments of a computer readable medium include a hard drive or other mass-storage device, an electrical connection having wires, random access memory (RAM), read-only memory (ROM), erasable-programmable read-only memory such as EPROM or flash memory, an optical fiber, a portable computer disk / diskette, such as a compact disc read-only memory (CD-ROM) or Digital Versatile Disc (DVD), an optical storage device, a magnetic storage device, or any combination of the foregoing. The computer readable medium may be readable by a processor, processing unit, or the like, to obtain data (e.g. instructions) from the medium for execution. In a particular example, a computer program product is or includes one or more computer readable media that includes / stores computer readable program code to provide and facilitate one or more aspects described herein.
[0159] As noted, program instruction contained or stored in / on a computer readable medium can be obtained and executed by any of various suitable components such as a processor of a computer system to cause the computer system to behave and function in a particular manner. Such program instructions for carrying out operations to perform, achieve, or facilitate aspects described herein may be written in, or compiled from code written in, any desired programming language. In some embodiments, such programming language includes object-oriented and / or procedural programming languages such as C, C++, C#, Java, etc.
[0160] Program code can include one or more program instructions obtained for execution by one or more processors. Computer program instructions may beIP-2920-PCT Page 45 of 56provided to one or more processors of, e.g., one or more computer systems, to produce a machine, such that the program instructions, when executed by the one or more processors, perform, achieve, or facilitate aspects of the present invention, such as actions or functions described in flowcharts and / or block diagrams described herein. Thus, each block, or combinations of blocks, of the flowchart illustrations and / or block diagrams depicted and described herein can be implemented, in some embodiments, by computer program instructions.
[0161] Although various embodiments are described above, these are only examples.
[0162] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and / or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and / or groups thereof.
[0163] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of one or more embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain various aspects and the practical application, and to enable others of ordinary skill in the art to understand various embodiments with various modifications as are suited to the particular use contemplated.IP-2920-PCT Page 46 of 56
Claims
CLAIMSWhat is claimed is:
1. A computer-implemented method including: obtaining a collection of sequence reads of a biological sample; aligning the collection of sequence reads to two or more regions in a reference sequence, and identifying candidate variant positions and candidate variant alleles; determining a total copy number of the two or more regions; generating, using the determined total copy numberjoint genotype candidates with haplotypes placed in respective regions of the two or more regions, in which at least some of the joint genotype candidates include different numbers of haplotypes per region than others of the joint genotype candidates; and scoring the joint genotype candidates for genotyping the sample.
2. The method of claim 1, wherein the generating the joint genotype candidates includes generating and scoring haplotype set candidates based on the determined copy number, each haplotype set candidate including a respective set of haplotypes, wherein the generating and scoring is performed iteratively in iterations based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions, wherein, at each iteration of the iterations, a respective set of haplotype set candidates is generated.
3. The method of claim 2, further including, based on the iterative performance of the generating and scoring, obtaining a final set of haplotype set candidates having haplotypes that cover the identified candidate variant positions, the final set of haplotype set candidates yielding the joint genotype candidates for performing the scoring of the joint genotype candidates.IP-2920-PCT Page 47 of 564. The method of claim 2, wherein each haplotype set candidate of the generated haplotype set candidates is a joint haplotype set (JHS) candidate that includes a respective set of haplotypes determined based on the identified candidate variant alleles and with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed, the respective set of haplotypes having a cardinality based on the determined copy number.
5. The method of claim 4, wherein scoring a JHS candidate of the JHS candidates uses a prior probability value based on a uniform probability distribution.
6. The method of claim 4, wherein scoring a JHS candidate of the JHS candidates uses a prior probability value based on probability distribution that is based on the determined total copy number.
7. The method of claim 2, further including, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration to produce a resulting set of candidates to carry forward to a next iteration of the iterations, wherein the pruning includes using an odds-ratio based pruning approach in which a score threshold is dynamically set based on a function of (i) a score of a highest-scoring haplotype set candidate generated at that iteration and (ii) a prespecified odds-ratio value, and any haplotype set candidate generated at that iteration and being scored below the score threshold is eliminated from the respective set of haplotype set candidates, wherein the pruning using the odds-ratio based pruning method produces a resulting collection of candidates, each having been scored to meet or exceed the score threshold.
8. The method of claim 2, further including, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration using a haplotype-based priors pruning approach to produce a set of joint haplotype candidates to carry forward to a next iteration of the iterations, wherein the haplotype-based priors pruning approach includes: determining, based on the respective set of haplotype set candidates, a collection of proposed candidates having haplotypes with proposed placements;IP-2920-PCT Page 48 of 56scoring the proposed candidates based on haplotype-based prior scores of the haplotypes with the proposed placements; and pruning the collection of proposed candidates based on scoring the proposed candidates to obtain a resulting set of candidates.
9. The method of claim 8, further including activating the haplotypebased priors pruning approach based on determining that an initial pruning approach at that iteration, and applied to the respective set of haplotype set candidates generated at that iteration, produces a resulting collection of candidates having a total number of candidates that exceeds a pruning threshold.
10. The method of claim 1, further including, based on the scoring the joint genotype candidates, performing at least one of: genotyping the biological sample based on one or more best-scoring joint genotype candidates, and generating a variant call file that identifies candidate variants based on the genotyping; detecting and reporting regional copy numbers of the two or more regions; and reporting, for a region of the two or more regions, a variant genotype based on ploidy that follows a regional copy number of the region.
11. A computer system including: a memory; and a processing circuit in communication with the memory, wherein the computer system is configured to perform a method that includes: obtaining a collection of sequence reads of a biological sample; aligning the collection of sequence reads to two or more regions in a reference sequence, and identifying candidate variant positions and candidate variant alleles; determining a total copy number of the two or more regions;IP-2920-PCT Page 49 of 56generating, using the determined total copy numberjoint genotype candidates with haplotypes placed in respective regions of the two or more regions, in which at least some of the joint genotype candidates include different numbers of haplotypes per region than others of the joint genotype candidates; and scoring the joint genotype candidates for genotyping the sample.
12. The computer system of claim 11, wherein the generating the joint genotype candidates includes generating and scoring haplotype set candidates based on the determined copy number, each haplotype set candidate including a respective set of haplotypes, wherein the generating and scoring is performed iteratively in iterations based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions, wherein, at each iteration of the iterations, a respective set of haplotype set candidates is generated.
13. The computer system of claim 12, further including, based on the iterative performance of the generating and scoring, obtaining a final set of haplotype set candidates having haplotypes that cover the identified candidate variant positions, the final set of haplotype set candidates yielding the joint genotype candidates for performing the scoring of the joint genotype candidates.
14. The computer system of claim 12, wherein each haplotype set candidate of the generated haplotype set candidates is a joint haplotype set (JHS) candidate that includes a respective set of haplotypes determined based on the identified candidate variant alleles and with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed, the respective set of haplotypes having a cardinality based on the determined copy number.
15. The computer system of claim 14, wherein scoring a JHS candidate of the JHS candidates uses a prior probability value based on a uniform probability distribution.IP-2920-PCT Page 50 of 5616. The computer system of claim 14, wherein scoring a JHS candidate of the JHS candidates uses a prior probability value based on probability distribution that is based on the determined total copy number.
17. The computer system of claim 12, wherein the method further includes, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration to produce a resulting set of candidates to carry forward to a next iteration of the iterations, wherein the pruning includes using an odds-ratio based pruning approach in which a score threshold is dynamically set based on a function of (i) a score of a highest-scoring haplotype set candidate generated at that iteration and (ii) a prespecified odds-ratio value, and any haplotype set candidate generated at that iteration and being scored below the score threshold is eliminated from the respective set of haplotype set candidates, wherein the pruning using the odds-ratio based pruning method produces a resulting collection of candidates, each having been scored to meet or exceed the score threshold.
18. The computer system of claim 12, wherein the method further includes, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration using a haplotype-based priors pruning approach to produce a set of joint haplotype candidates to carry forward to a next iteration of the iterations, wherein the haplotype-based priors pruning approach includes: determining, based on the respective set of haplotype set candidates, a collection of proposed candidates having haplotypes with proposed placements; scoring the proposed candidates based on haplotype-based prior scores of the haplotypes with the proposed placements; and pruning the collection of proposed candidates based on scoring the proposed candidates to obtain a resulting set of candidates.
19. The computer system of claim 18, wherein the method further includes activating the haplotype-based priors pruning approach based on determining that an initial pruning approach at that iteration, and applied to the respective set of haplotype set candidates generated at that iteration, produces a resulting collection of candidates having a total number of candidates that exceeds a pruning threshold.IP-2920-PCT Page 51 of 5620. The computer system of claim 11, wherein the method further includes, based on the scoring the joint genotype candidates, performing at least one of: genotyping the biological sample based on one or more best-scoring joint genotype candidates, and generating a variant call file that identifies candidate variants based on the genotyping; detecting and reporting regional copy numbers of the two or more regions; and reporting, for a region of the two or more regions, a variant genotype based on ploidy that follows a regional copy number of the region.
21. A computer program product including: a computer readable storage medium readable by a processing circuit and storing instructions for execution by the processing circuit to perform a method that includes: obtaining a collection of sequence reads of a biological sample; aligning the collection of sequence reads to two or more regions in a reference sequence, and identifying candidate variant positions and candidate variant alleles; determining a total copy number of the two or more regions; generating, using the determined total copy numberjoint genotype candidates with haplotypes placed in respective regions of the two or more regions, in which at least some of the joint genotype candidates include different numbers of haplotypes per region than others of the joint genotype candidates; and scoring the joint genotype candidates for genotyping the sample.
22. The computer program product of claim 21, wherein the generating the joint genotype candidates includes generating and scoring haplotype set candidatesIP-2920-PCT Page 52 of 56based on the determined copy number, each haplotype set candidate including a respective set of haplotypes, wherein the generating and scoring is performed iteratively in iterations based on introduction, at each iteration of the iterations, of at least one additional candidate variant position of the identified candidate variant positions, wherein, at each iteration of the iterations, a respective set of haplotype set candidates is generated.
23. The computer program product of claim 22, further including, based on the iterative performance of the generating and scoring, obtaining a final set of haplotype set candidates having haplotypes that cover the identified candidate variant positions, the final set of haplotype set candidates yielding the joint genotype candidates for performing the scoring of the joint genotype candidates.
24. The computer program product of claim 22, wherein each haplotype set candidate of the generated haplotype set candidates is a joint haplotype set (JHS) candidate that includes a respective set of haplotypes determined based on the identified candidate variant alleles and with unspecified placement as to which region, of the two or more regions of the reference sequence, each haplotype of the respective set of haplotypes is to be placed, the respective set of haplotypes having a cardinality based on the determined copy number.
25. The computer program product of claim 24, wherein scoring a JHS candidate of the JHS candidates uses a prior probability value based on a uniform probability distribution.
26. The computer program product of claim 24, wherein scoring a JHS candidate of the JHS candidates uses a prior probability value based on probability distribution that is based on the determined total copy number.
27. The computer program product of claim 22, wherein the method further includes, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration to produce a resulting set of candidates to carry forward to a next iteration of the iterations, wherein the pruning includes using an odds-ratio based pruning approach in which a score threshold is dynamically set based on a function of (i) a score of a highest-scoring haplotype set candidate generated at that iteration and (ii) a prespecified odds-ratio value, and anyIP-2920-PCT Page 53 of 56haplotype set candidate generated at that iteration and being scored below the score threshold is eliminated from the respective set of haplotype set candidates, wherein the pruning using the odds-ratio based pruning method produces a resulting collection of candidates, each having been scored to meet or exceed the score threshold.
28. The computer program product of claim 22, wherein the method further includes, at an iteration of the iterations, pruning the respective set of haplotype set candidates generated at that iteration using a haplotype-based priors pruning approach to produce a set of joint haplotype candidates to carry forward to a next iteration of the iterations, wherein the haplotype-based priors pruning approach includes: determining, based on the respective set of haplotype set candidates, a collection of proposed candidates having haplotypes with proposed placements; scoring the proposed candidates based on haplotype-based prior scores of the haplotypes with the proposed placements; and pruning the collection of proposed candidates based on scoring the proposed candidates to obtain a resulting set of candidates.
29. The computer program product of claim 28, wherein the method further includes activating the haplotype-based priors pruning approach based on determining that an initial pruning approach at that iteration, and applied to the respective set of haplotype set candidates generated at that iteration, produces a resulting collection of candidates having a total number of candidates that exceeds a pruning threshold.
30. The computer program product of claim 21, wherein the method further includes, based on the scoring the joint genotype candidates, performing at least one of: genotyping the biological sample based on one or more best-scoring joint genotype candidates, and generating a variant call file that identifies candidate variants based on the genotyping;IP-2920-PCT Page 54 of 56detecting and reporting regional copy numbers of the two or more regions; and reporting, for a region of the two or more regions, a variant genotype based on ploidy that follows a regional copy number of the region.IP-2920-PCT Page 55 of 56