Method for re-alignment of sequencing data reads

By using computer-based methods to re-align sequencing data reads and utilizing bilateral positional information and candidate insertions/deletions, the problem of accurately recalling insertions/deletions and variants in next-generation sequencing data analysis was solved, improving the accuracy and efficiency of data analysis.

CN117457074BActive Publication Date: 2026-06-19ILLUMINA INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ILLUMINA INC
Filing Date
2017-11-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately identify insertions, deletions, and variants in next-generation sequencing data analysis, especially in repetitive regions and at the ends of reads, leading to misjudgments of false positives and false negatives.

Method used

By providing a computer-implemented method, sequencing data reads are re-aligned, and inserts and deletions are iteratively introduced into the flattened aligned reads using bilateral position information and candidate insertions and deletions. The best re-alignment is then selected to reduce mismatches and improve accuracy.

Benefits of technology

It improves the accurate recall of insertions and deletions, reduces the misjudgment of false positives and false negatives, and improves the accuracy and efficiency of sequencing data analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117457074B_ABST
    Figure CN117457074B_ABST
Patent Text Reader

Abstract

This invention relates to a method for re-aligning sequencing data reads. One method involves obtaining an initial alignment of a read sequence with a reference sequence from a sequence alignment dataset and performing a re-alignment process on the initial alignment. The re-alignment process includes identifying candidate insertions / deletions, which include zero or more insertions / deletions in the aligned read and zero or more insertions / deletions aligned proximal to the aligned read, as indicated by the sequence alignment dataset; creating a flattened aligned read based at least on removing any insertions / deletions indicated by the initial alignment from the aligned read; and determining candidate re-alignments between the read sequence and the reference sequence based on introducing at least one corresponding candidate insertion / deletion from the candidate insertions / deletions into the flattened aligned read for each candidate re-alignment in the candidate re-alignment. The method further provides selected candidate re-alignments from the initial alignment or candidate re-alignments based on selection criteria.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application is a divisional application of the invention patent application with application number 201780077066.2, application date November 15, 2017, and invention title "Method for Re-alignment of Sequencing Data Reads". Technical Field

[0002] This application relates to a method for re-aligning sequencing data reads. Background Technology

[0003] A persistent challenge in next-generation sequencing data analysis is the accurate recall of insertions and deletions (“insertions and deletions”). This difficulty stems from factors including low incidence rates, difficulty in mapping to the correct locations within the genome, and the presence of repetitive regions in the genome that prevent unique mapping. Another reason is that current alignment tools fail to correctly identify variants at the ends of reads or lack sufficient accuracy. This is due to the lack of two-sided context information for recalling variants. Summary of the Invention

[0004] By providing computer implementation methods, computer systems, and computer program products, the shortcomings of the prior art are overcome and additional advantages are provided.

[0005] According to one embodiment, a computer-implemented method for re-aligning sequencing data reads includes: obtaining an initial alignment of a read sequence with a reference sequence from a sequence alignment dataset, the initial alignment including aligned reads; performing a re-alignment process on the initial alignment, the re-alignment process re-aligning the read sequence with the reference sequence to generate one or more candidate re-alignments, and the re-alignment process including: identifying one or more candidate insertions / deletions, the one or more candidate insertions / deletions including zero or more insertions / deletions in the aligned reads and alignments at the proximal end of the aligned reads. The sequence alignment dataset contains zero or more insertion defects, as indicated by the sequence alignment dataset; a flattened alignment read is created based at least on removing any insertion defects indicated by the initial alignment from the alignment read; and one or more candidate realignments of the read sequence with the reference sequence are determined based on introducing at least one corresponding candidate insertion defect from the one or more candidate insertion defects into the flattened alignment read for each candidate realignment in the one or more candidate realignments; and a selection of candidate realignments from the initial alignment or the one or more candidate realignments is provided based on one or more selection criteria.

[0006] The one or more candidate insertion missings may include multiple candidate insertion missings, and determining the one or more candidate re-alignment may include starting to iteratively introduce the multiple candidate insertion missings into the flattened alignment read, wherein each iteration of the iterative introduction of the corresponding at least one candidate insertion missing from the candidate re-alignment into the flattened alignment read provides a candidate re-alignment in the one or more candidate re-alignment.

[0007] The iterative introduction can introduce one or more arrangements of candidate insertion defects from the plurality of candidate insertion defects into the flattened alignment read, so as to obtain different candidate re-alignments for each of the one or more candidate alignments.

[0008] The realignment process may further include: examining the provided candidate realignments among the one or more candidate realignments to determine whether the alignment read of the provided candidate realignment (i.e., the alignment read of the provided candidate realignment with one or more corresponding candidate insertions or deletions introduced) aligns with the reference sequence and whether there are no mismatched bases between the alignment read of the provided candidate realignment and the reference sequence; stopping the iterative introduction based on the determination that the alignment read of the provided candidate realignment aligns with the reference sequence and there are no mismatched bases; and selecting the provided candidate realignment as the selected candidate realignment, wherein the output of the selected candidate realignment is based on the alignment of the alignment read of the provided candidate realignment with the reference sequence.

[0009] The re-alignment process may further include prioritizing the plurality of missing insertions for the iterative introduction, wherein the iterative introduction introduces the plurality of missing insertions in priority order based on the priority order.

[0010] The prioritization can prioritize insertions that are indicated as predicted insertions by the reference insertion / deletion dataset over insertions that are not indicated as predicted insertions by the reference insertion / deletion dataset. Alternatively, the prioritization can prioritize longer insertions over shorter insertions. Alternatively, the prioritization can prioritize insertions indicated by a larger number of alignment reads in the sequence alignment dataset over insertions indicated by a smaller number of alignment reads in the sequence alignment dataset. Alternatively, the prioritization can prioritize insertions that correspond to the position of the insertion relative to the reference sequence with a larger proportion of alignment reads in the sequence alignment dataset over insertions that correspond to a smaller proportion of alignment reads in the sequence alignment dataset. Alternatively, among different insertions indicated by the same number of alignment reads in the sequence alignment dataset, the prioritization can prioritize insertions that are upstream of the position of the reference genome sequence indicated by the sequence alignment dataset (compared to the position of another insertion indication relative to the reference genome sequence).

[0011] The selection criteria may be based, at least in part, on one or more of the following: the number of mismatched bases, the number of insertions or deletions, the position of the insertions or deletions relative to the reference genome sequence indicated by the sequence alignment dataset, and the number of soft-cut bases.

[0012] The selection criteria may prioritize one or more of the following: for the provided provision, an alignment with no insertions or deletions and only one mismatched base is preferred over an alignment with one or more insertions or deletions; for the provided provision, an alignment with a smaller number of mismatched bases is preferred over an alignment with a larger number of mismatched bases; among different alignments with the same number of mismatched bases, for the provided provision, an alignment with a smaller number of soft cuts of a specific type is preferred over an alignment with a larger number of soft cuts of the specific type; and among different alignments with the same number of mismatched bases, for the provided provision, an alignment with a smaller number of insertions or deletions is preferred over an alignment with a larger number of insertions or deletions.

[0013] The re-alignment process may further include selecting the best candidate re-alignment among the one or more candidate re-alignments based on a first criterion among the one or more selection criteria, wherein the selected candidate re-alignment is the selected best candidate re-alignment, and wherein the output is selected between the initial alignment and the best re-alignment candidate based on a second criterion among the one or more selection criteria.

[0014] An embodiment of the computer implementation method may further include determining whether the obtained initial alignment is suitable for re-alignment, the determination being based at least in part on one or more of the following: identifying whether there are one or more mismatched bases between the aligned read of the initial alignment and the reference sequence; identifying whether the aligned read includes soft cuts; identifying whether the initial alignment is not a secondary alignment; and identifying whether there are candidate insertions or deletions around the aligned read in the base region of the reference genome sequence of the sequence alignment dataset.

[0015] An embodiment of the computer implementation method may further include determining whether the obtained initial alignment is suitable for realignment, and performing the realignment process and providing the initial alignment or selected candidate realignment based on determining that the obtained initial alignment is suitable for realignment; repeating the process of obtaining and determining whether the obtained additional initial alignment is suitable for realignment for each additional initial alignment in one or more additional initial alignments of the sequence alignment dataset; and processing for each additional initial alignment in one or more additional initial alignments, the processing including (i) providing the additional initial alignment as is without performing the realignment process, or (ii) performing the realignment process and providing the additional initial alignment or selected candidate realignment.

[0016] Furthermore, a computer system for re-aligning sequencing data reads includes a memory and at least one processor, which can be configured to execute program instructions to perform methods according to the various aspects described herein.

[0017] In addition, a computer program product for re-aligning sequencing data reads includes a tangible storage medium storing program instructions for execution, which can be performed according to the methods described herein.

[0018] Additional features and advantages are achieved through the concepts described herein. Numerous inventive aspects and features are disclosed herein, and unless inconsistent, each disclosed aspect or feature may be combined with any other disclosed aspects or features desired for a particular application, such as to facilitate the detection of image obstacles. Attached Figure Description

[0019] In the claims at the conclusion of the specification, the various aspects described herein are specifically pointed out and clearly claimed as examples. The foregoing and other objects, features, and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:

[0020] Figure 1A-1D This demonstrates how bilateral positional information can be used to interpret variant bases oriented toward the end of a read;

[0021] Figure 2 The clearing of read segments according to the various aspects described herein is illustrated for processing;

[0022] Figure 3 Exemplary methods for handling initial alignments are described based on the various aspects described herein;

[0023] Figure 4A and 4B Exemplary location diagrams of reads containing soft cuts, insertions, and omissions are depicted according to the various aspects described in this article;

[0024] Figures 5A-5C The flattening of the comparative readings based on the various aspects described in this article is depicted;

[0025] Figures 6A-6D This paper describes how candidate insertion missing information is introduced into flattened alignment reads based on various aspects described in this paper.

[0026] Figure 7 An example of re-alignment processing of reads based on the aspects described in this article is depicted;

[0027] Figure 8 An exemplary process for selecting the best candidate for re-alignment, based on the various aspects described in this paper, is depicted.

[0028] Figure 9 The exemplary and target re-comparison process is described based on the various aspects described in this paper;

[0029] Figure 10A and 10B An exemplary process for re-aligning the results of left and right anchoring according to the aspects described in this paper is depicted;

[0030] Figure 11 An exemplary process for adding missing insertions to obtain a realigned result, based on the various aspects described herein, is depicted.

[0031] Figure 12 The distribution of variant lengths used in the simulation analysis is depicted according to the various aspects described in this paper;

[0032] Figure 13 The possible outcomes of the evaluation of truth variants based on the aspects described in this paper are depicted;

[0033] Figure 14 The true and false positive rates of simulated BAMs generated by iSAAC using priors of non-realignment, GATK realignment, or realignment according to various aspects of the realignment methods disclosed herein are depicted.

[0034] Figure 15The true and false positive rates of simulated BAMs generated by iSAAC using priors of non-realignment, GATK realignment, or realignment according to various aspects of the realignment methods disclosed herein are depicted.

[0035] Figure 16 The total somatic mutation count per sample is depicted for samples that are not realigned, realigned using GATK, or realigned according to various aspects of the realignment methods disclosed herein.

[0036] Figure 17 Somatic mutation counts per sample, broken down by mutation type, are depicted for samples that were not realigned, realigned using GATK, or realigned according to various aspects of the realignment methods disclosed herein.

[0037] Figure 18 The realignment time per million alignments is depicted for each aspect of GATK and the realignment method according to the aspects described in this paper.

[0038] Figure 19 An exemplary process for sequence alignment processing is described in accordance with the various aspects described herein;

[0039] Figure 20 An exemplary process for realigning sequencing data reads, based on the various aspects described herein, is depicted.

[0040] Figure 21 An exemplary process for determining the eligibility of an initial alignment to undergo sequencing data read realignment processing is described in accordance with the various aspects described herein;

[0041] Figure 22 An example of a computer system and related apparatus that combines and / or uses the various aspects described herein is depicted;

[0042] Figure 23 An example of a sequencing device that can be used in conjunction with the various aspects described herein is depicted; and

[0043] Figure 24 An example of a cloud computing environment based on the various aspects described herein is depicted. Detailed Implementation

[0044] The development of next-generation sequencing (NGS) technology has revolutionized gene sequencing, allowing the generation of numerous copies of gene sequences (e.g., from an organism's genome) and the comparison of these sequences with the putative reproducibility of the nucleotide sequences that created the copied gene sequences. By identifying the sequence of nucleotide base pairs in the compared copies, the nucleotide sequence in the original sequence can be determined. One use of this technology is for the identification, understanding, prevention, treatment, or cure of diseases. For example, NGS can be used to identify an individual's genome sequence to determine whether they possess nucleotide sequences that are considered to be the basis of a specific disease or express a susceptibility to a specific disease, or to identify such sequences that might do so, or to determine whether a given pharmacological or other treatment might be beneficial in treating the individual's condition.

[0045] The sheer volume of sequence information that must be processed to derive nucleotide sequences from the alignment of its copies is often substantial. For example, the human genome contains approximately three billion base pairs. The ability to determine such a large nucleotide sequence requires advanced computer processing techniques. For instance, synthesizing numerous copies of certain overlapping and / or adjacent portions of a large gene sequence (e.g., billions of nucleotides in the entire reference genome, tens or hundreds of millions of nucleotides in one or more chromosomes, or long portions of chromosomes or other genomic sequences) via high-throughput processing and subsequently aligning them to reproduce and identify the nucleotide sequences of the copies typically requires processing vast amounts of data by computer.

[0046] In many cases, errors may exist, leading to inaccurate representations of genomic sequences in the alignments they create. A crucial component of NGS technology is the ability to identify and correct these errors. In the case of sequencing large gene sequences, the number of potential errors can also be substantial. Therefore, computer technology is expected to identify the location of these potential errors, determine whether they are errors, and, if so, determine what the correct sequence should be, often requiring selection among several potentially correct sequences. Given the potentially large number of these potential errors spanning vast gene sequences, the automated process of identifying and correcting these errors is essential as an integral part of the computer processing used in NGS.

[0047] For example, nucleotide sequences within chromosomes that are present in most populations may be suppressed. An individual's sequence can then be determined and compared to these known sequences. Differences between the individual's sequence and known sequences can be significant medically, pedigree-wise, or otherwise. However, errors or potential errors in the alignment sequences determined by NGS for an individual complicate the identification of differences between the individual's gene sequence and known sequences; for example, if errors exist but are not identified, or differences between the individual's sequence and known sequences are erroneously not detected. This disclosure contains computer techniques for automating the identification and correction of certain types of errors that may exist in NGS, as well as related information science processing for generating sequence alignments. Advantages include reduced processing time and increased error identification and correction, thereby improving the availability of NGS tools and related technologies.

[0048] Specifically, the aspects described herein address the problem of false positives (typically single nucleotide variants) and false negatives (typically insertions / deletions) variant calls caused by inappropriate alignments of sequencing data reads containing insertions / deletions to a reference genome. The process described herein allows reads to be realigned in a manner that respects the existing representation of true insertions / deletions and rejects low-frequency “noisy” variants, all within a short run time. Typically, one or more reads or read sequences can correspond to positions in a gene sequenced by NGS. With the generation of multiple reads, in general, all positions in the sequence are sequenced and aligned in order from the position corresponding to one end of the sequenced sequence to the position corresponding to the other end of the sequenced sequence, and the sequence of nucleotides represented thereby is identified, thus the complete sequence can be determined. Since each or more reads corresponding to a position in the sequenced gene sequence is identified as corresponding to said position, they can be considered aligned or aligned reads. However, errors can occur in identifying or calling the presence of insertions / deletions in the alignment due to the difficulty in accurately identifying deterministic insertions / deletions indicated by the aligned reads.

[0049] The lack of bilateral location information poses a challenge to the accurate invocation of insertion / deletion. When invoking insertion / deletion, bilateral location information can help indicate the start and end points of the mutation. Figure 1A-1D This demonstrates how bilateral positional information can be used to interpret variant bases oriented toward the end of a read. Figure 1AAn initial alignment 100 is depicted, from read sequence 102 (also referred to herein as the "read") to reference sequence 104 (also referred to herein as the "reference"), thus generating an aligned read. In practice, the "read" and "reference" can actually be portions of a longer nucleotide sequence, which can also be referred to as the read sequence and the reference sequence. Nucleotide base positions 1 to 12 are marked above the reference sequence 104. The aligned read 102 (seven nucleotides long in this example) partially matches the first five bases of the reference 104. That is, the sequence TCGTA at base positions 2 to 6 matches between the aligned read 102 and the reference 104. The sequence begins to diverge at base position 7, where the sequence CG is observed in the read sequences at positions 7 and 8. Figures 1B to 1C Three alternative ways of explaining this variant are described by showing the variant with different bilateral location information. Figure 1B In this context, additional positional information provided by downstream sequences (e.g., another aligned read 103b in this example) identifies the variant bases CG at positions 7 and 8 as point mutations. Figure 1C In the text, additional positional information provided by 103c indicates that the variation is explained by the deletion of two bases at positions 7 and 8. Figure 1D In this context, additional positional information provided by 103d indicates that the change is explained by an insertion of two base lengths between positions 6 and 7. Differences in the bilateral positional information of a given read sequence aligned to 102 may result in different read alignments of the read sequence (e.g., 102', 102" or 102"').

[0050] Sequencing data collected as part of sequence analysis is stored in sequence alignment datasets. Common file types used to store sequence alignment data are SAM (.sam) and BAM (.bam) file formats. Sequence alignment software (“alignment tools”) outputs sequence alignment dataset files, such as BAM files, which indicate the alignment of read sequences with a reference genome and indicate evidence that insertions or deletions may be present in these aligned reads. Alignment tools typically have a higher penalty for opening “gaps” (insertions or deletions) than for assigning mismatches, which becomes particularly noticeable at the ends of reads. Therefore, many sequence variations may be incorrectly labeled as mismatches or soft cuts even when other read evidence indicates the possible presence of insertions or deletions.

[0051] The aspects described in this paper reprocess sequence alignment dataset files, such as source / raw / input sequence alignment datasets, to extract information from nearby aligned reads to form surrounding positional information. This method, as indicated in the input sequence alignment dataset, collects existing insertion / deletion observations from the initial alignment and processes them by attempting to realign imperfectly aligned reads surrounding the observed insertion / deletion, thus minimizing mismatches. In some instances, reads initially not indicating any insertion / deletion are realigned so that they do indeed indicate insertion / deletion relative to a reference. A sequence alignment dataset may initially have no evidence that a particular read contains any insertion / deletion. However, the aspects described in this paper can “rescue” reads when the presence of an insertion / deletion is more appropriately indicated by realignment. As a concrete example, it is possible that only one read aligned to a region of the reference genome sequence indicated in the input sequence alignment dataset reflects an insertion / deletion, but after processing the initial alignment as described in this paper, several reads in the output sequence alignment dataset, for example, output by the process described in this paper, support the presence of the insertion / deletion.

[0052] In addition to reducing false negatives as described above, the method described in this paper can also reduce false positives by eliminating some mismatches or some insertions or omissions initially indicated in one or more reads of the input sequence alignment dataset.

[0053] The process described in this paper presents a local insertion-missing re-alignment algorithm. This helps minimize mismatches by re-aligning input reads around the insertion-missing data, such as those observed in the input sequence alignment dataset file and / or indicated in the reference insertion-missing dataset, for example, a 'prior' variant call format (.VCF) file. The VCF prior can be provided as input to the algorithm and indicates the assumed insertion-missing data in the source sequence alignment dataset file.

[0054] At a high level, the computer system can receive an input sequence alignment dataset as input and execute an algorithm to read through the input dataset, collect existing insertion / missing observations, and process one or more initial alignments from the sequence alignment dataset by attempting to realign reads around each initial alignment surrounding the observed insertion / missing. The algorithm can provide a new 'realigned' classification index sequence alignment dataset, for example, as an output BAM or other sequence alignment dataset file. If a realignment of a read with a reference is better than the initial alignment of a read with a reference, the realignment can be output instead of the initial alignment. Otherwise, the initial alignments can be output from the input sequence alignment dataset as is. The output sequence alignment dataset can be a separate file from the original sequence alignment dataset, or it can be a modified version of the input sequence alignment dataset, where the algorithm can directly modify / rewrite the original sequence alignment dataset.

[0055] In a specific instance, the algorithm iteratively collects existing insertion / deletion observations from the input sequence alignment dataset and adds them to a set of candidate insertion / deletions for realignment processing of a particular initial alignment. Whether an observed insertion / deletion is considered a candidate can depend on any desired parameter, such as the allele frequency of the observed insertion / deletion. In some instances, a user-configurable threshold allele frequency is provided as a parameter of the algorithm or other input, such as as a command-line argument or as an option specified in the software settings. Observed insertion / deletions that occur at least as frequently as the frequency indicated by the threshold can be considered candidate insertion / deletions. The frequency can include the total number of reads aligned to a given position in the reference sequence that indicates the presence of a given insertion / deletion at that position. Alternatively, the frequency can include a portion of the total number of reads aligned to a given position in the reference sequence that indicates the presence of a given insertion / deletion at that position. The configurable threshold can be set as low as 1, indicating that the presence of an insertion / deletion in only one read aligned to a given position in the reference sequence constitutes sufficient evidence to consider the insertion / deletion a candidate. Alternatively, the configurable threshold can be a predetermined proportion of the number of reads aligned to a given position in the reference sequence, between 0 and 1, indicating that the presence of an insertion missing constitutes sufficient evidence to consider it a candidate. In practice, noise and other considerations may necessitate setting the frequency to a higher level. Furthermore, any insertion missing provided in the optional prior VCF reference insertion missing dataset can be considered a candidate insertion missing.

[0056] Computer systems that read through sequence alignment datasets can typically proceed from the beginning to the end of the mapped reference genome sequence. Candidate insertions and deletions associated with individual alignments can exist before or after the original position of the alignment (i.e., upstream or downstream relative to the reference genome sequence). Subsequent reads to be processed can provide further support for candidate insertions and deletions. Therefore, the algorithm can retain encountered initial alignments in memory until they are considered to be cleared for processing, rather than immediately processing those initial alignments without reading alignments further down the reference genome sequence. Cleared alignments are those whose positions, as indicated in the sequence alignment dataset, are upstream of the end of the window (with a configurable window size). This allows for the collection of candidate insertions and deletions for a given read from regions before and after it. The genome window size is related to the number of bases beyond the initial alignment that must be read before the algorithm for collecting information deemed potentially relevant to the alignment is satisfied. The window size can be configurable, for example, as a command-line parameter. Larger window sizes allow for the consideration of larger and more distant insertions and deletions, but if the window size is set too large due to greater resource demands, the performance of the computer system may be negatively impacted. In certain instances, a window size of 250-1000 bases may be sufficient for general use.

[0057] Figure 2 The purge of reads according to various aspects described herein is illustrated. The genome block or window size is represented by 206. Read 202 is aligned horizontally (in this example) individually to its corresponding position relative to a reference genome sequence (not shown). Insertions and deletions 208a-208d are represented by various reads. 210 represents the first group of reads ( Figure 2 The top 8 reads (202a) are cleared for processing at this point. This is a window size at the end of the last read (202a) of the group. The delayed configurable window alignment process ensures that when processing, for example, the initial alignment of read 202b, not only insertion missing 208b (which is part of the initial alignment) and upstream insertion missing 208a are considered, but also downstream insertion missing 208c and 208d are considered because they are located within window 206 upstream of point 210, where the alignments of the first 8 reads are cleared for processing.

[0058] The initial alignment, which is removed for processing, may undergo processing, the exemplary method of which is referenced. Figure 3 To describe and depict. Figure 3The method is a process that can be performed by one or more computer systems. The process initially determines whether the alignment is suitable for inclusion in the output sequence alignment dataset (302) (in this example, the BAM file). In this regard, the software performing the processing may have a configuration setting that allows the processing to skip and remove certain alignments, such as PCR replication alignments, so that these are ignored if the setting is enabled. If the initial alignment is not suitable for inclusion, the processing of the initial alignment ends without outputting the alignment. Otherwise, processing continues by determining whether the initial alignment is suitable for re-alignment processing (304). Eligibility can be determined based on any desired factors. As an example, it can be determined whether (i) it is perfectly aligned, for example, whether there are one or more mismatched bases between the aligned read and the reference sequence it is aligned with, (ii) whether the aligned read contains soft cuts, (iii) whether the initial alignment is a secondary alignment and / or (iv) whether there are candidate insertions or deletions around the aligned read in the base region of the reference genome sequence indicated in the sequence alignment dataset. In a particular instance, if the alignment is perfectly aligned and there is no soft cut, it is a secondary alignment or there is no candidate insertion missing in the region, then the alignment is determined to be unsuitable for re-alignment processing (304 - No), and the process outputs the alignment as is (306), for example by buffering it to output directly to the output sequence alignment dataset.

[0059] Conversely, if at 304 it is determined that the alignment is suitable for realignment processing (304-Yes), for example, if it is not perfectly aligned, there is a soft cut, it is not a secondary alignment, and / or there is a candidate insertion missing in the region, then the process continues by attempting realignment processing to realign the initial alignment (308). This realignment is described in further detail below as part of the read realignment procedure. This realignment procedure provides a realignment that is considered the “best” realignment. After the realignment processing, it is determined whether the best realignment is at least as good as the original initial alignment (310). If not, the initial alignment is output as is (306). Otherwise, the best realignment is output (312). Thus, in any case of processing the initial alignment, an alignment of a given read with a reference can be made by outputting either the initial alignment (306) or the realignment (312).

[0060] Before considering realigning the alignment (308), all observed and potentially impactful candidate insertions / deletions (including any insertions / deletions from the original alignment itself, surrounding insertions / deletions, and any "priorities") of the read sequences have been collected to form a set of candidate insertions / deletions, which are introduced as candidates to provide a realignment of the read with the reference. An iterative process begins, introducing each candidate insertion / deletion (and in some instances, combinations of two or more of these candidate insertions / deletions) into a flattened version of the aligned read. In some instances, insertions / deletions are introduced from the left (i.e., from the upstream or 5' direction) and right (i.e., from the downstream or 3' direction) of the flattened aligned read. Each iteration provides a resulting 'candidate realignment', which is evaluated to determine the quality of the realignment. The evaluation may consider any desired quality metrics, such as the number of mismatched bases between the aligned read and the reference in the realignment example, the number of insertions / deletions, the position of the insertions / deletions, and / or the number of soft-cleaved bases.

[0061] One concept described in this article is a position map, which is a chromosome coordinate array for each base in a read. A position map is a data structure used to represent sequences in a sequence alignment dataset. Figure 4A Exemplary location diagrams of aligned reads containing soft cuts and omissions are depicted according to the various aspects described in this article. Figure 4B Exemplary location diagrams of the alignment reads containing insertions are depicted according to the various aspects described herein. First, refer to... Figure 4A The diagram shows the alignment of read 402a with reference 404, with the corresponding CIGAR operation 412a shown below the read. Below CIGAR operation 412a is position plot 414a. Position plots typically reflect the base positions shown above reference 404, except that soft-splitting or inserted bases not mapped to the reference genome are given a position of "-1" in the position plot, and deleted bases (shown at positions 7 and 8) are not present in the read, so they do not have their own positions. Instead, obvious deletions are indicated by positional jumps between bases in two consecutive reads; for example, as shown in position plot 414a, a jump from 6 to 9 indicates a 2-base pair (bp) deletion. Therefore, Figure 4A An exemplary location map reflecting the chrN:2(1S5M2D2M) reads with soft cuts and 2bp deletions is depicted. Figure 4A The soft cut indicated in the text is an N-type soft cut. The initial alignment tool has soft-cut a portion of the read, assigned it an "N," and indicated that it cannot distinguish what bases are present. "N" is a special type of soft cut; other types of soft cuts may have identified bases but are still considered soft cuts.

[0062] Figure 4BAnother exemplary location diagram 414b depicts a read segment reflecting chrN:2(5M2I2M) with a 2bp insert (between positions 6 and 7). This insert is shown in reference 404 and is reflected in alignment read 402b, CIGAR operation 412b, and location diagram 414b.

[0063] Read realignment can involve manipulation of the position map and subsequent comparison of nucleotide-position pairs with the reference genome. Each realigned read can first be stripped of its existing insertions, deletions, and non-N-type soft cuts to create a “blank slate.” This provides a read that begins presumably without insertions or deletions. Reads without insertions or deletions are referred to herein as “flattened” read sequences or flattened aligned reads (flattened versions of the reads in the initial alignment). Candidate insertions or deletions are then iteratively introduced into the flattened aligned reads, and their agreement with the reference is evaluated. This introduction can be accomplished by manipulating the position map. The resulting nucleotide-position pairs can then be compared with the reference genome.

[0064] Figures 5A-5C The flattening of the comparative readings based on the various aspects described in this article is depicted. Figure 5A The initial alignment read 502a, as indicated by reference 504, is shown, along with the corresponding CIGAR operation 512a. Position diagram 514a indicates the N-type soft cut at position 1 and the 2bp missing at positions 7 and 8. Figure 5B The flattened alignment read 502b is shown, with its left anchoring meaning that it is flattened to shift the bases to the left (i.e., upstream or in the 5' direction). The CIGAR operation 512b and the corresponding position map 514b have been updated as indicated. Figure 5C The flattened alignment read 502c is shown, with its right anchoring meaning it is flattened to shift the bases to the right. (This has already been described...) Figure 5C The instructions updated CIGAR operation 512c and the corresponding location diagram 514c.

[0065] Figures 6A-6D This describes the introduction, or 'injection, of candidate insertions / deletions into flattened alignment reads. Candidate insertions / deletions can be those found within the genomic proximity of the reads being processed, as well as any 'prior' indicated by a reference insertion / deletion dataset if used. Figures 6A-6D In the examples, the proximal candidate insertion missing includes: chrN:6ATC>A, chrN:6A>ACG, and chrN:10GA>G.

[0066] Figure 6AThe candidate realignment with realigned read 602a, which is flattened and left-anchored, is depicted and aligned with reference 604. There are no insertions or deletions, and the result is four mismatched bases between the candidate realigned read 602a and reference sequence 604—see positions 7-10.

[0067] Figure 6B Another candidate realignment with realigned read 602b is depicted, which has a chrN:6ATC>A deletion insertion introduced at positions 7 and 8. The last four bases of the read, CGTC, are shifted down two positions to introduce the deletion. The result is two mismatched bases between the candidate realigned read sequence 602b and the reference sequence 604—see positions 11 and 12.

[0068] Figure 6C Another candidate realignment with realigned read 602c is depicted, which has a chrN:6A>ACG insertion deletion added between positions 6 and 7. The result is that there are no mismatched bases between the candidate realigned read sequence 602c and the reference sequence 604. As further described below, in determining this candidate realignment, the iterative introduction of the insertion deletion into the flattened aligned read may be interrupted due to a candidate perfect match reference to return to the candidate realignment, which would be considered a perfect alignment.

[0069] Figure 6D The candidate realignment with read 602d is depicted, which has a chrN:6ATC>A deletion insertion at positions 7 and 8 and a chrN:10GA>G deletion insertion at position 11. This example illustrates the injection of two deletions into a flattened alignment read. The result here also shows that there are no mismatched bases between the candidate realigned read sequence 602d and the reference sequence 604. Figure 6D Is with Figure 6B Same alignment, but with added missing insertions.

[0070] One objective in finding the desired realignment is to prioritize minimizing mismatches, then minimizing the number of insertions and missing values ​​to achieve the optimal realignment. A realignment with a single insertion and no mismatches can be considered the best, in which case the realignment process can be stopped and re-alignment returned. This can then be compared with the initial alignment to determine which is the better alignment to be output. Alternatively, when no 'perfect alignment' is encountered during the realignment process, the 'best' candidate realignments from the considered combinations can be compared with the original alignment, and the better one can be selected for output as described below.

[0071] When determining the "best" candidate alignment, rules can be used or applied in order or by priority. In some instances, the current best candidate re-alignment is stored and compared with the next candidate determined for re-alignment. The two are compared according to rules, and if the re-alignment candidate is better, it is prioritized as the new best candidate for re-alignment, replacing the old candidate re-alignment. An example of these rules and priority ordering is:

[0072] (i) If one alignment of a read segment has only one mismatched base with a reference and no insertions or deletions, while another alignment of the read segment has one or more insertions or deletions, the alignment with a single mismatch and no insertions or deletions is preferred. An alignment with no insertions or deletions and only one mismatched base is preferred over a candidate alignment with one or more insertions or deletions;

[0073] (ii) Regardless of the number of insertions or deletions, minimize the number of mismatched bases (i.e., mismatches between the aligned read and the reference). Alignments with a smaller number of mismatched bases are preferred over alignments with a larger number of mismatched bases;

[0074] (iii) If the number of mismatched bases is the same as the reference number, alignments with fewer non-N soft cuts are preferred. Between alignments with the same number of mismatched bases, alignments with fewer soft cuts of a specific type (e.g., N) are preferred over alignments with more soft cuts of a specific type; and

[0075] (iv) If the number of mismatched bases is the same as the reference, the alignment with fewer insertions or deletions is preferred. Among different alignments with the same number of mismatched bases, the alignment with fewer insertions or deletions is preferred over the alignment with more insertions or deletions.

[0076] The input sequence alignment dataset can be processed by a computer system to read the data one alignment at a time. These initial alignments are read into memory, and each initial alignment is eventually purged for processing based on a sliding window as described above. If the process determines that the purged initial alignments are suitable for re-alignment processing, then for each initial alignment purged for processing, as referenced... Figure 7 The described and depicted passages were re-compared and processed. Figure 7 The processing can be performed by one or more computer systems.

[0077] The process begins by obtaining all proximal candidate insertions / deletions of the read in this initial alignment, i.e., the insertions / deletions observed in the region (702). Proximal insertions / deletions can be those within the region or window considered relevant to the read alignment, and thus can be any insertions / deletions seen in any of the several different alignments indicated in the sequence alignment dataset. This set of insertions / deletions, optionally together with insertions / deletions indicated as 'prior' in the reference insertion / deletion dataset or known / assumed to exist, forms a candidate insertion / deletion set.

[0078] The process then sorts these candidate insertions / missings related to the initial alignment (704). This sorting, or priority sorting, can be based on any desired rule, such as the following examples, and is applied in the following order:

[0079] (i) "Known" / Priority Priority (if used) - Priority ordering can give priority to inserts that are indicated as known inserts by the reference insert missing dataset over inserts that are not indicated as known inserts by the reference insert missing dataset, even if the inserts that are not indicated as "prior" are indicated in large numbers in the sequence alignment dataset.

[0080] (ii) Prioritize larger / longer insertions / missings—Priority ranking can prioritize longer insertions / missings over shorter ones. Longer insertions / missings may rank higher than smaller ones, even those that may appear more frequently in the sequence alignment dataset;

[0081] (iii) Higher frequency priority, such as an insertion missing that indicates the presence of an insertion missing at a given position in a higher number of reads—priority ranking can prioritize insertion missing indicated by a larger total number of reads in the sequence alignment dataset or a larger proportion thereof over insertion missing indicated by a smaller number or proportion of alignment reads in the sequence alignment dataset corresponding to a given position in the reference sequence.

[0082] (iv) If frequencies are the same, leftmost insertions / deletions take precedence—among different insertions / deletions indicated by the same number of aligned reads in the sequence alignment dataset, priority ranking can prioritize insertions / deletions that are upstream relative to their position in the reference genome sequence indicated by the sequence alignment dataset (compared to their position in the reference genome sequence for another insertion / deletion indication). As an example, when processing read 202b that is attempting to be realigned, insertion / deletion 208a may actually be ranked higher than insertion / deletion 208b.

[0083] The prioritization indicates which insertions / deletions are weighted more heavily in terms of probability of presence compared to other insertions / deletions. If two possible candidate insertions / deletions might provide two distinct candidate re-alignments with the same number (zero or more) mismatched bases, then the prioritization indicates which insertions / deletions are more likely to be present. The exemplary prioritization rule above favors known, longer, and more frequent insertions / deletions. Prioritization reflects the more likely true insertions / deletions to exist.

[0084] Figure 7 The process continues by obtaining the 'best' re-alignment (706), referencing Figure 8 Further detailed description. Optimal realignment is a process of progressively introducing candidate insertions and missing values ​​into a flattened version of the original aligned read to produce candidate realignment results. The iteration realigns the read with permutations of one insertion, two insertions, etc. (up to and including n insertions and missing values) by sorting the candidate insertions and missing values. In some embodiments, n is 3. Each iteration produces candidate realignments. Selection criteria can be used to select the 'best' among these realignments. One objective may be to achieve as few mismatches as possible between the modified read with injected insertions and missing values ​​and the reference.

[0085] continue Figure 7 The process, after obtaining the best candidate for re-alignment, determines whether the re-aligned read perfectly aligns with the reference (708), that is, whether the candidate re-aligned read (with one or more insertions or deletions introduced therein) aligns with the reference sequence and whether there are no mismatched bases between the re-aligned read and the reference. One example is in Figure 6C As shown in the figure. Using the CG insertion between positions 6 and 7 to insert the missing information, the alignment read 602c is perfectly aligned with reference 604. If the selected best realignment presents a perfect realignment (708-Y), the process outputs the selected best realignment 710 instead of the original alignment from the input sequence alignment dataset.

[0086] Otherwise, if there is a mismatch between the realigned read and the reference (708-N), the process is performed by comparing the best candidate realignment with the original alignment (712). Ultimately, the goal is to output the better alignment of the two. Therefore, based on the comparison, the process determines whether the best candidate realignment provided by 706 is better than the original alignment (714). If so, the process outputs this best realignment (710). In a particular instance, if the best candidate realignment is better than or as good as the original alignment, the mapping quality is adjusted as appropriate (e.g., set to 40 if the original quality is 20 or lower and the realignment has no mismatch), and the process outputs the best candidate realignment to the output sequence alignment dataset after this mapping quality adjustment. Returning to query 714, if the best candidate realignment is not better than or as good as the original alignment (714-N), the process outputs the original alignment (716).

[0087] The selection criteria used for choosing the better candidate from the original alignment and the best candidate re-alignment can be the same as or different from the selection criteria used to determine the best re-alignment candidate from 706. In a specific instance, the selection criteria for choosing the best candidate re-alignment and / or the better candidate from the original alignment and the best candidate re-alignment can be based on: the number of mismatched bases between the aligned read and the reference sequence, the number of insertions or deletions indicated by the alignment, the position of the insertion or deletion corresponding to the reference genome sequence indicated in the sequence alignment dataset, and / or the number of soft-cut bases indicated by the alignment. The term "alignment" as used above encompasses both alignment (as in the original alignment) and re-alignment (as in the candidate re-alignment), since both cases present an alignment of the corresponding read with the reference sequence.

[0088] As an example, the selection criteria may prioritize one or more of the following: an alignment with no insertions or deletions and only a single mismatched base (between the read and the reference to which it is aligned) is preferred over an alignment with one or more insertions or deletions; an alignment with a smaller number of mismatched bases is preferred over an alignment with a larger number of mismatched bases; among alignments with the same number of mismatched bases, an alignment with a smaller number of soft cuts of a specific type (e.g., N) is preferred over an alignment with a larger number of soft cuts of a specified type; and / or among different alignments with the same number of mismatched bases, an alignment with a smaller number of insertions or deletions is preferred over an alignment with a larger number of insertions or deletions.

[0089] Figure 8 An exemplary process for selecting the best candidate for re-alignment, based on the aspects described in this paper, is depicted. Figure 8The processing can be performed by one or more computer systems. At a high level, the process introduces one or more permutations of insertions and missing values ​​into a modified (e.g., flattened) version of the alignment read that has undergone realignment processing. Each introduction generates candidate realignments. The process first introduces each insertion and missing value individually into the flattened alignment read, providing candidate realignments, and then introduces each combination of two insertions and missing values ​​into the read to provide additional candidate realignments. This can be repeated for multiple insertions and missing values, such as 3 or 4, until some configurable threshold is met. In some instances, this threshold is met after introducing permutations of 3 insertions and missing values ​​into the read. The priority of introducing insertions and missing values ​​in the realignment processing follows the order of insertions and missing values ​​as described above. Moreover, in some instances, the processing is configured to interrupt (exit / stop) whenever a perfect alignment is determined.

[0090] As a specific example, assume there are n candidate insertion missings {I1,I2,I3,…,In} ordered by priority, and the iteration is performed by combining one insertion missing, then two insertion missings, then three insertion missings. The iterative introduction of insertion missings into the flattened alignment reads will be performed in the following order, where each iteration provides candidate re-alignments:

[0091] -[An iteration with insertion loss:] Introduce I1, then I2, then I3, ..., then In; then

[0092] -[Two insertion-missing iterations:] Introduce I1+I2, then I1+I3, ..., then I1+In, then I2+I3, then I2+I4, ..., then I2+In, ..., then In-1+In; then

[0093] -[Three insertion missing iterations:] Introduce I1+I2+I3, ..., then In-2+In-1+In.

[0094] The introduction of insertions and deletions injects the insertions and deletions into flattened alignment reads and checks how the modified read alignments are aligned with the reference genome, an examination that can be aided by the resulting modification location map.

[0095] As noted, if a candidate re-alignment that perfectly matches the reference is provided at any point during the iteration, the process can be interrupted and said candidate selected as the best candidate to be provided for re-alignment. Figure 7 #706).

[0096] refer to Figure 8 The process begins by initiating an optimal re-alignment. In one instance, this is initially empty or defaults to the original alignment as a placeholder, but as... Figure 8The processing continues by replacing the current best realignment encountered during the original alignment of the subject. The process enters a loop, which begins by determining whether there are more insertion-missing permutations to try (804). If so, the process obtains / identifies the next permutation to try (806). Then, when the next permutation to try contains multiple insertion-missings, an optional determination 808 is performed. Some insertion-missings may not coexist, in which case it is not meaningful to introduce them into the flattened alignment read to provide candidate realignments. When only a single insertion-missing is introduced, determination 808 may not be performed during the initial iteration. Finally, if the realignment processing reaches two or more permutations with insertion-missings, determination 808 may be performed at each iteration. If it is determined at 808 that the insertion-missings to be introduced into the flattened read cannot coexist, the process proceeds to the next iteration by returning to 804 to determine whether there are more insertion-missing permutations to try. Otherwise, or if no determination is made 808 because only a single missing insertion is considered in the current iteration, the process is performed to obtain the result (810) by performing a 're-alignment with target' process. Reference Figure 9 This process will be described in further detail.

[0097] The result obtained from 810 is a candidate re-alignment. Then, Figure 8 The process determines whether the result is better than the current best realignment (812). If so, the result becomes the new current best realignment (814). In one instance, the result replaces the previously stored best realignment, which is discarded. Because the result is determined to be better than any prior candidate realignment obtained in this process, the process is performed by determining whether the result (the new best realignment) is a perfect alignment (i.e., there are no mismatched bases between the realigned read and the reference) (816). If so, the process ends, and the best realignment is used as the selected best realignment. In some instances, it is output to the output sequence alignment dataset as the best alignment ( Figure 7 ,710).

[0098] If at 816 it is determined that the new best realignment is not a perfect alignment, or if at 812 it is determined that the obtained result is not better than the current best realignment, the process returns to 804 to determine if there are any other insertion / missing permutations to try. If not, the process returns to the current best realignment (818). It can be seen that this process continues iterating until there are no more insertion / missing permutations to try (804 - No), or a perfect alignment is provided by a determined candidate realignment (816 - Yes).

[0099] Figure 9This paper depicts an exemplary 're-comparison' with the target based on the various aspects described herein. Figure 8 (810) processing. Figure 9 The processing can be performed by one or more computer systems. The process obtains a reference. Figure 10A The described left anchoring result (902) is obtained from the reference. Figure 10B The right anchoring result is described (904), and the better of the two (906) is returned. The selection criterion used to choose between the two can be any desired selection criterion, such as the selection criteria described above. In a modified instance, instead of obtaining candidate re-alignments from the left and right anchoring processes, the re-alignment provided by the left anchoring result is obtained by... Figure 8 (812-816) processing, and Figure 10B The processing of the right anchoring result is only performed if it is determined that the left anchoring result does not result in a perfect alignment (816-No).

[0100] Figures 10A-10B An exemplary process is described for re-aligning the results of left and right anchoring according to the aspects described in this paper. Figure 10A and 10B The processing can be performed by one or more computer systems. Left or right anchoring reflects which side of the assumed read is more accurate in identifying its bases. If one end of the read is more reliable than the other, the read is anchored from that end, and insertion / deletion injection is performed from that end. In left-anchored re-alignment, the left side is more reliable than the right side. The processing of left-anchored results... Figure 10A The process obtains the adjusted start position by shifting the read position to the left (1002) by the length of any prefix soft cut (in some instances, except N-soft cut). If the read begins with (i) an insertion or (ii) a soft cut and an insertion, the read position is shifted to the left by the length of the insertion. The anchor corresponds to the outermost matching nucleotide. The process then creates a flattened read, sequence, and position map (1004). Figure 5B An example is described in the paper. Then, for each of the one or more insertions and missing values ​​in the current permutation and combination, and in sort order, the process adds the insertion and missing values ​​to obtain the resulting re-alignment (1006). Figure 6C This illustrates an example of an inserted missing segment introduced by flattening the left-anchored read segment. Figure 6D An example of two insertion missing segments introduced by flattening the read segment is shown.

[0101] Regarding the processing of right anchoring results, Figure 10BThe process obtains the end position of the adjustment by finding the maximum position in the position map and adding the number of inserted / soft-cut bases present at the end of the read (1008). The starting position of the read adjustment will be the maximum position minus the length of the read, excluding N-type soft cuts. The process then creates a flattened alignment read, sequence, and position map (1010). Figure 5C An example is described. Then, for each of the one or more insertions in the current permutation, and in order from right (or upstream, toward the 5' end) to left (or downstream, toward the 3' end), the process adds the insertion to obtain the resulting realignment (1012). For example, if there are three insertions to be introduced, the process introduces them from right to left, first adding the upstream insertion of the three, then adding the upstream of the other two, and then the third.

[0102] refer to Figure 11 Describe the 'Add Insert and Get Results' process () Figure 10A 1006, Figure 10B (1012). Figure 11 The processing can be performed by one or more computer systems. This is done on a per-insertion-deletion basis. In the case of multiple insertion-deletions to be added, the realignment resulting from adding the insertion-deletion is modified by adding the next insertion-deletion (the resulting realignment is layered on each consecutive insertion-deletion). Ultimately, the final result is determined by... Figure 10A Or 10B processing generates and is produced by Figure 8 The candidate re-alignment returned by 810 is based on Figure 11 Each missing insertion in the added combination is followed by a final re-alignment.

[0103] Figure 11 The process assumes some initial realignment candidates, which will initially be flattened reads without introduced insertions, but are replaced with the updated realignments as each insertion is added. The process begins by determining whether the location map allows the introduction of insertions (1102). If not, for example, if the reference position of the insertion to be introduced is off-target or the final position in the location map, the insertion cannot be added, and the process returns NULL (1114) or some other expected result, and then terminates.

[0104] If the position map allows for the introduction of insertion deletions, the procedure determines whether the new position map (with the insertion deletion) is valid (1104). If not, the insertion deletion cannot be added, and the procedure returns NULL (1114) or some other expected result, and then terminates. Otherwise, the procedure proceeds by determining whether the candidate insertion deletion is an insertion (1106). If so, it determines whether the bases of the read sequence match the putative insertion (1108). If the bases in the read sequence at the putative insertion position are the same as those specified in the putative insertion, then the bases of the read sequence can match the putative insertion. As an illustrative example, if the following read sequence ATCTGA is anchored at position 10 (i.e., 5'A at chrN:10), and the putative insertion is chrN:12C>CTG, it is considered a match because the next two bases in the read sequence after C at chrN:12 are TG. Conversely, as another illustrative example, if the presumed insertion is chrN:12C>CAA, it does not match because the next two bases in the read sequence after C at chrN:12 are not AA. If the bases in the read sequence do not match the presumed insertion, the insertion deletion cannot be added, and the process returns NULL (1114) or some other expected result, and then terminates.

[0105] Conversely, if the bases of the read sequence are determined to match the putative insertion at 1108 (1108-yes), or if the insertion / deletion is determined not to be an insertion at 1106 (e.g., it is a deletion), the process proceeds by determining a new CIGAR position string and start position based on the adjusted position map (1110). It then returns to the resulting re-alignment with the added insertion / deletion (1112) and terminates.

[0106] The following provides a sample GetBestAlignment routine (corresponding to...) Figure 8 The pseudocode for ) and the exemplary subroutine RealignToTargets (corresponding to Figure 9 The pseudocode for ).

[0107] GetBestAlignment is a routine performed on a list of candidate missing insertions for introducing flattened reads. In this process, RealignToTargets is performed on each individual candidate missing insertion and on each candidate missing insertion combined with other candidate missing insertions. If at any time introducing a single missing insertion results in no mismatch in the read, the process can exit, and the realignment is considered the best candidate for realignment. Otherwise, the process returns the “best” realignment, measured as all evaluation combinations of one to n missing insertions by the rules / selection criteria described above, where n is the maximum number of missing insertions to be introduced.

[0108] GetBestAlignment routine pseudocode:

[0109] Initialize BestResultSoFar to empty;

[0110] For each candidate insertion missing A, sorted in order:

[0111] / / Attempt to compare with an insertion missing:

[0112] The RealignToTargets routine is executed, and the result ResultA is obtained;

[0113] If ResultA is better than BestResultSoFar, then ResultA becomes BestResultSoFar;

[0114] If BestResultSoFar has 1 missing insertion and 0 mismatches, break it and keep it as the best alignment.

[0115] / / Attempt to compare with two missing insertions:

[0116] For each additional candidate insertion missing B:

[0117] If insertions A and B cannot coexist, then skip this pair;

[0118] Perform the RealignToTargets routine to obtain the result ResultAB;

[0119] If ResultAB is better than BestResultSoFar, then ResultAB becomes BestResultSoFar.

[0120] / / Attempt to compare with three missing insertions:

[0121] If configured to try a combination of the three, for each additional candidate insertion missing C:

[0122] If insertions of A, B, and C cannot coexist, then skip this group consisting of the three.

[0123] If BestResultSoFar has >0 mismatches:

[0124] Perform the RealignToTargets routine to obtain ResultABC;

[0125] If ResultABC is better than BestResultSoFar, then ResultABC becomes BestResultSoFar.

[0126] Returns BestResultSoFar;

[0127] Pseudocode for the RealignToTargets routine:

[0128] Given: CombinationIndels, which is, for example, a list of one to three candidate missing insertions to be evaluated in combination:

[0129] / / Results obtained using left anchoring:

[0130] To determine the adjusted position: shift the read segment position to the left by the length of the prefix soft cut (except for N-soft cut). If the read segment begins with an insertion or a soft cut + insertion, shift the read segment position to the left by the length of the insertion.

[0131] Create a flattened read with CIGAR, sequence, and position map (except for the terminal N) assuming all matches. For each base in the read (except for the terminal N), the resulting read will have a CIGAR string of "M";

[0132] Initialize ResultLeftAnchored;

[0133] For each missing insertion X in CombinationIndels, sort by position in ascending order:

[0134] Perform the AddIndelAndGetResult routine ( Figure 11 Modify ResultAlignment (to stratify on each consecutive insertion and missing);

[0135] / / Results obtained using right anchoring:

[0136] To determine the adjustment position: Locate the maximum position in the position graph and add the number of inserted or soft-cleaved bases present at the end of the read. The starting position for read adjustment will be the maximum position minus the read length.

[0137] Create a flattened read with CIGAR, sequence, and position map (except for the terminal N) assuming all matches. For each base in the read (except for the terminal N), the resulting read will have a CIGAR string of "M";

[0138] Initialize ResultRightAnchored

[0139] For each missing insertion X in CombinationIndels, sort by position in descending order:

[0140] Perform the AddIndelAndGetResult routine and modify ResultAlignment (to stratify on each consecutive insertion and missing);

[0141] Returns the better of ResultLeftAnchored and ResultRightAnchored. In the event of a tie, returns ResultLeftAnchored.

[0142] The various aspects described herein can be used to adjust and improve sequencing data alignments from the output of an initial alignment tool. The alignment tool can output an initial sequence alignment dataset, which is provided as input to software configured to perform the various aspects described herein. The software outputs a sequence alignment dataset containing realignments of one or more initial alignments.

[0143] The following is a comparison of the insertion-missing realignment according to the various aspects described in this paper (hereinafter referred to as Realigner) with the insertion-missing realignment of the GATK insertion-missing realignment tool (provided by the Eli and Edythe L. Broad Institute of MIT and Harvard (“Broad Institute”), Cambridge, Massachusetts, USA).

[0144] The difference with Realigner is that it can accurately realign reads around observed mutations and can do so in a shorter time than existing methods. To demonstrate this, Realigner is compared with what is perhaps the most well-known local insertion / deletion realignment tool in the bioinformatics community (i.e., GATK insertion / deletion realignment tool (see, for example, DePristo, M., Banks, E., Poplin, R., Garimella, K., Maguire, J., and Hartl, C., A framework for variation discovery and genotyping using next-generation DNA sequencing data, Nature Genetics, 43(5), 491-498, (2011))) to determine whether it also does so in a shorter time.

[0145] Sensitivity and specificity to simulated variant data:

[0146] method

[0147] To evaluate sensitivity, the following experiment was conducted:

[0148] 1. Simulate 200 FASTQ files with 200 insertions and 200 deletions for individual variants with a length of 4-25bp (400 simulated FASTQ files in total).

[0149] 2. The simulated FASTQ files were compared using the iSAAC alignment tool (provided by Illumina, Inc., San Diego, CA, USA). Two conditions were evaluated: with and without 'prior'. Providing iSAAC with a list of priors allowed insertions or omissions in the list to take precedence over a series of mismatches at the stated location.

[0150] 3. Realign each of the above conditions using Realigner, GATK, and without realignment (with or without prior knowledge).

[0151] 4. Use the Pisces mutant calling tool provided by Yimingda Company to call the mutant.

[0152] 5. Evaluate the sensitivity and specificity of calling up variants.

[0153] Samples used in the analysis:

[0154] 200 insertions and 200 deletions were randomly selected from a library of approximately 2000 medium-length (4-25bp) insertions and deletions. Figure 12 The distribution of variant lengths used in the simulation analysis based on the various aspects described in this paper is depicted.

[0155] Calling the evaluation of variants

[0156] The goal is to have only one call variant per simulated sample. To evaluate the sensitivity and specificity of the results, all call variants are extracted from the VCF (generating 0 to more variants, where 0 to 1 will match the expected variant). The resulting variants are compared with the expected "true" variants to obtain... Figure 13 One of the results listed here describes the possible outcomes of the evaluation of truth variants based on the aspects described in this paper.

[0157] result

[0158] Using a priori in the initial iSAAC alignment improved sensitivity across all conditions. Without realignment, 48.5% of variants were successfully invoked with no false positives. Using GATK realignment, this rose to 48.8%, while the Realigner achieved 75.3%. In all cases, if a variant was correctly invoked and passed, no other variants passed. In some instances, the Realigner, used with a priori, produced fewer false negatives and fewer false positives than GATK realignment.

[0159] Figure 14 The true and false positive rates of simulated BAMs generated by iSAAC using priors of non-realignment, GATK realignment, or realignment according to various aspects of the realignment methods disclosed herein are depicted. Figure 15 The true and false positive rates of simulated BAMs generated by iSAAC using prior alignments with non-realignment, GATK realignment, or realignment according to various aspects of the realignment methods disclosed herein are depicted. It should be noted that these results are based on the specific expected representation of the insertion / deletion, which may not always be a left-aligned representation. The GATK representation of the insertion / deletion will always be left-aligned, while the Realigner maintains fidelity for the original representation of the insertion / deletion it sees in the input BAM.

[0160] Specificity for normal FFPE samples:

[0161] method

[0162] To evaluate specificity for real-world samples, normal (non-disease) samples were used. To fully challenge the realignment tool, FFPE samples were used, which typically have poor DNA quality, resulting in a large number of low-frequency "noisy" variants. Especially for the Realigner, each of these low-frequency variants represents an opportunity to introduce spurious variants.

[0163] Because these are normal, non-cancerous samples, we assume that all true variants are diploid in frequency (approximately 50% for heterozygous variants and approximately 100% for homozygous variants). Therefore, any case falling within the "somatic" range (<20% VAF) can be considered a false positive. Furthermore, all other things being equal, the lower the resulting somatic mutation count, the more accurate the re-alignment method can be considered.

[0164] The following experiments were conducted:

[0165] 1. Run the iSAAC variant invocation tool using a priori VCF containing targeted variants from the Cancer Somatic Mutation Catalogue (COSMIC) online database.

[0166] 2. Use Realigner or GATK to re-align the BAM files.

[0167] 3. Use the Pisces mutant invocation tool to invoke the mutant.

[0168] 4. Assess somatic mutation rate.

[0169] The analysis was performed on 20 normal FFPE samples prepared and sequenced using the TruSightTumor 170 assay provided by Eminex Corporation, and processed through the TruSightTumor 170 informatics pipeline up to the alignment step.

[0170] result

[0171] Compared to non-realignment or GATK realignment, Realigner consistently demonstrated a lower somatic mutation rate (representing the false positive rate in non-cancer samples) (in twenty cases, Realigner had a higher FP (false positive) compared to GATK in only three, and all three were very close). Realigner also appeared to have a more aggressive deletion call compared to non-realignment or GATK realignment (see [link to relevant documentation]). Figure 16 Typically, insertion-deletion realignment significantly reduces false positives, especially for the Realigner.

[0172] Figure 16 The total per-sample somatic mutation count (representing the false positive count in non-cancer samples) is depicted for samples that were not re-aligned, re-aligned using GATK re-alignment, or re-aligned according to various aspects of the re-alignment methods disclosed herein. Figure 17 Somatic mutation counts per sample, broken down by mutation type, are depicted for samples that were not re-aligned, re-aligned using GATK, or re-aligned according to various aspects of the re-alignment methods disclosed herein (representing false positive counts in non-cancer samples).

[0173] Runtime evaluation:

[0174] method

[0175] The evaluation was performed on the same 20 samples used for normal FFPE assessment, including the computation time required from the input BAM to the re-aligned output BAM. Each input BAM file contained approximately 60 million reads.

[0176] result

[0177] In all cases, Realigner is significantly faster than GATK on medium-sized BAMs. Figure 18The realignment times per million alignments for various aspects of GATK and the realignment methods described herein are depicted. On the test computer system, the realignment times per million alignments ranged from approximately 1.5 to 5 minutes for GATK, while the Realigner consistently remained below 10 seconds per million alignments.

[0178] Realigner is a fast and accurate insertion-missing realignment algorithm that preserves the fidelity of existing representations of insertions and missing information. It relies on the presence of existing signals in the input sequence alignment dataset to realign around insertions and missing information. In the example above, Realigner performs particularly well when used with BAM files generated by iSAAC taking priors into account, as this maximizes the probability that the input BAM will contain at least one read with an insertion and missing information.

[0179] The expected gold standard for local realignment would involve a stacking approach with consensus generation and consensus-based local realignment. However, consensus-based solutions have proven expensive in terms of time and computation. In contrast, Realigner processes each read segment individually using information about the location of insertion / missing information observed proximal to achieve a simpler candidate-based approach.

[0180] Therefore, this paper describes the process used for sequence alignment. Figure 19 An exemplary process for sequence alignment processing based on the aspects described herein is depicted. Figure 19 The processing can be performed by one or more computer systems. In a particular instance, software running on the computer system opens the input sequence alignment dataset file and reads its contents, which, as an example, contains a binary representation of the alignment of the read sequence with the reference sequence. The process begins by determining whether there is a next initial alignment to be processed (1902). If not, the process ends. If there is a next initial alignment to be processed, the process continues by obtaining an initial alignment of the read sequence with the reference sequence (if not yet read into memory) from the sequence alignment dataset (1904). This initial alignment is then processed. Initially, the processing determines whether the obtained initial alignment is suitable for re-alignment (1906). If not, the process provides the initial alignment as is without re-alignment processing (1908). Otherwise, if the initial alignment is suitable for re-alignment, the process continues by re-aligning the initial alignment (1910). The re-alignment processing re-aligns the read sequence with the reference sequence. See below for reference. Figure 20 An exemplary re-alignment process is described and illustrated. As part of this process, one or more candidate re-alignments are generated. Then, Figure 19The process is based on one or more selection criteria to provide an initial alignment or a selected candidate re-alignment in one or more candidate re-alignments (1912).

[0181] Selection criteria may be based at least in part on: the number of mismatched bases, the number of insertions / deletions, the position of the insertions / deletions relative to the reference genome sequence indicated by the sequence alignment dataset, and / or the number of soft-cut bases. In some instances, selection criteria are prioritized: for the provided, alignments with no insertions / deletions and only one mismatched base are preferred over alignments with one or more insertions / deletions; for the provided, alignments with a smaller number of mismatched bases are preferred over alignments with a larger number of mismatched bases; among different alignments with the same number of mismatched bases, for the provided, alignments with a smaller number of soft-cut bases of a particular type are preferred over alignments with a larger number of soft-cut bases of a particular type; and / or among different alignments with the same number of mismatched bases, for the provided, alignments with a smaller number of insertions / deletions are preferred over alignments with a larger number of insertions / deletions.

[0182] Refer again Figure 19 After providing appropriate alignments (1908, 1912), the process is repeated by returning to 1902. This can be repeated for each additional initial alignment among several initial alignments (e.g., several initial alignments that have been cleared for processing). Thus, the process is repeated by processing each additional initial alignment among one or more additional initial alignments. That is, the process repeats the acquisition and determination of whether the acquired additional initial alignments are suitable for re-alignment for each additional initial alignment among one or more additional initial alignments in the sequence alignment dataset.

[0183] Figure 20 An exemplary process for re-alignment processing based on the aspects described herein is depicted. Figure 20 The processing can be performed by one or more computer systems. The process begins by identifying one or more candidate insertion defects (2002). The one or more candidate insertion defects can be any of the aligned reads and may be other insertion defects aligned near or proximal to the aligned read. There may be zero or more insertion defects indicated in the initial read alignment and zero or more in the vicinity of the aligned sequence, so candidate insertion defects may include zero or more insertion defects in the aligned reads and zero or more insertion defects aligned proximal to the aligned read, as indicated in the sequence alignment dataset. Additionally, and optionally, a reference insertion defect dataset may be used to provide one or more insertion defects to the candidate insertion defect set for introduction.

[0184] Then, Figure 20The process prioritizes candidate insertions and deletions (2004). This prioritization uses any desired method to prioritize or rank candidate insertions and deletions. For example, the prioritization prioritizes insertions and deletions indicated as predicted by a reference insertion and deletion dataset over those not indicated as predicted by the reference dataset. Alternatively, the prioritization prioritizes longer insertions and deletions over shorter ones. Alternatively, the prioritization prioritizes insertions and deletions indicated by a larger number of aligned reads in the sequence alignment dataset over those indicated by a smaller number of aligned reads. Alternatively, among different insertions and deletions indicated by the same number of aligned reads in the sequence alignment dataset, the prioritization prioritizes insertions and deletions that are upstream relative to their position in the reference genome sequence indicated by the sequence alignment dataset (relative to their position in the reference genome sequence for another insertion and deletion indication).

[0185] Figure 20 The process involves creating a flattened alignment read (2006) by removing any insertion defects indicated by the initial alignment from the read sequence at least as a basis (2006), and then determining one or more candidate realignments of the read sequence with a reference sequence (2008). The determination of candidate realignments is based on introducing at least one corresponding candidate insertion defect from the one or more candidate insertion defects into the flattened alignment read for each candidate realignment in the one or more candidate realignments. The one or more candidate insertion defects may comprise multiple candidate insertion defects, and the determination of the one or more candidate realignments may include iteratively introducing multiple candidate insertion defects into the flattened alignment read, wherein each iteration of the iterative introduction provides a candidate realignment in the one or more candidate realignments. The iterative introduction may be based on a priority order in which the multiple insertion defects are introduced.

[0186] The iterative introduction of permutations of one or more candidate insertion missings from a plurality of candidate insertion missings into the flattened read segment, in order to obtain different candidate re-alignments for one or more candidate alignments for each permutation in the permutation.

[0187] Re-comparison processing ( Figure 20 Finally, based on the selection criteria, the best candidate from one or more candidate re-alignments is selected for re-alignment (2010). For this selection, the same as... Figure 19The criteria used to select between the initial alignment and the best candidate re-alignment are different. Therefore, the selection of the best candidate re-alignment can be based on a first criterion among one or more selection criteria, where the selected candidate re-alignment is the selected best candidate re-alignment, and where the output ( Figure 19 (1912) Selects between the initial alignment and the best re-alignment candidate based on the second criterion among one or more selection criteria.

[0188] The selection of the best candidate realignment may include checking the provided candidate realignments to determine whether the aligned reads of the provided candidate realignment (i.e., aligned reads with one or more corresponding candidate insertions / deletions introduced) match the reference sequence and whether there are no mismatched bases between the aligned reads of the provided candidate realignment and the reference sequence. Based on the determination that the aligned reads of the provided candidate realignment match the reference sequence and that there are no mismatched bases, the iterative introduction of candidate insertions / deletions into the flattened aligned reads may cease, and the provided candidate realignment without mismatched bases may be provided as the selected candidate realignment (2010). In these cases, the provision ( Figure 19 (1912) can output the selected candidate re-alignment based on the matching of the alignment reads provided with the reference sequence.

[0189] Figure 21 An exemplary process for determining the conformity of an initial alignment to undergo realignment processing, based on the various aspects described herein, is depicted. This conformity determination is made in... Figure 19 (1906) was carried out. Figure 21 The processing can be performed by one or more computer systems. The process begins by determining whether there are any (e.g., one or more) mismatched bases between the initially aligned read and the reference sequence, or whether the aligned read contains a soft cut (2102). If neither is true, the process determines that the alignment is not suitable for realignment (2108). Otherwise, one or more mismatched bases and / or a soft cut are present, and the process continues by determining whether the alignment is a secondary alignment (2104). In one instance, whether the alignment is a secondary alignment can be indicated in the sequence alignment dataset. If the alignment is identified as a secondary alignment, the process determines that the alignment is not suitable for realignment (2108). Otherwise, the process determines that the initial alignment is not a secondary alignment and continues to determine whether there are any candidate insertions or deletions around the aligned read in the base region of the reference genome sequence in the sequence alignment dataset (2106). Thus, if nothing is present, the process determines that the alignment is not suitable for realignment (2108). Otherwise, the process determines that the initial alignment is suitable for re-alignment processing (2110), and the process ends.

[0190] Figure 21The examples shown are just some of the possible criteria for determining whether an alignment is suitable for realignment processing. The same or other criteria may be used alone or in combination with one or more other criteria.

[0191] The process described in this article can be performed individually or jointly by one or more computer systems. Figure 22 An example of this computer system and related apparatus incorporating and / or using the various aspects described herein is depicted. The computer system may also be referred to herein as a data processing apparatus / system or a computing apparatus / system / node, or simply as a computer. Figure 22 The computer system 2200 described herein can be implemented as one or more of the following: personal computer system, server computer system, thin client, fat client, handheld or laptop device, mobile device, multiprocessor system, microprocessor-based system, set-top box, programmable consumer electronics product, network PC, minicomputer system, mainframe computer system, and / or a distributed cloud computing environment containing any of the above systems or devices.

[0192] System 2200 includes one or more processors or processing units 2250 and memory 2252 including volatile memory 2254 (e.g., random access memory, RAM) and non-volatile memory 2056. Memory 2252 may further include removable / non-removable, volatile / non-volatile computer system storage media. Furthermore, memory 2252 may include one or more readers for reading and writing non-removable non-volatile magnetic media (e.g., hard disk drives), disk drives for reading and writing removable non-volatile disks, and / or optical disc drives for reading or writing removable non-volatile optical discs (e.g., CD-ROM, DVD-ROM). System 2200 may also include various computer-readable tangible storage media. These media can be any available media, such as volatile and non-volatile media, as well as removable and non-removable media.

[0193] Memory 2252 may contain at least one program product having a group of program modules (e.g., at least one) implemented as executable instructions that, when executed, perform the functions described herein. Executable instructions 2258 may include an operating system, one or more application programs, other program modules, and program data or other types of software. Typically, program modules may contain routines, programs, objects, components, logic, data structures, etc., that perform a specific task or implement a specific abstract data type. Program modules may perform the functions, procedures, methods, etc., described herein, including but not limited to re-alignment of sequencing data reads.

[0194] The components of the computer system 2200 can be connected via an internal bus 2260, which can be implemented as any one or more of several types of bus architectures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of the various bus architectures.

[0195] Computer system 2200 can also communicate with one or more external devices (e.g., keyboard, indicator, display 2262, etc.) and / or any device that enables computer system 2200 to communicate with one or more other computer systems (e.g., servers or other systems hosted in a cloud computing environment) (e.g., network interface card, modem, etc.). This communication can be performed via I / O interface 2264, which may include a network interface for connection to one or more networks (e.g., local area network (LAN), general area network (WAN), and / or public network (e.g., the Internet)) via a suitable network adapter.

[0196] Other aspects of using computer systems for sequencing are now described. Figure 23 It can be compared with, for example, references Figure 24 This is a schematic diagram of a sequencing device 2300 used in conjunction with a cloud computing environment. The sequencing device 2300 can be implemented according to any sequencing technology, such as those combining synthetic sequencing methods or ligation sequencing technologies. Some embodiments may utilize nanopore sequencing, in which the target nucleic acid strand or nucleotides excised from the target nucleic acid pass through a nanopore. Each type of base can be identified by measuring fluctuations in the pore's conductivity as the target nucleic acid or nucleotide passes through the nanopore. Other embodiments include detecting protons released during the incorporation of nucleotides into the extension product. For example, sequencing based on the detection of released protons can use an electrodetector and related techniques. Specific embodiments may utilize methods involving real-time monitoring of DNA polymerase activity. Nucleotide incorporation can be detected by fluorescence resonance energy transfer (FRET) interaction between the polymerase carrying a fluorophore and the γ-phosphate-labeled nucleotide or by using a zero-mode waveguide. Other suitable alternative technologies include, for example, fluorescence in situ sequencing (FISSEQ) and massively parallel signature sequencing (MPSS). In specific embodiments, the sequencing device 16 may be a HiSeq, MiSeq, or HiScanSQ from Eminex Corporation.

[0197] In the depicted embodiment, the sequencing apparatus 2300 includes a separate sample processing unit 2318 and an associated computer system 2320. However, as noted, these can be implemented as a single device. Furthermore, the associated computer 2320 may be local to or networked with the sample processing unit 2318 (e.g., provided as a cloud or other remote location). In some embodiments, the computer 2320 may be a cloud computing device located remotely from the sequencing apparatus 2300. That is, the computer 2320 may be able to communicate with the sequencing apparatus 2300 via a cloud computing environment. In the depicted embodiment, a biological sample may be loaded into the sample processing unit 2318 as a sample slide 2370, which is then imaged to generate sequence data. For example, in response to an excitation beam generated by the imaging module 2372, a reagent interacting with the biological sample fluoresces at a specific wavelength, thereby returning radiation used for imaging. For example, fluorescent components may be generated by fluorescently labeled nucleic acids that hybridize with complementary molecules of the component or with fluorescently labeled nucleotides incorporated into oligonucleotides using a polymerase. As those skilled in the art will understand, the wavelength at which the sample dyes are excited and the wavelength at which they fluoresce will depend on the absorption and emission spectra of the specific dye. This returned radiation can be propagated back via guiding optics. This returned beam can typically be directed to the detection optics of the imaging module 2372.

[0198] The imaging module's detection optics can be based on any suitable technology and can be, for example, a charge-coupled device (CCD) sensor that generates pixelated image data based on photons affecting the position within the device. However, it should be understood that any of a variety of other detectors can also be used, including, but not limited to, detector arrays configured for time-delay integration (TDI) operation, complementary metal-oxide-semiconductor (CMOS) detectors, avalanche photodiode (APD) detectors, Geiger-mode photon counters, or any other suitable detector. TDI-mode detection can be combined with line scanning. Other useful detectors are described, for example, in the references provided earlier in this document in the context of various nucleic acid sequencing methods.

[0199] The imaging module 2372 can be controlled by a processor (e.g., via processor 2374), and the sample receiving device 2318 may also include I / O control 2376, an internal bus 2378, non-volatile memory 2380, RAM 2382, and any other memory structure that enables the memory to store executable instructions, and may be similar to the one described above. Figure 22Other suitable hardware components described. Furthermore, the associated computer 2320 may also include a processor 2384, I / O control 2386, a communication module 2387, and a memory architecture including RAM 2388 and non-volatile memory 2390, enabling the memory architecture to store executable instructions 2392. The hardware components can be connected via an internal bus 2394, which can also be connected to a display 2396. In embodiments where the sequencing device is implemented as an integrated device, certain redundant hardware elements can be eliminated.

[0200] Turn now Figure 24 The diagram illustrates a cloud computing environment 2410 for biological data. As used herein, the term "cloud" or "cloud computing environment" can refer to various evolved arrangements, infrastructures, networks, etc., typically based on the Internet. The term can refer to any type of cloud, including client clouds, application clouds, platform clouds, infrastructure clouds, server clouds, etc. As those skilled in the art will understand, these arrangements will typically allow the owner or user of the sequencing device to use, providing Software as a Service (SaaS), various aspects of Platform as a Service (PaaS), various types of Infrastructure as a Service (IaaS), etc. Furthermore, the term should encompass various types and business arrangements of these products and services, including public clouds, community clouds, hybrid clouds, and private clouds. Any or all of these can be provided by third-party entities. However, in some embodiments, a private cloud or hybrid cloud may allow the sharing of sequence data and services among authorized users.

[0201] Cloud facility 2412 comprises multiple computer systems / nodes 2414. The computing resources of nodes 2414 can be consolidated to serve multiple consumers, where different physical and virtual resources are dynamically allocated and reallocated based on consumer demand. Instances of resources include storage, processing power, memory, network bandwidth, and virtual machines. Nodes 2414 can communicate with each other to allocate resources, and this communication and management of resource allocation can be controlled by a cloud management module residing in one or more nodes 2414. Nodes 2414 can communicate via any suitable arrangement and protocol. Furthermore, nodes 2414 may contain servers associated with one or more vendors. For example, certain programs or software platforms may be accessed via a set of nodes 2414 provided by the program owner, while other nodes 2414 are provided by a data storage company. Some nodes 2414 may also be overflow nodes used during periods of high load.

[0202] In one embodiment, the cloud management module is responsible for load management and cloud resources. Load management can be implemented by considering various factors, including user access levels and / or total load (peak time vs. average load time) in the cloud computing environment. Project type can also be considered. In one embodiment, public health emergencies can be prioritized over other types of projects. Furthermore, users can manage costs by providing certain runs with lower priority (keeping them until cloud utilization falls below a certain threshold).

[0203] Cloud facility 2412 is configured to communicate with various users (e.g., user computer systems) to generate biological data. This data may include sequence data generated via sequencing device 2416, which in certain embodiments may include sequencing device 2418 (which includes modules for receiving biological samples and generating sequence data) and associated computer 2420 (which includes executable instructions for analyzing the sequence data or transmitting it to cloud facility 2412). It should be understood that in some embodiments, sequencing device 2416 may also be implemented as an integrated device. Sequencing device 2416 is configured to communicate with cloud facility 2412 via a suitable communication link 2424. Communication with cloud facility 2412 may include communication via communication link 2424 over a local area network (LAN), general wide area network (WAN), and / or public network (e.g., the Internet). In particular, communication link 2424 transmits sequence data 2426 and authentication information 2428 (in some embodiments) to cloud computing environment 2412. The authentication information confirms that sequencing device 2416 is a client of cloud facility 2412.

[0204] As described above, cloud facility 2412 can serve multiple users or clients with associated devices (e.g., devices 2416a, 2416b, and 2416c). Furthermore, cloud facility 2412 can also be accessed by other types of clients, such as secondary users 2430 or third-party software holders. Therefore, cloud facility 2412 can provide different types of services depending on the access level of a particular client. Sequencing clients can access storage and data analysis services, while secondary users 2430 can only access shared or public sequences. Third-party software holders can negotiate with sequencing clients to determine appropriate access permissions. For example, open-source software can be provided free of charge or on a limited license, while other types of software can be provided on various fee or subscription basis.

[0205] Furthermore, primary users (or secondary users) can also interact with cloud facility 2412 via any suitable access device (e.g., a mobile device or other computer system containing components similar to those described with respect to computer 2420). That is, once sequence data has been transferred to cloud facility 2412, further interaction with and access to the sequence data may not necessarily be associated with sequence device 2416. These embodiments may be advantageous in embodiments where the owner of the biological sample and / or sequence data has undertaken the sequencing (e.g., for a core laboratory facility). In these embodiments, the primary user can be the owner, while the core laboratory facility associated with sequence device 2416 is at most a secondary user after the sequence data has been transferred to cloud facility 2412. In some embodiments, sequence data can be accessed via security parameters (e.g., password-protected client accounts in cloud facility 2412 or association with a specific institution or IP address). The sequence data can be accessed by downloading one or more files from cloud facility 2412 or by logging into a web-based interface or software program that provides a graphical user display, wherein the sequence data is depicted as text, images, and / or hyperlinks. In this embodiment, the sequence data can be provided to primary or secondary users in the form of data packets transmitted via a communication link or network.

[0206] Cloud facility 2412 can implement user-interactive software (e.g., via a web-based interface or application platform) that provides a graphical user interface and facilitates access to sequence data, as well as user selection for researcher groups, data analysis programs, available third-party software, and load balancing and instrument settings. For example, in a particular embodiment, settings for sequencing runs on sequencing device 2416 can be configured via cloud facility 2412. Therefore, cloud facility 2412 and the individual sequencing device 2416 can communicate bidirectionally. This embodiment may be particularly useful for controlling parameters of remote sequencing runs.

[0207] As examples, the results of sequencing runs and various analyses can be stored in files as FASTQ files, binary alignment files (bam), *.bcl, *.vcf, and / or *.csv files. Output files can be in formats compatible with software for viewing, modifying, annotating, manipulating, aligning, and realigning sequence data. Therefore, the accessible sequence alignment datasets provided herein can be in the form of raw data, partially processed or manipulated data, and / or data files compatible with specific software programs. In this regard, as examples, a computer system (e.g., the computer system of a sequencing facility or a computer system communicating with a sequencing facility, or a cloud facility computer system) can obtain bam or other sequencing alignment datasets and process the files by, for example, reading their data and performing operations to execute the various aspects described herein. The computer system can then output a file containing sequencing alignment data (e.g., another bam file). Furthermore, the output file can be compatible with other data sharing platforms or third-party software.

[0208] Although various embodiments have been described above, these are merely examples. For instance, computing environments with other architectures can be used in combination with and employ one or more embodiments.

[0209] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a / an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that, when used in this specification, the terms “comprises and / or comprising” specify the presence of the stated features, integers, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups.

[0210] All the means or steps plus functional elements in the following claims, their corresponding structures, materials, actions, and equivalents (if any), are intended to include any structure, material, or action for functioning in conjunction with other required elements of a particular requirement. One or more embodiments have been described for purposes of illustration and description, but are not intended to be exhaustive or limited to the forms disclosed. Various modifications and variations will be apparent to those skilled in the art. The embodiments were chosen and described in order to best explain the various aspects and practical application, and to enable those skilled in the art to understand the various embodiments with various modifications suitable for the particular intended use.

Claims

1. A computer-implemented method for re-aligning sequencing data reads, the method comprising: An initial alignment of read sequences with a reference sequence is obtained from a sequence alignment dataset, wherein the initial alignment includes the read sequences to be aligned. The initial alignment is re-aligned, wherein the re-alignment process re-aligns the read sequence with the reference sequence to generate one or more candidate re-alignments that are not used in the initial alignment, and the re-alignment process includes: Identify one or more candidate insertions or missing values, which include one or more of the following: (i) one or more insertions or missing values ​​in the aligned read and (ii) one or more insertions or missing values ​​aligned to the proximal end of the aligned read, as indicated by the sequence alignment dataset; The flattened alignment reads are created at least based on removing any insertions or omissions indicated by the initial alignment from the alignment reads; and Based on iteratively introducing at least some of the one or more candidate insertions / deletions into the flattened alignment read, the read sequence is determined to be re-aligned with the reference sequence by one or more candidate insertions / deletions, wherein each iteration of the iterative introduction provides a candidate re-alignment in the one or more candidate re-alignments by introducing at least one corresponding candidate insertion / deletion of the one or more candidate insertions / deletions into the flattened alignment read for a candidate re-alignment; and The initial alignment or the selected candidate re-alignment in the one or more candidate re-alignments is provided based on one or more selection criteria.

2. The method according to claim 1, wherein the iterative introduction is to introduce the one or more candidate insertion missing permutations into the flattened alignment read, so as to obtain different candidate re-alignments in the one or more candidate alignments for each permutation in the permutations.

3. The method according to claim 2, wherein the re-alignment process further comprises: Examine the provided candidate realignments in one or more candidate realignments to determine whether the aligned reads in the provided candidate realignments are aligned with the reference sequence and whether there are no mismatched bases between the aligned reads in the provided candidate realignments and the reference sequence; The iterative introduction is stopped based on the determination that the alignment read in the provided candidate re-alignment is aligned with the reference sequence and that there are no mismatched bases. and Selecting a provided candidate re-alignment as the selected candidate re-alignment, wherein the provision is based on the alignment of the alignment read in the provided candidate re-alignment with the reference sequence to output the selected candidate re-alignment.

4. The method of claim 1, wherein the re-alignment process further comprises prioritizing the one or more candidate insertion missings for the iterative introduction, wherein the iterative introduction is based on the priority ranking and introduces at least some of the one or more candidate insertion missings in priority order.

5. The method of claim 4, wherein the priority ordering includes one or more of the following: Insertions that are indicated as predicted insertions in the reference missing data set take precedence over those that are not indicated as predicted insertions in the reference missing data set. Longer insertions / deletes are prioritized over shorter insertions / deletes. Insertions or deletions indicated in a larger number of aligned reads of the sequence alignment dataset are prioritized over those indicated in a smaller number of aligned reads of the sequence alignment dataset. The insertion defects indicated in a larger proportion of aligned reads in the sequence alignment dataset corresponding to the position of the insertion defect relative to the reference sequence are prioritized over those indicated in a smaller proportion of aligned reads in the sequence alignment dataset. Among the different insertions and deletions indicated in the same number of aligned reads in the sequence alignment dataset, the priority sorting prioritizes insertions and deletions that are upstream of the position of the reference genome sequence indicated in the sequence alignment dataset for another insertion or deletion indication.

6. The method of claim 1, wherein the selection criteria are based at least in part on one or more of the following: the number of mismatched bases, the number of insertions or deletions, the position of the insertions or deletions relative to the reference genome sequence indicated by the sequence alignment dataset, and the number of soft-cut bases.

7. The method according to claim 1, 2, 3, 4 or 6, wherein the selection criteria prioritize one or more of the following: For the purposes of this provision, an alignment with no insertions or deletions and only one mismatched base is preferred over an alignment with one or more insertions or deletions. For the provided method, alignments with fewer mismatched bases are preferred over alignments with more mismatched bases. Among different alignments with the same number of mismatched bases, for the provided context, alignments with a smaller number of soft cuts of a specific type are preferred over alignments with a larger number of soft cuts of that specific type; and Among different alignments with the same number of mismatched bases, for the purposes of this provision, the alignment with a smaller number of insertions or deletions is preferred over the alignment with a larger number of insertions or deletions.

8. The method according to claim 1, 2, 3, 4 or 6, wherein the re-alignment process further comprises selecting the best candidate re-alignment among the one or more candidate re-alignments based on a first criterion among the one or more selection criteria, wherein the selected candidate re-alignment is the selected best candidate re-alignment, and wherein the provision is based on a second criterion among the one or more selection criteria to select between the initial alignment and the best candidate re-alignment.

9. The method according to claim 1, 2, 3, 4 or 6, further comprising determining whether the obtained initial alignment is suitable for re-alignment, said determination being based at least in part on one or more of the following: To determine whether there are one or more mismatched bases between the alignment read and the reference sequence in the initial alignment; Determine whether the compared reads include soft cuts; Determine whether the initial alignment is not a secondary alignment; and To identify whether there are candidate insertions or deletions around the aligned reads in the base regions of the reference genome sequence in the sequence alignment dataset.

10. The method according to claim 1, 2, 3, 4 or 6, further comprising: Determine whether the obtained initial alignment is suitable for re-alignment, and based on the determination that the obtained initial alignment is suitable for re-alignment, perform the re-alignment processing on the initial alignment or the selected candidate re-alignment and provide the results; For each of the additional initial alignments in one or more additional initial alignments in the sequence alignment dataset, the resulting additional initial alignment is repeatedly obtained and determined to be suitable for re-alignment; and Processing each of the one or more additional initial alignments, the processing comprising (i) providing the additional initial alignment as is without performing the re-alignment process, or (ii) performing the re-alignment process and providing the additional initial alignment or the selected candidate re-alignment.

11. A computer system for re-aligning sequencing data reads, the computer system comprising a memory and at least one processor, the computer system being configured to execute program instructions to perform methods including: An initial alignment of read sequences with a reference sequence is obtained from a sequence alignment dataset, wherein the initial alignment includes the read sequences to be aligned. The initial alignment is re-aligned, wherein the re-alignment process re-aligns the read sequence with the reference sequence to generate one or more candidate re-alignments that are not used in the initial alignment, and the re-alignment process includes: Identify one or more candidate insertions or missing values, which include one or more of the following: (i) one or more insertions or missing values ​​in the aligned read and (ii) one or more insertions or missing values ​​aligned to the proximal end of the aligned read, as indicated by the sequence alignment dataset; The flattened alignment reads are created at least based on removing any insertions or omissions indicated by the initial alignment from the alignment reads; and The read sequence is determined to be realigned with the reference sequence by iteratively introducing at least some of the one or more candidate insertions into the flattened alignment read, wherein each iteration of the iteration is provided by introducing at least one corresponding candidate insertion for the one or more candidate insertions into the flattened alignment read for candidate realignment. and The initial alignment or the selected candidate re-alignment in the one or more candidate re-alignments is provided based on one or more selection criteria.

12. The computer system of claim 11, wherein the iterative introduction is to introduce the one or more candidate insertion missing permutations into the flattened alignment read segment to obtain different candidate re-alignments for each of the one or more candidate alignments.

13. The computer system of claim 12, wherein the re-comparison process further comprises: Examine the provided candidate realignments in one or more candidate realignments to determine whether the alignment read of the provided candidate realignment is aligned with the reference sequence and whether there are no mismatched bases between the alignment read of the provided candidate realignment and the reference sequence; The iterative introduction is stopped based on the determination that the provided candidate re-aligned read is aligned with the reference sequence and that there are no mismatched bases. and Selecting a provided candidate re-alignment as the selected candidate re-alignment, wherein the provision is based on the alignment of the alignment read with the reference sequence to output the selected candidate re-alignment.

14. The computer system of claim 12 or 13, wherein the re-alignment process further comprises prioritizing the one or more candidate insertion defects for the iterative introduction, wherein the iterative introduction introduces at least some of the one or more insertion defects in priority order based on the priority order.

15. The computer system of claim 14, wherein the priority ordering includes one or more of the following: Insertions that are indicated as predicted insertions in the reference missing data set take precedence over those that are not indicated as predicted insertions in the reference missing data set. Longer insertions / deletes are prioritized over shorter insertions / deletes. Insertions or deletions indicated in a larger number of aligned reads of the sequence alignment dataset are prioritized over those indicated in a smaller number of aligned reads of the sequence alignment dataset. The insertion defects indicated in a larger proportion of aligned reads in the sequence alignment dataset corresponding to the position of the insertion defect relative to the reference sequence are prioritized over those indicated in a smaller proportion of aligned reads in the sequence alignment dataset. Among the different insertions and deletions indicated in the same number of aligned reads in the sequence alignment dataset, the priority sorting prioritizes insertions and deletions that are upstream of the position of the reference genome sequence indicated in the sequence alignment dataset for another insertion or deletion indication.