Machine learning approaches to general cancer screening in whole-genome sequencing
A machine learning approach using whole-genome sequencing data from NIPT samples enhances cancer detection sensitivity by leveraging a synthetic training dataset, addressing the limitations of existing NIPT methods in identifying cancer-related genomic instability.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- LABORATORY CORPORATION OF AMERICA HOLDINGS INC
- Filing Date
- 2024-05-15
- Publication Date
- 2026-07-01
AI Technical Summary
Existing non-invasive prenatal testing (NIPT) methods for detecting fetal chromosomal abnormalities are limited in their ability to accurately and cost-effectively identify cancer-related genomic instability, particularly in early stages, due to low sensitivity of expert analysis and statistical measures.
A machine learning approach utilizing whole-genome sequencing data from NIPT samples, combined with a synthetic training dataset from cancer patients, to classify samples as negative or positive for cancer, enhancing sensitivity and maintaining low computational load.
The method improves cancer detection sensitivity beyond traditional methods, providing a cost-effective and accurate means to identify cancer-related genomic instability through machine learning techniques.
Smart Images

Figure 2026521676000001_ABST
Abstract
Description
Technical Field
[0001] Cross - reference to Related Applications This application claims the priority and benefit of U.S. Provisional Application No. 63 / 502,289, filed on May 15, 2023, the entire content of which is incorporated herein by reference for all purposes.
[0002] This disclosure relates to pan - cancer screening, and specifically, to machine learning techniques for pan - cancer screening in whole - genome sequencing from non - invasive prenatal testing (NIPT) procedures.
Background Art
[0003] Traditional tissue biopsies in prenatal testing procedures involve taking placental samples for analysis in highly invasive procedures such as amniocentesis and chorionic villus sampling (CVS). While the safety of invasive procedures has improved significantly since their introduction, a well-recognized risk of iatrogenic fetal loss remains. In contrast to tissue biopsies from the placenta, a newer approach for prenatal testing procedures has been developed by analyzing fetal DNA in the pregnant woman's bloodstream. Fetuses release DNA into the mother's bloodstream, and clinical testing can therefore detect whether the fetus has a genetic disorder by taking a blood sample from the mother (referred to as a liquid biopsy for non-invasive prenatal testing (NIPT)). More specifically, circulating cell-free (ccf) fetal (ccff) DNA is present in the pregnant woman's plasma. Fetal-origin DNA ranges from 2% to 40%, averaging about 10% of total ccf DNA across various gestational ages. Unlike invasive amniocentesis and CVS, which present a low risk of iatrogenic fetal loss, liquid biopsy is non-invasive and completely safe in this respect. By detecting small amounts of DNA from the placenta in the mother's bloodstream, NIPT can help identify whether there is a high probability of certain chromosomal abnormalities that could affect the baby's health and development. It can also indicate whether the mother is likely to have a boy, a girl, or both. Many NIPT procedures screen for specific chromosomal abnormalities called trisomies. These include trisomy 21 (Down syndrome), trisomy 18 (Edwards syndrome), and trisomy 13 (Patau syndrome).
[0004] Next-generation sequencing (NGS) improves the flexibility and results of NIPT procedures, providing a highly sensitive, accurate, and high-throughput platform for large-scale genomic testing. Whole-genome sequencing (WGS) is a comprehensive NGS method for analyzing the entire genome (translating all or virtually all of the 3 billion DNA base pairs that make up the entire genome by determining the order of nucleotides (A, C, G, T)). The purpose of whole-genome sequencing is typically to look for genetic abnormalities (e.g., single-nucleotide variants, deletions, insertions, and copy number variants). Because the entire genome is sequenced, it is also possible to determine changes in non-coding segments of DNA within genes called introns. Under normal conditions, introns are removed by RNA splicing during post-transcriptional processes, and changes in these regions may be important to whether the DNA is transcribed into RNA or potentially results in cleaved non-functional proteins.
[0005] An alternative approach is to sequence only the exomes, called whole exome sequencing (WES). Exomes are the parts of the genome formed by exons, or coding regions, which, when transcribed and translated, will be expressed as proteins. Exomes make up only about 2% of the entire genome. Because the genome is so large, exomes can be sequenced to a much greater depth (the number of times a given nucleotide is sequenced) for lower costs. This greater depth provides greater confidence in low-frequency variations. In contrast to WGS and WES, targeted genome sequencing (TS) focuses on a panel of genes or targets known to have a strong association with the pathogenesis and / or clinical relevance of a disease, which means that its depth can be even greater due to reduced cost and data load. This means that targeted sequencing can identify low-frequency variants within a target region with high confidence and is therefore suitable for profiling low-quality and fragmented clinical DNA samples. However, because WES and targeted panels focus on a reduced region of the genome, they only see part of the story. Therefore, WGS may be advantageous for some research projects and genetic testing.
[0006] Due to advances in next-generation sequencing technology, the cost of WGS has decreased significantly in recent years. Nevertheless, the cost of performing high-coverage NIPT using WGS can still be exorbitant. To overcome these challenges, low-coverage WGS (1×~10×) and ultra-low-coverage WGS (less than 1× coverage) have been developed for NIPT. Coverage (or depth) in nucleic acid sequencing is the number of unique reads containing a given nucleotide in the reconstructed sequence. Low-coverage sequencing refers to the general concept of aiming for a small number of unique reads for each region of the sequence. By sampling across the entire genome at a small depth, it is possible to reliably detect and predict common variants in a sample. Low-coverage and ultra-low-coverage WGS have been demonstrated to accurately assess common genetic mutations. For example, the MateriniT® GENOME WGS test uses relatively low (approximately 0.4x) sequencing coverage, which is sufficient to detect large subchromosomal and whole-chromosomal events.
[0007] Next-generation sequencing (NGS) of cell-free DNA (cfDNA) has also been demonstrated to detect tumor-specific copy number variations in subjects with cancer. Cancer is a disease of the genome, characterized by several types of changes, including point mutations, balanced and unbalanced chromosomal rearrangements, and copy number variations (CNAs). Researchers have described genomic profiles of tumor tissue obtained from numerous primary and metastatic tumors. In addition, tumor-mediated release of cfDNA into the bloodstream has been described for decades and has recently been utilized in both research and clinical settings to facilitate therapy selection, identify drug resistance, and monitor treatment response by detecting tumor signals through the measurement of genomic instability (e.g., measuring genomic instability numbers (GINs)). GINs are indices intended to capture the autosomal deviations of the entire genome from the empirically derived amount of euploidy in the circulating genome. Depending on the width of the target genomic locus, this liquid biopsy process is typically performed using digital polymerase chain reaction (PCR) or ultra-deep next-generation sequencing (>8,000–10,000x raw coverage). While these assays are useful in guiding therapy through the detection of all types of mutations associated with cancer, cost factors currently necessitate that these tests narrow their content to changes within the target region. In contrast to the above, detection of CNAs via low-coverage and ultra-low-coverage WGS of cfDNA has been validated for NIPT, which is clinically routinely applied. Since tumor aneuploidy (and therefore the presence of CNAs) is a fundamental feature of cancer, NIPT procedures occasionally detect CNAs in patients with either known or unknown neoplasms. Taking an approach that leverages information from the whole genome, disclosed herein is a machine learning technique that utilizes the utility of NIPT procedures for detecting and monitoring CNAs in patients with known cancers (pan-cancer screening). These machine learning techniques demonstrate that coverage levels from low-coverage and very-low-coverage WGS are sufficient for tumor screening and, advantageously, can be used to keep the cost of fluid biopsy testing relatively low. [Overview of the project]
[0008] In various embodiments, a computer-aided method includes accessing NIPT sequence read data for a first sample group, wherein the NIPT sequence read data is generated as part of performing a whole-genome sequencing assay on the NIPT samples, and the NIPT sequence read data includes a bin count profile, which includes sequence read counts for each bin associated with a segment of the reference genome; generating a first training data subset and a second training data subset based on the NIPT sequence read data, wherein each example in the second training data subset is classified as negative for cancer; accessing copy number variant data for a second sample group, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about genomic events from cancer patients, which are copy counts for one or more segments that deviate from the reference profile; and obtaining from cancer patients. The method involves generating a synthetic training dataset based on copy number variant data, wherein generating the synthetic training dataset includes generating an empirical population of genomic events characteristic of copy number variant data obtained from cancer patients, extracting event features and standard scores from the empirical population of genomic events, mapping the event features and standard scores to a bin count profile for a second training data subset, wherein each example in the synthetic training dataset is classified as positive for cancer, generating an expanded training dataset by extending the first training data subset with the synthetic training dataset, and training a machine learning model using the expanded training dataset to classify samples as negative for cancer or positive for cancer, wherein training includes iterative operations to find a set of parameters for a machine learning model that minimizes the loss function or error function of the machine learning model, with each iteration being the value of the loss function or error function using the parameter set.A computer implementation method is provided which includes finding a set of parameters for a machine learning model such that the value of the loss function or error function is smaller than the value of the loss function or error function using a different set of parameters in a previous iteration, wherein the loss function or error function is configured to measure the difference between (i) the estimated class output for each example in the augmented training dataset and (ii) a label that provides ground truth information for each example in the augmented training dataset, the ground truth information identifying whether an example is classified as negative for cancer or positive for cancer.
[0009] In some embodiments, generating a first training data subset and a second training data subset involves normalizing the NIPT sequence read data by (i) smoothing the bin count profile using LOESS, (ii) performing population-based correction of the bin count profile using principal component analysis, or (iii) both.
[0010] In some embodiments, the computer implementation further includes reducing the dimensionality of the augmented training dataset using another principal component analysis, the other principal component analysis including mapping the augmented training dataset to a first k principal components while distinguishing between negative and positive classes, the mapping reduces data redundancy and reduces the feature space n.
[0011] In some embodiments, the computer implementation method further includes using an oncogene feature selection process to reduce the dimensionality of the augmented training dataset, the oncogene feature selection process comprising generating a subset of a predetermined number of bins that overlap with a predetermined number of oncogenes, and using a bin count profile from normalized NIPT sequence read data in the subset of bins of the predetermined number of bins as a new model attribute.
[0012] In some embodiments, the computer implementation method further includes determining an indicator of systematic anomaly in each example of an augmented training dataset, determining whether the indicator is a genomic instability number, tumor content, or both, and adding the indicator of systematic anomaly to each example of the augmented training dataset, wherein a machine learning model is trained to classify samples as negative for cancer or positive for cancer using the augmented training dataset with the added indicator of systematic anomaly, and the machine learning model uses the indicator of systematic anomaly as an additional model feature during training to find a parameter set.
[0013] In some embodiments, the genomic instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
[0014] In some embodiments, the synthetic training dataset presents the same statistical distribution of genomic events as presented in the copy number variant data.
[0015] In various embodiments, a computer-aided method is provided, comprising: accessing NIPT sequence read data for a sample, wherein the sequence read data is generated as part of performing a whole-genome sequencing assay on the sample, and the NIPT sequence read data includes a bin count profile, which includes sequence read counts for each bin associated with a segment of a reference genome; determining, based on the NIPT sequence read data, an indicator of systematic abnormality for the sample, which is either a genomic instability number, tumor content, or both; inputting the NIPT sequence read data and the indicator of systematic abnormality into a machine learning model constructed as a binary classifier; classifying the sample as either negative or positive for cancer using the machine learning model, wherein the machine learning model uses the indicator of systematic abnormality as an additional model feature for predicting the negative or positive class; and outputting the negative or positive class for cancer using the machine learning model.
[0016] In some embodiments, the machine learning model includes a number of parameters trained using an augmented training dataset, the augmented training dataset including an original training dataset containing NIPT sequence read data for a sample group, the NIPT sequence read data being generated as part of performing a whole-genome sequencing assay on NIPT samples, and the NIPT sequence read data including a bin count profile containing sequence read counts for each bin associated with a segment of the reference genome; and a synthetic training dataset generated based on copy number variant data obtained from cancer patients, the copy number variant data being generated as part of performing a copy number polymorphism assay, and the copy number variant data including information on cancer patient-derived genomic events, where the copy number variant data are copy counts for one or more segments that deviate from the reference profile.
[0017] In some embodiments, the synthetic training dataset is generated by generating an empirical population of genomic events characteristic of copy number variant data obtained from cancer patients, extracting event features and standard scores from the empirical population of genomic events, and mapping the event features and standard scores to a bin count profile for a second training data subset to generate the synthetic training dataset, where each example in the synthetic training dataset is classified as positive for cancer.
[0018] In some embodiments, the genomic instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
[0019] In some embodiments, the computer-aided method further includes generating a report that includes negative or positive classes for cancer and the results of a whole-genome sequencing assay.
[0020] In some embodiments, machine learning models are deployed to callable endpoints within the cloud infrastructure.
[0021] In some embodiments, the computer implementation further includes using a callable endpoint to invoke a machine learning model via an application programming interface.
[0022] In some embodiments, the computer-aided method further includes providing recommendations for whole-genome sequencing assays based on a classification of cancer.
[0023] In some embodiments, the computer-aided method further includes performing a whole-genome sequencing assay on a sample from a subject to obtain analytical results for clinical diagnostic testing.
[0024] In some embodiments, the computer-aided method further includes diagnosing and / or treating a subject based on the results of an analysis of clinical diagnostic tests and / or a classification for cancer.
[0025] In some embodiments, a system is provided that includes one or more data processors and a non-temporary computer-readable medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform some or all of the methods or processes disclosed herein.
[0026] In some embodiments, a computer program product is provided which includes instructions tangibly embodied in a non-temporary machine-readable medium and configured to cause one or more data processors to carry out some or all of the methods disclosed herein.
[0027] The terms and expressions used are used as terms of explanation rather than limitation, and the use of such terms and expressions is not intended to exclude any equivalents of the features shown and described, but it is recognized that various modifications are possible within the scope of the invention as set forth in the claims. Therefore, although the invention is specifically disclosed by embodiments and optional features, modifications and variations of the concepts disclosed herein may be adopted by those skilled in the art, and it should be understood that such modifications and variations are considered to be within the scope of the invention as defined by the appended claims.
Brief Description of Drawings
[0028] The present invention will be better understood in consideration of the following non-limiting drawings.
[0029] [Figure 1] A block diagram of a clinical testing pipeline implementing an NIPT procedure or assay, such as MaterniT (registered trademark) GENOME, for analyzing nucleic acids according to various embodiments is shown. [Figure 2] An overview of a pan-cancer screening approach that receives sequence read data (e.g., WGS sequence read data) processed from a clinical testing pipeline according to various embodiments and implements a pan-cancer screening procedure or assay to predict whether a subject has cancer is shown. [Figure 3] A block diagram exemplifying a system for implementing a technique for preprocessing WGS sequence read data for a subject, training a machine learning model using the preprocessed WGS sequence read data, and predicting a cancer classification for the subject from the WGS sequence read data using the trained model according to various embodiments is shown. [Figure 4] A flowchart exemplifying a process for training a machine learning model to predict a cancer classification for a subject according to various embodiments is shown. [Figure 5]A flowchart illustrating the process of predicting cancer classification for a given subject using machine learning models in various embodiments is provided. [Figure 6] A block diagram illustrates a computing environment in which various systems, methods, algorithms, machine learning models, and data structures can be implemented through various embodiments. [Figure 7] This document demonstrates bin count extension mapping functions in various embodiments. [Figures 8A-8B] This shows a comparison of event feature distributions under various embodiments. [Figures 9A-9D] The onset location of the first event on the chromosome is shown for both actual and simulated tumor data in various embodiments. [Figure 10A-10B] The event lengths on chromosomes with one and two events are shown in various embodiments. [Figure 11A-11D] This shows the event lengths on chromosomes with >=3 events in various embodiments. [Figure 12A-12C] This shows the gaps between events on a single chromosome in various embodiments. [Figures 13A-13D] The standard deviation of the SPCA, conditional on event length, is shown for various embodiments. [Figure 14A-14D] This section compares the event feature distributions under various embodiments. [Figure 15A] This document presents machine learning results for all positive and negative samples of the MaterniT GENOME under various embodiments. [Figure 15B] This document presents machine learning results for negative samples of MaterniT GENOME using various embodiments.
[0030] In the attached diagram, similar components and / or features may have the same reference label. Furthermore, various components of the same type may be distinguished by following the reference label with a dash and a second label to distinguish them from similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label, regardless of the second reference label. [Modes for carrying out the invention]
[0031] The following description provides only preferred exemplary embodiments and is not intended to limit the scope, applicability, or configuration of the Disclosure. Rather, the following description of preferred exemplary embodiments will provide a useful explanation for implementing various embodiments for those skilled in the art. It will be understood that various modifications can be made to the function and arrangement of the elements without departing from the spirit and scope set forth in the appended claims.
[0032] Specific details are given in the following description to provide a complete understanding of the embodiments. However, it will be understood that embodiments may be carried out without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in the form of block diagrams so as not to obscure the embodiments with unnecessary details. In other cases, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary details to avoid obscuring the embodiments.
[0033] Furthermore, it should be noted that individual embodiments may be described as processes depicted as flowcharts, flow diagrams, data flow diagrams, structural diagrams, or block diagrams. While flowcharts or diagrams may describe operation as a sequential process, many of the operations may be performed in parallel or simultaneously. In addition, the order of operations may be rearranged. A process terminates when its operations are complete, but it may have additional steps not shown in the diagram. A process may correspond to a method, function, procedure, subroutine, subprogram, etc. When a process corresponds to a function, its termination may correspond to the function's return to the calling function or the main function.
[0034] I. Introduction MaterniT® GENOME is one of Sequenom's blood-based NIPT tests that predict fetal genetic syndromes from fetal DNA information. Chromosomes are the way cells transmit genetic information as a baby develops, and extra or missing parts of chromosomes, or changes in entire chromosomes, can have serious impacts on a baby's health. Most NIPTs analyze information from selected chromosomes, and like these NIPTs, MaterniT® GENOME can screen for trisomy 21 (Down syndrome), trisomy 18 (Edwards syndrome), and trisomy 13 (Patau syndrome), as well as whether a mother will have a boy or a girl. However, changes can be found in any chromosome, and MaterniT® GENOME was developed to analyze all of them using WGS, displaying both maternal and fetal DNA profiles across the entire human genome. From these NIPTs, Sequenom has found that some mothers tested using MaterniT® GENOME display abnormal genomic profiles unrelated to fetal DNA that may indicate undiagnosed cancer. These findings are consistent with the general understanding in the field of genomic testing that NIPT for fetal aneuploidy screening using cell-free DNA derived from maternal plasma may incidentally increase the likelihood of cancer. Therefore, it is hypothesized that sequencing data typical of NIPT tests (including WGS used in MaterniT® GENOME) carry hidden cancer signals. The challenge lies in finding a way to extract these signals in a cost-effective manner with sufficient specificity and sensitivity.
[0035] This signal has traditionally been extracted by field experts, such as lab directors, based on classification reports summarizing WGS findings, or using statistical measures of abnormality such as GIN. GIN is an indicator of systematic abnormality in the patient's genomic composition and is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value. More specifically, genomic profile variability is smoothed with LOESS fits for each chromosome, and the absolute deviations of the LOESS fits from the predicted value are summed to calculate the AUC index-GIN. An empirical threshold is used to predict the tumor status of the sample. This method is described in more detail in Jensen TJ, Goodman AM, Kato S, Ellison CK, Daniels GA, Kim L, Nakashe P, McCarthy E, Mazloom AR, McLennan G, Grosu DS, Ehrich M, Kurzrock R. Genome-Wide Sequencing of Cell-Free DNA Identifies Copy-Number Alterations That Can Be Used for Monitoring Response to Immunotherapy in Cancer Patients. Mol Cancer Ther. 2019 Feb;18(2):448-458. doi: 10.1158 / 1535-7163. MCT-18-0535. Epub 2018 Dec 6. PMID: 30523049, the entire contents of which are incorporated herein by reference for all purposes.
[0036] Nevertheless, expert analysis and statistical measures of anomalies such as GIN typically demonstrate low sensitivity (the ability to classify an individual as having cancer), particularly in the early stages of cancer (e.g., the sensitivity of the GIN method itself is typically less than 60%, increasing in patients in later stages of the disease). This is presumably because expert analysis and statistical measures, alone or in combination, cannot explore all features or attributes that may or may not be associated with different pathological or physiological conditions. To overcome this and other challenges, aspects of this disclosure relate to machine learning (ML) techniques that can explore variability between WGS profiles and reveal previously unknown features or attributes that may form distinct patterns associated with different pathological or physiological conditions.
[0037] ML (Machine Learning) is having a significant impact on many areas of modern society. For example, it is used to filter spam messages from text documents such as emails, to analyze various images to distinguish differences, and to extract important data from large datasets through data mining. By learning from training data, ML enables the discovery of patterns, the construction of models, and the making of predictions. ML algorithms are used in a wide range of domains, including biology and genomics. Deep learning (DL) is a subset of ML that differs in many ways from other ML processes. Most ML models perform well due to custom-designed representations and input features. Using the input data generated through the process, ML learns algorithms, optimizes the weights of each feature, and optimizes the final prediction. DL attempts to learn multiple levels of representation using a hierarchy of multiple layers. In recent years, DL has surpassed ML in many areas, including speech, vision, and natural language processing. DL and ML are also increasingly used in the medical field, mainly in image analysis, pharmaceutical research and development, data mining from medical documents, and in the field of speech. In addition to image and text data from medical charts generated in hospitals, various types of laboratory data, including sequencing data, can also be analyzed to detect various signals associated with the subject's health. The primary objective supporting this disclosure was to develop an accurate and inexpensive machine learning approach for tumor detection in an NIPT setting that would improve sensitivity compared to other methods while maintaining a low computational load. This method would utilize data from NIPT procedures or assays such as the MaterniT® GENOME test pipeline (e.g., WGS data) to present a separate clinical test / assay in addition to existing ones.
[0038] In exemplary embodiments, a computer-aided method includes accessing NIPT sequence read data for a first sample group, wherein the NIPT sequence read data is generated as part of performing a whole-genome sequencing assay on the NIPT samples, and the NIPT sequence read data includes a bin count profile, which includes sequence read counts for each bin associated with a segment of the reference genome; generating a first training data subset and a second training data subset based on the NIPT sequence read data, wherein each example in the second training data subset is classified as negative for cancer; accessing copy number variant data for a second sample group, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about genomic events from cancer patients, which are copy counts for one or more segments that deviate from the reference profile; and obtaining from cancer patients The method involves generating a synthetic training dataset based on copy number variant data, wherein generating the synthetic training dataset includes generating an empirical population of genomic events characteristic of copy number variant data obtained from cancer patients, extracting event features and standard scores from the empirical population of genomic events, mapping the event features and standard scores to a bin count profile for a second training data subset, wherein each example in the synthetic training dataset is classified as positive for cancer, extending the first training data subset with the synthetic training dataset to generate an expanded training dataset, and training a machine learning model using the expanded training dataset to classify samples as negative for cancer or positive for cancer, wherein training includes iterative operations to find a set of parameters for a machine learning model that minimizes the loss function or error function of the machine learning model, with each iteration being the value of the loss function or error function using the parameter set.A computer implementation method is provided which includes finding a set of parameters for a machine learning model such that the value of the loss function or error function is smaller than the value of the loss function or error function using a different set of parameters in a previous iteration, wherein the loss function or error function is configured to measure the difference between (i) the estimated class output for each example in the augmented training dataset and (ii) a label that provides ground truth information for each example in the augmented training dataset, the ground truth information identifying whether an example is classified as negative for cancer or positive for cancer.
[0039] In another exemplary embodiment, a computer-aided method is provided, comprising: accessing NIPT sequence read data for a sample, wherein the sequence read data is generated as part of performing a whole-genome sequencing assay on the sample, and the NIPT sequence read data includes a bin count profile, which includes sequence read counts for each bin associated with a segment of a reference genome; determining, based on the NIPT sequence read data, an indicator of systematic abnormality for the sample, which is either a genomic instability number, tumor content, or both; inputting the NIPT sequence read data and the indicator of systematic abnormality into a machine learning model constructed as a binary classifier; classifying the sample as either negative or positive for cancer using the machine learning model, wherein the machine learning model uses the indicator of systematic abnormality as an additional model feature for predicting the negative or positive class; and outputting the negative or positive class for cancer using the machine learning model.
[0040] As used herein, the terms “substantially,” “approximately,” and “about” are defined as, as understood by those skilled in the art, largely but not necessarily fully specified (and including fully specified). In any disclosed embodiment, the terms “substantially,” “approximately,” or “about” may be replaced by “within [percentage]” of the specified, the percentages including 0.1, 1, 5, and 10 percent. As used herein, when an action is “based on” something, this means that the action is at least partially based on at least a part of something.
[0041] II. NIPT Pipeline Figure 1 is a block diagram of a clinical laboratory pipeline 100 that implements a NIPT procedure or assay such as MatriniT® GENOME for nucleic acid analysis. In some embodiments, nucleic acid fragments in a mixture of nucleic acid fragments are analyzed. Nucleic acid fragments may be referred to as nucleic acid templates, and these terms may be used interchangeably herein. A mixture of nucleic acids may contain two or more nucleic acid fragment species having the same or different nucleotide sequences, different fragment lengths, different origins (e.g., genome-derived, fetal vs. maternal, cell or tissue-derived, cancer vs. non-cancerous, tumor vs. non-tumor-derived, sample-derived, subject-derived, etc.), or combinations thereof.
[0042] The terms “nucleic acid,” “nucleic acid molecule,” “nucleic acid fragment,” and “nucleic acid template” may be used interchangeably throughout this disclosure. These terms refer to nucleic acids of any composition, such as DNA (e.g., complementary DNA (cDNA), genomic DNA (gDNA), etc.), RNA (e.g., message RNA (mRNA), short interfering RNA (siRNA), ribosomal RNA (rRNA), tRNA, microRNA, RNA highly expressed by the fetus or placenta, etc.), and / or DNA or RNA analogs (e.g., base analogs, sugar analogs and / or non-natural skeletons, etc.), RNA / DNA hybrids and polyamide nucleic acids (PNA), all of which may be in single-stranded or double-stranded form and may encompass known analogs of natural nucleotides that can function in a manner similar to naturally occurring nucleotides, unless otherwise limited. Nucleic acids, in certain embodiments, may be plasmids, phages, viruses, bacteria, autonomous replication sequences (ARS), mitochondria, centromeres, artificial chromosomes, chromosomes, or other nucleic acids capable of replicating or being replicated in vitro or within host cells, cells, cell nuclei, or cytoplasm, or may be derived therefrom. In some embodiments, the template nucleic acid may be derived from a single chromosome (e.g., a nucleic acid sample may be derived from one chromosome of a sample obtained from a diploid organism). Unless otherwise specified, the term encompasses nucleic acids including known analogs of native nucleotides that have similar binding properties to the reference nucleic acid and are metabolized in a similar manner to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly includes its conservedly modified variants (e.g., degenerate codon substitutions), alleles, orthologues, single nucleotide polymorphisms (SNPs), and complementary sequences, as well as explicitly indicated sequences. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed bases and / or deoxyinosine residues. The term nucleic acid is used interchangeably with gene loci, genes, cDNA, and mRNA, which are encoded by genes.This term may also include, as equivalents, derivatives, variants, and analogues of RNA or DNA synthesized from nucleotide analogs, single-stranded ("sense" or "antisense," "plus" or "minus" strands, "forward" reading frames or "reverse" reading frames) and double-stranded polynucleotides. The term "gene" refers to a segment of DNA involved in the production of polypeptide chains and generally includes the regions before and after the coding regions (leaders and trailers) involved in the transcription / translation of gene products, as well as the intervening sequences (introns) between individual coding regions (exons). A nucleotide or base generally refers to the purine and pyrimidine molecular units of nucleic acids (e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)). In the case of RNA, the base thymine is replaced by uracil. The length or size of a nucleic acid may be expressed as the number of bases.
[0043] In Block 105, nucleic acids or nucleic acid mixtures used in the systems, methods, and products described herein can be isolated from samples obtained from a subject (e.g., a test subject). In some embodiments, nucleic acids are extracted from cells using cell lysis procedures. Cell lysis procedures and reagents are known in the art and can generally be carried out by chemical (e.g., detergents, hypotonic solutions, enzymatic procedures, etc., or combinations thereof), physical (e.g., French press, sonication, etc.), or electrolytic lysis methods. Any suitable lysis procedure can be utilized. For example, chemical methods generally involve disrupting cells, extracting nucleic acids from the cells, and then using a solvent to treat with chaotropic salts. Physical methods such as freezing / thawing, subsequent grinding, and the use of cell presses are also useful. In some cases, high-salt and / or alkaline lysis procedures may be utilized.
[0044] In some embodiments, nucleic acids may include extracellular nucleic acids. As used herein, the term “extracellular nucleic acid” can refer to nucleic acids isolated from a substantially cell-free source and may also be referred to as “cell-free” nucleic acids, “circulating cell-free nucleic acids” (e.g., CCF fragments, ccf DNA) and / or “cell-free circulating nucleic acids.” Extracellular nucleic acids are present in blood and can be obtained from blood (e.g., from the blood of a human subject). Extracellular nucleic acids often do not contain detectable cells and may contain cellular elements or cellular remnants. Non-limiting examples of cell-free sources for extracellular nucleic acids are blood, plasma, serum, and urine. As used herein, the term “obtaining a cell-free circulating sample nucleic acid” includes obtaining a sample directly (e.g., collecting a sample, e.g., a test sample) or obtaining a sample from another person who has collected a sample. Without being limited by theory, extracellular nucleic acids may be products of cell apoptosis and cytodegradation, which often provides a basis for extracellular nucleic acids having a series of lengths across the spectrum (e.g., “ladders”). In some embodiments, the sample nucleic acid derived from the test subject is a circulating cell-free nucleic acid. In some embodiments, the circulating cell-free nucleic acids are derived from plasma or serum from the subject being tested.
[0045] The subjects may include, but are not limited to, any living or non-living organisms, including humans, non-human animals, plants, bacteria, fungi, protists, or pathogens. Any human or non-human animal may be selected, and may include, for example, mammals, reptiles, birds, amphibians, fish, ungulates, ruminants, Bovidae (e.g., cattle), Equidae (e.g., horses), Goatidae and Sheepidae (e.g., sheep, goats), Pigridae (e.g., pigs), Camelidae (e.g., camels, llamas, alpacas), monkeys, apes (e.g., gorillas, chimpanzees), Ursidae (e.g., bears), poultry, dogs, cats, mice, rats, fish, dolphins, whales, and sharks. The subjects may be male or female (e.g., women, pregnant women). The subjects may be of any age (e.g., embryos, fetuses, infants, children, adults). The subjects may include cancer patients, patients suspected of having cancer, patients in remission, patients with a family history of cancer, and / or individuals undergoing cancer screening.
[0046] Nucleic acids can be isolated from any type of suitable biological specimen or sample (e.g., test specimen). A specimen or test specimen can be any specimen isolated from or obtained from a subject or a part thereof (e.g., a human subject, a pregnant woman, a cancer patient, a fetus, a tumor). The specimen may be from a pregnant woman subject with a fetus at any stage of pregnancy (e.g., the first, second, or third stage of a human subject), and possibly from a postnatal subject. The specimen may be from a pregnant subject with a fetus that is euploid for all chromosomes, and possibly from a pregnant subject with a fetus that has chromosomal aneuploidy (e.g., one, three (i.e., trisomy (e.g., T21, T18, T13)), or four copies of a chromosome) or other genetic mutations. Non-limiting examples of specimens include, without limitation, fluids or tissues from a subject, including blood or blood products (e.g., serum, plasma, etc.), umbilical cord blood, chorionic villi, amniotic fluid, cerebrospinal fluid, cerebrospinal fluid, lavage fluids (e.g., bronchoalveolar, gastric, peritoneal, tubal, ear, arthroscopy), biopsy specimens (e.g., from preimplantation embryos, cancer biopsies), paracentesis specimens, cells (blood cells, placental cells, germ cells or fetal cells, fetal nuclear cells or fetal cell remnants, normal cells, abnormal cells (e.g., cancer cells)) or parts thereof (e.g., mitochondria, nuclei, extracts, etc.), lavage fluids of the female reproductive system, urine, feces, sputum, saliva, nasal mucosa, prostatic fluid, perfusion fluid, semen, lymph, bile, tears, sweat, breast milk, mammary gland fluid, etc., or combinations thereof. In some embodiments, the biological specimen is a cervical swab from the subject. The fluid or tissue specimen from which nucleic acids are extracted may be non-cellular (e.g., cell-free). In some embodiments, the fluid or tissue sample may contain cellular elements or cellular remnants. In some embodiments, fetal cells or cancer cells may be included in the sample.
[0047] The sample may be a liquid sample. A liquid sample may contain extracellular nucleic acids (e.g., circulating cell-free DNA). Non-limiting examples of liquid samples include blood or blood products (e.g., serum, plasma, etc.), urine, biopsy samples (e.g., liquid biopsy for cancer detection), the liquid samples described above, or combinations thereof. In certain embodiments, the sample is a liquid biopsy, which generally refers to the evaluation of a liquid sample from a subject for the presence, absence, progression, or remission of a disease (e.g., cancer). Liquid biopsies can be used in conjunction with or as a substitute for commercially available biopsies (e.g., tumor biopsies). In certain cases, extracellular nucleic acids are analyzed in the liquid biopsy.
[0048] In some embodiments, the biological sample may be blood, plasma, or serum. The term “blood” encompasses whole blood, blood products, or any fraction of blood, such as serum, plasma, or buffy coat, as conventionally defined. Blood or its fractions often contain nucleosomes. Nucleosomes contain nucleic acids and may be cell-free or intracellular. Blood also contains a buffy coat. The buffy coat may be isolated by utilizing a Ficol gradient. The buffy coat may contain leukocytes (e.g., white blood cells, T cells, B cells, platelets, etc.). Plasma refers to the fraction of whole blood resulting from the centrifugation of blood treated with an anticoagulant. Serum refers to the aqueous portion of the fluid remaining after a blood sample has clotted. Fluid or tissue samples are often collected according to standard protocols commonly followed by hospitals or clinics. For blood, an appropriate amount of peripheral blood (e.g., 3-40 ml, 5-50 ml) is often collected and can be stored according to standard procedures before or after preparation.
[0049] Analysis of nucleic acids found in the blood of a subject may be performed using, for example, whole blood, serum, or plasma. For example, analysis of fetal DNA found in maternal blood may be performed using, for example, whole blood, serum, or plasma. For example, analysis of tumor DNA found in a patient's blood may be performed using, for example, whole blood, serum, or plasma. Methods for preparing serum or plasma from blood obtained from a subject (e.g., maternal subject, cancer patient) are known. For example, the blood of a subject (e.g., blood of a pregnant woman, blood of a cancer patient) can be placed in a tube containing a special commercially available product such as EDTA or Vacutainer SST (Becton Dickinson, Franklin Lakes, NJ) to prevent blood coagulation, and then plasma can be obtained from the whole blood by centrifugation. Serum can be obtained with or without blood coagulation after centrifugation. When centrifugation is used, it is typically performed at an appropriate rate, e.g., 1,500-3,000 × g, although this is not exclusive. Plasma or serum may be subjected to an additional centrifugation step before being transferred to a fresh tube for nucleic acid extraction. In addition to the non-cellular portion of whole blood, nucleic acids can also be recovered from the cellular fraction concentrated in the buffy coat portion, which can be obtained after centrifugation of the whole blood sample from the subject and removal of plasma.
[0050] In some embodiments, nucleic acids or nucleic acid mixtures (e.g., extracellular nucleic acids) are enriched with respect to or relatively enriched with respect to a subpopulation or species of nucleic acids. Examples of nucleic acid subpopulations include fetal nucleic acids, maternal nucleic acids, cancer nucleic acids, patient nucleic acids, nucleic acids containing fragments of a specific length or range of lengths, or nucleic acids derived from a specific genomic region (e.g., a single chromosome, a set of chromosomes, and / or a specific chromosomal region). Such enriched samples can be used in conjunction with the methods provided herein. Thus, in certain embodiments, the methods of the art include an additional step of enriching a subpopulation of nucleic acids in a sample, such as cancer or fetal nucleic acids. In certain embodiments, a method for determining a fraction of cancer cell nucleic acids or a fetal fraction can also be used to enrich cancer or fetal nucleic acids. In certain embodiments, nucleic acids from normal tissue (e.g., non-cancer cells) are selectively (partially, substantially, almost completely, or completely) removed from the sample. In certain embodiments, maternal nucleic acids are selectively (partially, substantially, almost completely, or completely) removed from the sample. In certain embodiments, enriching specific low-copy nucleic acids (e.g., cancer or fetal nucleic acids) can improve quantitative sensitivity. Methods for enriching samples with respect to specific types of nucleic acids are described, for example, in U.S. Patent No. 6,927,028, International Patent Application Publication No. 2007 / 140417, International Patent Application Publication No. 2007 / 147063, International Patent Application Publication No. 2009 / 032779, International Patent Application Publication No. 2009 / 032781, International Patent Application Publication No. 2010 / 033639, International Patent Application Publication No. 2011 / 034631, International Patent Application Publication No. 2006 / 056480, and International Patent Application Publication No. 2011 / 143659, the entire contents of each, including all text, tables, equations, and drawings, are incorporated herein by reference for all purposes.
[0051] The amount of nucleic acids in a sample (e.g., concentration, relative amount, absolute amount, copy number, etc.) can be determined. In some embodiments, the amount of minority nucleic acids in the nucleic acids (e.g., concentration, relative amount, absolute amount, copy number, etc.) is determined. In certain embodiments, the amount of minority nucleic acid species in a sample is referred to as the “minority species fraction.” In some embodiments, the “minority species fraction” refers to the fraction of minority nucleic acid species in circulating cell-free nucleic acids in a sample obtained from a subject (e.g., blood sample, serum sample, plasma sample, urine sample). The amount of minority nucleic acids in extracellular nucleic acids can be quantified and used in conjunction with the methods provided herein. Therefore, in certain embodiments, the methods described herein include an additional step of determining the amount of minority nucleic acids. The amount of minority nucleic acids can be determined in a sample from a subject before or after processing for preparing the sample nucleic acid. In certain embodiments, the amount of minority nucleic acids is determined in the sample after the sample nucleic acid has been processed and prepared, and the amount is used for further evaluation. In some embodiments, the results include taking into account minority species fractions in the sample nucleic acid (e.g., adjusting the count, removing the sample, calling, or not calling).
[0052] The determination of minority species fractions can be performed before, during, or at any point in time in the clinical testing pipeline 100 described herein, or after a specific process described herein (e.g., detection of gene mutations or genetic alterations). For example, a minority nucleic acid quantification method may be implemented before, during, or after the determination of gene mutations / gene alterations in order to implement a gene mutation / gene alteration determination method having a specific sensitivity or specificity, in order to identify samples having minority nucleic acids greater than approximately 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23%, 24%, 25%, or more. In some embodiments, samples determined to have a specific threshold amount of minority nucleic acid (e.g., approximately 15% or more of minority nucleic acid, approximately 4% or more of minority nucleic acid) are further analyzed, for example, for the presence or absence of gene mutations / genetic alterations. In certain embodiments, for example, the determination of gene mutations or genetic alterations is selected (e.g., selected and communicated to the patient) only for samples having a specific threshold amount of minority nucleic acid (e.g., approximately 15% or more of minority nucleic acid, approximately 4% or more of minority nucleic acid).
[0053] For example, the amount of cancer cell nucleic acid in nucleic acids (e.g., concentration, relative amount, absolute amount, copy number, etc.) can be determined. In certain cases, the amount of cancer cell nucleic acid in a sample is referred to as the "cancer cell nucleic acid fraction," and in some cases as the "cancer fraction" or "tumor fraction." In some embodiments, the "cancer cell nucleic acid fraction" refers to the fraction of cancer cell nucleic acid in the circulating cell-free nucleic acid in a sample obtained from a subject (e.g., blood sample, serum sample, plasma sample, urine sample). Additionally or alternatively, the amount of fetal nucleic acid in nucleic acids (e.g., concentration, relative amount, absolute amount, copy number, etc.) can be determined. In certain embodiments, the amount of fetal nucleic acid in a sample is referred to as the "fetal fraction." In some embodiments, the "fetal fraction" refers to the fraction of fetal nucleic acid in the circulating cell-free nucleic acid in a sample obtained from a pregnant woman (e.g., blood sample, serum sample, plasma sample, urine sample). Certain methods for determining fetal fractions, described herein or known in the art, can be used to determine cancer cell nucleic acid fractions and / or minority species fractions.
[0054] In certain cases, the fetal fraction may be determined according to markers specific to male fetuses (e.g., Y-chromosome STR markers (e.g., DYS19, DYS385, DYS392 markers), RhD markers in RhD-negative women), allele ratios of polymorphic sequences, or one or more markers specific to fetal nucleic acids rather than maternal nucleic acids (e.g., different epigenetic biomarkers between mother and fetus (e.g., methylation), or fetal RNA markers in maternal plasma (e.g., Lo, 2005, Journal of Histochemistry and Cytochemistry 53(3):293-296)). Determination of the fetal fraction may, in some cases, be carried out using a fetal quantitative assay (FQA), as described, for example, in U.S. Patent Application Publication No. 2010 / 0105049, the entirety of which is incorporated herein by reference for all purposes. This type of assay allows for the detection and quantification of fetal nucleic acids in a maternal sample based on the methylation status of nucleic acids in the sample.
[0055] In certain embodiments, minority species fractions can be determined based on allele ratios of polymorphic sequences (e.g., single nucleotide polymorphisms (SNPs)) using, for example, the method described in U.S. Patent Application Publication No. 2011 / 0224087, the entire contents of which are incorporated herein by reference for all purposes. In such a method for determining fetal fractions, for example, nucleotide sequence reads are obtained for a maternal sample, and the fetal fraction is determined by comparing the total number of nucleotide sequence reads that map to a first allele with the total number of nucleotide sequence reads that map to a second allele at a useful polymorphic site (e.g., SNP) in a reference genome.
[0056] Minority species fractions can be determined in some embodiments using methods that incorporate information derived from chromosomal abnormalities, for example, as described in International Patent Application Publication 2014 / 055774, the entirety of which is incorporated herein by reference for all purposes. Minority species fractions can be determined in some embodiments using methods that incorporate information derived from sex chromosomes, for example, as described in U.S. Patent Application Publication 2013 / 0288244 and U.S. Patent Application Publication 2013 / 0338933, the entirety of which is incorporated herein by reference for all purposes.
[0057] Minority species fractions can be determined in some embodiments using methods that incorporate fragment length information (e.g., fragment length ratio (FLR) analysis, fetal ratio statistics (FRS) analysis, described in International Patent Application Publication 2013 / 177086, the entire content of which is incorporated herein by reference for all purposes). Cell-free fetal nucleic acid fragments are generally shorter than maternal nucleic acid fragments (e.g., see Chan et al. (2004) Clin. Chem. 50:88-92; Lo et al. (2010) Sci. Transl. Med. 2:61ra91). Therefore, in some embodiments, the fetal fraction can be determined by counting fragments below a certain length threshold and comparing the count to, for example, the count from fragments above a certain length threshold and / or the total amount of nucleic acid in the sample. Methods for counting nucleic acid fragments of a specific length are described in further detail in International Patent Application Publication 2013 / 177086.
[0058] Minority species fractions can be determined in some embodiments according to partial-specific fraction estimation (e.g., described in International Patent Application Publication 2014 / 205401, the entirety of which is incorporated herein by reference for all purposes). Not limited to theory, the amount of reads from fetal CCF fragments (e.g., fragments of a particular length or range of length) often maps to parts (e.g., within the same sample, e.g., within the same sequencing run) at different frequencies. Also, not limited to theory, particular parts tend to have similar representations of reads from fetal CCF fragments (e.g., fragments of a particular length or range of length) when compared across multiple samples, and these representations correlate with partial-specific fetal fractions (e.g., relative amount, percentage, or ratio of CCF fragments derived from the fetus). Partial-specific fetal fraction estimation is generally determined according to partial-specific parameters and their relationship to the fetal fraction.
[0059] In some embodiments, the determination of minority species fractions (e.g., cancer cell nucleic acid fractions, fetal fractions) is not required or essential for identifying the presence or absence of gene mutations or genetic alterations. In some embodiments, identifying the presence or absence of gene mutations or genetic alterations does not require the distinction between minority nucleic acid sequences and majority nucleic acid sequences. In certain embodiments, this is because the sum of contributions from both minority and majority sequences in a particular chromosome, chromosomal portion, or part thereof is analyzed. In some embodiments, identifying the presence or absence of gene mutations or genetic alterations does not rely on a priori sequence information that would distinguish minority nucleic acids from majority nucleic acids.
[0060] In block 110, the nucleic acid library is prepared from isolated nucleic acids or a mixture of nucleic acids. In some embodiments, the nucleic acid library is a plurality of polynucleotide molecules (e.g., a sample of nucleic acids) that are prepared, assembled, and / or modified for a particular process, non-limiting examples of which include immobilization on a solid phase (e.g., a solid support, flow cell, beads), concentration, amplification, cloning, detection, and / or nucleic acid sequencing. In certain embodiments, the nucleic acid library is prepared before or during the sequencing process. The nucleic acid library (e.g., a sequencing library) can be prepared by preferred methods known in the art. The nucleic acid library can be prepared by a targeted preparation process or a non-targeted preparation process.
[0061] In some embodiments, the nucleic acid library is modified to include chemical moieties (e.g., functional groups) configured for the immobilization of nucleic acids to a solid support. In certain embodiments, the nucleic acid library is modified to include biomolecules (e.g., functional groups) and / or members of binding pairs configured to immobilize the library to a solid support, non-limiting examples of which include thyroxine-binding globulins, steroid-binding proteins, antibodies, antigens, haptens, enzymes, lectins, nucleic acids, repressors, protein A, protein G, avidin, streptavidin, biotin, complement component C1q, nucleic acid-binding proteins, receptors, carbohydrates, oligonucleotides, polynucleotides, complementary nucleic acid sequences, and combinations thereof. Some examples of specific binding pairs include, but are not limited to, avidin and biotin moieties, antigen epitopes and their antibodies or immunoreactive fragments, antibodies and haptens, digoxigen moieties and anti-digoxigen antibodies, fluorescein moieties and anti-fluorescein antibodies, operators and repressors, nucleases and nucleotides, lectins and polysaccharides, steroids and steroid-binding proteins, active compounds and active compound receptors, hormones and hormone receptors, enzymes and substrates, immunoglobulins and protein A, oligonucleotides or polynucleotides and their corresponding complements, or combinations thereof.
[0062] In some embodiments, a nucleic acid library is modified to include one or more polynucleotides of known composition, non-limiting examples of which include identifiers (e.g., tags, indexing tags), capture sequences, labels, adapters, restriction enzyme sites, promoters, enhancers, origins of replication, stem-loops, complementary sequences (e.g., primer-binding sites, annealing sites), preferred integration sites (e.g., transposons, viral integration sites), modified nucleotides, or combinations thereof. Polynucleotides of known sequences can be added to preferred locations, e.g., the 5' end, the 3' end, or within the nucleic acid sequence. The polynucleotides of known sequences can be the same or different sequences. In some embodiments, polynucleotides of known sequences are configured to hybridize to one or more oligonucleotides immobilized on a surface (e.g., a surface in a flow cell). For example, a nucleic acid molecule containing a 5' known sequence may hybridize to a first plurality of oligonucleotides, while a 3' known sequence may hybridize to a second plurality of oligonucleotides. In some embodiments, the nucleic acid library may include chromosome-specific tags, capture sequences, labels, and / or adapters. In some embodiments, the nucleic acid library includes one or more detectable labels. In some embodiments, one or more detectable labels may be incorporated into the nucleic acid library at the 5' end, 3' end, and / or any nucleotide position of the nucleic acids in the library. In some embodiments, the nucleic acid library includes hybridized oligonucleotides. In certain embodiments, the hybridized oligonucleotides are labeled probes. In some embodiments, the nucleic acid library includes hybridized oligonucleotide probes before immobilization on a solid phase.
[0063] In certain embodiments, ligation-based library preparation methods are used (e.g., ILLUMINA TRUSEQ, Illumina, San Diego Calif.). Ligation-based library preparation methods often utilize adapter (e.g., methylation adapter) designs that can incorporate index sequences (e.g., sample index sequences for identifying the sample origin for nucleic acid sequences) in the first ligation step, and can often be used to prepare samples for single-read sequencing, paired-end sequencing, and multiplex sequencing. For example, nucleic acids (e.g., fragmented nucleic acids or cell-free DNA) can be end-repaired by packing reactions, exonuclease reactions, or a combination thereof. In some embodiments, the resulting blunt-end repaired nucleic acid can then be extended by a single nucleotide complementary to a single nucleotide overhang at the 3' end of the adapter / primer. Any nucleotide can be used as the extension / overhang nucleotide.
[0064] In some embodiments, nucleic acid library preparation involves ligating an adapter oligonucleotide (e.g., to a sample nucleic acid, a sample nucleic acid fragment, or a template nucleic acid). The adapter oligonucleotide is often complementary to a flow cell anchor and, in some cases, is used to immobilize the nucleic acid library on a solid support, such as the inner surface of a flow cell. In some embodiments, the adapter oligonucleotide includes an identifier, one or more sequencing primer hybridization sites (e.g., sequences complementary to a universal sequencing primer, a single-ended sequencing primer, a paired-ended sequencing primer, a multiplex sequencing primer, etc.), or a combination thereof (e.g., adapter / sequencing, adapter / identifier, adapter / identifier / sequencing). In some embodiments, the adapter oligonucleotide includes one or more of the following: primer annealing polynucleotides (e.g., for annealing to oligonucleotides attached to a flow cell and / or free amplification primers), index polynucleotides (e.g., sample index sequences for tracking nucleic acids from different samples, also referred to as sample IDs), and barcode polynucleotides (e.g., single-molecule barcodes (SMBs) for tracking individual molecules of sample nucleic acids to be amplified before sequencing, also referred to as molecular barcodes). In some embodiments, the primer annealing component of the adapter oligonucleotide includes one or more universal sequences (e.g., sequences complementary to one or more universal amplification primers). In some embodiments, the index polynucleotide (e.g., sample index, sample ID) is a component of the adapter oligonucleotide. In some embodiments, the index polynucleotide (e.g., sample index, sample ID) is a component of the universal amplification primer sequence.
[0065] In some embodiments, when used in combination with amplification primers (e.g., universal amplification primers), the adapter oligonucleotide is designed to generate a library construct containing one or more of the universal sequence, molecular barcode, sample ID sequence, spacer sequence, and sample nucleic acid sequence. In some embodiments, when used in combination with universal amplification primers, the adapter oligonucleotide is designed to generate a library construct containing one or more ordered combinations of the universal sequence, molecular barcode, sample ID sequence, spacer sequence, and sample nucleic acid sequence. For example, the library construct may include a first universal sequence, followed by a second universal sequence, followed by a first molecular barcode, followed by a spacer sequence, followed by a template sequence (e.g., sample nucleic acid sequence), followed by a spacer sequence, followed by a second molecular barcode, followed by a third universal sequence, followed by a sample ID, followed by a fourth universal sequence. In some embodiments, when used in combination with amplification primers (e.g., universal amplification primers), the adapter oligonucleotide is designed to generate a library construct for each strand of the template molecule (e.g., sample nucleic acid molecule). In some embodiments, the adapter oligonucleotide is a double-stranded adapter oligonucleotide.
[0066] A universal sequence is a specific nucleotide sequence incorporated into two or more nucleic acid molecules or subsets of nucleic acid molecules, and the universal sequence is the same for all molecules or molecular subsets into which it is incorporated. Universal sequences are often designed to hybridize and / or amplify into multiple different sequences using a single universal primer complementary to the universal sequence. In some embodiments, two or more (e.g., pairs) of universal sequences and / or universal primers are used. Universal primers often contain universal sequences. In some embodiments, adapters (e.g., universal adapters) contain universal sequences. In some embodiments, one or more universal sequences are used to capture, identify, and / or detect multiple species or subsets of nucleic acids.
[0067] The identifier can be a suitable detectable label embedded in or attached to a nucleic acid (e.g., polynucleotide) that enables the detection and / or identification of the nucleic acid containing the identifier. In some embodiments, the identifier is embedded in or attached to the nucleic acid during a sequencing method (e.g., by polymerase). Non-limiting examples of identifiers include nucleic acid tags, nucleic acid indices or barcodes, radioactive labels (e.g., isotopes), metallic labels, fluorescent labels, chemiluminescent labels, phosphorescent labels, fluorescent quenchers, dyes, proteins (e.g., enzymes, antibodies or parts thereof, linkers, members of binding pairs), or combinations thereof. In some embodiments, the identifier (e.g., nucleic acid index or barcode) is a unique, known, and / or identifiable sequence of a nucleotide or nucleotide analogue. In some embodiments, the identifier is six or more adjacent nucleotides. Multiple phosphors with various different excitation and emission spectra are available. Any suitable type and / or number of phosphors can be used as identifiers. In some embodiments, one or more, two or more, three or more, four or more, five or more, six or more, seven or more, eight or more, nine or more, ten or more, twenty or more, thirty or more, or fifty or more different identifiers are used in the methods described herein (e.g., nucleic acid detection and / or sequencing methods). In some embodiments, one or two types of identifiers (e.g., fluorescent labels) are concatenated to each nucleic acid in the library. Identifier detection and / or quantification can be carried out by suitable methods, apparatus, or machines, including but not limited to flow cytometry, quantitative polymerase chain reaction (qPCR), gel electrophoresis, luminometers, fluorometers, spectrophotometers, suitable gene chip or microarray analysis, Western blotting, mass spectrometry, chromatography, cell fluorescence analysis, fluorescence microscopy, suitable fluorescence or digital imaging methods, confocal laser scanning microscopes, laser scanning cytometry, affinity chromatography, manual batch mode separation, electrostatic suspension, suitable nucleic acid sequencing methods, and / or nucleic acid sequencing apparatus, as well as combinations thereof.
[0068] In other embodiments, transposon-based library preparation methods are used (e.g., EPICENTRE NEXTERA, Epicentre, Madison, Wis.). Transposon-based methods typically use in vitro transposition to simultaneously fragment and tag DNA in a single-tube reaction (often allowing for the incorporation of platform-specific tags and optional barcodes) to prepare a sequencer-ready library.
[0069] In some embodiments, a nucleic acid library or a portion thereof is amplified (e.g., by a PCR-based method). For example, a sequencing method may include amplification of a nucleic acid library. The nucleic acid library can be amplified before or after immobilization on a solid support (e.g., a solid support in a flow cell). Nucleic acid amplification involves a process of amplifying or increasing the number of nucleic acid templates and / or their complements present (e.g., in the nucleic acid library) by generating one or more copies of the template and / or its complements. Amplification can be carried out by a preferred method. The nucleic acid library can be amplified by thermal cycling or isothermal amplification. In some embodiments, rolling circle amplification is used. In some embodiments, amplification is carried out on a solid support (e.g., in a flow cell) on which the nucleic acid library or a portion thereof is immobilized. In certain sequencing methods, the nucleic acid library is added to a flow cell and immobilized by hybridization to an anchor under preferred conditions. This type of nucleic acid amplification is often referred to as solid-phase amplification. In some embodiments of solid-phase amplification, all or part of the amplification product is synthesized by extension starting from an immobilized primer. Solid-phase amplification reactions are similar to standard solution-phase amplification, except that at least one of the amplification oligonucleotides (e.g., primers) is immobilized on a solid support. In some embodiments, modified nucleic acids (e.g., nucleic acids modified by the addition of adapters) are amplified.
[0070] In some embodiments, solid-phase amplification includes a nucleic acid amplification reaction involving only one species of oligonucleotide primer immobilized on a surface. In certain embodiments, solid-phase amplification includes multiple different species of immobilized oligonucleotide primers. In some embodiments, solid-phase amplification may include a nucleic acid amplification reaction involving one species of oligonucleotide primer immobilized on a solid surface and a second different species of oligonucleotide primer in solution. Multiple different species of immobilized or solution-based primers can be used. Non-limiting examples of solid-phase nucleic acid amplification reactions include interfacial amplification, bridge amplification, emulsion PCR, WildFire amplification (e.g., U.S. Patent Application Publication 2013 / 0012399), or combinations thereof.
[0071] In block 115, nucleic acids from a nucleic acid library (e.g., nucleic acid fragments, sample nucleic acids, cell-free nucleic acids) are sequenced using a sequencing subsystem. In certain cases, a complete or substantially complete sequence is obtained, and in some cases, a partial sequence is obtained. Nucleic acid sequencing generally produces a collection of sequence reads. As used herein, a “read” (e.g., “read,” “sequence read”) is a short nucleotide sequence produced by any sequencing process described herein or known in the art. Reads can be generated from one end of a nucleic acid fragment (“single-ended read”), and in some cases, from both ends of a nucleic acid fragment (e.g., paired-ended read, double-ended read).
[0072] The length of a sequence read is often associated with a particular sequencing technique. High-throughput methods can provide sequence reads that can vary in size from tens to hundreds of base pairs (bp), for example. Nanopore sequencing can provide sequence reads that can vary in size from tens to hundreds or thousands of base pairs, for example. In some embodiments, the sequence read has an average, median, mean, or absolute length of about 15 bp to about 900 bp. In certain embodiments, the sequence read has an average, median, mean, or absolute length of about 1000 bp or more. In some embodiments, the sequence read has an average, median, mean, or absolute length of about 1500, 2000, 2500, 3000, 3500, 4000, 4500, or 5000 bp or more. In some embodiments, the sequence read has an average, median, mean, or absolute length of about 100 bp to about 200 bp. In some embodiments, the sequence reads have an average, median, mean, or absolute length of approximately 140 bp to approximately 160 bp. For example, the sequence reads may have an average, median, mean, or absolute length of approximately 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, or 160 bp.
[0073] In some embodiments, the nominal, mean, average, or absolute length of a single-ended read is, in some cases, about 10 consecutive nucleotides to about 250 or more adjacent nucleotides, about 15 adjacent nucleotides to about 200 or more adjacent nucleotides, about 15 adjacent nucleotides to about 150 or more adjacent nucleotides, about 15 adjacent nucleotides to about 125 or more adjacent nucleotides, about 15 adjacent nucleotides to about 100 or more adjacent nucleotides, about 15 adjacent nucleotides to about 75 or more adjacent nucleotides, about 15 adjacent nucleotides to about 60 or more adjacent nucleotides, about 15 adjacent nucleotides to about 50 or more adjacent nucleotides, about 15 adjacent nucleotides to about 40 or more adjacent nucleotides, and, in some embodiments, about 15 adjacent nucleotides to about 36 or more adjacent nucleotides. In certain embodiments, the nominal, mean, average, or absolute length of a single-ended read is about 20 to about 30 bases, or about 24 to about 28 bases. In certain embodiments, the nominal, mean, average, or absolute length of a single-ended read is approximately 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28, or approximately 29 bases or longer. In certain embodiments, the nominal, mean, average, or absolute length of a single-ended read is approximately 20 to approximately 200 bases, approximately 100 to approximately 200 bases, or approximately 140 to approximately 160 bases. In certain embodiments, the nominal, mean, average, or absolute length of a single-ended read is approximately 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or approximately 200 bases or longer. In certain embodiments, the nominal, mean, average, or absolute length of a paired-ended read is, in some cases, approximately 10 to approximately 25 adjacent nucleotides or longer (e.g., approximately 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, or 25 nucleotides or longer), approximately 15 to approximately 20 adjacent nucleotides or longer, and in some cases, approximately 17 or approximately 18 adjacent nucleotides.In certain embodiments, the nominal, mean, average, or absolute length of a paired-end read may be approximately 25 to 400 or more adjacent nucleotides (e.g., approximately 25, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, or 400 or more nucleotides), approximately 50 to 350 adjacent nucleotides. The number of adjacent nucleotides is approximately 100 to 325 adjacent nucleotides, approximately 150 to 325 adjacent nucleotides, approximately 200 to 325 adjacent nucleotides, approximately 275 to 310 adjacent nucleotides, approximately 100 to 200 adjacent nucleotides, approximately 100 to 175 adjacent nucleotides, approximately 125 to 175 adjacent nucleotides, and in some cases, approximately 140 to 160 adjacent nucleotides. In certain embodiments, the nominal, mean, average, or absolute length of a paired-end lead is approximately 150 adjacent nucleotides, and in some cases, 150 adjacent nucleotides.
[0074] In some embodiments, the nucleotide sequence reads obtained from a sample are partial nucleotide sequence reads. As used herein, “partial nucleotide sequence read” refers to a sequence read of any length with incomplete sequence information, and is also referred to as sequence ambiguity. Partial nucleotide sequence reads may lack information about nucleic acid base identity and / or the position or order of nucleic acid bases. Partial nucleotide sequence reads generally do not include sequence reads in which only incomplete sequence information (or fewer than all of the bases are sequenced or determined) is due to inattentive or unintentional sequencing errors. Such sequencing errors are specific to a particular sequencing process and may include, for example, incorrect calls of nucleic acid base identity and missing or extra nucleic acid bases. Therefore, with respect to partial nucleotide sequence reads herein, certain information about the sequence is often intentionally excluded; that is, less sequence information than all of the nucleic acid bases is intentionally obtained, or otherwise characterized as a sequencing error, or may be a sequencing error. In some embodiments, partial nucleotide sequence reads may extend to a portion of a nucleic acid fragment. In some embodiments, partial nucleotide sequence reads can extend to the entire length of a nucleic acid fragment. Partial nucleotide sequence reads are described, for example, in International Patent Application Publication No. 2013 / 052907, the entirety of which is incorporated herein by reference for all purposes.
[0075] A read is generally a representation of a nucleotide sequence in a physical nucleic acid. For example, in a read containing an ATGC description of a sequence, in the physical nucleic acid, "A" represents an adenine nucleotide, "T" represents a thymine nucleotide, "G" represents a guanine nucleotide, and "C" represents a cytosine nucleotide. Sequence reads obtained from a sample from a subject can be reads from a mixture of minority and majority nucleic acids. For example, a sequence read obtained from the blood of a cancer patient can be a read from a mixture of cancer and non-cancer nucleic acids. In another example, a sequence read obtained from the blood of a pregnant woman can be a read from a mixture of fetal and maternal nucleic acids. A mixture of relatively short reads can be converted by the processes described herein to a representation of the genomic nucleic acid present in the subject and / or a genomic nucleic acid present in a tumor or fetus. In certain cases, a mixture of relatively short reads can be converted, for example, to a representation of copy number variation, gene mutation / gene alteration, or aneuploidy. In one example, reads of a mixture of cancer and non-cancerous nucleic acids can be converted into a composite chromosome or a portion thereof that contains features of one or both of the chromosomes of cancer and non-cancerous cells. In another example, reads of a mixture of maternal and fetal nucleic acids can be converted into a composite chromosome or a portion thereof that contains features of one or both of the chromosomes of the maternal and fetal cells.
[0076] In some cases, circulating cell-free nucleic acid fragments (CCF fragments) obtained from cancer patients include nucleic acid fragments derived from normal cells (i.e., non-cancerous fragments) and nucleic acid fragments derived from cancer cells (i.e., cancerous fragments). Sequence reads derived from CCF fragments originating from normal cells (i.e., non-cancerous cells) are referred to herein as “non-cancerous reads.” Sequence reads derived from CCF fragments originating from cancer cells are referred to herein as “cancerous reads.” CCF fragments from which non-cancerous reads are obtained may be referred to herein as non-cancerous templates, and CCF fragments from which cancerous reads are obtained may be referred to herein as cancerous templates.
[0077] In some cases, circulating cell-free nucleic acid fragments (CCF fragments) obtained from pregnant women include nucleic acid fragments of fetal cell origin (i.e., fetal fragments) and nucleic acid fragments of maternal cell origin (i.e., maternal fragments). Sequence reads derived from fetal-origin CCF fragments are referred to herein as “fetal reads.” Sequence reads derived from CCF fragments of the genome of a pregnant woman with a fetus (e.g., a mother) are referred to herein as “maternal reads.” CCF fragments from which fetal reads are obtained are referred to herein as fetal templates, and CCF fragments from which maternal reads are obtained are referred to herein as maternal templates.
[0078] In certain embodiments, “obtaining” nucleic acid sequence reads of a sample from a subject and / or “obtaining” nucleic acid sequence reads of a biological sample from one or more reference persons may involve directly sequencing the nucleic acid to obtain sequence information. In some embodiments, “obtaining” may involve receiving sequence information obtained directly from the nucleic acid by another.
[0079] In some embodiments, some or all nucleic acids in the sample are enriched and / or amplified (e.g., nonspecifically, e.g., by a PCR-based method) before or during sequencing. In certain embodiments, specific nucleic acid species or subsets in the sample are enriched and / or amplified before or during sequencing. In some embodiments, species or subsets of a pre-selected pool of nucleic acids are randomly sequenced. In some embodiments, nucleic acids in the sample are not enriched and / or amplified before or during sequencing.
[0080] In some embodiments, a representative fraction of the genome is sequenced, sometimes referred to as “coverage” or “multiple coverage.” For example, 1x coverage indicates that approximately 100% of the genome’s nucleotide sequence is represented by reads. In some cases, multiple coverage is referred to (and directly proportional to) “sequencing depth.” In some embodiments, “multiple coverage” is a relative term referring to a previous sequencing run as a reference. For example, a second sequencing run may have twice as little coverage as a first sequencing run. In some embodiments, the genome is sequenced with redundancy, and a given region of the genome can be covered by two or more reads or overlapping reads (e.g., “multiple coverage” greater than 1, e.g., 2x coverage). In some embodiments, the genome or a large portion of the genome (e.g., genome-wide sequencing) is sequenced with coverage of approximately 0.1x to approximately 100x, approximately 0.2x to approximately 20x, or approximately 0.2x to approximately 1x (e.g., coverage of approximately 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90x). In some embodiments, the genome or a large portion of the genome (e.g., genome-wide sequencing) is sequenced with coverage of approximately 0.01x to approximately 100x, approximately 0.1x to approximately 20x, or approximately 0.1x to approximately 1x (e.g., coverage of approximately 0.015, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90x or more). In some embodiments, specific regions of the genome (e.g., genomic regions from targeted methods and / or probe-based methods) are sequenced, and the multiplicative coverage value generally refers to the fraction of the specific genomic region that has been sequenced (i.e., the multiplicative coverage value does not refer to the entire genome). In some cases, specific genomic regions are sequenced with coverage of 1000 times or more.For example, specific genomic regions can be sequenced with coverage of 2,000x, 5,000x, 10,000x, 20,000x, 30,000x, 40,000x, or 50,000x. In some embodiments, sequencing has coverage of approximately 1,000x to 100,000x. In some embodiments, sequencing has coverage of approximately 10,000x to 70,000x. In some embodiments, sequencing has coverage of approximately 20,000x to 60,000x. In some embodiments, sequencing has coverage of approximately 30,000x to 50,000x.
[0081] In some embodiments, one nucleic acid sample from one individual is sequenced. In certain embodiments, nucleic acids from each of two or more samples are sequenced, and the samples are from one or different individuals. In certain embodiments, nucleic acid samples from two or more biological samples are pooled, and each biological sample is from one or more individuals, and the pool is sequenced. In the latter embodiments, the nucleic acid samples from each biological sample are often identified by one or more unique identifiers.
[0082] In some embodiments, the sequencing method utilizes identifiers that enable the multiplexing of sequence reactions in the sequencing process. A larger number of unique identifiers allows for, for example, a greater number of samples and / or chromosomes for detection to be multiplexed in the sequencing process. The sequencing process can be carried out using any suitable number of unique identifiers (e.g., 4, 8, 12, 24, 48, 96, 1000, 10000, 108, 1024, or more).
[0083] The sequencing process may, in some cases, utilize a solid phase, which may include a flow cell to which nucleic acids from a library can be attached and reagents can flow and come into contact with the attached nucleic acids. The flow cell may, in some cases, include flow cell lanes, and the use of identifiers can facilitate the analysis of several samples in each lane. The flow cell is often a solid support that can be configured to allow the reagent solution to be held and / or passed through in an orderly manner over the bound analyte. The flow cell is often planar, optionally transparent, generally on a millimeter or sub-millimeter scale, and often has channels or lanes where analyte / reagent interactions occur. In some embodiments, the number of samples analyzed in a given flow cell lane depends on the number of unique identifiers used during library preparation and / or probe design. For example, multiplexing using 12 identifiers allows for the simultaneous analysis of 96 samples in an 8-lane flow cell (e.g., equal to the number of wells in a 96-well microwell plate). Similarly, multiplexing using, for example, 48 identifiers allows for the simultaneous analysis of 384 samples in an 8-lane flow cell (e.g., equal to the number of wells in a 384-well microwell plate). Non-limiting examples of commercially available multiplex sequencing kits include Illumina's multiplex sample preparation oligonucleotide kits and multiplex sequencing primers, as well as the PhiX control kit (e.g., Illumina catalog numbers PE-400-1001 and PE-400-1002, respectively).
[0084] Any suitable method for sequencing nucleic acids can be used, non-limiting examples of which include Maxim & Gilbert, chain termination, synthesis sequencing, ligation sequencing, mass spectrometry sequencing, microscope-based techniques, or combinations thereof. In some embodiments, first-generation techniques, such as Sanger sequencing methods including automated Sanger sequencing methods including microfluidic Sanger sequencing, can be used in the methods provided herein. In some embodiments, sequencing techniques including the use of nucleic acid imaging techniques (e.g., transmission electron microscopy (TEM) and atomic force microscopy (AFM)) can be used. In some embodiments, high-throughput sequencing methods are used. High-throughput sequencing methods generally include cloned amplified DNA templates or single DNA molecules, which are sequenced in a large-scale parallel manner, sometimes within a flow cell. Next-generation (e.g., second and third-generation) sequencing techniques capable of sequencing DNA in a large-scale parallel manner can be used in the methods described herein and are collectively referred to herein as “large-scale parallel sequencing” (MPS). In some embodiments, the MPS sequencing method utilizes a targeted approach in which a specific chromosome, gene, or region of interest is sequenced. In certain embodiments, a non-targeted approach is used in which most or all nucleic acids in a sample are randomly sequenced, amplified, and / or captured.
[0085] In some embodiments, targeted enrichment, amplification, and / or sequencing approaches are used. Targeted approaches often isolate, select, and / or enrich a subset of nucleic acids in a sample for further processing using sequence-specific oligonucleotides. In some embodiments, a library of sequence-specific oligonucleotides is used to target (e.g., hybridize) one or more sets of nucleic acids in a sample. Sequence-specific oligonucleotides and / or primers are often selective for specific sequences (e.g., unique nucleic acid sequences) present in one or more chromosomes, genes, exons, introns, and / or regulatory regions of interest. Any preferred method or combination of methods can be used to enrich, amplify, and / or sequence one or more target nucleic acid subsets. In some embodiments, target sequences are isolated and / or enriched by trapping them on a solid phase (e.g., flow cell, beads) using one or more sequence-specific anchors. In some embodiments, target sequences are enriched and / or amplified by polymerase-based methods (e.g., PCR-based methods, any preferred polymerase-based extension) using sequence-specific primers and / or primer sets. Sequence-specific anchors can often be used as sequence-specific primers.
[0086] MPS sequencing, in some cases, utilizes sequencing by synthesis and specific imaging processes. Nucleic acid sequencing techniques that may be used in the methods described herein include synthetic sequencing and reversible terminator-based sequencing (e.g., Illumina's Genome Analyzer, Genome Analyzer II, HISEQ2000, HISEQ2500 (Illumina, San Diego Calif.)). This technique allows for the parallel sequencing of millions of nucleic acid (e.g., DNA) fragments. An example of this type of sequencing technique uses a flow cell containing an optically transparent slide with eight individual lanes, on which bound oligonucleotide anchors (e.g., adapter primers) are located.
[0087] Synthetic sequencing is generally performed by repeatedly adding nucleotides (e.g., covalent addition) to a primer or an existing nucleic acid chain in a template-directed manner. The process is repeated multiple times until each repeated addition of nucleotides is detected and the sequence of the nucleic acid chain is obtained. The length of the obtained sequence depends in part on the number of addition and detection steps performed. In some embodiments of synthetic sequencing, one, two, three, or more nucleotides of the same type (e.g., A, G, C, or T) are added and detected in one round of nucleotide addition. Nucleotides can be added by any preferred method (e.g., enzymatically or chemically). For example, in some embodiments, polymerases or ligases add nucleotides to the primer or an existing nucleic acid chain in a template-directed manner. In some embodiments of synthetic sequencing, different types of nucleotides, nucleotide analogs, and / or identifiers are used. In some embodiments, reversible terminators and / or removable (e.g., cleavable) identifiers are used. In some embodiments, fluorescently labeled nucleotides and / or nucleotide analogs are used. In certain embodiments, sequencing by synthesis includes a cleavage (e.g., cleavage and removal of identifiers) and / or a washing step. In some embodiments, the addition of one or more nucleotides is detected by preferred methods described herein or known in the art, non-limiting examples of which include any preferred imaging device, a preferred camera, a digital camera, a CCD (charge-coupled device) based imaging device (e.g., a CCD camera), a CMOS (complementary metal oxide silicon) based imaging device (e.g., a CMOS camera), a photodiode (e.g., a photomultiplier tube), an electron microscope, a field-effect transistor (e.g., a DNA field-effect transistor), an ISFET ion sensor (e.g., a CHEMFET sensor), or a combination thereof.
[0088] Any preferred MPS method, system, or technical platform for carrying out the methods described herein can be used to obtain nucleic acid sequence reads. Non-exclusive examples of MPS platforms include Illumina / Solex / HiSeq (e.g., Illumina's Genome Analyzer, Genome Analyzer II, HISEQ2000, HISEQ), SOLiD, Roche / 454, PACBIO, and / or SMRT, Helicos True Single Molecule Sequencing, Ion Torrent, and Ion semiconductor-based sequencing (e.g., developed by Life Technologies), WildFire, 5500, 5500xl W, and / or 5500xl W Genetic Analyzer-based technologies (e.g., developed and marketed by Life Technologies, U.S. Patent Application Publication 2013 / 0012399), Polony sequencing, Pyrosequencing, Massively Parallel Signature Sequencing (MPSS), RNA polymerase (RNAP) sequencing, LaserGen systems and methods, Nanopore-based platforms, chemosensitive field-effect transistor (CHEMFET) arrays, and electron microscope-based sequencing (e.g., ZS Examples include Halcyon Molecular sequencing, nanoball sequencing, or combinations thereof, developed by Genetics. Other sequencing methods that may be used to carry out the methods described herein include digital PCR, hybridization sequencing, nanopore sequencing, and chromosome-specific sequencing (e.g., using DANSR (Digital Analysis of Selected Regions) technology).
[0089] In some embodiments, sequence reads are generated, obtained, collected, assembled, manipulated, transformed, processed, and / or provided by the sequencing subsystem. A machine including the sequencing subsystem can be a suitable machine and / or apparatus for determining the sequence of nucleic acids using sequencing techniques known in the art. In some embodiments, the sequencing subsystem can perform alignment, assembly, fragmentation, complement, reverse complement, and / or error checking (e.g., error-corrected sequence reads).
[0090] In block 120, sequence reads are processed using the sequence processing subsystem to obtain sequence read data. Sequence read processing includes read alignment, mapping, and filtering.
[0091] Alignment and mapping In some embodiments, one or more processing steps include aligning and mapping sequence reads. Sequence reads can be mapped, and the number of reads that map to a specific nucleic acid region (e.g., a chromosome or a portion thereof) is referred to as a count. Any preferred mapping method (e.g., a process, algorithm, program, software, subsystem, etc., or a combination thereof) can be used. Specific aspects of the mapping process are described below. Mapping nucleotide sequence reads (i.e., sequence information from a fragment whose physical genomic location is unknown) can be carried out in several ways, often involving alignment of the obtained sequence reads with matching sequences in a reference genome. In such alignment, the sequence reads are generally aligned to a reference sequence, and the aligned sequences are designated as “mapped,” “mapped sequence reads,” or “mapped reads.” In certain embodiments, mapped sequence reads are referred to as “hits” or “counts.” In some embodiments, mapped sequence reads are grouped together according to various parameters and assigned to specific genomic regions, which are discussed in further detail below.
[0092] The terms “to be aligned,” “alignment,” or “the act of aligning” generally refer to two or more nucleic acid sequences that can be identified as identical (e.g., 100% identity) or partially identical. Alignment can be performed manually or by computer (e.g., software, program, subsystem, or algorithm), a non-exclusive example of which is the Nucleotide Data (ELAND) computer program distributed as part of the Illumina Genomics Analysis pipeline. Alignment of sequence reads can be 100% sequence identical. In some cases, the alignment is less than 100% sequence identical (i.e., non-exact match, partial match, partial alignment). In some embodiments, the alignment is approximately 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, 85%, 84%, 83%, 82%, 81%, 80%, 79%, 78%, 77%, 76%, or 75% match. In some embodiments, the alignment includes mismatches. In some embodiments, the alignment includes one, two, three, four, or five mismatches. Two or more sequences can be aligned using either strand (e.g., sense strand or antisense strand). In certain embodiments, nucleic acid sequences are aligned with the reverse complement of another nucleic acid sequence.
[0093] Various computational methods can be used to map each sequence read to a portion. Non-limiting examples of computer algorithms that can be used to align sequences include, without limitation, BLAST, BLITZ, FASTA, BOWTIE1, BOWTIE2, ELAND, MAQ, PROBEMATCH, SOAP, BWA, or SEQMAP, or variations thereof, or combinations thereof. In some embodiments, sequence reads can be aligned with sequences in a reference genome. In some embodiments, sequence reads can be found and / or aligned with sequences in nucleic acid databases known in the art, including, for example, GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory), and DDBJ (DNA Databank of Japan). BLAST or a similar tool can be used to search for identified sequences against a sequence database. The search hits can then be used, for example, to sort the identified sequences into appropriate portions (as described below).
[0094] In some embodiments, reads may map uniquely or non-uniquely to a portion within a reference genome. A read is considered "uniquely mapped" if it aligns with a single sequence in the reference genome. A read is considered "non-uniquely mapped" if it aligns with two or more sequences in the reference genome. In some embodiments, non-uniquely mapped reads are excluded from further analysis (e.g., quantification). In certain embodiments, a certain small degree of mismatch (0-1) may be tolerated to account for possible single nucleotide polymorphisms between reads from individual samples mapped to the reference genome. In some embodiments, no degree of mismatch is tolerated in reads mapped to a reference sequence.
[0095] As used herein, the term “reference genome” can refer to any specific known, sequenced, or characterized genome of any organism or virus, whether partial or complete, that can be used to reference identified sequences from a subject. For example, reference genomes used for human subjects and many other organisms can be found at the National Center for Biotechnology Information at the World Wide Web URL ncbi.nlm.nih.gov. “Genome” refers to the complete genetic information of an organism or virus expressed in a nucleic acid sequence. As used herein, a reference sequence or reference genome is often an assembled or partially assembled genome sequence from one or more individuals. In some embodiments, the reference genome is an assembled or partially assembled genome sequence derived from one or more human individuals. In some embodiments, the reference genome includes sequences assigned to chromosomes.
[0096] In certain embodiments, mapping potential is evaluated for genomic regions (e.g., parts, genomic segments). Mapping potential is the ability to clearly align nucleotide sequence reads to a portion of a reference genome up to a specified number of mismatches, typically including, for example, 0, 1, 2, or more mismatches. For a given genomic region, the expected mapping potential can be estimated by using a sliding window approach of preset read lengths and averaging the resulting read-level mapping potential values. Genomic regions containing unique nucleotide sequence extensions may, in some cases, have high mapping potential values.
[0097] For paired-end sequencing, reads may be mapped to a reference genome using a suitable mapping and / or alignment program, non-limiting examples of which include BWA (Li H. and Durbin R. (2009) Bioinformatics 25, 1754-60), Novoalign [Novocraft (2010)], Bowtie (Langmead B, et al., (2009) Genome Biol. 10: R25), SOAP2 (Li R, et al., (2009) Bioinformatics 25, 1966-67), BFAST (Homer N, et al., (2009) PloS ONE 4, e7767), GASSST (Rizk, G. and Lavenier, D. (2010) Bioinformatics 26, 2534-2540), and Mpscan (Rivals E., et al.) Examples include al. (2009) Lecture Notes in Computer Science 5724, 246-260). Paired-end reads can be mapped and / or aligned using a suitable short read alignment program.Non-exclusive examples of short lead alignment programs include BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, BWA, CASHX, CUDA-EC, CUSHAW, CUSHAW2, drFAST, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP, and Geneious. Examples include Assembler, iSAAC, LAST, MAQ, mrFAST, mrsFAST, MOSAIK, Mpscan, Novoalign, NovoalignCS, Novocraft, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, Qpalma, RazerS, REAL, cREAL, RMAP, rNA, RTG, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SOAP, SOAP2, SOAP3, SOCS, SSAHA, SSAHA2, Stampy, StoRM, Subread, Subjunc, Taipan, UGENE, VelociMapper, TimeLogic, XpressAlign, ZOOM, or combinations thereof. Paired-end reads are often mapped to opposite ends of the same polynucleotide fragment according to a reference genome. In some embodiments, readmates are mapped independently. In some embodiments, information from both sequence reads (i.e., from each end) is considered in the mapping process. A reference genome is often used to determine and / or infer the sequences of nucleic acids located between paired-end readmates. As used herein, the term “inconsistent read pair” refers to paired-end reads containing a pair of readmates, one or both of which cannot clearly map to the same region of the reference genome, defined in part by a segment of adjacent nucleotides. In some embodiments, an inconsistent read pair is a paired-end readmate that maps to an unexpected location in the reference genome.Non-limiting examples of unexpected locations in the reference genome include (i) two different chromosomes, (ii) locations separated beyond a given fragment size (e.g., greater than 300 bp, greater than 500 bp, greater than 1000 bp, greater than 5000 bp, or greater than 10,000 bp), (iii) orientations that do not match the reference sequence (e.g., opposite orientation), or combinations thereof. In some embodiments, mismatched readmates are identified according to the length (e.g., average length, given fragment size) or expected length of the template polynucleotide fragment in the sample. For example, readmates that map to locations separated beyond the average length or expected length of the polynucleotide fragment in the sample may, in some cases, be identified as mismatched read pairs. Read pairs that map to opposite orientations may, in some cases, be determined by taking the reverse complement of one of the reads and comparing the alignment of both reads using the same strand of the reference sequence. Incompatible read pairs can be identified by any preferred method and / or algorithm known in the art or described herein (e.g., SVDetect, Lumpy, BreakDancer, BreakDancerMax, CREST, DELLY, etc., or a combination thereof).
[0098] Partialization In some embodiments, mapped sequence reads are grouped together according to various parameters and assigned to specific genomic regions (e.g., reference genome regions). The “region” may also be referred to herein as a “genomic partition,” “bin,” “partition,” “reference genome region,” “chromosome region,” or “genomic region.” A region is often defined by the partitioning of the genome according to one or more features. Non-limiting examples of specific partitioning features include length (e.g., fixed length, non-fixed length) and other structural features. Genomic regions may include one or more of the following features: fixed length, non-fixed length, random length, non-random length, isolength, unequal length (e.g., at least two genomic regions are of unequal length), non-overlapping (e.g., the 3' end of a genomic region may, in some cases, be adjacent to the 5' end of an adjacent genomic region), overlapping (e.g., at least two genomic regions overlap), adjacent, continuous, non-adjacent, and discontinuous. The genomic portion can be approximately 1 to 1,000 kilobases long (for example, approximately 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500, 600, 700, 800, 900 kilobases), approximately 5 to 500 kilobases, approximately 10 to 100 kilobases, or approximately 40 to 60 kilobases.
[0099] Partitioning may, in some cases, be based, or partially based, on specific information features, such as information content and information gain. Non-limiting examples of specific information features include alignment speed and / or convenience, sequencing coverage variability, GC content (e.g., stratified GC content, specific GC content, high or low GC content), GC content uniformity, other measures of sequence content (e.g., individual nucleotide fractions, pyrimidine or purine fractions, native vs. non-native nucleic acid fractions, methylated nucleotide fractions, and CpG content), methylation status, double melting temperature, adaptability to sequencing or PCR, uncertainty values assigned to individual parts of the reference genome, and / or targeted searches for specific features. In some embodiments, information content may be quantified using p-value profiles that measure the significance of specific genomic locations for distinguishing between confirmed normal and abnormal subjects (e.g., euploid and trisomic subjects, respectively).
[0100] In some embodiments, partitioning a genome can eliminate similar regions (e.g., identical or homologous regions or sequences) across the genome, retaining only unique regions. Regions removed during partitioning may be within a single chromosome, one or more chromosomes, or span multiple chromosomes. In some embodiments, the partitioned genome is often reduced and optimized for faster alignment, focusing on uniquely identifiable sequences.
[0101] In some embodiments, genomic portions arise from non-overlapping fixed-size partitioning, which results in consecutive non-overlapping fixed-length portions. Such portions are often shorter than chromosomes and often shorter than copy number variation (or copy number variation) regions (e.g., regions that are replicated or deleted), the latter of which may be referred to as segments. A “segment” or “genomic segment” often comprises two or more fixed-length genomic portions, and often comprises two or more consecutive fixed-length portions (e.g., about 2 to about 100 such portions (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90 such portions)).
[0102] Multiple parts may be analyzed in groups, and in some cases, reads mapped to parts are quantified according to specific genomic subgroups. If parts are partitioned by structural features and correspond to regions within the genome, parts may be grouped into one or more segments and / or one or more regions. Non-limiting examples of regions include subchromosomes (i.e., shorter than chromosomes), chromosomes, autosomes, sex chromosomes, and combinations thereof. One or more subchromosomal regions may include genes, gene fragments, regulatory sequences, introns, exons, segments (e.g., segments extending to copy number variation regions, segments extending to copy number polymorphism regions), microduplications, microdeletions, etc. Regions may be smaller than or the same size as the chromosome of interest, and in some cases, smaller than or the same size as a reference chromosome.
[0103] Partial filtering and / or selection In some embodiments, one or more processing steps include one or more partial filtering steps and / or partial selection steps. As used herein, the term “filtering” means removing a portion or part of a reference genome from consideration. In certain embodiments, one or more portions are filtered (e.g., subjected to a filtering process) and thereby provide the filtered portion. In some embodiments, the filtering process removes a particular portion and retains a portion (e.g., a subset of a portion). Following the filtering process, the retained portion is often referred to herein as the filtered portion.
[0104] The reference genome segments can be selected for removal based on any preferred criteria, including, but not limited to, redundant data (e.g., redundant or duplicate mapping reads), non-informational data (e.g., segments of the reference genome with a median count of zero), segments of the reference genome with excessive or insufficient sequences, noisy data, or a combination of the above. The filtering process often involves removing one or more segments of the reference genome from consideration and subtracting the counts of the one or more segments of the reference genome selected for removal from the counts or summed counts of the reference genome, chromosomes, or segments of genome under consideration. In some embodiments, segments of the reference genome can be removed sequentially (e.g., one at a time to allow evaluation of the effect of removing each individual segment), and in certain embodiments, all segments of the reference genome marked for removal can be removed simultaneously. In some embodiments, segments of the reference genome characterized by variance above or below a certain level are removed, which is, in some cases, referred to herein as filtering out “noisy” segments of the reference genome. In certain embodiments, the filtering process includes obtaining data points from datasets that deviate from the mean profile level of a portion, chromosome, or chromosomal region by a predetermined multiple of the profile variance, and in certain embodiments, the filtering process includes removing data points from datasets that do not deviate from the mean profile level of a portion, chromosome, or chromosomal region by a predetermined multiple of the profile variance. In some embodiments, the filtering process is used to reduce the number of candidate regions of the reference genome analyzed for the presence or absence of gene mutations / gene alterations and / or copy number alterations (e.g., aneuploidy, microdeletions, microduplications). Reducing the number of candidate regions of the reference genome analyzed for the presence or absence of gene mutations / gene alterations and / or copy number alterations often reduces the complexity and / or dimensionality of the dataset and, in some cases, increases the speed of searching for and / or identifying gene mutations / gene alterations and / or copy number alterations by more than two orders of magnitude.
[0105] The portions may be processed (e.g., filtered and / or selected) by any preferred method and according to any preferred parameters. Non-limiting examples of features and / or parameters that can be used to filter and / or select portions include redundant data (e.g., redundant or duplicate mapping reads), non-informational data (e.g., portions of the reference genome with zero mapping counts), portions of the reference genome with excessive or insufficient sequences, noisy data, counts, count variability, coverage, mappingability, variability, reproducibility measures, read density, read density variability, level of uncertainty, guanine-cytosine (GC) content, CCF fragment length and / or read length (e.g., fragment length ratio (FLR), fetal ratio statistics (FRS)), Dnasel sensitivity, methylation status, acetylation, histone distribution, chromatin structure, percent repeats, or combinations thereof. The portions may be filtered and / or selected according to any preferred features or parameters that correlate with the features or parameters listed or described herein. Parts can be filtered and / or selected according to features or parameters specific to a part (e.g., determined for a single part according to multiple samples) and / or features or parameters specific to a sample (e.g., determined for multiple parts within a sample). In some embodiments, parts are filtered and / or removed according to relatively low mappingability, relatively high variability, a high level of uncertainty, relatively long CCF fragment length (e.g., low FRS, low FLR), a relatively large proportion of repeating sequences, high GC content, low GC content, low count, zero count, high count, etc., or a combination thereof. In some embodiments, parts (e.g., subsets of parts) are selected according to a preferred level of mappingability, variability, level of uncertainty, proportion of repeating sequences, count, GC content, etc., or a combination thereof. In some embodiments, parts (e.g., subsets of parts) are selected according to relatively short CCF fragment length (e.g., high FRS, high FLR).The counts and / or reads mapped to parts are, in some cases, processed (e.g., normalized) before or after filtering or selecting parts (e.g., subsets of parts). In some embodiments, the counts and / or reads mapped to parts are not processed before or after filtering or selecting parts (e.g., subsets of parts).
[0106] In some embodiments, portions may be filtered according to a measure of error (e.g., standard deviation, standard error, calculated variance, p-value, mean absolute error (MAE), mean absolute deviation, and / or mean absolute deviation (MAD)). In certain cases, the measure of error may refer to count variability. In some embodiments, portions are filtered according to count variability. In certain embodiments, count variability is a measure of error determined for counts mapped to portions (i.e., parts) of a reference genome for multiple samples (e.g., multiple subjects, e.g., multiple samples obtained from 50 or more, 100 or more, 500 or more, 1000 or more, 5000 or more, or 10,000 or more subjects). In some embodiments, portions with count variability above a predetermined upper limit are filtered (e.g., excluded from consideration). In some embodiments, portions with count variability below a predetermined lower range are filtered (e.g., excluded from consideration). In some embodiments, portions with count variability outside a predetermined range are filtered (e.g., excluded from consideration). In some embodiments, a portion having count variability within a predetermined range is selected (for example, used to determine whether or not there is a change in copy number). In some embodiments, the count variability of the portion represents a distribution (e.g., a normal distribution). In some embodiments, the portion is selected within a quantile of the distribution. In some embodiments, a portion is selected within 99% of the quantiles of the count variability distribution.
[0107] Sequence reads from any number of suitable samples can be used to identify a subset of portions that satisfy one or more criteria, parameters, and / or features described herein. Sequence reads from a group of samples from multiple subjects may be used in some cases. In some embodiments, the multiple subjects include pregnant women. In some embodiments, the multiple subjects include healthy subjects. In some embodiments, the multiple subjects include cancer patients. It can handle one or more samples from each of multiple subjects (for example, 1 to about 20 samples from each subject (e.g., about 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, or 19 samples)), and can handle a suitable number of subjects (for example, about 2 to about 10,000 subjects (e.g., about 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000 subjects)). In some embodiments, sequence reads from the same test sample from the same subject are mapped to a portion within the reference genome and used to generate a subset of that portion.
[0108] The portions can be selected and / or filtered by any preferred method. In some embodiments, portions are selected according to a visual inspection of data, graphs, plots, and / or charts. In certain embodiments, portions are selected and / or filtered (e.g., partially) by a system or machine including one or more microprocessors and memory. In some embodiments, portions are selected and / or filtered (e.g., partially) by a non-temporary computer-readable storage medium having an executable program stored therein, the program instructing the microprocessor to perform the selection and / or filtering.
[0109] In some embodiments, sequence reads derived from a sample are mapped to all or most of a reference genome, and then a subset of pre-selected regions is chosen. For example, a subset of regions may be selected in which reads from fragments below a certain length threshold preferentially map. Specific methods for pre-selecting a subset of regions are described in U.S. Patent Application Publication 2014 / 0180594, which is incorporated herein by reference in its entirety for all purposes. Reads from the selected subset of regions are often used in further steps, for example, to determine the presence or absence of gene mutations or alterations. Often, reads from unselected regions are not used in further steps to determine the presence or absence of gene mutations or alterations (for example, reads from unselected regions are removed or filtered).
[0110] In some embodiments, portions associated with read density (e.g., if read density is pertaining to a portion) are removed by a filtering process, and the read densities associated with the removed portions are not included in the determination of the presence or absence of copy number changes (e.g., chromosomal aneuploidy, microduplication, microdeletion). In some embodiments, the read density profile includes and / or consists of the read densities of the filtered portions. The portions are optionally filtered according to the distribution of counts and / or read densities. In some embodiments, the portions are filtered according to the distribution of counts and / or read densities, and the counts and / or read densities are obtained from one or more reference samples. One or more reference samples may be referred to herein as a training set. In some embodiments, the portions are filtered according to the distribution of counts and / or read densities, and the counts and / or read densities are obtained from one or more test samples. In some embodiments, the portions are filtered according to a measure of uncertainty in the read density distribution. In certain embodiments, portions demonstrating large deviations in read density are removed by a filtering process. For example, the distribution of read density (e.g., the distribution of mean, average, or median read density) can be determined, and each read density within the distribution is mapped to the same portion. A measure of uncertainty (e.g., MAD) can be determined by comparing the read density distributions of multiple samples, where each portion of the genome is associated with a measure of uncertainty. According to the above example, portions can be filtered according to the measure of uncertainty (e.g., standard deviation (SD), MAD) associated with each portion and a predetermined threshold. In certain cases, portions containing MAD values within an acceptable range are retained, and portions containing MAD values outside an acceptable range are excluded from consideration by the filtering process. In some embodiments, according to the above example, portions containing read density values outside a predetermined measure of uncertainty (e.g., median, mean, or average read density) are often excluded from consideration by the filtering process.In some embodiments, portions containing read density values outside the interquartile range of the distribution (e.g., median, mean, or mean read density) are excluded from consideration by the filtering process. In some embodiments, portions containing read density values outside the interquartile range of the distribution beyond two, three, four, or five times are excluded from consideration by the filtering process. In some embodiments, portions containing read density values outside the range beyond 2 sigma, 3 sigma, 4 sigma, 5 sigma, 6 sigma, 7 sigma, or 8 sigma (e.g., sigma is a range defined by the standard deviation) are excluded from consideration by the filtering process.
[0111] In block 125, sequence read data is processed using a data processing subsystem to obtain processed sequence read data. Processing of sequence read data includes quantification, level or class assignment, data normalization, or any combination thereof.
[0112] Quantification of sequence reads In some embodiments, sequence reads mapped or partitioned based on selected features or variables can be quantified to determine the amount or number of reads mapped to one or more parts (e.g., parts of a reference genome). In certain embodiments, the amount of sequence reads mapped to a part or segment is referred to as the count or read density. The count is often associated with a genomic part. In some embodiments, the count is determined from some or all of the sequence reads mapped to a part (i.e., associated with a part). In certain embodiments, the count is determined from some or all of the sequence reads mapped to a subgroup (e.g., a segment or part within a region (as described herein)).
[0113] The count can be determined by a preferred method, calculation, or mathematical process. The count is, in some cases, the direct sum of all sequence reads mapped to a genomic portion or subgroup corresponding to a segment, a subgroup corresponding to a subregion of the genome (e.g., copy number variation region, copy number variation region, copy number duplication region, copy number deletion region, microduplication region, microdeletion region, chromosomal region, autosomal region, sex chromosome region, or other chromosomal rearrangement), and / or, in some cases, a subgroup corresponding to the genome. Read quantification is, in some cases, a ratio, and in some cases, the ratio of quantification for a portion in region a to quantification for a portion in region b. Region a is, in some cases, a single portion, a segment region, a copy number variation region, a copy number variation region, a copy number duplication region, a copy number deletion region, a microduplication region, microdeletion region, chromosomal region, autosomal region, and / or sex chromosome region. Region b is, independently and possibly, a single part, a segment region, a copy number variation region, a copy number variation region, a copy number duplication region, a copy number deletion region, a microduplication region, a microdeletion region, a chromosome region, an autosomal region, a sex chromosome region, a region containing all autosomes, a region containing sex chromosomes, and / or a region containing all chromosomes.
[0114] In some embodiments, the count is derived from raw and / or filtered sequence reads. In certain embodiments, the count is determined by a mathematical process. In certain embodiments, the count is the mean, average, or sum of sequence reads mapped to a genomic portion or subgroup (e.g., a genomic portion within a region). In some embodiments, the count is associated with an uncertainty value. The count may be adjusted in some cases. The count may be adjusted according to weighted, removed, filtered, normalized, adjusted, averaged, derived as mean, derived as median, added, or a combination thereof of sequence reads associated with a genomic portion or subgroup.
[0115] Sequence read quantification is, in some cases, read density. Read density can be determined and / or generated for one or more segments of a genome. In certain cases, read density can be determined and / or generated for one or more chromosomes. In some embodiments, read density includes a quantitative measure of the count of sequence reads mapped to a segment or portion of a reference genome. Read density can be determined by a preferred process. In some embodiments, read density is determined by a preferred distribution and / or a preferred distribution function. Non-restrictive examples of distribution functions include any preferred distribution or combination thereof, such as a probability function, probability distribution function, probability density function (PDF), kernel density function (kernel density estimate), cumulative distribution function, probability mass function, discrete probability distribution, absolute continuous unvariable distribution. Read density may be a density estimate derived from a preferred probability density function. Density estimation is the construction of an estimate of the underlying probability density function based on observed data. In some embodiments, read density includes density estimates (e.g., probability density estimate, kernel density estimate). Read density may be generated according to a process that includes generating density estimates for each of one or more parts of the genome, each part containing a count of sequence reads. Read density may be generated for normalized and / or weighted counts mapped to parts or segments. In some cases, each read mapped to a part or segment may contribute to the read density, a value (e.g., count) equal to its weight obtained from the normalization process described herein. In some embodiments, the read density of one or more parts or segments is adjusted. Read density can be adjusted by preferred methods. For example, the read density of one or more parts can be weighted and / or normalized.
[0116] Reads quantified for a given portion or segment may come from one or different sources. In one example, reads may be obtained from nucleic acids from a subject who has or is suspected of having cancer. In such a situation, reads mapped to one or more portions are often reads representing both healthy cells (i.e., non-cancerous cells) and cancer cells (e.g., tumor cells). In a particular embodiment, some of the reads mapped to a portion are derived from cancer cell nucleic acids, and some of the reads mapped to the same portion are derived from non-cancerous cell nucleic acids. In another example, reads may be obtained from nucleic acid samples from a pregnant woman with a fetus. In such a situation, reads mapped to one or more portions are often reads representing both the fetus and the mother of the fetus (e.g., a pregnant woman subject). In a particular embodiment, some of the reads mapped to a portion are derived from the fetal genome, and some of the reads mapped to the same portion are derived from the maternal genome.
[0117] level In some embodiments, a value (e.g., a numerical value, a quantitative value) is assigned to a level. The level can be determined by a preferred method, operation, or mathematical process (e.g., a processed level). Often, the level is a count (e.g., a normalized count) for a set of parts, or derived therefrom. In some embodiments, the level for a part is substantially equal to the total number of counts mapped to the part (e.g., counts, normalized counts). Often, the level is determined from counts that are processed, transformed, or manipulated by a preferred method, operation, or mathematical process known in the art. In some embodiments, the level is derived from the processed count, and non-limiting examples of the processed count include weighting, elimination, filtering, normalization, adjustment, averaging, derivation as an average (e.g., average level), addition, subtraction, transformed count, or a combination thereof. In some embodiments, the level includes a normalized count (e.g., a normalized count of a part). The level can be for a count normalized by a preferred process, non-limiting examples of which are described herein. The level may include a normalized count or a relative quantity of the count. In some embodiments, the level is for two or more parts of the count that are averaged or for a normalized count, and the level is referred to as the mean level. In some embodiments, the level is for the average count of the normalized count or for a set of parts that have the mean, and is referred to as the mean level. In some embodiments, the level is derived for parts that include raw and / or filtered counts. In some embodiments, the level is based on the raw count. In some embodiments, the level is associated with an uncertainty value (e.g., standard deviation, MAD). In some embodiments, the level is represented by a Z-score or p-value.
[0118] A level for one or more parts is synonymous with “genome partitioning level” as used herein. As used herein, the term “level” is sometimes synonymous with the term “elevation.” The meaning of the term “level” can be determined from the context in which it is used. For example, when the term “level” is used in the context of parts, profiles, reads, and / or counts, it often means elevation. When the term “level” is used in the context of substances or compositions (e.g., RNA level, plexing level), it often refers to a quantity. When the term “level” is used in the context of uncertainty (e.g., error level, confidence level, deviation level, uncertainty level), it often refers to a quantity.
[0119] Normalized or unnormalized counts at two or more levels (e.g., two or more levels in a profile) can be mathematically manipulated according to the levels (e.g., addition, multiplication, averaging, normalization, etc., or a combination thereof). For example, normalized or unnormalized counts at two or more levels can be normalized according to one, some, or all of the levels in the profile. In some embodiments, normalized or unnormalized counts at all levels in a profile are normalized according to one level in the profile. In some embodiments, normalized or unnormalized counts at a first level in a profile are normalized according to normalized or unnormalized counts at a second level in the profile.
[0120] Non-limiting examples of levels (e.g., first level, second level) include levels for a set of parts containing processed counts, levels for a set of parts containing the mean, median, or average of the counts, levels for a set of parts containing normalized counts, or any combination thereof. In some embodiments, the first and second levels in the profile derive from counts of parts mapped to the same chromosome. In some embodiments, the first and second levels in the profile derive from counts of parts mapped to different chromosomes.
[0121] In some embodiments, a level is determined from normalized or unnormalized counts mapped to one or more parts. In some embodiments, a level is determined from normalized or unnormalized counts mapped to two or more parts, where the normalized counts for each part are often nearly identical. Variation can exist in the counts (e.g., normalized counts) within a set of parts for a level. Within a set of parts for a level, there may be one or more parts that have counts significantly different from the other parts of the set (e.g., peaks and / or dips). Any number of normalized or unnormalized counts associated with any number of parts can define a level.
[0122] In some embodiments, one or more levels can be determined from normalized or unnormalized counts of all or part of a portion of the genome. Often, levels can be determined from all or part of normalized or unnormalized counts of chromosomes. In some embodiments, two or more counts derived from two or more parts (e.g., a set of parts) determine the level. In some embodiments, two or more counts (e.g., counts from two or more parts) determine the level. In some embodiments, counts from 2 to about 100,000 parts determine the level. In some embodiments, counts from 2 to about 50,000, 2 to about 40,000, 2 to about 30,000, 2 to about 20,000, 2 to about 10,000, 2 to about 5,000, 2 to about 2,500, 2 to about 1,250, 2 to about 1,000, 2 to about 500, 2 to about 250, 2 to about 100, or 2 to about 60 parts determine the level. In some embodiments, a count from approximately 10 to approximately 50 parts determines the level. In some embodiments, a count from approximately 20 to approximately 40 or more parts determines the level. In some embodiments, the level includes counts from approximately 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 45, 50, 55, 60 or more parts. In some embodiments, the level corresponds to a set of parts (e.g., a set of parts of a reference genome, a set of parts of a chromosome, or a set of parts of a sub-chromosome).
[0123] In some embodiments, the level is determined from the normalized or unnormalized count of adjacent parts. In some embodiments, adjacent parts (e.g., a set of parts) represent adjacent regions of a genome, or adjacent regions of a chromosome or gene. For example, two or more adjacent parts can represent a sequence assembly of DNA sequences longer than each individual part when aligned by merging the parts from end to end. For example, two or more adjacent parts can represent an intact genome, chromosome, gene, intron, exon, or a portion thereof. In some embodiments, the level is determined from a set of adjacent and / or non-adjacent parts.
[0124] Data processing and normalization Since the data represents unmanipulated counts (e.g., raw counts), counted mapped sequence reads are referred to herein as raw data. In some embodiments, sequence read data within a dataset can be further processed (e.g., mathematically and / or statistically manipulated) and / or displayed to facilitate the provision of results. In certain embodiments, datasets containing larger datasets may benefit from preprocessing to facilitate further analysis. Preprocessing of a dataset may, in some cases, involve the removal of redundant and / or non-informational portions or portions of the reference genome (e.g., portions of the reference genome with non-informational data, redundant mapped reads, portions with a median count of zero, or excessive or insufficient sequences). Without being limited by theory, data processing and / or preprocessing may (i) remove noisy data, (ii) remove non-informational data, (iii) remove redundant data, (iv) reduce the complexity of larger datasets, and / or (v) facilitate the conversion of data from one form to one or more other forms. Where used in reference to data or datasets, the terms “preprocessing” and “processing” are collectively referred to herein as “processing.” The processing can make the data more suitable for further analysis and, in some embodiments, can generate results. In some embodiments, one or more processing methods (e.g., normalization methods, partial filtering, mapping, validation, etc., or combinations thereof) are performed by a processor, microprocessor, computer, in conjunction with memory, and / or by a microprocessor control unit.
[0125] As used herein, the term “noisy data” means (a) data that have significant variance between data points when analyzed or plotted, (b) data that have significant standard deviations (e.g., more than three standard deviations), (c) data that have significant standard errors of the mean, and any combination thereof. Noisy data may, in some cases, arise from the quantity and / or quality of the starting material (e.g., nucleic acid sample) and may occur as part of the process for preparing or replicating the DNA used to generate sequence reads. In certain embodiments, noise may result from the overrepresentation of certain sequences when prepared using PCR-based methods. The methods described herein can reduce or eliminate the contribution of noisy data and thus reduce the impact of noisy data on the results provided.
[0126] As used herein, the terms “non-informational data,” “non-informational portion of the reference genome,” and “non-informational portion” refer to portions of data, or data derived therefrom, that have numerical values that are significantly different from a given threshold or fall outside a given cutoff range of values. The terms “threshold” and “threshold value” as used herein refer to any number calculated using a qualifying dataset that serves as a limit for diagnosing a gene variant or gene change (e.g., copy number variation, aneuploidy, microduplication, microdeletion, chromosomal abnormality, etc.). In certain embodiments, the threshold is exceeded by results obtained by the methods described herein, and the subject is diagnosed using copy number variation. Thresholds or ranges of values are often calculated by mathematically and / or statistically manipulating sequence read data (e.g., from the reference and / or subject), and in certain embodiments, the sequence read data manipulated to generate thresholds or ranges of values is sequence read data (e.g., from the reference and / or subject). In some embodiments, an uncertainty value is determined. The uncertainty value is generally a measure of variance or error, and can be any preferred measure of variance or error. In some embodiments, the uncertainty value is the standard deviation, standard error, calculated variance, p-value, or mean absolute deviation (MAD). In some embodiments, the uncertainty value can be calculated according to the formulas described herein.
[0127] Any preferred procedure can be used to process the datasets described herein. Non-limiting examples of preferred procedures for processing the datasets include filtering, normalization, weighting, monitoring of peak height, monitoring of peak area, monitoring of peak edges, peak level analysis, peak width analysis, peak edge location analysis, peak lateral tolerance, determination of area ratios, mathematical processing of the data, statistical processing of the data, application of statistical algorithms, analysis by fixed variables, analysis by optimized variables, plotting of data to identify patterns or trends for further processing, and combinations thereof. In some embodiments, the dataset is processed based on various features (e.g., GC content, redundantly mapped reads, centromere regions, telomere regions, etc., and combinations thereof), and / or variables (e.g., sex of the subject, age of the subject, ploidy of the subject, contribution rate of cancer cell nucleic acids, sex of the fetus, maternal age, maternal ploidy, contribution rate of fetal nucleic acids, etc., or combinations thereof). In certain embodiments, processing the dataset as described herein can reduce the complexity and / or dimensionality of large and / or complex datasets. A non-limiting example of a complex dataset includes sequence read data generated from one or more test subjects and multiple reference subjects of different ages and ethnic backgrounds. In some embodiments, the dataset may contain thousands to millions, or even millions to billions, of sequence reads per test and / or reference subject.
[0128] In certain embodiments, data processing can be carried out in any number of steps. For example, in some embodiments, data may be processed using only a single processing procedure, and in certain embodiments, data may be processed using one or more, five or more, ten or more, or twenty or more processing steps (e.g., one or more processing steps, two or more processing steps, three or more processing steps, four or more processing steps, five or more processing steps, six or more processing steps, seven or more processing steps, eight or more processing steps, nine or more processing steps, ten or more processing steps, eleven or more processing steps, twelve or more processing steps, thirteen or more processing steps, fourteen or more processing steps, fifteen or more processing steps, sixteen or more processing steps, seventeen or more processing steps, eighteen or more processing steps, nineteen or more processing steps, or twenty or more processing steps). In some embodiments, the processing step may be the same step repeated two or more times (e.g., two or more filters, two or more normalizations), and in certain embodiments, the processing step may be two or more different processing steps performed simultaneously or sequentially (e.g., filtering, normalization, normalization, monitoring of peak height and edges, filtering, normalization, normalization to reference, statistical operations for determining p-values, etc.). In some embodiments, any preferred number and / or combination of the same or different processing steps can be used to facilitate processing of sequence read data and providing results. In certain embodiments, processing a dataset according to the criteria described herein may reduce the complexity and / or dimensionality of the dataset.
[0129] In some embodiments, one or more processing steps may include one or more normalization steps. Normalization may be carried out by preferred methods described herein or known in the art. In certain embodiments, normalization includes adjusting values measured on different scales to a conceptually common scale. In certain embodiments, normalization includes sophisticated mathematical adjustments to align the probability distributions of the adjusted values. In some embodiments, normalization includes aligning the distributions to a normal distribution. In certain embodiments, normalization includes mathematical adjustments that enable comparison of corresponding normalized values for different datasets in a manner that eliminates the effects of certain large errors (e.g., errors and anomalies). In certain embodiments, normalization includes scaling. Normalization may optionally include partitioning one or more datasets by a given variable or expression. Normalization may optionally include subtraction of one or more datasets by a given variable or expression. Non-restrictive examples of normalization methods include partwise normalization, GC content normalization, median count (median bin count, median part count) normalization, linear and nonlinear least squares regression, locally estimated scatter plot smoothing (LOESS), GC LOESS, LOWESS (locally weighted scatter plot smoothing), principal component normalization, repeat masking (RM), GC normalization and repeat masking (GCRM), conditional quantile normalization (cQn), and / or combinations thereof. In some embodiments, the determination of the presence or absence of copy number variations (e.g., aneuploidy, minute overlaps, minute deletions) is made using normalization methods (e.g., partwise normalization, GC content normalization, median count (median bin count, median part count) normalization, linear and nonlinear least squares regression, LOESS, GC LOESS, LOWESS (locally weighted scatter plot smoothing), principal component normalization, repeat masking (RM), GC normalization and repeat masking (GCRM), cQn, normalization methods known in the art, and / or combinations thereof). Specific examples of available normalization processes, such as LOESS normalization, principal component normalization, and hybrid normalization methods, are described in more detail below.Specific aspects of the normalization process are described, for example, in International Patent Application Publication No. 2013 / 052913 and International Patent Application Publication No. 2015 / 051163, the entirety of which is incorporated herein by reference for all purposes.
[0130] Any number of preferred normalizations can be used. In some embodiments, a dataset can be normalized one or more times, five or more times, ten or even twenty or more times. A dataset can be normalized to a value (e.g., a normalized value) that represents any preferred feature or variable (e.g., sample data, reference data, or both). Non-limiting examples of types of data normalization that can be used include: normalizing raw count data for one or more selected tests or reference parts to the total number of counts mapped to the entire chromosome or genome to which the selected parts or segments are mapped; normalizing raw count data for one or more selected parts to the median of the reference counts for the chromosome to which one or more parts or selected parts are mapped; normalizing raw count data to previously normalized data or its derivatives; and normalizing previously normalized data to one or more other predetermined normalization variables. Normalizing a dataset may, in some cases, have the effect of separating statistical errors, depending on the feature or characteristic selected as the predetermined normalization variable. Normalizing a dataset also, in some cases, allows for the comparison of data properties of data with different scales by bringing the data to a common scale (e.g., a given normalization variable). In some embodiments, one or more normalizations of statistically derived values can be used to minimize data differences and reduce the importance of outliers. Normalizing a portion, or a portion of a reference genome, is sometimes referred to as "partial normalization" with respect to the normalized values.
[0131] In certain embodiments, the processing steps may include one or more mathematical and / or statistical operations. Any suitable mathematical and / or statistical operation may be used alone or in combination to analyze and / or manipulate the datasets described herein. Any number of suitable mathematical and / or statistical operations may be used. In some embodiments, the dataset may be mathematically and / or statistically manipulated one or more, five or more, ten or more, or twenty or more times. Non-limiting examples of mathematical and statistical operations that may be used include addition, subtraction, multiplication, division, algebraic functions, least squares estimators, curve fitting, differential equations, rational polynomials, bipolynomials, orthogonal polynomials, z-scores, p-values, chi-values, phi-values, peak level analysis, peak edge location determination, peak area ratio calculation, chromosome-level median analysis, mean absolute deviation calculation, sum of squared residuals, mean, standard deviation, standard error, or combinations thereof. Mathematical and / or statistical operations may be performed on all or part of the sequence read data or on the processing products thereof. Non-limiting examples of dataset variables or features that can be statistically manipulated include raw counts, filtered counts, normalized counts, peak height, peak width, peak area, peak edges, lateral tolerance, p-value, median level, mean level, count distribution within a genomic region, relative representation of nucleic acid species, or combinations thereof.
[0132] In some embodiments, the processing steps may include the use of one or more statistical algorithms. Any suitable statistical algorithm may be used alone or in combination to analyze and / or manipulate the datasets described herein. Any appropriate number of statistical algorithms may be used. In some embodiments, the dataset may be analyzed using one or more, five or more, ten or more, or twenty or more statistical algorithms. Non-exclusive examples of statistical algorithms suitable for use in the methods described herein include principal component analysis, decision trees, counter nulls, multiple comparisons, omnibus tests, the Behrens-Fisher problem, bootstrap, Fisher's method for combining independent significance tests, null hypotheses, Type I errors, Type II errors, exact probability tests, one-sample Z-tests, two-sample Z-tests, one-sample t-tests, paired t-tests, two-sample pooled t-tests with equal variances, two-sample unpooled t-tests with unequal variances, one-proportion Z-tests, pooled two-proportion Z-tests, unpooled two-proportion Z-tests, one-sample chi-squared tests, two-sample F-tests for equal variances, confidence intervals, credible intervals, significance, meta-analysis, simple linear regression, robust linear regression, or combinations thereof. Non-limiting examples of dataset variables or features that can be analyzed using statistical algorithms include raw counts, filtered counts, normalized counts, peak height, peak width, peak edges, lateral tolerance, p-values, median levels, mean levels, count distribution within a genomic region, relative representation of nucleic acid species, or combinations thereof.
[0133] In certain embodiments, a dataset can be analyzed by utilizing multiple (e.g., two or more) statistical algorithms (e.g., least squares regression, principal component analysis, linear discriminant analysis, quadratic discriminant analysis, bagging, neural networks, support vector machine models, random forests, classification tree models, K-nearest neighbors, logistic regression, and / or smoothing), and / or mathematical and / or statistical operations (e.g., referred to herein as operations). The use of multiple operations can, in some embodiments, generate an N-dimensional space that can be used to provide results. In certain embodiments, analysis of a dataset by utilizing multiple operations can reduce the complexity and / or dimensionality of the dataset. For example, the use of multiple operations on a reference dataset can generate an N-dimensional space (e.g., a probability plot) that can be used to represent the presence or absence of gene mutations / genetic changes and / or copy number changes, depending on the state of the reference sample (e.g., positive or negative for a selected copy number change). Analysis of test samples using substantially similar sets of operations can be used to generate N-dimensional points for each test sample. The complexity and / or dimensionality of the dataset under test is, in some cases, reduced to a single value or N-dimensional point that can be easily compared to an N-dimensional space generated from the reference data. The test sample data within the N-dimensional space fed into the reference data exhibits a genetic state substantially similar to that of the reference. The test sample data outside the N-dimensional space fed into the reference data exhibits a genetic state substantially different from that of the reference. In some embodiments, the reference is euploid or otherwise lacks gene mutations / genetic alterations and / or copy number changes and / or disease states.
[0134] In some embodiments, after a dataset is counted, optionally filtered, normalized, and optionally weighted, the processed dataset can be further manipulated by one or more filtering and / or normalization and / or weighting steps. In certain embodiments, the dataset further manipulated by one or more filtering and / or normalization and / or weighting steps can be used to generate a profile. In some embodiments, one or more filtering and / or normalization and / or weighting steps can, in some cases, reduce the complexity and / or dimensionality of the dataset. Results can be provided based on the dataset with reduced complexity and / or dimensionality. In some embodiments, for example, a profile plot of the processed data further manipulated by weighting is generated to facilitate classification and / or the provision of results. Results can be provided, for example, based on a profile plot of the weighted data.
[0135] Partial filtering or weighting can be performed at one or more preferred points in the analysis. For example, parts may be filtered or weighted before or after sequence reads are mapped to parts of the reference genome. In some embodiments, parts may be filtered or weighted before or after experimental bias for individual genome parts is determined. In certain embodiments, parts may be filtered or weighted before or after levels are calculated.
[0136] In some embodiments, after a dataset is counted, optionally filtered, normalized, and optionally weighted, the processed dataset can be manipulated by one or more mathematical and / or statistical (e.g., statistical functions or statistical algorithms) operations. In certain embodiments, the processed dataset can be further manipulated by calculating Z-scores for one or more selected parts of chromosomes, chromosomes, or parts. In some embodiments, the processed dataset can be further manipulated by calculating P-values. In certain embodiments, the mathematical and / or statistical operations include one or more assumptions about ploidy and / or minority species fractions (e.g., cancer cell nucleic acid fraction, fetal fraction). In some embodiments, a profile plot of the processed data, further manipulated by one or more statistical and / or mathematical operations, is generated to facilitate classification and / or providing results. Results can be provided based on the profile plot of the statistically and / or mathematically manipulated data. Results provided based on the profile plot of the statistically and / or mathematically manipulated data often include one or more assumptions about ploidy and / or minority species fractions (e.g., cancer cell nucleic acid fraction, fetal fraction).
[0137] In some embodiments, data analysis and processing may involve the use of one or more assumptions. A suitable number or type of assumptions may be utilized to analyze or process a dataset. Non-limiting examples of assumptions that may be used for data processing and / or analysis include target ploidy, cancer cell contribution, maternal ploidy, fetal contribution, prevalence of a particular sequence in a reference population, ethnic background, prevalence of a selected disease in related families, parallelism between raw count profiles from different patients and / or runs after GC normalization and repeat masking (e.g., GCRM), identical matches representing PCR artifacts (e.g., identical base positions), assumptions specific to nucleic acid quantification assays (e.g., fetal quantification assays (FQA)), assumptions regarding twins (e.g., if there are two twins and only one is affected, the valid fetal fraction is only 50% of the total fetal fractions measured (similarly for triplets, quadruplets, etc.)), the assumption that cell-free DNA (e.g., cfDNA) uniformly covers the entire genome, and combinations thereof.
[0138] In cases where the quality and / or depth of the mapped sequence reads does not allow for predicting the presence or absence of gene mutations / gene alterations and / or copy number alterations at a desired confidence level (e.g., a confidence level of 95% or higher), one or more additional mathematical manipulation algorithms and / or statistical prediction algorithms can be used based on the normalized count profile to generate additional numerical values useful for data analysis and / or providing results. As used herein, the term “normalized count profile” refers to a profile generated using a normalized count. Examples of normalized counts and methods that can be used to generate a normalized count profile are described herein. As described above, the counted mapped sequence reads can be normalized with respect to a test sample count or a reference sample count. In some embodiments, the normalized count profile can be presented as a plot.
[0139] The following provides a more detailed description of some non-exclusive examples of processing steps and normalization methods that can be used, including normalization to a window (fixed or sliding), weighting, bias relationship determination, LOESS normalization, principal component normalization, hybrid normalization, profile generation, and comparison.
[0140] Normalization to windows (fixed or sliding) In certain embodiments, the processing step includes normalizing to a fixed window, and in some embodiments, the processing step includes normalizing to a moving or sliding window. As used herein, the term “window” refers to one or more portions selected for analysis, which may be used as a reference for comparison (e.g., for normalization and / or other mathematical or statistical operations). As used herein, the term “normalization to a fixed window” refers to a normalization process that uses one or more portions selected for comparison between the data under test and a reference data set. In some embodiments, the selected portions are used to generate a profile. A fixed window generally includes a predetermined set of portions that do not change during the operation and / or analysis. As used herein, the terms “normalization to a moving window” and “normalization to a sliding window” refer to normalization performed on portions localized to genomic regions of the selected data under test (e.g., immediately surrounding portions, adjacent portions, or segments), where one or more selected data under test are normalized to portions immediately surrounding the selected data under test. In certain embodiments, the selected portions are used to generate a profile. Slide or moving window normalization often involves repeatedly moving or sliding to adjacent test portions and normalizing the newly selected test portion to the portions immediately surrounding or adjacent to the newly selected test portion, where adjacent windows have one or more common portions. In certain embodiments, multiple selected test portions and / or chromosomes can be analyzed by the slide window process.
[0141] In some embodiments, normalization to a slide or move window can produce one or more values, each representing normalization to a different set of reference parts selected from different regions of the genome (e.g., chromosomes). In certain embodiments, the one or more values produced are cumulative sums (e.g., numerical estimates of the integral of the normalized count profile across selected parts, domains (e.g., parts of chromosomes), or chromosomes). The values produced by the slide or move window process can be used to generate profiles and facilitate arriving at results. In some embodiments, the cumulative sums of one or more parts can be displayed as a function of genomic location. Move or slide window analysis is sometimes used to analyze the genome for the presence or absence of microdeletions and / or microduplications. In certain embodiments, displaying the cumulative sums of one or more parts is used to identify the presence or absence of regions of copy number variation (e.g., microdeletions, microduplications).
[0142] weighting In some embodiments, the processing step includes weighting. As used herein, the terms “weighted,” “weighting,” or “weighting function,” or their grammatical derivatives or equivalents, refer to mathematical operations on part or all of a dataset that are used to modify the influence of a particular dataset’s features or variables on other datasets (e.g., increasing or decreasing the significance and / or contribution of data contained in one or more parts or parts of a reference genome based on the quality or usefulness of the data in selected parts or parts of a reference genome). In some embodiments, a weighting function can be used to increase the influence of data with relatively small measure variance and / or decrease the influence of data with relatively large measure variance. For example, a part of a reference genome with insufficient or low-quality sequence data can be “weighted down” to minimize its influence on the dataset, while a selected part of the reference genome can be “weighted up” to increase its influence on the dataset. A non-restrictive example of a weighting function is [1 / (standard deviation)²]. Weighting parts can, in some cases, eliminate part dependencies. In some embodiments, one or more parts are weighted by an eigenfunction (e.g., an eigenfunction). In some embodiments, the eigenfunction includes replacing parts with orthogonal eigenparts. The weighting step is sometimes performed in a manner substantially similar to the normalization step. In some embodiments, the dataset is adjusted (e.g., by splitting, multiplying, adding, or subtracting) by a given variable (e.g., a weighting variable). In some embodiments, the dataset is divided by a given variable (e.g., a weighting variable). The given variable (e.g., a minimized target function, Phi) is often selected to weight different parts of the dataset differently (e.g., increasing the influence of a particular data type while decreasing the influence of other data types).
[0143] Bias relationship In some embodiments, the processing step includes determining bias relationships. For example, one or more relationships may be generated between local genome bias estimates and bias frequencies. As used herein, the term “relationship” refers to a mathematical and / or schematic relationship between two or more variables or values. Relationships can be generated by a preferred mathematical and / or schematic process. Non-limiting examples of relationships include mathematical and / or schematic representations of functions, correlations, distributions, linear or nonlinear equations, lines, regressions, fitted regressions, or combinations thereof. In some cases, a relationship includes a fitted relationship. In some embodiments, a fitted relationship includes a fitted regression. In some cases, a relationship includes two or more weighted variables or values. In some embodiments, a relationship includes a fitted regression in which the values of one or more variables or relationships are weighted. In some cases, the regression is fitted in a weighted manner. In some cases, the regression is fitted unweighted. In certain embodiments, generating a relationship includes plotting or graphing.
[0144] In certain embodiments, a relationship is generated between GC density and GC density frequency. In some embodiments, generating a relationship between (i) GC density and (ii) GC density frequency for a sample provides a sample GC density relationship. In some embodiments, generating a relationship between (i) GC density and (ii) GC density frequency for a reference provides a reference GC density relationship. In some embodiments, the local genome bias estimate is the GC density, the sample bias relationship is the sample GC density relationship, and the reference bias relationship is the reference GC density relationship. The GC density in the reference GC density relationship and / or sample GC density relationship is often a representation of the local GC content (e.g., mathematical or quantitative).
[0145] In some embodiments, the relationship between local genome bias estimates and bias frequencies includes a distribution. In some embodiments, the relationship between local genome bias estimates and bias frequencies includes a fitted relationship (e.g., fitted regression). In some embodiments, the relationship between local genome bias estimates and bias frequencies includes a fitted linear or nonlinear regression (e.g., polynomial regression). In certain embodiments, the relationship between local genome bias estimates and bias frequencies includes a weighted relationship in which the local genome bias estimates and / or bias frequencies are weighted by a preferred process. In some embodiments, a weighted fitted relationship (e.g., weighted fit) can be obtained by a process including quantile regression, parameterized distributions, or empirical distributions with interpolation. In certain embodiments, the relationship between local genome bias estimates and bias frequencies for a test sample, reference, or a portion thereof includes a polynomial regression in which the local genome bias estimates are weighted. In some embodiments, the weighted fitted model includes weighted values of the distribution. The values of the distribution can be weighted by a preferred process. In some embodiments, values located near the tails of the distribution are given smaller weights than values near the median of the distribution. For example, in the case of a distribution between local genome bias estimates (e.g., GC density) and bias frequencies (e.g., GC density frequencies), the weights are determined according to the bias frequencies relative to a given local genome bias estimate, and local genome bias estimates containing bias frequencies close to the mean of the distribution are given greater weights than local genome bias estimates containing bias frequencies far from the mean.
[0146] In some embodiments, the processing step includes normalizing the sequence read count by comparing the local genome bias estimate of the sequence reads of the sample under test with the local genome bias estimate of a reference (e.g., a reference genome, or a portion thereof). In some embodiments, the sequence read count is normalized by comparing the bias frequency of the local genome bias estimate of the sample under test with the bias frequency of the local genome bias estimate of the reference. In some embodiments, the sequence read count is normalized by comparing the sample bias relationship and the reference bias relationship, thereby generating a comparison.
[0147] The count of sequence reads may be normalized according to a comparison of two or more relationships. In certain embodiments, two or more relationships are compared to provide a comparison used to reduce local bias in sequence reads (e.g., normalizing the count). The two or more relationships may be compared in a preferred manner. In some embodiments, the comparison includes adding, subtracting, multiplying, and / or dividing a first relationship from a second relationship. In certain embodiments, comparing two or more relationships includes the use of preferred linear and / or nonlinear regressions. In certain embodiments, comparing two or more relationships includes preferred polynomial regression (e.g., cubic polynomial regression). In some embodiments, the comparison includes adding, subtracting, multiplying, and / or dividing a first regression from a second regression. In some embodiments, the two or more relationships are compared by a process that includes a multiple regression inference framework. In some embodiments, the two or more relationships are compared by a process that includes preferred multivariate analysis. In some embodiments, two or more relationships are compared by a process that includes basis functions (e.g., blend functions, e.g., polynomial basis, Fourier basis, etc.), splines, radial basis functions, and / or wavelets.
[0148] In certain embodiments, the distribution of local genome bias estimates, including bias frequencies for the test sample and the reference, is compared by a process that includes a polynomial regression on which the local genome bias estimates are weighted. In some embodiments, the polynomial regression is generated between (i) ratios, each of which includes the bias frequency of the reference local genome bias estimate and the bias frequency of the sample local genome bias estimate, and (ii) local genome bias estimates. In some embodiments, the polynomial regression is generated between (i) the ratio of the bias frequency of the reference local genome bias estimate to the bias frequency of the sample local genome bias estimate, and (ii) local genome bias estimates. In some embodiments, the comparison of the distribution of local genome bias estimates for the test sample and the reference reads includes determining the logarithmic ratio (e.g., log2 ratio) of the bias frequencies of the local genome bias estimates for the reference and the sample. In some embodiments, comparing the distributions of local genome bias estimates involves dividing the logarithmic ratio (e.g., log2 ratio) of the bias frequencies of the local genome bias estimates for the reference by the logarithmic ratio (e.g., log2 ratio) of the bias frequencies of the local genome bias estimates for the sample.
[0149] Normalizing counts according to comparison typically involves adjusting some counts while leaving others unadjusted. Normalized counts may, in some cases, adjust all counts, and in other cases, leave no counts for sequence reads adjusted. Counts for sequence reads may be normalized by a process that includes determining weighting coefficients, and in other cases, the process does not involve directly generating and utilizing weighting coefficients. Normalizing counts according to comparison may, in some cases, involve determining weighting coefficients for each count for sequence reads. Weighting coefficients are often sequence read-specific and applied to the counts of specific sequence reads. Weighting coefficients are often determined by comparing two or more bias relationships (e.g., a sample bias relationship is compared to a reference bias relationship). Normalized counts are often determined by adjusting count values according to weighting coefficients. Adjusting counts according to weighting coefficients may, in some cases, involve adding, subtracting, multiplying, and / or dividing the counts for sequence reads by the weighting coefficients. Weighting coefficients and / or normalized counts may, in some cases, be determined from regressions (e.g., regression lines). Normalized counts are obtained directly from a regression line (e.g., a fitted regression line) resulting from a comparison between the bias frequencies of local genome bias estimates for a reference (e.g., a reference genome) and the sample being tested. In some embodiments, each count for a sample read is provided with a normalized count value according to a comparison between (i) the bias frequency of the local genome bias estimate of the read being compared and (ii) the bias frequency of the local genome bias estimate of the reference. In certain embodiments, the counts of sequence reads obtained for a sample are normalized to reduce the bias of the sequence reads.
[0150] LOESS normalization In some embodiments, the processing step includes LOESS normalization. LOESS is a regression modeling method known in the art that combines multiple regression models in a k-nearest neighbor-based metamodel. LOESS is sometimes referred to as locally weighted polynomial regression. In some embodiments, GC LOESS applies the LOESS model to the relationship between fragment counts (e.g., sequence reads, counts) and GC composition for a portion of a reference genome. In particular, plotting a smooth curve through a set of data points using LOESS is sometimes called a LOESS curve, if each smoothed value is given by weighted quadratic least squares regression over a span of values of the y-axis criterion variable. For each point in the dataset, the LOESS method fits a low-order polynomial to a subset of the data, with explanatory variable values near the point where the response is estimated. The polynomial is fitted using weighted least squares, giving greater weight to points near the point where the response is estimated and less weight to points farther away. The value of the regression function for a point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point. LOESS fitting is sometimes considered complete after the regression function value has been calculated for each data point. Many of the details of this method, such as the degree and weights of the polynomial model, are flexible.
[0151] Principal component analysis In some embodiments, the processing step includes principal component analysis (PCA). In some embodiments, the sequence read count (e.g., the sequence read count of the sample under test) is adjusted according to principal component analysis (PCA). In some embodiments, the lead density profile (e.g., the lead density profile of the sample under test) is adjusted according to principal component analysis (PCA). The lead density profiles of one or more reference samples and / or the lead density profile under test can be adjusted according to PCA. Removing bias from the lead density profile by a PCA-related process is sometimes referred to herein as profile adjustment. PCA can be carried out by a preferred PCA method or a variation thereof. Non-limiting examples of PCA methods include canonical correlation analysis (CCA), Carunen-Lobe transform (KLT), Hotelling transform, eigenorthogonal decomposition (POD), singular value decomposition of X (SVD), eigenvalue decomposition of XTX (EVD), factor analysis, Eckert-Young theorem, Schmidt-Milski theorem, empirical orthogonal functions (EOF), empirical eigenfunction decomposition, empirical component analysis, quasi-harmonic modes, spectral decomposition, empirical modal analysis, and variations or combinations thereof. PCA often identifies and / or adjusts one or more biases in the read density profile. Biases identified and / or adjusted by PCA are sometimes referred to herein as principal components. In some embodiments, one or more biases can be removed by adjusting the read density profile according to one or more principal components using a preferred method. The read density profile can be adjusted by adding, subtracting, multiplying, and / or dividing one or more principal components from the read density profile. In some embodiments, one or more biases can be removed from the lead density profile by subtracting one or more principal components from the lead density profile. While biases within the lead density profile are often identified and / or quantified by PCA of the profile, principal components are often subtracted from the profile at the lead density level. PCA often identifies one or more principal components.In some embodiments, PCA identifies first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, and tenth or more principal components. In certain embodiments, one, two, three, four, five, six, seven, eight, nine, ten, or more principal components are used to refine the profile. In certain embodiments, five principal components are used to refine the profile. Often, the principal components are used to refine the profile in the order of their appearance in the PCA. For example, if three principal components are subtracted from the read density profile, the first, second, and third principal components are used. In some cases, biases identified by principal components include profile features that are not used to refine the profile. For example, PCA may identify copy number variations (e.g., aneuploidy, minute duplication, minute deletion, deletion, translocation, insertion) and / or sex differences as principal components. Therefore, in some embodiments, one or more principal components are not used to refine the profile. For example, in some cases, the first, second, and fourth principal components are used to adjust profiles where the third principal component is not used to adjust the profile.
[0152] The principal components can be obtained from PCA using any suitable sample or reference. In some embodiments, the principal components are obtained from the sample under test (e.g., the subject under test). In some embodiments, the principal components are obtained from one or more references (e.g., a reference sample, a reference sequence, a reference set). In certain cases, PCA is performed on a median read density profile obtained from a training set containing multiple samples, resulting in the identification of a first and a second principal component. In some embodiments, the principal components are obtained from a set of subjects with no copy number variation in the problem. In some embodiments, the principal components are obtained from a set of known euploids. The principal components are often identified according to PCA performed using one or more read density profiles of references (e.g., a training set). One or more principal components obtained from references are often subtracted from the read density profile of the subject under test, thereby providing an adjusted profile.
[0153] Hybrid normalization In some embodiments, the processing step includes a hybrid normalization method. In certain cases, the hybrid normalization method can reduce bias (e.g., GC bias). In some embodiments, hybrid normalization includes (i) an analysis of the relationship between two variables (e.g., count and GC content), and (ii) the selection and application of a normalization method according to the analysis. In certain embodiments, hybrid normalization includes (i) regression (e.g., regression analysis), and (ii) the selection and application of a normalization method by regression. In some embodiments, the count obtained for a first sample (e.g., a first sample set) is normalized in a different way than the count obtained from another sample (e.g., a second sample set). In some embodiments, the count obtained for a first sample (e.g., a first sample set) is normalized by a first normalization method, and the count obtained from a second sample (e.g., a second sample set) is normalized by a second normalization method. For example, in certain embodiments, the first normalization method includes the use of linear regression, and the second normalization method includes the use of nonlinear regression (e.g., LOESS, GC-LOESS, LOWESS regression, LOESS smoothing).
[0154] In some embodiments, the hybrid normalization method is used to normalize sequence reads mapped to parts of the genome or chromosomes (e.g., counts, mapped counts, mapped reads). In certain embodiments, raw counts are normalized, and in some embodiments, adjusted, weighted, filtered, or previously normalized counts are normalized by the hybrid normalization method. In certain embodiments, levels or Z-scores are normalized. In some embodiments, counts mapped to selected parts of the genome or chromosomes are normalized by the hybrid normalization approach. A count can refer to a preferred measure of sequence reads mapped to a portion of the genome, and non-limiting examples include raw counts (e.g., untreated counts), normalized counts (e.g., LOESS, principal component, or normalized by a preferred method), partial levels (e.g., mean level, mean level, median level, etc.), Z-scores, etc., or combinations thereof. A count can be a raw count or a processed count from one or more samples (e.g., test samples, samples from pregnant women). In some embodiments, the count is obtained from one or more samples obtained from one or more subjects.
[0155] In some embodiments, the normalization method (e.g., type of normalization method) is selected according to regression (e.g., regression analysis) and / or correlation coefficients. Regression analysis refers to a statistical method for estimating the relationship between variables (e.g., count and GC content). In some embodiments, the regression is generated according to the count and scale of GC content for each part of multiple parts of a reference genome.
[0156] Any suitable measure of GC content can be used, non-limiting examples of which include measures of guanine (G), cytosine (C), adenine (A), thymine (T), purine (GC), or pyrimidine (AT or ATU) content, melting temperature (Tm) (e.g., denaturation temperature, annealing temperature, hybridization temperature), free energy, or combinations thereof. Measures of guanine (G), cytosine (C), adenine (A), thymine (T), purine (GC), or pyrimidine (AT or ATU) content can be expressed as ratios or percentages. In some embodiments, any suitable ratio or percentage is used, non-limiting examples of which include GC / AT, GC / total nucleotides, GC / A, GC / T, AT / total nucleotides, AT / GC, AT / G, AT / C, G / A, C / A, G / T, G / A, G / AT, C / T, or combinations thereof. In some embodiments, the measure of GC content is the ratio or percentage of GC to the total nucleotide content. In some embodiments, the measure of GC content is the ratio or percentage of GC to the total nucleotide content of sequence reads mapped to a portion of a reference genome. In certain embodiments, GC content is determined according to and / or from sequence reads mapped to each portion of a reference genome, and the sequence reads are obtained from a sample. In some embodiments, the measure of GC content is not determined according to and / or from sequence reads. In certain embodiments, the measure of GC content is determined for one or more samples obtained from one or more subjects.
[0157] In some embodiments, generating a regression includes generating a regression analysis or correlation analysis. A suitable regression can be used, and non-limiting examples include regression analysis (e.g., linear regression analysis), goodness-of-fit analysis, Pearson correlation analysis, rank correlation, fractionation of unexplained variance, Nash-Sutcliff model efficiency analysis, regression model validation, proportional reduction of loss, root mean square deviation, or a combination thereof. In some embodiments, a regression line is generated. In certain embodiments, generating a regression includes generating a linear regression. In certain embodiments, generating a regression includes generating a nonlinear regression (e.g., LOESS regression, LOWESS regression).
[0158] In some embodiments, the regression determines whether there is a correlation (e.g., linear correlation) between, for example, a count and a measure of GC content. In some embodiments, a regression (e.g., linear regression) is generated and a correlation coefficient is determined. In some embodiments, a preferred correlation coefficient is determined, and non-limiting examples include the decision coefficient, R² value, and Pearson correlation coefficient.
[0159] In some embodiments, goodness of fit is determined for a regression (e.g., regression analysis, linear regression). Goodness of fit may be determined by visual or mathematical analysis. The evaluation may include determining whether the goodness of fit is greater for a nonlinear regression or greater for a linear regression. In some embodiments, the correlation coefficient is a measure of goodness of fit. In some embodiments, the evaluation of goodness of fit for a regression is determined according to the correlation coefficient and / or a correlation coefficient cutoff value. In some embodiments, the evaluation of goodness of fit includes comparing the correlation coefficient to a correlation coefficient cutoff value. In some embodiments, the evaluation of goodness of fit for a regression indicates linear regression. For example, in a particular embodiment, the goodness of fit is greater for linear regression than for nonlinear regression, and the evaluation of goodness of fit indicates linear regression. In some embodiments, the evaluation indicates linear regression, and linear regression is used to normalize the count. In some embodiments, the evaluation of goodness of fit for a regression indicates nonlinear regression. For example, in a particular embodiment, the goodness of fit is higher for nonlinear regression than for linear regression, and the evaluation of goodness of fit indicates nonlinear regression. In some embodiments, the evaluation indicates nonlinear regression, and nonlinear regression is used to normalize the count.
[0160] In some embodiments, the goodness-of-fit assessment represents linear regression when the correlation coefficient is greater than or equal to the correlation coefficient cutoff. In some embodiments, the goodness-of-fit assessment represents nonlinear regression when the correlation coefficient is less than the correlation coefficient cutoff. In some embodiments, the correlation coefficient cutoff is determined in advance. In some embodiments, the correlation coefficient cutoff is about 0.5 or greater, about 0.55 or greater, about 0.6 or greater, about 0.65 or greater, about 0.7 or greater, about 0.75 or greater, about 0.8 or greater, about 0.85 or greater, about 0.9 or greater, or about 0.95 or greater.
[0161] In some embodiments, a specific type of regression is selected (e.g., linear or nonlinear regression), and after the regression is generated, the count is normalized by subtracting the regression from the count. In some embodiments, subtracting the regression from the count provides a normalized count with reduced bias (e.g., GC bias). In some embodiments, linear regression is subtracted from the count. In some embodiments, nonlinear regression (e.g., LOESS, GC-LOESS, LOWESS regression) is subtracted from the count. Any preferred method can be used to subtract the regression line from the count. For example, if count x is derived from a portion i (e.g., portion i) containing a GC content of 0.5, and the regression line determines count y with a GC content of 0.5, then xy = the count normalized with respect to portion i. In some embodiments, the count is normalized before and / or after subtracting the regression. In some embodiments, the count normalized by a hybrid normalization approach is used to generate levels, Z-scores, levels, and / or profiles of the genome or a portion thereof. In certain embodiments, counts normalized by a hybrid normalization approach are analyzed by methods described herein to determine the presence or absence of gene mutations or gene changes (e.g., copy number changes).
[0162] In some embodiments, the hybrid normalization method includes filtering or weighting one or more portions before or after normalization. Preferred methods for filtering portions can be used, including methods for filtering portions (e.g., portions of the reference genome) as described herein. In some embodiments, portions (e.g., portions of the reference genome) are filtered before applying the hybrid normalization method. In some embodiments, only the counts of sequencing reads mapped to selected portions (e.g., portions selected according to count variability) are normalized by hybrid normalization. In some embodiments, the counts of sequencing reads mapped to filtered portions of the reference genome (e.g., portions filtered according to count variability) are removed before utilizing the hybrid normalization method. In some embodiments, the hybrid normalization method includes selecting or filtering portions (e.g., portions of the reference genome) according to preferred methods (e.g., methods described herein). In some embodiments, the hybrid normalization method includes selecting or filtering portions (e.g., portions of the reference genome) according to uncertainty values for counts mapped to each of multiple portions of a sample. In some embodiments, the hybrid normalization method includes selecting or filtering portions (e.g., portions of the reference genome) according to count variability. In some embodiments, the hybrid normalization method includes selecting or filtering a portion (e.g., a portion of the reference genome) according to GC content, repeating elements, repeating sequences, introns, exons, or a combination thereof.
[0163] Profile In some embodiments, the processing step includes generating one or more profiles (e.g., profile plots) from various aspects of a dataset or its derivatives (e.g., the product of one or more mathematical and / or statistical data processing steps known in the art and / or described herein). As used herein, the term “profile” refers to the product of mathematical and / or statistical manipulation of data that can facilitate the identification of patterns and / or correlations in large amounts of data. A “profile” often includes values resulting from one or more operations on the data or dataset, based on one or more criteria. A profile often includes multiple data points. Depending on the nature and / or complexity of the dataset, any suitable number of data points may be included in the profile. In certain embodiments, the profile may include two or more data points, three or more data points, five or more data points, ten or more data points, twenty-four or more data points, twenty-five or more data points, fifty or more data points, one hundred or more data points, five hundred or more data points, one thousand or more data points, five thousand or more data points, one hundred, thousand or more data points, or one hundred, thousand or more data points.
[0164] In some embodiments, a profile represents the entire dataset, and in certain embodiments, a profile represents a part or subset of the dataset. That is, a profile may include, or be generated from, data points representing unfiltered data to remove any unwanted data, and may include, or be generated from, data points representing filtered data to remove unwanted data. In some embodiments, data points in a profile represent the results of data manipulation on a portion of the dataset. In certain embodiments, data points in a profile include the results of data manipulation on a subgroup. In some embodiments, subgroups may be adjacent to each other, and in certain embodiments, subgroups may originate from different locations on a chromosome or genome.
[0165] The data points in a profile derived from a dataset can represent any preferred data classification. Non-limiting examples of classifications to which data can be grouped to generate profile data points include parts based on size, sequence features (e.g., GC content, AT content, chromosomal location (e.g., short arm, long arm, centromere, telomere)), expression levels, chromosomes, or combinations thereof. In some embodiments, a profile may be generated from data points obtained from another profile (e.g., a normalized data profile, renormalized to different normalization values to generate a renormalized data profile). In certain embodiments, a profile generated from data points obtained from another profile reduces the number of data points and / or the complexity of the dataset. Reducing the number of data points and / or the complexity of the dataset often facilitates data interpretation and / or the provision of results.
[0166] A profile (e.g., a genome profile, a chromosome profile, a profile of a portion of a chromosome) is often a set of normalized or unnormalized counts for two or more portions. A profile often includes at least one level, and often includes two or more levels (e.g., a profile often has multiple levels). Levels are generally for sets of portions having approximately the same count or normalized count. Levels are described in more detail herein. In certain embodiments, a profile includes one or more portions, which can be weighted, removed, filtered, normalized, adjusted, averaged, derived as an average, added, subtracted, processed, or transformed by any combination thereof. A profile often includes normalized counts mapped to portions defining two or more levels, and the counts are further normalized according to one of the levels by a preferred method. Often, the counts of a profile (e.g., profile levels) are associated with uncertainty values.
[0167] Profiles containing one or more levels may, in some cases, be padded (e.g., whole padding). Padding (e.g., whole padding) refers to the process of identifying and adjusting levels in a profile that result from copy number variations (e.g., microduplications or microdeletions in the patient's genome, maternal microduplications or microdeletions). In some embodiments, levels resulting from microduplications or microdeletions in a tumor or fetus are padded. Microduplications or microdeletions in a profile can, in some embodiments, artificially increase or decrease the overall level of the profile (e.g., a chromosomal profile), leading to false-positive or false-negative determinations for chromosomal aneuploidy (e.g., trisomy). In some embodiments, levels in a profile resulting from microduplications and / or deletions are identified and adjusted (e.g., padded and / or removed) by a process sometimes referred to as padding or whole padding.
[0168] A profile containing one or more levels may include a first level and a second level. In some embodiments, the first level is different from (e.g., significantly different from) the second level. In some embodiments, the first level includes a first set of parts, the second level includes a second set of parts, and the first set of parts is not a subset of the second set of parts. In certain embodiments, the first set of parts is different from the second set of parts from which the first and second levels are determined. In some embodiments, a profile may have multiple first levels that are different from (e.g., significantly different from, e.g., have significantly different values from) the second level in the profile. In some embodiments, a profile includes one or more first levels that are significantly different from the second level in the profile, and one or more of the first levels are adjusted. In some embodiments, the first levels in the profile are removed from the profile or adjusted (e.g., padded). A profile may include multiple levels, including one or more first levels that are significantly different from one or more second levels, and often the majority of the levels in the profile are second levels, and the second levels are approximately equal to each other. In some embodiments, a second level is greater than 50%, 60%, 70%, 80%, 90%, or 95% of the level in the profile.
[0169] Profiles may, in some cases, be displayed as plots. For example, one or more levels representing the counts of a portion (e.g., normalized counts) can be plotted and visualized. Non-limiting examples of profile plots that can be generated include raw counts (e.g., raw count profile or raw profile), normalized counts, portion weights, z-scores, p-values, area ratio to fit ploidy, median level to proportion of fit and measured minority species fractions, principal components, or combinations thereof. In some embodiments, profile plots enable visualization of the manipulated data. In certain embodiments, profile plots can be used to provide results (e.g., area ratio to fit ploidy, median level to proportion of fit and measured minority species fractions, principal components). As used herein, the term “raw count profile plot” or “raw profile plot” refers to a plot of counts in each portion of a region normalized to the total count in the region (e.g., genome, portion, chromosome, chromosomal portion, or part of a chromosome of a reference genome). In some embodiments, the profile can be generated using a fixed window process, and in certain embodiments, the profile can be generated using a sliding window process.
[0170] The profile generated for a test subject is, in some cases, compared to a profile generated for one or more reference subjects to facilitate the interpretation of mathematical and / or statistical operations on the dataset and / or to provide results. In some embodiments, the profile is generated based on one or more initial assumptions, e.g., the assumptions described herein. In certain embodiments, the test profile is often centered on a predetermined value representing the absence of copy number variation, and often deviates from the predetermined value in the region corresponding to the genomic location where the copy number variation is located in the test subject, if the test subject does possess copy number variation. In test subjects at risk of or suffering from a disease associated with copy number variation, the numerical values for the selected portion are expected to deviate significantly from the predetermined values for unaffected genomic locations. Depending on the initial assumptions (e.g., fixed ploidy or optimized ploidy, fixed fraction or optimized fraction of cancer cell nucleic acids, fixed fetal fraction or optimized fetal fraction, or a combination thereof), a predetermined threshold or cutoff value or threshold range for the value indicating the presence or absence of copy number variation can be changed while still providing useful results for determining the presence or absence of copy number variation. In some embodiments, the profile indicates and / or represents the phenotype.
[0171] In some embodiments, the use of one or more reference samples substantially free of the copy number variation in question can be used to generate a reference count profile (e.g., a reference median count profile), which may yield a predetermined value representing the absence of the copy number variation, and in many cases, if the test subject had carried the copy number variation, the region corresponding to the genomic location where the copy number variation is located in the test subject will deviate from the predetermined value. In test subjects at risk of or suffering from a disease associated with the copy number variation, the numerical values for the selected portion or segment are expected to deviate significantly from the predetermined values for the unaffected genomic location. In certain embodiments, the use of one or more reference samples known to carry the copy number variation in question can be used to generate a reference count profile (a reference median count profile), which may yield a predetermined value representing the presence of the copy number variation, and in many cases, the region corresponding to the genomic location where the test subject does not carry the copy number variation. In test subjects without risk of or suffering from a disease associated with the copy number variation, the numerical values for the selected portion or segment are expected to deviate significantly from the predetermined values for the unaffected genomic location.
[0172] As a non-limiting example, a normalized sample and / or reference count profile can be obtained from raw sequence read data by (a) calculating the reference median count for a chromosome, part or site selected from a set of references known not to carry copy number variations; (b) removing (e.g., filtering) non-informational portions from the raw count of the reference sample; (c) normalizing the reference count for all remaining portions of the reference genome to the total remaining count for the selected chromosome or selected genomic location of the reference sample (e.g., the sum of the remaining counts after removal of non-informational portions of the reference genome), thereby generating a normalized reference target profile; (d) removing the corresponding portions from the test target sample; and (e) normalizing the remaining test target counts for one or more selected genomic locations to the sum of the remaining reference median counts for the chromosome containing the selected genomic location, thereby generating a normalized test target profile. In certain embodiments, an additional normalization step for the entire genome reduced by the filtered portion in (b) may be included between (c) and (d).
[0173] In some embodiments, a lead density profile is determined. In some embodiments, the lead density profile includes at least one lead density, and often includes two or more lead densities (e.g., a lead density profile often includes multiple lead densities). In some embodiments, the lead density profile includes preferred quantitative values (e.g., mean, median, Z-score, etc.). The lead density profile often includes values resulting from one or more lead densities. The lead density profile may include values resulting from one or more operations on the lead density based on one or more adjustments (e.g., normalization). In some embodiments, the lead density profile includes unoperated lead densities. In some embodiments, one or more lead density profiles are generated from various aspects of a dataset including lead densities or their derivatives (e.g., products of one or more mathematical and / or statistical data processing steps known in the art and / or described herein). In certain embodiments, the lead density profile includes normalized lead densities. In some embodiments, the lead density profile includes adjusted lead densities. In certain embodiments, a read density profile may include raw read density (e.g., unmanipulated, unadjusted, or unnormalized), normalized read density, weighted read density, filtered partial read density, z-score of read density, p-value of read density, integral of read density (e.g., area under the curve), averaged, mean or median read density, principal components, or a combination thereof. Often, the read density and / or read density profile of a read density profile are associated with a measure of uncertainty (e.g., MAD). In certain embodiments, a read density profile may include a distribution of median read densities. In some embodiments, a read density profile may include relationships between multiple read densities (e.g., fit relationships, regressions, etc.). For example, in some cases, a read density profile may include relationships between read density (e.g., read density values) and genomic locations (e.g., parts, sublocations).In some embodiments, the lead density profile can be generated using a fixed window process, and in certain embodiments, the lead density profile can be generated using a sliding window process. In some embodiments, the lead density profile may be printed and / or displayed (e.g., as a visual representation, e.g., as a plot or graph).
[0174] In some embodiments, the read density profile corresponds to a set of parts (e.g., a set of parts of a reference genome, a set of parts of a chromosome, or a subset of parts of a chromosome). In some embodiments, the read density profile includes the read density and / or count associated with the set of parts (e.g., a set, a subset). In some embodiments, the read density profile is determined for the read density of adjacent parts. In some embodiments, adjacent parts include gaps (e.g., parts removed by filtering) containing regions of reference sequences and / or sequence reads that are not included in the density profile. In some cases, adjacent parts (e.g., a set of parts) represent adjacent regions of a genome or adjacent regions of a chromosome or gene. For example, two or more adjacent parts can represent a sequence assembly of DNA sequences longer than each part when aligned by merging the parts from end to end. For example, two or more adjacent parts can represent an intact genome, chromosome, gene, intron, exon, or a part thereof. In some cases, the read density profile is determined from a set of adjacent parts and / or non-adjacent parts (e.g., a set, a subset). In some cases, a read density profile consists of one or more parts, which can be weighted, removed, filtered, normalized, adjusted, averaged, derived as an average, added, subtracted, processed, or transformed by any combination thereof.
[0175] Read density profiles are often determined for a sample and / or reference (e.g., a reference sample). Read density profiles may be generated for the entire genome, one or more chromosomes, or a portion of the genome or chromosomes. In some embodiments, one or more read density profiles are determined for the genome or a portion thereof. In some embodiments, a read density profile represents the entire set of read densities for a sample, while in certain embodiments, a read density profile represents a portion or subset of the read densities for a sample. That is, in some cases, a read density profile includes or is generated from data points representing data that has not been filtered to remove any unwanted data, while in some cases, a read density profile includes or is generated from data points representing data that has been filtered to remove unwanted data.
[0176] In some embodiments, the read density profile is determined against a reference (e.g., a reference sample, a training set). The read density profile for reference is, in some cases, referred to herein as the reference profile. In some embodiments, the reference profile includes read densities obtained from one or more references (e.g., a reference sequence, a reference sample). In some embodiments, the reference profile includes read densities determined against one or more known euploid samples (e.g., a set thereof). In some embodiments, the reference profile includes read densities of a filtered portion. In some embodiments, the reference profile includes read densities adjusted according to one or more principal components.
[0177] In block 130, the processed sequence read data is analyzed using one or more bioinformatics subsystems. In certain cases, one or more bioinformatics subsystems include one or more machine learning models. For example, the processed sequence read data may be analyzed using one or more bioinformatics subsystems (e.g., chromosomal aberration decision trees (CADET)) to detect genomic events (e.g., presence or absence of gene mutations). One or more bioinformatics subsystems may also include comparators and transformers. In some cases, decision analysis is performed by one or more bioinformatics subsystems.
[0178] Conducting a comparison In some embodiments, one or more bioinformatics subsystems include comparators for pre-forming comparisons (e.g., comparing a test profile to a reference profile). Two or more datasets, two or more relationships, and / or two or more profiles can be compared in a preferred manner. Non-exclusive examples of preferred statistical methods for comparing datasets, relationships, and / or profiles include the Behrens-Fischer method, bootstrap method, Fisher's method for combining independent significance tests, Neyman-Pearson test, confirmatory data analysis, exploratory data analysis, exact probability tests, F-tests, Z-tests, T-tests, calculation and / or comparison of uncertainty measures, null hypotheses, counter-null hypotheses, etc., chi-squared tests, omnibus tests, calculation and / or comparison of significance levels (e.g., statistical significance), meta-analysis, multivariate analysis, regression, linear simple regression, robust linear regression, etc., or combinations thereof. In certain embodiments, comparing two or more datasets, relationships, and / or profiles includes determining and / or comparing uncertainty measures. As used herein, “measure of uncertainty” refers to measures of significance (e.g., statistical significance), measures of error, measures of variance, measures of confidence, etc., or combinations thereof. Measures of uncertainty can be values (e.g., thresholds) or ranges of values (e.g., intervals, confidence intervals, Bayesian confidence intervals, threshold ranges). Non-restrictive examples of measures of uncertainty include p-values, preferred deviation measures (e.g., standard deviation, sigma, absolute deviation, mean absolute deviation, etc.), preferred error measures (e.g., standard error, mean squared error, root mean squared error, etc.), preferred variance measures, preferred standard scores (e.g., standard deviation, cumulative percentage, percentile equivalent, Z-score, T-score, R-score, standard nine (StanNine), StanNine percent, etc.), or combinations thereof. In some embodiments, determining the level of significance includes determining the measure of uncertainty (e.g., p-value).In certain embodiments, two or more datasets, relationships, and / or profiles can be analyzed and / or compared by utilizing multiple (e.g., two or more) statistical methods (e.g., least squares regression, principal component analysis, linear discriminant analysis, quadratic discriminant analysis, bagging, neural networks, support vector machine models, random forests, classification tree models, K-nearest neighbors, logistic regression, and / or loss smoothing) and / or any suitable mathematical and / or statistical operations (e.g., referred to herein as operations).
[0179] In some embodiments, the processing step includes comparing two or more profiles (e.g., two or more read density profiles). Comparing profiles may include comparing the profiles generated for selected regions of the genome. For example, a test profile may be compared to a reference profile for which the test and reference profiles have been determined for a region of the genome (e.g., a reference genome) that is substantially the same region. Comparing profiles may, in some cases, include comparing two or more subsets of parts of a profile (e.g., read density profiles). A subset of parts of a profile may represent a region of the genome (e.g., a chromosome, or a region thereof). A profile (e.g., a read density profile) may contain any number of subsets of parts. In some cases, a profile (e.g., a read density profile) may contain two or more, three or more, four or more, or five or more subsets. In certain embodiments, a profile (e.g., a read density profile) may contain two subsets of parts, each part representing a neighboring region of the reference genome. In some embodiments, the test profile can be compared to a reference profile in which both the test profile and the reference profile include a first subset of parts and a second subset of parts, where the first and second subsets represent different regions of the genome. Some subsets of parts of the profile may contain copy number variations, while other subsets of parts may, in some cases, substantially not contain copy number variations. In some cases, all subsets of parts of the profile (e.g., the test profile) may substantially not contain copy number variations. In some cases, all subsets of parts of the profile (e.g., the test profile) may contain copy number variations. In some embodiments, the test profile may include a first subset of parts containing copy number variations and a second subset of parts substantially not containing copy number variations.
[0180] In certain embodiments, comparing two or more profiles involves determining and / or comparing measures of uncertainty for two or more profiles. Profiles (e.g., read density profiles) and / or associated measures of uncertainty are compared, in some cases, to facilitate the interpretation of mathematical and / or statistical operations on the dataset and / or to provide results. A profile generated for a test subject (e.g., read density profile) is, in some cases, compared to a profile (e.g., read density profile) generated for one or more references (e.g., reference samples, reference subjects, etc.). In some embodiments, results are provided by comparing a profile from a test subject (e.g., read density profile) to a profile (e.g., read density profile) from references to a chromosome, part or portion thereof, where the reference profiles are obtained from a set of reference subjects known not to possess copy number variations (e.g., references). In some embodiments, the results are provided by comparing a profile from the object under test (e.g., a read density profile) with a profile (e.g., a read density profile) from a reference to a chromosome, a part thereof, or a portion thereof, the reference profile being obtained from a set of reference objects known to harbor specific copy number variations (e.g., chromosomal aneuploidy, microduplication, microdeletion).
[0181] In certain embodiments, the profile of the subject being tested (e.g., a read density profile) is compared to a predetermined value representing the absence of copy number variation, and, in some cases, deviates from the predetermined value at one or more genomic locations (e.g., segments) corresponding to the genomic locations where the copy number variation is located. For example, in a subject being tested (e.g., a subject at risk of or suffering from a medical condition associated with copy number variation), when the subject being tested contains the copy number variation in question, the profile is expected to be significantly different from the profile of the reference (e.g., a reference sequence, reference object, reference set) for the selected segment. The profile of the subject being tested (e.g., a read density profile) is often substantially the same as the profile (e.g., a read density profile) of the reference (e.g., a reference sequence, reference object, reference set) for the selected segment when the subject being tested does not contain the copy number variation in question. The profile (e.g., a read density profile) may be compared to a predetermined threshold and / or threshold range. As used herein, the term “threshold” refers to any number calculated using a qualifying dataset that serves as a limit for the diagnosis of copy number variation (e.g., aneuploidy, microduplication, microdeletion, etc.). In certain embodiments, a threshold is exceeded by results obtained by the methods described herein, and the subject is diagnosed using copy number variation. In some embodiments, the threshold or range of values may be calculated by mathematically and / or statistically manipulating sequence read data (e.g., from reference and / or subject). A predetermined threshold or threshold range of values indicating the presence or absence of copy number variation can be varied while still providing useful results for determining the presence or absence of copy number variation. In certain embodiments, a profile including normalized read density and / or normalized count (e.g., a read density profile) is generated to facilitate classification and / or providing results. Results may be provided based on (e.g., using) a plot of the profile including the normalized count (e.g., a read density profile).
[0182] Decision analysis In some embodiments, the determination of an outcome (e.g., making a call) or the presence or absence of copy number changes (e.g., chromosomal aneuploidy, microduplication, microdeletion) is carried out according to a decision analysis. Specific features of decision analysis are described in International Patent Application Publication 2014 / 190286, the entirety of which is incorporated herein by reference for all purposes. For example, a decision analysis may include, in some cases, one or more methods for producing one or more outcomes, evaluation of the outcomes, and applying a set of decisions based on the possible outcomes of the outcomes, evaluation, and / or decisions, and terminating at some point in the process where a final decision is made. In some embodiments, the decision analysis is a decision tree. In some embodiments, the decision analysis involves the coordinated use of one or more processes (e.g., process steps, e.g., algorithms). A decision analysis can be carried out by a person, a system, a device, software (e.g., a subsystem), a computer, a processor (e.g., a microprocessor), or a combination thereof. In some embodiments, the decision analysis includes a method for determining the presence or absence of copy number changes (e.g., chromosomal aneuploidy, microduplication, or microdeletion) with reduced false-negative and false-positive determinations compared to cases where the decision analysis is not used (e.g., where the determination is made directly from normalized counts). In some embodiments, the decision analysis includes determining the presence or absence of one or more conditions associated with copy number changes.
[0183] In some embodiments, the decision analysis includes generating a profile of the genome or a region of the genome (e.g., a chromosome or a portion thereof). The profile can be generated by any preferred method known or described herein. In some embodiments, the decision analysis includes a segmentation process. Segmentation can modify and / or transform the profile, thereby providing one or more decomposition renderings of the profile. The profile subjected to the segmentation process is often a profile of normalized counts mapped to a portion or part thereof within a reference genome. As addressed herein, the raw counts mapped to a portion can be normalized by one or more preferred normalization processes (e.g., LOESS, GC-LOESS, principal component normalization, or a combination thereof) to generate a profile that is segmented as part of the decision analysis. Decomposition rendering of a profile is often a transformation of the profile. Decomposition rendering of a profile is, in some cases, a transformation of the profile to a representation of the genome, chromosome, or a portion thereof.
[0184] In certain embodiments, the segmentation process used for segmentation identifies and distinguishes one or more levels in the profile that are different (e.g., substantially or significantly different) from one or more other levels in the profile. Levels identified in the profile according to a segmentation process different from other levels in the profile and having different endpoints from other levels in the profile are referred to herein as levels for discrete segments. The segmentation process can generate a decomposition rendering from a normalized count or level profile that can identify one or more distinct segments. Discrete segments generally cover less than the segmented portion (e.g., chromosome, autosome).
[0185] In some embodiments, segmentation localizes and identifies the ends of discrete segments within a profile. In certain embodiments, one or both ends of one or more distinct segments are identified. For example, the segmentation process can identify the locations of the right and / or left ends of distinct segments within a profile (e.g., genomic coordinates, e.g., location of a portion). Distinct segments often contain two ends. For example, a discrete segment may include a left end and a right end. In some embodiments, depending on the representation or figure, the left end may be the 5' end of the nucleic acid segment in the profile and the right end may be the 3' end. In some embodiments, the left end may be the 3' end of the nucleic acid segment in the profile and the right end may be the 5' end. Often, the edges of the profile are known before segmentation, and therefore, in some embodiments, the ends of the profile determine which ends of a level are the 5' ends and which are the 3' ends. In some embodiments, one or both ends of the profile and / or distinct segments are chromosome ends.
[0186] In some embodiments, the ends of distinct segments are determined according to a decomposed rendering generated for a reference sample (e.g., a reference profile). In some embodiments, the null edge height distribution is determined according to a decomposed rendering of a reference profile (e.g., a profile of a chromosome or a part thereof). In certain embodiments, the ends of distinct segments in a profile are identified when the level of the distinct segment lies outside the null edge height distribution. In some embodiments, the ends of distinct segments in a profile are identified according to a Z-score calculated according to a decomposed rendering of a reference profile.
[0187] In some cases, segmentation generates two or more distinct segments in the profile (e.g., two or more fragmented levels, two or more fragmented segments). In some embodiments, the decomposed rendering derived from the segmentation process is over-segmented or fragmented and contains multiple distinct segments. In some cases, the distinct segments generated by segmentation are substantially different, and in some cases, the distinct segments generated by segmentation are substantially similar. Substantially similar distinct segments (e.g., substantially similar levels) often refer to two or more adjacent distinct segments in the segmented profile, each having levels that differ by less than a given level of uncertainty. In some embodiments, substantially similar distinct segments are adjacent to each other and are not separated by intervening segments. In some embodiments, substantially similar distinct segments are separated by one or more smaller segments. In some embodiments, substantially similar distinct segments are separated by about 1 to about 20, about 1 to about 15, about 1 to about 10, or about 1 to about 5 parts, and one or more of the intervening parts have levels that are significantly different from each of the levels of the substantially similar distinct segments. In some embodiments, the levels of substantially similar distinct segments differ by less than approximately 3 times, less than 2 times, less than 1 time, or less than approximately 0.5 times the level of uncertainty. In some embodiments, substantially similar distinct segments include median levels that differ by less than 3 MAD (e.g., less than 3 sigma), less than 2 MAD, less than 1 MAD, or less than approximately 0.5 MAD, where MAD is calculated from the median levels of each segment. In some embodiments, substantially distinct segments are not adjacent or are separated by 10 or more, 15 or more, or 20 or more segments. Substantially distinct segments generally have substantially different levels. In certain embodiments, substantially distinct segments include levels that differ by more than approximately 2.5 times, more than approximately 3 times, more than approximately 4 times, more than approximately 5 times, or more than approximately 6 times the level of uncertainty.In some embodiments, substantially distinct segments include median levels that differ by only 2.5 MAD (e.g., 2.5 sigma), 3 MAD, 4 MAD, approximately 5 MAD, or approximately 6 MAD, and MAD is calculated from the respective median levels of the distinct segments.
[0188] In some embodiments, the segmentation process includes determining (e.g., calculating) levels (e.g., quantitative values, e.g., mean or median levels), levels of uncertainty (e.g., uncertainty values), Z-scores, Z-values, p-values, etc., or combinations thereof, for one or more distinct segments in a profile or a portion thereof. In some embodiments, levels (e.g., quantitative values, e.g., mean or median levels), levels of uncertainty (e.g., uncertainty values), Z-scores, Z-values, p-values, etc., or combinations thereof, are determined (e.g., calculated) for distinct segments.
[0189] Segmentation can be carried out entirely or partially by one or more decomposition generation processes. A decomposition generation process may, for example, provide a decomposed rendering of a profile. Any decomposition generation process described herein or known in the art may be used. Non-limiting examples of decomposition and generation processes include circular binary segmentation (CBS) (e.g., see Olshen et al. (2004) Biostatistics 5(4):557-72; Venkatraman, ES, Olshen, AB (2007) Bioinformatics 23(6):657-63), Haar wavelet segmentation (e.g., see Haar, Alfred (1910) Mathematische Annalen 69(3):331-371), maximum overlap discrete wavelet transform (MODWT) (e.g., see Hsu et al. (2005) Biostatistics 6 (2):211-226), and stationary wavelet transformation (SWT) (e.g., see Y. Wang and S. Wang (2007) International Journal of Bioinformatics Research and Applications) Examples include 3(2):206-222), dual-tree complex wavelet transform (DTCWT) (see, for example, Nguyen et al. (2007) Proceedings of the 7th IEEE International Conference, Boston Mass., on Oct. 14-17, 2007, pages 137-144), maximum entropy segmentation, convolution with edge detection kernel, Jensen-Shannon divergence, Kullback-Leibler divergence, binary recursive segmentation, Fourier transform, or combinations thereof.
[0190] In some embodiments, segmentation is achieved by a process comprising one or more subprocesses, non-limiting examples of which include decomposition generation processes, thresholding, leveling, smoothing, polishing, etc., or combinations thereof. Thresholding, leveling, smoothing, polishing, etc., can be performed, for example, in conjunction with the decomposition generation process.
[0191] In some embodiments, decision analysis includes identifying candidate segments in the decomposition rendering. Candidate segments are determined to be the most important discrete segments in the decomposition rendering. Candidate segments may be most important in terms of the number of parts covered by the segment and / or the absolute value of the normalized count level of the segment. Candidate segments may be larger, and in some cases substantially larger, than other discrete segments in the decomposition rendering. Candidate segments can be identified by preferred methods. In some embodiments, candidate segments are identified by area under curve (AUC) analysis. In certain embodiments, if a first distinct segment has a substantially larger level and / or covers a substantially larger number of parts than another distinct segment in the decomposition rendering, then the first segment has a larger AUC. When levels are analyzed for AUC, the absolute value of the level is often used (e.g., a level corresponding to a normalized count may have a negative value for deletions and a positive value for overlaps). In certain embodiments, the AUC is determined as the absolute value of the calculated AUC (e.g., the obtained positive value). In certain embodiments, once a candidate segment is identified (e.g., by AUC analysis or by a preferred method) and optionally validated, it is selected for z-score calculation or the like to determine whether the candidate segment represents a gene mutation or gene alteration (e.g., aneuploidy, microdeletion, or microduplication).
[0192] In some embodiments, the decision analysis includes comparison. In some embodiments, the comparison includes comparing at least two decomposed renderings. In some embodiments, the comparison includes comparing at least two candidate segments. In certain embodiments, each of the at least two candidate segments is from a different decomposed rendering. For example, the first candidate segment may be from a first decomposed rendering, and the second candidate segment may be from a second decomposed rendering. In some embodiments, the comparison includes determining whether the two decomposed renderings are substantially the same or different. In some embodiments, the comparison includes determining whether the two candidate segments are substantially the same or different. The two candidate segments may be determined to be substantially the same or different by a preferred comparison method, non-limiting examples of which include visual inspection, comparing the levels or Z-scores of the two candidate segments, comparing the ends of the two candidate segments, overlaying the two candidate segments or their corresponding decomposed renderings, or a combination thereof.
[0193] conversion As described above, data may be transformed from one form to another. As used herein, the terms “transformed,” “converted,” and their grammatical derivatives or equivalents refer to the transformation of data from a physical starting material (e.g., nucleic acids of the sample under test and / or reference) to a digital representation of the physical starting material (e.g., sequence read data), which in some embodiments includes further transformation of the digital representation to one or more numerical or graphical representations that can be used to provide results. In certain embodiments, one or more numerical and / or graphical representations of the digitally represented data may be used to represent the physical appearance of the genome under test (e.g., representing the presence or absence of insertions, replications, or deletions in the genome, or representing the presence or absence of variations in physical quantities of sequences associated with a disease condition). The virtual representation may be further transformed to one or more numerical or graphical representations of the digital representation of the starting material. These methods can transform the physical starting material into a numerical or graphical representation, or a representation of the physical appearance of the nucleic acid under test.
[0194] In some embodiments, the transformation of the dataset facilitates the delivery of results by reducing the complexity and / or dimensionality of the data. The complexity of the dataset is sometimes reduced during the process of transforming physical starting materials into virtual representations of those starting materials (e.g., sequence reads representing the physical starting materials). Preferred features or variables can be used to reduce the complexity and / or dimensionality of the dataset. Non-limiting examples of features that can be selected to be used as target features for data processing include GC content, fetal sex prediction, fragment size (e.g., CCF fragment length, read or preferred representation thereof (e.g., FRS)), fragment sequence, copy number variation identification, chromosomal aneuploidy identification, specific gene or protein identification, cancer, disease, gene / trait, chromosomal aberration, biological classification, chemical classification, biochemical classification, gene or protein classification, gene ontology, protein ontology, co-regulated genes, cell signaling genes, cell cycle genes, proteins associated with the above genes, gene variants, protein variants, co-regulated genes, co-regulated proteins, amino acid sequences, nucleotide sequences, protein structure data, and combinations thereof. Non-limiting examples of reducing the complexity and / or dimensionality of a dataset include reducing multiple sequence reads to profile plots, reducing multiple sequence reads to numerical values (e.g., normalized values, Z-scores, p-values), reducing multiple analytical methods to probability plots or single points, principal component analysis of derived quantities, or a combination thereof.
[0195] classification The methods described herein can provide results indicating the presence or absence of genomic instability in a test sample. The methods described herein may, in some cases, provide results indicating the presence or absence of genotype and / or gene mutation / alteration in a genomic region of the test sample (e.g., providing results for determining the presence or absence of gene mutation). The methods described herein may, in some cases, provide results indicating the presence or absence of phenotype and / or disease state in the test sample (e.g., providing results for determining the presence or absence of disease state and / or phenotype). The results are often part of a classification process, and the classification (e.g., classification of genomic instability, genotype, phenotype, gene mutation, and / or disease state of the test sample) may, in some cases, be based on and / or include the results. The results and / or classification may, in some cases, include and / or results based on the results of data processing on the test sample that facilitates the determination of the presence or absence of genomic instability, genotype, phenotype, gene mutation, gene alteration, and / or disease status (e.g., statistical values (e.g., standard scores (e.g., z-scores)) in the classification process. The results and / or classification may, in some cases, include or be based on scores that determine the presence or absence of genomic instability, genotype, phenotype, gene mutation, gene alteration, and / or disease status in the classification process. In certain embodiments, the results and / or classification include conclusions that predict and / or determine the presence or absence of genomic instability, genotype, phenotype, gene mutation, gene alteration, and / or disease status in the classification process. In some cases, the results and / or classification are the output of one or more machine learning models trained to predict the presence or absence of genomic instability in the test sample.
[0196] Genotypes and / or gene mutations often involve the acquisition, loss, and / or alteration of regions containing one or more nucleotides (e.g., duplication, deletion, fusion, insertion, short tandem repeat (STR), mutation, single nucleotide change, rearrangement, substitution, or abnormal methylation) that result in a detectable change in the genome or genetic information of the sample being tested. Genotypes and / or gene mutations often reside in specific genomic regions (e.g., chromosomes, parts of chromosomes (i.e., subchromosomal regions), STRs, polymorphic regions, translocated regions, altered nucleotide sequences, or combinations thereof). Gene mutations may, in some cases, be copy number changes in specific regions, such as trisomy or monosomy of a chromosomal region, or microduplication or microdeletion events in specific regions (e.g., acquisition or loss of regions of approximately 10 megabases or less (e.g., approximately 9 megabases or less, 8 megabases or less, 7 megabases or less, 6 megabases or less, 5 megabases or less, 4 megabases or less, 3 megabases or less, 2 megabases or less, or 1 megabase or less)). Copy number changes may sometimes be represented as the absence of copies of a particular region (e.g., a chromosome, subchromosome, STR, microduplication, or microdeletion region), or as having one, two, three, four, or more copies.
[0197] The presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease can be determined by transforming, analyzing, and / or manipulating sequence reads mapped to genomic regions (e.g., counts, counts of genomic regions of a reference genome). In certain embodiments, results and / or classifications are determined according to normalized counts, read density, read density profiles, etc., and can be determined by methods described herein. Results and / or classifications may include one or more scores and / or calls indicating the likelihood of the presence or absence of genomic instability, or a particular genotype, phenotype, gene mutation, or disease in the test sample. The score values can be used, for example, to determine the variation, difference, or ratio of mapped sequence reads that may correspond to genomic instability, genotype, phenotype, gene mutation, or disease. For example, calculating a positive score for genomic instability, or a selected genotype, phenotype, gene mutation, or disease from a dataset with respect to a reference genome can lead to a classification of genomic instability, or genotype, phenotype, gene mutation, or disease in the test sample.
[0198] Any preferred expression of results and / or classifications can be provided. Results and / or classifications may, in some cases, be based on and / or include one or more numerical values generated using the processing methods described herein in the context of one or more probabilistic considerations. Non-limiting examples of values that may be available include sensitivity, specificity, standard deviation, median absolute deviation (MAD), measure of certainty, measure of confidence, measure of certainty or confidence that the values obtained for a test sample are within or outside a specific range of values, measure of uncertainty, measure of uncertainty that the values obtained for a test sample are within or outside a specific range of values, coefficient of variation (CV), confidence level, confidence interval (e.g., about 95% confidence interval), standard score (e.g., z score), chi value, phi value, t-test result, p value, double value, fitted minority species fraction, area ratio, median level, etc., or combinations thereof. In some embodiments, results and / or classifications may include read density, read density profile, and / or plot (e.g., profile plot). In certain embodiments, multiple values are analyzed together, and possibly in a profile of such values (e.g., a z-score profile, p-value profile, chi-value profile, phi-value profile, t-test results, value profile, etc., or a combination thereof). Probability considerations can facilitate determining whether a subject is at or has a risk of having genomic instability, genotype, phenotype, gene mutation, and / or disease, and the results and / or classifications used to determine the above may, if applicable, include such considerations.
[0199] In certain embodiments, the results and / or classifications include conclusions based on and / or determinations of the risk or probability of genomic instability, genotype, phenotype, gene mutation, and / or disease presence for the test sample. The conclusions may, in some cases, be based on values determined from the data analysis methods described herein (e.g., statistical values indicating probability, certainty, and / or uncertainty (e.g., standard deviation, median absolute deviation (MAD), measure of certainty, measure of confidence, measure of certainty or confidence that the values obtained for a test sample are within or outside a specific range of values, measure of uncertainty, measure of uncertainty that the values obtained for a test sample are within or outside a specific range of values, coefficient of variation (CV), confidence level, confidence interval (e.g., about 95% confidence interval), standard score (e.g., z-score), CHI value, PHI value, t-test result, p-value, sensitivity, specificity, etc., or a combination thereof)). The results and / or classifications may, in some cases, be based on probabilities (e.g., odds ratio, p-value), likelihood, or risk factors associated with genomic instability, genotype, phenotype, gene mutation, and / or the presence or absence of disease). This is expressed in a clinical laboratory report for a specific test sample (described in more detail below). The results and / or classifications for a test sample are, in some cases, provided as “positive” or “negative” with respect to genomic instability, or a specific genotype, phenotype, gene mutation, and / or disease. For example, the results and / or classifications may be designated as “positive” in a clinical laboratory report for a specific test sample where the presence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined, and in some cases, the results and / or classifications may be designated as “negative” in a clinical laboratory report for a specific test sample where the absence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined. The results and / or classifications may, in some cases, include assumptions that are determined and, in some cases, used in data processing.
[0200] The results and / or classifications may, in some cases, be based on or expressed as values within or outside a cluster, values above or below a threshold, values within a range (e.g., a threshold range), and / or values having a measure of variance or confidence. In some embodiments, the results and / or classifications may be based on or expressed as values above or below a predetermined threshold or cutoff value, and / or measures of uncertainty, confidence level, or confidence interval associated with the value. In certain embodiments, the predetermined threshold or cutoff value is an expected level or expected level range. In some embodiments, the value obtained for a test sample is a standard score (e.g., a z-score), and the presence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined if the absolute value of the score exceeds a certain score threshold (e.g., a threshold of about 2 to about 5, a threshold of about 3 to about 4), and the absence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined if the absolute value of the score is below a certain score threshold. In certain embodiments, the results and / or classifications are based on, or expressed as, values that are within or outside a predetermined range of values (e.g., a threshold range), and associated uncertainty or confidence levels regarding whether those values are within or outside the range. In some embodiments, the results and / or classifications include values that are equal to a predetermined value (e.g., equal to 1, equal to zero) or values within a predetermined range, and associated uncertainty or confidence levels regarding whether those values are equal to, within, or outside the range. The results and / or classifications may, in some cases, be represented graphically as plots (e.g., profile plots). The results and / or classifications may, in some cases, involve the use of reference values or reference profiles, which may, in some cases, be obtained from one or more reference samples (e.g., euploids of reference samples for selected sites (e.g., regions) of the genome).
[0201] In some embodiments, the results and / or classification are based on or include the use of a measure of uncertainty between the test value or profile and the reference value or profile for a selected region. In some embodiments, the determination of the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease follows the number of deviations (e.g., sigma) between the test value or profile and the reference value or profile of the selected region (e.g., a chromosome or a portion thereof). The measure of deviation is often the absolute value or absolute scale of the deviation (e.g., mean absolute deviation or median absolute deviation (MAD)). In some embodiments, the presence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined when the number of deviations between the test value or profile and the reference value or profile is approximately 1 or greater (e.g., approximately 1.5, 2, 2.5, 2.6, 2.7, 2.8, 2.9, 3, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4, 5, or 6 deviations or greater). In certain embodiments, the presence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined when the test value or profile and the reference value or profile differ by a deviation scale of approximately 2 to approximately 5 (e.g., sigma, MAD), or by a deviation scale of more than 3 (e.g., 3 sigma, 3 MAD). A deviation of more than 3 between the test value or profile and the reference value or profile often indicates non-euploidy for the selected region (e.g., the presence of a gene mutation (e.g., trisomy, monosomy, microduplication, microdeletion)). A test value or profile significantly above the reference profile is, if the reference profile is polyploid, a determinant of trisomy, subchromosome duplication, or microduplication. A test value or profile significantly below the reference profile is, if the reference profile is polyploid, a determinant of monosomy, subchromosome deletion, or microdeletion.In some embodiments, the absence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined when the deviation between the test value or profile for a selected region of the genome and the reference value or profile is approximately 3.5 or less (e.g., approximately 3.4, 3.3, 3.2, 3.1, 3, 2.9, 2.8, 2.7, 2.6, 2.5, 2.4, 2.3, 2.2, 2.1, 2, 1.9, 1.8, 1.7, 1.6, 1.5, 1.4, 1.3, 1.2, 1.1, or 1 or less). In certain embodiments, the absence of genomic instability, genotype, phenotype, gene mutation, and / or disease is determined when the test value or profile differs from the reference value or profile by a deviation of less than 3 (e.g., 3 sigma, 3MAD). In some embodiments, a measure of deviation less than 3 between the test value or profile and the reference value or profile (e.g., 3 sigma with respect to the standard deviation) often indicates a region that is euploid (e.g., absence of gene mutations). A measure of deviation between the test value or profile for a test sample and the reference value or profile for one or more reference objects can be plotted and visualized (e.g., a z-score plot).
[0202] In some embodiments, results and / or classifications are determined according to a call zone. In certain embodiments, a call is made when a value (e.g., a profile, read density profile, and / or a measure of uncertainty) or set of values falls within a predefined range (e.g., a zone, a call zone) (e.g., a call to determine the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease). In some embodiments, a call zone is defined according to a set of values (e.g., a profile, read density profile, a measure of probability or determination, and / or a measure of uncertainty) obtained from a particular sample group. In certain embodiments, a call zone is defined according to a set of values originating from the same chromosome or a portion thereof. In some embodiments, the call zone for determining the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease is defined according to a measure of uncertainty determined for the test sample (e.g., a high confidence level or a low uncertainty level) and / or minority nucleic acid species quantification (e.g., minority nucleic acid species of about 1% or more (e.g., about 2, 3, 4, 5, 6, 7, 8, 9, 10% or more)). Minority nucleic acid species quantification may, in some cases, be the fraction or percentage of cancer cell nucleic acids or fetal nucleic acids (i.e., fetal fraction) confirmed for the test sample. In some embodiments, the call zone is defined by a confidence level or confidence interval (e.g., a 95% confidence interval). The call zone is defined in some cases by a confidence level of approximately 90% or higher (e.g., approximately 91, 92, 93, 94, 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, 99.9%), or by a confidence interval based on a specific confidence level. In some embodiments, calls are made using the call zone and additional data or information. In some embodiments, calls are made without using the call zone. In some embodiments, calls are made based on comparison without using the call zone. In some embodiments, calls are made based on a visual inspection of the profile (e.g., a visual inspection of lead density).
[0203] In some embodiments, no classification or call is provided to the test sample when the test value or profile is in the no-call zone. In some embodiments, the no-call zone is defined by a value (e.g., a set of values) or profile that indicates a level of low precision, high risk, high error, low confidence, a measure of high uncertainty, or a combination thereof. In some embodiments, the no-call zone is defined in part by the quantification of minority nucleic acid species (e.g., minority nucleic acid species of about 10% or less (e.g., minority nucleic acid species of about 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1.5%, 1%, or less)). The results and / or classifications generated to determine the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease state may, in some cases, include null results. A null result may, in some cases, be a data point between two clusters, in some cases be a numerical value with a standard deviation encompassing both values for genomic instability, genotype, phenotype, gene variation, and / or the presence or absence of disease, and in some cases be a dataset having a profile plot that does not resemble the profile plot for subjects under investigation who have or do not have genomic instability, or genotype, phenotype, gene variation, or disease. In some embodiments, results and / or classifications showing a null result are considered definitive results, and the determination may include conclusions regarding the need for additional information to determine the presence or absence of genomic instability, genotype, phenotype, gene variation, and / or disease, and / or the need for repeated data generation and / or analysis.
[0204] In some cases, the classification process generates four types of classifications: true positive, false positive, true negative, and false negative. As used herein, the term "true positive" refers to the presence of a genomic instability, genotype, phenotype, gene mutation, or disease that is accurately determined in the test sample. As used herein, the term "false positive" refers to the presence of a genomic instability, genotype, phenotype, gene mutation, or disease that is inaccurately determined in the test sample. As used herein, the term "true negative" refers to the absence of a genomic instability, genotype, phenotype, gene mutation, or disease that is accurately determined in the test sample. As used herein, the term "false negative" refers to the absence of a genomic instability, genotype, phenotype, gene mutation, or disease that is inaccurately determined in the test sample. Two measures of the performance of the classification process can be calculated based on the ratio of these occurrences: (i) sensitivity, which is generally the proportion of predicted positives that are accurately identified as positive, and (ii) specificity, which is generally the proportion of predicted negatives that are accurately identified as negative.
[0205] Genetic mutations / genetic changes and disease symptoms The presence or absence of a gene mutation can be determined using the methods or apparatus described herein. Gene mutations may also be referred to as gene alterations, and these terms are often used interchangeably herein and in the art. In certain cases, “gene alteration” may be used to describe a somatic alteration (e.g., in tumor or cancer cells) in which the genome in a subset of cells of interest contains that alteration. In certain cases, “gene mutation” may be used to describe a mutation inherited from one or both parents (e.g., a fetal gene mutation).
[0206] In certain embodiments, the presence or absence of one or more gene mutations or genetic alterations is determined according to the results provided by the methods and apparatus described herein. Gene mutations are generally specific genetic phenotypes present in particular individuals, and in many cases, gene mutations are present in statistically significant subpopulations of individuals. In some embodiments, gene mutations or genetic alterations are chromosomal abnormalities or copy number changes (e.g., aneuploidy, duplication of one or more chromosomes, deletion of one or more chromosomes, partial chromosomal abnormalities or mosaicism (e.g., loss or acquisition of one or more regions of a chromosome), translocations, inversions, each of which is described in more detail herein). Non-limiting examples of gene mutations / gene alterations include one or more copy number changes / mutations, deletions (e.g., microdeletions), duplications (e.g., microduplications), insertions, mutations (e.g., single nucleotide mutations, single nucleotide changes), polymorphisms (e.g., single nucleotide polymorphisms), fusions, repeats (e.g., short tandem repeats), distinct methylation sites, distinct methylation patterns, and combinations thereof. Insertions, repeats, deletions, duplications, mutations, or polymorphisms can be of any length, and in some embodiments, they are approximately 1 nucleotide or base pair (bp) to approximately 250 megabases (Mb). In some embodiments, insertions, repeats, deletions, duplications, mutations, or polymorphisms are approximately 1 nucleotide or base pair (bp) to approximately 50,000 kilobases (kb) in length (e.g., approximately 10 bp, 50 bp, 100 bp, 500 bp, 1 kb, 5 kb, 10 kb, 50 kb, 100 kb, 500 kb, 1000 kb, 5000 kb, or 10,000 kb).
[0207] Genetic mutations or alterations are, in some cases, deletions. In certain cases, a deletion is a mutation (e.g., a genetic abnormality) in which a portion of a chromosome or a sequence of DNA is missing. A deletion is often a loss of genetic material. Any number of nucleotides can be deleted. A deletion can include the deletion of one or more entire chromosomes, regions of chromosomes, alleles, genes, introns, exons, any non-coding regions, any coding regions, parts thereof, or combinations thereof. A deletion can include a microdeletion. A deletion can include the deletion of a single base.
[0208] Genetic mutations or genetic alterations are, in some cases, duplications. In certain cases, a duplication is a mutation (e.g., a genetic abnormality) in which a portion of a chromosome or a DNA sequence is copied and returned to the genome. In certain embodiments, a genetic duplication (e.g., a duplication) is any duplication of a region of DNA. In some embodiments, a duplication is a nucleic acid sequence that is often repeated in tandem within the genome or chromosome. In some embodiments, a duplication can include copies of one or more entire chromosomes, regions of chromosomes, alleles, genes, introns, exons, any non-coding regions, any coding regions, parts thereof, or combinations thereof. A duplication can include microduplications. A duplication may, in some cases, include one or more copies of the duplicated nucleic acid. A duplication may, in some cases, be characterized as a genetic region that is repeated one or more times (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 times). In some cases, a duplication can range from a small region (several thousand base pairs) to an entire chromosome. Duplication frequently occurs as a result of homologous recombination errors or due to retrotransposon events. Duplication has been associated with certain types of proliferative disorders. Duplication can be characterized using genomic microarrays or comparative gene hybridization (CGH).
[0209] Genetic mutations or genetic alterations are, in some cases, insertions. Insertions are, in some cases, the addition of one or more nucleotide base pairs to a nucleic acid sequence. Insertions are, in some cases, microinsertions. In certain embodiments, insertions include the addition of a region of a chromosome to a genome, chromosome, or part thereof. In certain embodiments, insertions include the addition of an allele, gene, intron, exon, any non-coding region, any coding region, part thereof, or a combination thereof to a genome or part thereof. In certain embodiments, insertions include the addition (e.g., insertion) of nucleic acids of unknown origin to a genome, chromosome, or part thereof. In certain embodiments, insertions include the addition of a single base (e.g., insertion).
[0210] As used herein, “copy number variation” generally refers to a class or type of gene mutation, gene alteration, or chromosomal abnormality. Copy number variation may also be referred to as copy number polymorphism, and these terms are often used interchangeably herein and in the art. In certain cases, “copy number variation” may be used to describe somatic changes in the genome of a subset of cells of interest (e.g., in tumor or cancer cells). In certain cases, “copy number polymorphism” may be used to describe variations inherited from one or both parents (e.g., fetal copy number polymorphism). Copy number variation can be a deletion (e.g., a microdeletion), a duplication (e.g., a microduplication), or an insertion (e.g., a microinsertion). In many cases as used herein, the prefix “micro” refers, in some cases, to a region of nucleic acid less than 5 Mb in length. Copy number variation may include one or more deletions (e.g., microdeletions), duplications, and / or insertions (e.g., microduplications, microinsertions) of a portion of a chromosome. In certain embodiments, duplication includes insertions. In certain embodiments, the insertion is a duplicate. In certain embodiments, the insertion is not a duplicate.
[0211] In some embodiments, the copy number variation is a copy number variation from a tumor or cancer cell. In some embodiments, the copy number variation is a copy number variation from a non-cancer cell. In certain embodiments, the copy number variation is a copy number variation within the genome of a subject (e.g., a cancer patient) and / or within the genome of cancer cells or tumors in the subject. The copy number variation can be a heterozygous copy number variation in which the mutation (e.g., duplication or deletion) is present on one allele of the genome. The copy number variation can be a homozygous copy number variation in which the change is present on both alleles of the genome. In some embodiments, the copy number variation is heterozygous or homozygous copy number variation. In some embodiments, the copy number variation is a heterozygous or homozygous copy number variation from cancer cells or non-cancer cells. The copy number variation may, in some cases, be present in the cancer cell genome and the non-cancer cell genome, in the cancer cell genome rather than the non-cancer cell genome, or in the non-cancer cell genome rather than the cancer cell genome.
[0212] In some embodiments, the copy number variation is a fetal copy number variation. Often, a fetal copy number variation is a copy number variation in the fetal genome. In some embodiments, the copy number variation is a maternal and / or fetal copy number variation. In certain embodiments, a maternal and / or fetal copy number variation is a copy number variation in the genome of a pregnant woman (e.g., a woman with a fetus), a woman who has given birth, or a woman capable of bearing a fetus. A copy number variation can be a heterozygous copy number variation in which the variation (e.g., duplication or deletion) is present on one allele of the genome. A copy number variation can be a homozygous copy number variation in which the variation is present on both alleles of the genome. In some embodiments, the copy number variation is a heterozygous or homozygous fetal copy number variation. In some embodiments, the copy number variation is a heterozygous or homozygous maternal and / or fetal copy number variation. The copy number variation may, in some cases, be present in the maternal genome and the fetal genome, in the maternal genome but not the fetal genome, or in the fetal genome but not the maternal genome.
[0213] "Ploidy" refers to the number of chromosomes present in an object. In certain embodiments, "ploidy" is synonymous with "chromosome ploidy." For example, in humans, autosomes often exist in pairs. For example, in the absence of gene mutations or alterations, most humans have two copies of each autosome (e.g., chromosomes 1-22). The presence of normal complement between two autosomes in humans is often referred to as euploidy or diploidy. "Microploidy" has a similar meaning to ploidy. "Microploidy" often refers to the ploidy of a portion of a chromosome. The term "microploidy" may sometimes refer to the presence or absence of copy number alterations within a chromosome (e.g., deletions, duplications, and / or insertions) (e.g., homozygous or heterozygous deletions, duplications, or insertions, or their absence).
[0214] In certain embodiments, the presence or absence of a gene mutation or genetic alteration in a subject is associated with a medical condition. Therefore, the techniques described herein can be used to identify the presence or absence of one or more gene mutations or genetic alterations associated with a medical condition or disease. Non-limiting examples of medical conditions include those associated with intellectual disability (e.g., Down syndrome), abnormal cell proliferation (e.g., cancer), the presence of microbial nucleic acids (e.g., viruses, bacteria, fungi, yeast), and pre-eclampsia.
[0215] The following are non-specific examples of gene mutations / genetic changes, disease symptoms, and conditions.
[0216] Chromosomal abnormalities In some embodiments, the presence or absence of chromosomal abnormalities can be determined by using the methods and / or apparatus described herein. Chromosomal abnormalities include, but are not limited to, copy number variations and the acquisition or loss of an entire chromosome or region of a chromosome containing one or more genes. Chromosomal abnormalities include monosomy, trisomy, polysomy, loss of heterozygote, translocation, deletion and / or duplication of one or more nucleotide sequences (e.g., one or more genes), including deletions and duplications caused by unbalanced translocations. As used herein, the terms “chromosomal abnormality” or “aneuploidy” refer to a deviation between the structure of the chromosome in question and a normal homologous chromosome. The term “normal” refers to the dominant karyotype or band pattern found in healthy individuals of a particular species, e.g., euploid genomes (e.g., diploid in humans, e.g., 46,XX or 46,XY). Because different organisms have widely varying chromosomal complements, the term “aneuploidy” refers to a situation in which a given cell of an organism or the chromosome content within a cell is abnormal, rather than referring to a specific number of chromosomes. In some embodiments, the term “aneuploidy” refers herein to an imbalance of genetic material caused by the loss or acquisition of an entire or partial chromosome. “Aneuploidy” can also refer to the deletion and / or insertion of one or more regions of a chromosome. In some embodiments, the term “euploidy” refers to the normal complement of a chromosome.
[0217] As used herein, the term “monosomy” refers to the absence of one chromosome in a normal complement. Partial monosomy can occur in an unbalanced translocation or deletion where only a portion of a chromosome is present in a single copy. For example, monosomy of a sex chromosome (45, X) causes Turner syndrome. The term “disomy” refers to the presence of two copies of a chromosome. In organisms such as humans (which are diploid or “euploid”), having two copies of each chromosome, disomy is a normal condition. In organisms that typically have three or more copies of each chromosome (which are triploid or greater), disomy is an aneuploid chromosome condition. In monoparental disomy, both copies of the chromosome come from the same parent (without contribution from the other parent).
[0218] As used herein, the term “trisomy” refers to the presence of three copies of a particular chromosome instead of two. The presence of an additional chromosome 21, as seen in human Down syndrome, is referred to as “trisomy 21.” Trisomy 18 and trisomy 13 are two other human autosomal trisomies. Sex chromosome trisomies can be found in females (e.g., 47,XXX in triple X syndrome) or males (e.g., 47,XXY in Klinefelter syndrome, or 47,XYY in Jacob syndrome). In some embodiments, trisomy is a duplication of most or all of an autosome. In certain embodiments, trisomy is total aneuploidy resulting in three instances (e.g., three copies) of a particular type of chromosome (e.g., instead of two instances (e.g., a pair) of a particular type of chromosome for euploidy).
[0219] As used herein, the terms “tetrasomy” and “pentasomy” refer to the presence of four or five copies of a chromosome, respectively. Although rarely seen in autosomes, sex chromosome tetrasomies and pentasomies have been reported in humans, including XXXX, XXXY, XXYY, XYYY, XXXXX, XXXXY, XXXYY, XXYYY, and XYYYY.
[0220] In some embodiments, the Disclosure provides a method for determining whether a test sample contains a trisomy, e.g., trisomy 21, trisomy 18, or trisomy 13. The method includes providing a set of genomic portions, each associated with a copy number change quantification for the test sample. The genomic portions include portions of a reference genome to which sequence reads obtained for the sample nucleic acid from the test subject are mapped, and the copy number change quantification associated with each genomic portion is determined from the quantification of sequence reads mapped to the genomic portion. The method uses a computing device to filter from the set of genomic portions a first subset of portions associated with copy number changes consistently presented in the sample's reference set, and / or a second subset of portions other than representative portions derived from subgroups identified by a clustering process applied to the sample's reference set, thereby generating a filtered set of genomic portions that does not include the first subset and / or the second subset of portions. The method then transforms the filtered set of genomic portions into a parameter-reduced set by performing a principal component transformation of the filtered set of genomic portions, which yields the principal components of the test sample. The generated principal components can be represented in a principal component space containing a common principal component origin. Trisomic samples typically group together and follow different patterns or directions, for example, they form vectors in a two-dimensional principal component space, planes in a three-dimensional principal component space, or hyperplanes in an n-dimensional principal component space.
[0221] Medical disorders and illnesses The methods described herein may be applicable to any suitable medical disorder or condition. Non-limiting examples of medical disorders and conditions include cytoproliferative disorders and conditions, wasting disorders and conditions, degenerative disorders and conditions, autoimmune disorders and conditions, preeclampsia, chemical or environmental toxicity, liver injury or disease, kidney injury or disease, vascular disease, hypertension, and myocardial infarction.
[0222] In some embodiments, the cytoproliferative disorder or condition may be, in some cases, cancer, tumors, neoplasms, metastatic diseases, or a combination thereof. The cytoproliferative disorder or condition may be, in some cases, disorders or conditions of the liver, lungs, spleen, pancreas, colon, skin, bladder, eyes, brain, esophagus, head, neck, ovaries, testes, prostate, etc., or a combination thereof. Non-limited examples of cancer include hematopoietic neoplastic disorders that involve hyperplastic / neoplastic cells of hematopoietic origin (e.g., arising from myeloid, lymphoid, or erythroid cells, or their progenitor cells) and may result from poorly differentiated acute leukemias (e.g., erythroblastic leukemia and acute megakaryoblastic leukemia). Specific myeloid disorders include, but are not limited to, acute promyelocytic leukemia (APML), acute myeloid leukemia (AML), and chronic myeloid leukemia (CML). Specific lymphoid malignancies include, but are not limited to, lineage B acute lymphoblastic leukemia (ALL) and lineage T ALL, chronic lymphocytic leukemia (CLL), prolymphocytic leukemia (PLL), hairy cell leukemia (HLL), and Waldenström macroglobulinemia (WM). Specific forms of malignant lymphoma include, but are not limited to, non-Hodgkin lymphoma and its variants, peripheral T-cell lymphoma, adult T-cell leukemia / lymphoma (ATL), cutaneous T-cell lymphoma (CTCL), large granular lymphocytic leukemia (LGLL), Hodgkin's disease, and Reed-Sternberg disease. Cellular proliferative disorders may, in some cases, be non-endocrine or endocrine tumors. Examples of non-endocrine tumors include, but are not limited to, adenocarcinoma, acinar cell carcinoma, adenosquamous cell carcinoma, giant cell tumor, intraductal papillary mucinous neoplasm, mucinous cystadenoma, pancreaticblastoma, serous cystadenoma, and solid pseudopapillary neoplasm. Endocrine tumors may, in some cases, be islet cell tumors.
[0223] In some embodiments, the present disclosure provides a method for determining whether a test sample contains cancer. The method includes providing a set of genomic portions, each associated with a copy number change quantification for the test sample. The genomic portions include portions of a reference genome to which sequence reads obtained for the sample nucleic acid from the test subject are mapped, and the copy number change quantification associated with each genomic portion is determined from the quantification of sequence reads mapped to the genomic portion. The method uses a computing device to filter from the set of genomic portions a first subset of portions associated with copy number changes consistently presented in the sample reference set, and / or a second subset of portions other than representative portions derived from subgroups identified by a clustering process applied to the sample reference set, thereby generating a filtered set of genomic portions that does not include the first subset of portions and / or the second subset of portions. The method then transforms the filtered set of genomic portions into a parameter-reduced set by performing a principal component transformation on the filtered set of genomic portions, which yields the principal components of the test sample. The generated principal components can be represented in a principal component space containing common principal component origins. The distance between the principal components of a sample from a common principal component origin can be determined and compared to a predetermined threshold. These tumor samples with abnormal CNA events demonstrate outlier behavior outside the central cloud, and therefore the distance determined above is typically greater than the predetermined threshold. In some cases, the distance is the Mahalanobis distance, and the predetermined threshold is greater than 300, e.g., greater than 400, greater than 450, greater than 500, or about 500 (i.e., the cutoff log10 Mahalanobis distance between 2 and 3).
[0224] In some embodiments, the debilitating disorder or condition, or the degenerative disorder or condition, is cirrhosis of the liver, amyotrophic lateral sclerosis (ALS), Alzheimer's disease, Parkinson's disease, multiple system atrophy, atherosclerosis, progressive supranuclear palsy, Tay-Sachs disease, diabetes mellitus, heart disease, keratoconus, inflammatory bowel disease (IBD), prostatitis, osteoarthritis, osteoporosis, rheumatoid arthritis, Huntington's disease, chronic traumatic encephalopathy, chronic obstructive pulmonary disease (COPD), tuberculosis, chronic diarrhea, acquired immunodeficiency syndrome (AIDS), superior mesenteric artery syndrome, or a combination thereof.
[0225] In some embodiments, the autoimmune disorder or condition is acute disseminated encephalomyelitis (ADEM), Addison's disease, alopecia areata, ankylosing spondylitis, antiphospholipid syndrome (APS), autoimmune hemolytic anemia, autoimmune hepatitis, autoimmune inner ear disease, bullous pemphigoid, celiac disease, Chagas disease, chronic obstructive pulmonary disease, Crohn's disease (a type of idiopathic inflammatory bowel disease "IBD"), dermatomyositis, type 1 diabetes, endometriosis, Goodpasture syndrome, Graves' disease, Guillain-Barré syndrome (GBS), Hashimoto's disease, hidradenitis suppurativa, idiopathic These include thrombocytopenic purpura, interstitial cystitis, systemic lupus erythematosus, mixed connective tissue disease, morphia, multiple sclerosis (MS), myasthenia gravis, narcolepsy, neuromyotonia, pemphigus vulgaris, pernicious anemia, polymyositis, primary biliary cirrhosis, rheumatoid arthritis, schizophrenia, scleroderma, Sjögren's syndrome, temporal arteritis (also known as "giant cell arteritis"), ulcerative colitis (a type of idiopathic inflammatory bowel disease "IBD"), vasculitis, vitiligo vulgaris, Wegener's granulomatosis, or combinations thereof.
[0226] Preeclampsia In some embodiments, the presence or absence of preeclampsia is determined by using the methods or apparatus described herein. Preeclampsia is a condition in which high blood pressure develops during pregnancy (e.g., pregnancy-induced hypertension) and is associated with significant amounts of protein in the urine. In certain cases, preeclampsia may be associated with elevated levels of extracellular nucleic acids and / or altered methylation patterns. For example, a positive correlation has been observed between hypermethylated RASSF1A levels from extracellular fetal origin and the severity of preeclampsia. In certain cases, increased DNA methylation of the H19 gene is observed in preeclamptic placentas compared to normal controls.
[0227] pathogen In some embodiments, the presence or absence of a pathogenic state is determined by methods or apparatus described herein. Pathogenic states can be caused by infection of a host by pathogens, including, but not limited to, bacteria, viruses, or fungi. Since pathogens typically possess nucleic acids (e.g., genomic DNA, genomic RNA, mRNA) that can be distinguished from host nucleic acids, the methods, machines, and apparatus provided herein can be used to determine the presence or absence of pathogens. Often, pathogens possess nucleic acids that have characteristics specific to a particular pathogen, such as epigenetic states and / or one or more sequence mutations, duplications, and / or deletions. Therefore, the methods provided herein can be used to identify a particular pathogen or pathogen variant (e.g., a strain).
[0228] Use of cell-free nucleic acids In certain cases, nucleic acids from abnormal or affected cells associated with a specific condition or disorder are released from cells as circulating cell-free nucleic acids (CCF-NA). For example, cancer cell nucleic acids are present in CCF-NA, and analysis of CCF-NA using the methods provided herein can be used to determine whether a subject has cancer or is at risk of having it. Analysis of the presence or absence of cancer cell nucleic acids in CCF-NA can be used, for example, in cancer screening. In certain cases, serum CCF-NA levels may be elevated in patients with various types of cancer compared to healthy patients. For example, patients with metastatic disease may, in some cases, have serum DNA levels approximately twice as high as those without metastatic disease. Therefore, the methods described herein can provide results by processing sequencing read counts obtained from CCF-NA extracted from a subject (e.g., a subject who has, is suspected of having, is susceptible to, or is suspected of being susceptible to a particular condition or disease).
[0229] marker In certain cases, polynucleotides in abnormal or diseased cells modify nucleic acids in normal or non-disease cells (e.g., single nucleotide changes, single nucleotide mutations, copy number changes, copy number polymorphisms). In some cases, polynucleotides are present in abnormal or diseased cells but not in normal or non-disease cells, and in some cases, polynucleotides are not present in abnormal or diseased cells but are present in normal or non-disease cells. Therefore, markers may, in some cases, be single nucleotide changes / mutations and / or copy number changes / mutations (e.g., differentially expressed DNA or RNA (e.g., mRNA)). For example, patients with metastatic disease may be identified, for example, by cancer-specific markers and / or specific single nucleotide polymorphisms or short tandem repeats. Non-limiting examples of cancer types that may be positively correlated with elevated levels of circulating DNA include breast cancer, colorectal cancer, gastrointestinal cancer, hepatocellular carcinoma, lung cancer, melanoma, non-Hodgkin lymphoma, leukemia, multiple myeloma, bladder cancer, liver cancer, cervical cancer, esophageal cancer, pancreatic cancer, and prostate cancer. Various cancers may possess nucleic acids with features distinguishable from those derived from non-cancerous healthy cells, such as epigenetic states and / or sequence mutations, duplications, and / or deletions, and may in some cases release them into the bloodstream. Such features may, for example, be specific to certain types of cancer. Thus, the methods described herein may, in some cases, provide results based on determining the presence or absence of a particular marker, and in some cases, the result is the presence or absence of a particular type of state (e.g., a particular type of cancer, or simply the presence or absence of cancer (any type)).
[0230] In block 135, the results of the analysis may be filtered using a filter subsystem. In some cases, the results of the analysis are filtered by intensity and / or significance. In certain cases, the results of the analysis are filtered by the intensity and / or significance of the identified genomic instability. Intensity and / or significance may be determined using one or more processes described herein. For example, the filtering step may include the use of one or more statistical algorithms, decision analysis, comparison, and / or similar processing steps, as described in detail herein.
[0231] In some embodiments, the results of the analysis can be filtered using any number of suitable statistical algorithms. Non-limiting examples of statistical algorithms suitable for use in the filtering described herein include principal component analysis, decision trees, counter nulls, multiple comparisons, omnibus tests, the Behrens-Fisher problem, bootstrap, Fisher's method for combining independent significance tests, null hypotheses, Type I errors, Type II errors, exact probability tests, one-sample Z-tests, two-sample Z-tests, one-sample t-tests, paired t-tests, two-sample pooled t-tests with equal variances, two-sample unpooled t-tests with unequal variances, one-proportion Z-tests, pooled two-proportion Z-tests, unpooled two-proportion Z-tests, one-sample chi-squared tests, two-sample F-tests for equal variances, confidence intervals, credible intervals, significance, meta-analysis, simple linear regression, robust linear regression, or combinations thereof.
[0232] In some embodiments, the results of the analysis can be filtered by performing comparisons (e.g., comparing a test profile with a reference profile). Two or more datasets, two or more relationships, and / or two or more profiles can be compared in a preferred manner. Non-exclusive examples of preferred statistical methods for comparing datasets, relationships, and / or profiles include the Behrens-Fischer method, the bootstrap method, Fisher's method for combining independent significance tests, the Neyman-Pearson test, confirmatory data analysis, exploratory data analysis, exact probability tests, F-tests, Z-tests, T-tests, calculation and / or comparison of uncertainty measures, null hypotheses, counter-null hypotheses, etc., chi-squared tests, omnibus tests, calculation and / or comparison of significance levels (e.g., statistical significance), meta-analysis, multivariate analysis, regression, simple linear regression, robust linear regression, etc., or combinations of the above. In certain embodiments, comparing two or more datasets, relationships, and / or profiles includes determining and / or comparing uncertainty measures.
[0233] In block 140, a reporting subsystem is used to generate a clinical laboratory report for the results of the analysis and / or the filtered analysis. In some cases, the filtering and / or clinical laboratory reporting process includes measures of test performance (e.g., sensitivity and / or specificity) and / or confidence (e.g., confidence level, confidence interval). Measures of test performance and / or confidence may, in some cases, be derived from clinical validation studies conducted before clinical testing of the test samples. In certain embodiments, one or more of sensitivity, specificity, and / or confidence are expressed as percentages. In some embodiments, the percentages expressed independently for each of sensitivity, specificity, or confidence level are greater than approximately 90% (e.g., about 90, 91, 92, 93, 94, 95, 96, 97, 98, or 99%, or greater than 99% (e.g., greater than or equal to about 99.5%, greater than or equal to about 99.9%, greater than or equal to about 99.95%, greater than or equal to about 99.99%)). Confidence intervals expressed for a specific confidence level (e.g., a confidence level of approximately 90% to approximately 99.9% (e.g., approximately 95%)) can be expressed as a range of values, and in some cases, as a range or sensitivity and / or specificity for a specific confidence level. In some embodiments, the coefficient of variation (CV) is expressed as a percentage, and in some cases, the percentage is approximately 10% or less (e.g., approximately 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1%), or less than 1% (e.g., approximately 0.5% or less, approximately 0.1% or less, approximately 0.05% or less, approximately 0.01% or less)). In certain embodiments, probabilities (e.g., the probability that a particular outcome and / or classification is not due to chance) are expressed as a standard score (e.g., a z-score), a p-value, or the result of a t-test. In some embodiments, measured variances, confidence levels, confidence intervals, sensitivity, specificity, etc. (collectively referred to as confidence parameters, e.g.,) for results and / or classifications can be generated using one or more data processing operations described herein. Specific examples of generating results and / or classifications, and associated confidence levels, are described, for example, in International Patent Application Publications 2013 / 052913, 2014 / 190286, and 2015 / 051163, the entirety of which is incorporated herein by reference for all purposes.
[0234] The results and / or classifications for a test sample are often ordered and, in many cases, provided by a healthcare professional or other qualified person (e.g., a physician or assistant) who transmits the results and / or classifications to the subject from which the test sample was obtained. In certain embodiments, the results and / or classifications are provided using a suitable visual medium (e.g., a peripheral device or component of the machine, e.g., a printer or display). The classifications and / or results are often provided to a healthcare professional or qualified person in the form of a report. The report typically includes a representation of the results and / or classifications (e.g., values for the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease status, or an assessment or probability), possibly including relevant confidence parameters, and possibly including a measure of the performance of the test used to produce the results and / or classifications. The report may also include recommendations for follow-up procedures (such as procedures to verify the results or classifications). The report may include, in some cases, a visual representation of a chromosome or a portion thereof (e.g., a chromosome ideogram or cariogram), and may also show visualizations of duplicate and / or deleted regions for the chromosomes identified in the test sample (e.g., visualization of an entire chromosome for a chromosomal deletion or duplication, visualization of an entire chromosome showing a deleted or duplicated region, visualization of a duplicated or deleted portion of a chromosome, visualization of a portion of a chromosome remaining in the event of a deletion of a portion of a chromosome).
[0235] The report may be presented in a format suitable for facilitating the determination of genomic instability, genotype, phenotype, gene mutation, and / or the presence or absence of disease by a medical professional or other qualified person. Non-exclusive examples of formats suitable for use in generating the report include digital data, graphs, 2D graphs, 3D graphs, and 4D graphs, images (e.g., jpg, bitmap (e.g., bmp), pdf, tiff, gif, raw, png, etc., or any other suitable format), pictograms, charts, tables, bar graphs, pie charts, diagrams, flowcharts, scatter plots, maps, histograms, density charts, function graphs, circuit diagrams, block diagrams, bubble maps, constellation charts, contour plots, cartograms, spider charts, Venn diagrams, nomograms, etc., or any combination thereof.
[0236] The report may be generated by computer and / or human data entry and may be transmitted and communicated using a suitable electronic medium (e.g., via the Internet, via computer, via facsimile, from one network location to another with the same or different physical sites), or by another method of sending or receiving data (e.g., postal service, courier service, etc.). Non-limiting examples of communication media for transmitting the report include audio files, computer-readable files (e.g., PDF files), paper files, laboratory files, medical record files, or any other media described in the preceding paragraph. In certain embodiments, the laboratory file or medical record file may be in tangible or electronic form (e.g., computer-readable form). After the report has been generated and transmitted, it may be received via a suitable communication medium by obtaining a written and / or graphical representation including the results and / or classification, which, at the time of review, allows a medical professional or other qualified person to make a determination regarding genomic instability, genotype, phenotype, gene mutation, and / or presence or absence of disease in the test sample.
[0237] Results and / or classifications are provided by and may be obtained from the laboratory (e.g., from a laboratory file). A laboratory file may be generated by a laboratory that performs one or more tests to determine the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease on a test sample. Laboratory staff (e.g., laboratory administrator) can analyze information associated with the test sample that forms the basis of the results and / or classification (e.g., test profile, reference profile, test values, reference values, deviation levels, patient information). If a call related to genomic instability, genotype, phenotype, gene mutation, and / or disease presence is imminent or suspected, laboratory staff may re-perform the same procedure using the same (e.g., aliquots of the same sample) or different test samples from the test subject. The laboratory may be located in the same or different location (e.g., a different country) as the staff evaluating genomic instability, phenotype, gene mutation, and / or disease presence from the laboratory file. For example, a laboratory file may be generated in one location, where information about the test samples within it is evaluated by a medical professional or other qualified person, and optionally transmitted to another location where the sample is transmitted to the subject from which it was obtained. The laboratory may, in some cases, generate and / or transmit a laboratory report that includes classification of genomic instability, genotype, phenotype, gene mutation, and / or presence or absence of disease for the test samples. The laboratory generating the clinical laboratory report may, in some cases, be an accredited laboratory, and in some cases, a laboratory accredited under the Clinical Laboratory Improvement Act (CLIA).
[0238] Results and / or classifications are, in some cases, components of a diagnosis of a subject, and in some cases, results and / or classifications are used and / or evaluated as part of providing a diagnosis of a test sample. For example, a medical professional or other qualified person may analyze the results and / or classifications and provide a diagnosis based on, or in part on, the results and / or classifications. In some embodiments, the determination, detection, or diagnosis of a condition, disease, syndrome, or abnormality involves the use of results and / or classifications to determine the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or condition. In some embodiments, results and / or classifications based on counted mapped sequence reads, normalized counts, and / or their transformations determine the presence or absence of genomic instability, genotype, and / or gene mutation. In certain embodiments, the diagnosis includes the determination of the presence or absence of a condition, syndrome, or abnormality. In certain cases, the diagnosis includes the determination of genomic instability and / or gene mutations as the nature and / or cause of the condition, disease, syndrome, or abnormality. Accordingly, the Specified herein provides a method for diagnosing genomic instability, genotype, phenotype, gene mutation, and / or the presence or absence of disease in a test sample, by means of the method described herein, and optionally by generating and transmitting a laboratory report that includes a classification of genomic instability, genotype, phenotype, gene mutation, and / or the presence or absence of disease in the test sample.
[0239] The results and / or classifications are, in some cases, components of healthcare and / or treatment of the subject. The results and / or classifications are, in some cases, used and / or evaluated as part of providing treatment to the subject from which the test sample was obtained. For example, results and / or classifications indicating the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease are components of healthcare and / or treatment of the subject from which the test sample was obtained. Medical care, treatment, and / or diagnosis may be in any preferred area of health, such as prenatal care, medical care for cell proliferation status, cancer, etc. The results and / or classifications determining the presence or absence of genomic instability, genotype, phenotype, gene mutation, and / or disease, syndrome, or abnormality by the methods described herein may, in some cases, be independently verified by further tests. Any suitable type of further testing may be used to verify the results and / or classification, and non-limiting examples include, for example, blood tests (e.g., serological tests), biopsies, scans (e.g., CT scans, MRI scans), invasive sampling (e.g., amniocentesis or chorionic villus sampling), karyotyping, microarray assays, ultrasound, sonograms, etc.
[0240] A healthcare professional or qualified person may provide appropriate medical recommendations based on the results and / or classifications provided in the laboratory report. In some embodiments, recommendations depend on the results and / or classifications provided (e.g., cancer, stage and / or type of cancer, Down syndrome, Turner syndrome, conditions associated with the T13 gene mutation, conditions associated with the T18 gene mutation). Non-limiting examples of recommendations that may be provided based on the results or classifications in the laboratory report include, without limitation, surgery, radiotherapy, chemotherapy, genetic counseling, postnatal treatment (e.g., life planning, long-term care, medication, symptomatic treatment), abortion, organ transplantation, blood transfusion, further testing as described in the previous paragraph, or a combination thereof. Therefore, methods for treating subjects and methods for providing healthcare to subjects may optionally include generating classifications of genomic instability, genotype, phenotype, gene mutations, and / or presence or absence of disease for a test sample by the methods described herein, and optionally generating and transmitting a laboratory report including the classifications of genomic instability, genotype, phenotype, gene mutations, and / or presence or absence of disease for a test sample.
[0241] The generation of results and / or classifications can be considered as the conversion of nucleic acid sequence reads from the test sample to a representation of the cellular nucleic acid of the subject. For example, the conversion of nucleic acid sequence reads from a subject by the method described herein to generate results and / or classifications can be considered as the conversion of relatively small sequence read fragments to a representation of the relatively large and complex structure of nucleic acids in the subject. In some embodiments, the results and / or classifications arise from the conversion of sequence reads from the subject to a representation of an existing nucleic acid structure present in the subject (e.g., a mixture of genome, chromosomes, chromosomal segments, and circulating cell-free nucleic acid fragments in the subject).
[0242] In some embodiments, the methods herein include treating a subject when the presence of a genetic alteration or mutation is determined in a test sample from that subject. In some embodiments, treating a subject includes performing a medical procedure when the presence of a genetic alteration or mutation is determined in a test sample. In some embodiments, the medical procedure includes, for example, an invasive diagnostic procedure such as amniocentesis, chorionic villus sampling, or biopsy. For example, a medical procedure including amniocentesis or chorionic villus sampling may be performed when the presence of fetal aneuploidy is determined in a test sample from a pregnant woman. In another example, a medical procedure including biopsy may be performed when the presence of a genetic alteration indicating or associated with the presence of cancer is determined in a test sample from a subject. The invasive diagnostic procedure may be performed, for example, to confirm the determination of the presence of a genetic alteration or mutation, and / or to further characterize a medical condition associated with the genetic alteration or mutation. In some embodiments, the medical procedure may be performed as treatment for a medical condition associated with the genetic alteration or mutation. Treatment may include, for example, one or more of the following: surgery, radiation therapy, chemotherapy, abortion, organ transplantation, cell transplantation, blood transfusion, medication, and symptomatic treatment.
[0243] In some embodiments, the methods herein include treating a subject when the absence of a genetic alteration or mutation is determined in a test sample from the subject. In some embodiments, treating a subject includes performing a medical procedure when the absence of a genetic alteration or mutation is determined in a test sample. For example, when the absence of a genetic alteration or mutation is determined in a test sample, the medical procedure may include health monitoring, retesting, further screening, follow-up testing, etc. In some embodiments, the methods herein include treating a subject consistent with an euploid pregnancy or normal pregnancy when the absence of fetal aneuploidy, genetic mutation, or genetic alteration is determined in a test sample from a pregnant woman. For example, when the absence of fetal aneuploidy, genetic mutation, or genetic alteration is determined in a test sample from a pregnant woman, a medical procedure consistent with an euploid pregnancy or normal pregnancy may be performed. A medical procedure consistent with an euploid pregnancy or normal pregnancy may include one or more procedures performed as part of monitoring the health of the fetus and / or the mother, or monitoring the well-being of the fetus and mother. Medical procedures consistent with euploid pregnancy or normal pregnancy may include one or more procedures to treat symptoms of pregnancy, which may include, for example, one or more of the following: nausea, fatigue, breast tenderness, frequent urination, back pain, abdominal pain, leg cramps, constipation, heartburn, shortness of breath, hemorrhoids, urinary incontinence, varicose veins, and sleep disturbances. Medical procedures consistent with euploid pregnancy or normal pregnancy may include, for example, one or more procedures performed throughout a process of prenatal care to assess potential risks, treat complications, address pre-existing medical conditions (e.g., hypertension, diabetes), and monitor fetal growth and development.Medical procedures consistent with euploid pregnancy or normal pregnancy include, for example, complete blood count (CBC) monitoring, Rh antibody testing, urinalysis, urine culture monitoring, rubella screening, hepatitis B and C screening, sexually transmitted infection (STI) screening (e.g., screening for syphilis, chlamydia, gonorrhea), human immunodeficiency virus (HIV) screening, tuberculosis (TB) screening, alpha-fetoprotein screening, fetal heart rate monitoring (e.g., using an ultrasound transducer), uterine activity monitoring (e.g., using a toco transducer), genetic screening and / or diagnostic testing for hereditary disorders (e.g., cystic fibrosis, sickle cell anemia, hemophilia A), glucose screening, glucose These may include resistance testing, treatment of gestational diabetes, treatment of gestational hypertension, treatment of pre-eclampsia, group B streptococcus (GBS) blood type screening, group B streptococcus culture, treatment of group B streptococcus (e.g., with antibiotics), ultrasound monitoring (e.g., routine ultrasound monitoring, level II ultrasound monitoring, targeted ultrasound monitoring), non-stress test monitoring, biophysical profile monitoring, amniotic fluid index monitoring, serological testing (e.g., plasma protein-A (PAPP-A), alpha-fetoprotein (AFP), human chorionic gonadotropin (hCG), unconjugated estriol (uE3), and inhibin-A (inhA) testing), genetic testing, amniocentesis diagnostic testing, and chorionic villus sampling (CVS) diagnostic testing.
[0244] In some embodiments, the methods herein include treating subjects consistent with not having cancer when the absence of a gene mutation or gene alteration is determined in a test sample from the subject. In certain embodiments, medical procedures consistent with a healthy prognosis may be performed when the absence of a gene alteration or gene mutation associated with cancer is determined in a test sample. For example, medical procedures consistent with a healthy prognosis include, without limitation, monitoring the health status of the subject from whom the test sample was examined, performing secondary tests (e.g., secondary screening tests), performing confirmatory tests, monitoring one or more biomarkers associated with cancer (e.g., prostate-specific antigen (PSA) in men), monitoring blood cells (e.g., red blood cells, white blood cells, platelets), monitoring one or more vital signs (e.g., heart rate, blood pressure), and / or monitoring one or more blood metabolites (e.g., total cholesterol, HDL (high-density lipoprotein), LDL (low-density lipoprotein), triglycerides, total cholesterol / HDL ratio, glucose, fibrinogen, hemoglobin, dehydroepiandrosterone (DHEA), homocysteine, C-reactive protein, hormones (e.g., thyroid-stimulating hormone, testosterone, estrogen, estradiol...
Claims
1. A computer implementation method, Accessing non-invasive prenatal testing (NIPT) sequence read data for a first sample group, wherein the NIPT sequence read data is generated as part of performing a whole-genome sequencing assay on the NIPT sample, and the NIPT sequence read data includes a bin count profile which includes sequence read counts for each bin associated with a segment of the reference genome. Based on the aforementioned NIPT sequence read data, a first training data subset and a second training data subset are generated, wherein each example in the second training data subset is classified as negative for cancer. Accessing copy number variant data for a second sample group, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about cancer patient-derived genomic events, which are copy number counts for one or more segments that deviate from a reference profile. The method involves generating a synthetic training dataset based on copy number variant data obtained from cancer patients, wherein generating the synthetic training dataset includes generating an empirical population of genomic events characteristic of the copy number variant data obtained from cancer patients, extracting event features and standard scores from the empirical population of genomic events, and mapping the event features and standard scores to the bin count profile for the second training data subset to generate the synthetic training dataset, wherein each example in the synthetic training dataset is classified as positive for cancer. The first training data subset is expanded with the synthetic training dataset to generate an expanded training dataset, A computer implementation method comprising: training a machine learning model for classifying samples as negative for cancer or positive for cancer using the expanded training dataset, wherein the training includes iterative operations for finding a set of parameters for the machine learning model that minimizes a loss function or error function of the machine learning model, each iteration including finding the set of parameters for the machine learning model such that the value of the loss function or error function using the set of parameters is less than the value of the loss function or error function using a different set of parameters in a previous iteration, wherein the loss function or error function is configured to measure the difference between (i) an estimated class output for each example in the expanded training dataset and (ii) a label providing ground truth information for each example in the expanded training dataset, the ground truth information identifying whether the example is classified as negative for cancer or positive for cancer.
2. The computer implementation method according to claim 1, wherein generating the first training data subset and the second training data subset comprises normalizing the NIPT sequence read data by (i) smoothing the bin count profile using LOESS, (ii) performing population-based correction of the bin count profile using principal component analysis, or (iii) both.
3. A computer-aided method according to claim 1 or 2, further comprising using another principal component analysis to reduce the dimensionality of the augmented training dataset, the other principal component analysis comprising mapping the augmented training dataset to a first k principal components, with the mapping reducing data redundancy and reducing the feature space n.
4. A computer-aided implementation of any one of claims 1 to 3, further comprising using an oncogene feature selection process to reduce the dimensionality of the expanded training dataset, wherein the oncogene feature selection process comprises generating a subset of a predetermined number of bins that overlap with a predetermined number of oncogenes, and using the bin count profile from the normalized NIPT sequence read data in the subset of the predetermined number of bins as a new model attribute.
5. Determining an indicator of systematic abnormality in each example of the expanded training dataset, wherein the indicator is genomic instability number, tumor content, or both. The further includes adding the indicator of systematic anomaly to each example in the extended training dataset, The computer implementation method according to any one of claims 1 to 4, wherein the machine learning model is trained to classify the samples as negative for cancer or positive for cancer using the augmented training dataset with the indicators with systematic anomalies, and the machine learning model uses the indicators with systematic anomalies as additional model features during training to find the parameter set.
6. The computer implementation method according to claim 5, wherein the genome instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
7. The computer implementation method according to any one of claims 1 to 6, wherein the synthetic training dataset presents the same statistical distribution of genomic events as that presented in the copy number variant data.
8. A computer implementation method, Accessing non-invasive prenatal testing (NIPT) sequence read data for a sample, wherein the sequence read data is generated as part of performing a whole-genome sequencing assay on the sample, and the NIPT sequence read data includes a bin count profile which includes sequence read counts for each bin associated with a segment of the reference genome. Based on the NIPT sequence read data, an indicator of systematic abnormality for the sample is determined, wherein the indicator is the number of genomic instability, the tumor content, or both. The NIPT sequence read data and the indicator of systematic anomalies are input into a machine learning model constructed as a binary classifier. The machine learning model is used to classify the sample as either negative or positive for cancer, wherein the machine learning model uses the indicator of systematic abnormality as an additional model feature for predicting the negative or positive class. A computer-aided method comprising using the machine learning model to output the negative class or the positive class for cancer.
9. The machine learning model includes multiple parameters trained using an augmented training dataset, and the augmented training dataset is An original training dataset containing NIPT sequence read data for a sample group, wherein the NIPT sequence read data is generated as part of performing the whole-genome sequencing assay on the NIPT sample, and the NIPT sequence read data includes a bin count profile containing sequence read counts for each bin associated with the segment of the reference genome, The computer-aided method according to claim 8, comprising: a synthetic training dataset generated based on copy number variant data obtained from cancer patients, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about cancer patient-derived genomic events, where the copy number variant data is a copy number count for one or more segments that deviate from a reference profile.
10. The aforementioned synthetic training dataset is To generate an empirical population of genomic events characteristic of the copy number variant data obtained from the aforementioned cancer patients, Extracting event characteristics and standard scores from the empirical population of the aforementioned genomic events, The event features and standard scores are mapped to the bin count profile for a second training data subset to generate the synthetic training dataset, and the resulting dataset is generated by this method. The computer implementation method according to claim 9, wherein each example in the synthetic training dataset is classified as positive for cancer.
11. The computer implementation method according to claim 9 or 10, wherein the genome instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
12. A computer-aided method according to any one of claims 8 to 11, further comprising generating a report including the negative class or the positive class for cancer and the results of the whole-genome sequencing assay.
13. It is a system, One or more processors, The system comprises one or more computer-readable media that store instructions for causing the system to perform an operation when executed by one or more processors, and the operation is Accessing non-invasive prenatal testing (NIPT) sequence read data for a first sample group, wherein the NIPT sequence read data is generated as part of performing a whole-genome sequencing assay on the NIPT sample, and the NIPT sequence read data includes a bin count profile which includes sequence read counts for each bin associated with a segment of the reference genome. Based on the aforementioned NIPT sequence read data, a first training data subset and a second training data subset are generated, wherein each example in the second training data subset is classified as negative for cancer. Accessing copy number variant data for a second sample group, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about cancer patient-derived genomic events, which are copy number counts for one or more segments that deviate from a reference profile. The method involves generating a synthetic training dataset based on copy number variant data obtained from cancer patients, wherein generating the synthetic training dataset includes generating an empirical population of genomic events characteristic of the copy number variant data obtained from cancer patients, extracting event features and standard scores from the empirical population of genomic events, and mapping the event features and standard scores to the bin count profile for the second training data subset to generate the synthetic training dataset, wherein each example in the synthetic training dataset is classified as positive for cancer. The first training data subset is expanded with the synthetic training dataset to generate an expanded training dataset, A system comprising: training a machine learning model for classifying samples as negative for cancer or positive for cancer using the augmented training dataset, wherein the training includes iterative operations for finding a set of parameters for the machine learning model that minimizes a loss function or error function of the machine learning model, each iteration including finding the set of parameters for the machine learning model such that the value of the loss function or error function using the set of parameters is less than the value of the loss function or error function using a different set of parameters in a previous iteration, wherein the loss function or error function is configured to measure the difference between (i) an estimated class output for each example in the augmented training dataset and (ii) a label providing ground truth information for each example in the augmented training dataset, the ground truth information identifying whether the example is classified as negative for cancer or positive for cancer.
14. The system according to claim 13, wherein generating the first training data subset and the second training data subset comprises normalizing the NIPT sequence read data by (i) smoothing the bin count profile using LOESS, (ii) performing population-based correction of the bin count profile using principal component analysis, or (iii) both.
15. The system according to claim 13 or 14, wherein the one or more computer-readable media, when executed by the one or more processors, further stores instructions causing the system to perform an action, the action comprising reducing the dimensionality of the augmented training dataset using another principal component analysis, the other principal component analysis comprising mapping the augmented training dataset to a first k principal components, distinguishing between negative and positive classes, the mapping reducing data redundancy and reducing the feature space n.
16. The system according to any one of claims 13 to 15, wherein the one or more computer-readable media, when executed by the one or more processors, further stores instructions causing the system to perform an action, the action comprising reducing the dimensionality of the augmented training dataset using an oncogene feature selection process, the oncogene feature selection process comprising generating a subset of a predetermined number of bins that overlap with a predetermined number of oncogenes, and using the bin count profile from the normalized NIPT sequence read data in the subset of the predetermined number of bins as a new model attribute.
17. When the one or more computer-readable media are executed by the one or more processors, they further store instructions that cause the system to perform an action, and the action is, Determining an indicator of systematic abnormality in each example of the expanded training dataset, wherein the indicator is genomic instability number, tumor content, or both. This includes adding the indicator of systematic anomaly to each example in the extended training dataset, The system according to any one of claims 13 to 16, wherein the machine learning model is trained to classify the samples as negative for cancer or positive for cancer using the augmented training dataset with the indicators with systematic anomalies, and the machine learning model uses the indicators with systematic anomalies as additional model features during training to find the parameter set.
18. The system according to claim 17, wherein the genome instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
19. The system according to any one of claims 13 to 18, wherein the synthetic training dataset presents the same statistical distribution of genomic events as that presented in the copy number variant data.
20. It is a system, One or more processors, The system comprises one or more computer-readable media that store instructions for causing the system to perform an operation when executed by one or more processors, and the operation is Accessing non-invasive prenatal testing (NIPT) sequence read data for a sample, wherein the sequence read data is generated as part of performing a whole-genome sequencing assay on the sample, and the NIPT sequence read data includes a bin count profile which includes sequence read counts for each bin associated with a segment of the reference genome. Based on the NIPT sequence read data, an indicator of systematic abnormality for the sample is determined, wherein the indicator is the number of genomic instability, the tumor content, or both. The NIPT sequence read data and the indicator of systematic anomalies are input into a machine learning model constructed as a binary classifier. The machine learning model is used to classify the sample as either negative or positive for cancer, wherein the machine learning model uses the indicator of systematic abnormality as an additional model feature for predicting the negative or positive class. A system comprising using the machine learning model to output the negative class or the positive class for cancer.
21. The machine learning model includes multiple parameters trained using an augmented training dataset, and the augmented training dataset is An original training dataset containing NIPT sequence read data for a sample group, wherein the NIPT sequence read data is generated as part of performing the whole-genome sequencing assay on the NIPT sample, and the NIPT sequence read data includes a bin count profile containing sequence read counts for each bin associated with the segment of the reference genome, The system according to claim 20, comprising: a synthetic training dataset generated based on copy number variant data obtained from cancer patients, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about cancer patient-derived genomic events, where the copy number variant data is a copy number count for one or more segments that deviate from a reference profile.
22. The aforementioned synthetic training dataset is To generate an empirical population of genomic events characteristic of the copy number variant data obtained from the aforementioned cancer patients, Extracting event characteristics and standard scores from the empirical population of the aforementioned genomic events, The event features and standard scores are mapped to the bin count profile for a second training data subset to generate the synthetic training dataset, and the resulting dataset is generated by this method. The system according to claim 21, wherein each example in the synthetic training dataset is classified as positive for cancer.
23. The system according to claim 21 or 22, wherein the genome instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
24. The system according to any one of claims 20 to 23, wherein when the one or more computer-readable media is executed by the one or more processors, the system further stores instructions for performing an action, the action includes generating a report comprising the negative or positive class for cancer and the results of the whole-genome sequencing assay.
25. A computer program product tangibly embodied in a non-temporary machine-readable medium, comprising instructions configured to cause one or more data processors to perform an operation, wherein the operation is: Accessing non-invasive prenatal testing (NIPT) sequence read data for a first sample group, wherein the NIPT sequence read data is generated as part of performing a whole-genome sequencing assay on the NIPT sample, and the NIPT sequence read data includes a bin count profile which includes sequence read counts for each bin associated with a segment of the reference genome. Based on the aforementioned NIPT sequence read data, a first training data subset and a second training data subset are generated, wherein each example in the second training data subset is classified as negative for cancer. Accessing copy number variant data for a second sample group, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information about cancer patient-derived genomic events, which are copy number counts for one or more segments that deviate from a reference profile. The method involves generating a synthetic training dataset based on copy number variant data obtained from cancer patients, wherein generating the synthetic training dataset includes generating an empirical population of genomic events characteristic of the copy number variant data obtained from cancer patients, extracting event features and standard scores from the empirical population of genomic events, and mapping the event features and standard scores to the bin count profile for the second training data subset to generate the synthetic training dataset, wherein each example in the synthetic training dataset is classified as positive for cancer. The first training data subset is expanded with the synthetic training dataset to generate an expanded training dataset, A computer program product comprising: training a machine learning model for classifying samples as negative for cancer or positive for cancer using the expanded training dataset, wherein the training includes iterative operations for finding a set of parameters for the machine learning model that minimizes a loss function or error function of the machine learning model, each iteration including finding the set of parameters for the machine learning model such that the value of the loss function or error function using the set of parameters is less than the value of the loss function or error function using a different set of parameters in a previous iteration, wherein the loss function or error function is configured to measure the difference between (i) an estimated class output for each example in the expanded training dataset and (ii) a label providing ground truth information for each example in the expanded training dataset, the ground truth information identifying whether the example is classified as negative for cancer or positive for cancer.
26. The computer program product according to claim 25, wherein generating the first training data subset and the second training data subset comprises normalizing the NIPT sequence read data by (i) smoothing the bin count profile using LOESS, (ii) performing population-based correction of the bin count profile using principal component analysis, or (iii) both.
27. The computer program product according to claim 25 or 26, further comprising instructions configured to cause one or more data processors to perform an operation, wherein the operation includes reducing the dimensionality of the augmented training dataset using another principal component analysis, the other principal component analysis including mapping the augmented training dataset to a first k principal components, distinguishing between negative and positive classes, the mapping reducing data redundancy and reducing the feature space n.
28. A computer program product according to any one of claims 25 to 27, further comprising instructions configured to cause one or more data processors to perform an operation, wherein the operation includes reducing the dimensionality of the expanded training dataset using an oncogene feature selection process, the oncogene feature selection process comprising generating a subset of a predetermined number of bins that overlap with a predetermined number of oncogenes, and using the bin count profile from the normalized NIPT sequence read data in the subset of the predetermined number of bins as a new model attribute.
29. The instruction further includes instructions configured to cause one or more data processors to perform an operation, the operation being: Determining an indicator of systematic abnormality in each example of the expanded training dataset, wherein the indicator is genomic instability number, tumor content, or both. This includes adding the indicator of systematic anomaly to each example in the extended training dataset, The computer program product according to any one of claims 25 to 28, wherein the machine learning model is trained to classify the samples as negative for cancer or positive for cancer using the augmented training dataset with the indicators with systematic anomalies, and the machine learning model uses the indicators with systematic anomalies as additional model features during training to find the parameter set.
30. The computer program product according to claim 29, wherein the genome instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
31. The computer program product according to any one of claims 25 to 30, wherein the synthetic training dataset presents the same statistical distribution of genomic events as that presented in the copy number variant data.
32. A computer program product tangibly embodied in a non-temporary machine-readable medium, comprising instructions configured to cause one or more data processors to perform an operation, wherein the operation is: Accessing non-invasive prenatal testing (NIPT) sequence read data for a sample, wherein the sequence read data is generated as part of performing a whole-genome sequencing assay on the sample, and the NIPT sequence read data includes a bin count profile which includes sequence read counts for each bin associated with a segment of the reference genome. Based on the NIPT sequence read data, an indicator of systematic abnormality for the sample is determined, wherein the indicator is the number of genomic instability, the tumor content, or both. The NIPT sequence read data and the indicator of systematic anomalies are input into a machine learning model constructed as a binary classifier. The machine learning model is used to classify the sample as either negative or positive for cancer, wherein the machine learning model uses the indicator of systematic abnormality as an additional model feature for predicting the negative or positive class. A computer program product that includes using the machine learning model to output the negative class or the positive class for cancer.
33. The machine learning model includes multiple parameters trained using an augmented training dataset, and the augmented training dataset is An original training dataset containing NIPT sequence read data for a sample group, wherein the NIPT sequence read data is generated as part of performing the whole-genome sequencing assay on the NIPT sample, and the NIPT sequence read data includes a bin count profile containing sequence read counts for each bin associated with the segment of the reference genome, The computer program product according to claim 32, comprising: a synthetic training dataset generated based on copy number variant data obtained from cancer patients, wherein the copy number variant data is generated as part of performing a copy number polymorphism assay, and the copy number variant data includes information on cancer patient-derived genomic events, where the copy number variant data is a copy number count for one or more segments that deviate from a reference profile.
34. The aforementioned synthetic training dataset is To generate an empirical population of genomic events characteristic of the copy number variant data obtained from the aforementioned cancer patients, Extracting event characteristics and standard scores from the empirical population of the aforementioned genomic events, The event features and standard scores are mapped to the bin count profile for a second training data subset to generate the synthetic training dataset, and the resulting dataset is generated by this method. The computer program product according to claim 33, wherein each example in the synthetic training dataset is classified as positive for cancer.
35. The computer program product according to claim 33 or 34, wherein the genome instability number is calculated as the integral of the absolute deviation of the LOESS smoothed normalized autosomal bin count profile from the predicted value.
36. The computer program product according to any one of claims 32 to 35, further comprising instructions configured to cause one or more data processors to perform an operation including generating a report including the negative class or the positive class for cancer and the results of the whole-genome sequencing assay.