Method and system for distinguishing somatic cell genome sequences from germline genome sequences.

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By amplifying, capturing, and sequencing the nucleic acid molecules in the samples, combined with surrogate genome sequences and statistical models, the challenge of distinguishing between somatic cell and germ cell genome sequences has been solved, supporting precision medicine and cancer monitoring.

JP7875815B2Active Publication Date: 2026-06-18FOUNDATION MEDICINE INC

View PDF 4 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: FOUNDATION MEDICINE INC
Filing Date: 2021-06-03
Publication Date: 2026-06-18

Application Information

Patent Timeline

03 Jun 2021

Application

18 Jun 2026

Publication

JP7875815B2

IPC: C12Q1/6869; A61K45/00; A61P35/00; C12M1/00; C12N15/09; C12Q1/02; C12Q1/6844; C12Q1/686

CPC: G16B20/00; G16H50/20; G16B30/10; G16B20/20; C12Q1/6844; C12Q1/6869; C12Q1/6886; C12Q2600/16

AI Tagging

Application Domain

Bioreactor/fermenter combinations Biological substance pretreatments

Technology Topics

Somatic cell Genome

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A method for constructing an immunized animal model for preparing a biofusion enzyme antibody and application thereof
CN121801968BEnzyme digestionEmbryo
Method for direct transdifferentiation of somatic cell
US20260185053A1TransdifferentiationGene product
Method for identifying mastitis in a dairy animal, server and unmarked online raw milk somatic cell detection device
CN122289271ABiotechnology Somatic cell
A lightGBM-based lncRNA subcellular localization prediction method
CN116110495BHigh precision Proteomics Genomics Base J Nucleotide
A mixing device for a freeze-dried probiotic production line
CN224371334URotary stirring mixers Transportation and packaging Cell membrane Material distribution

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately distinguish between somatic cell and germ cell genome sequences in patient samples, especially in the absence of matching normal samples, which hinders the effective implementation of precision medicine.

Method used

By amplifying, capturing, and sequencing nucleic acid molecules in a sample, and using surrogate genome sequences and statistical models, allelic frequency distances are calculated to identify the somatic or germ cell origin of the genome sequence.

Benefits of technology

It enables accurate differentiation of somatic and germ cell genome sequences in a single sample, supporting personalized treatment in precision medicine and cancer monitoring.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007875815000014
Figure 0007875815000015
Figure 0007875815000016

Patent Text Reader

Abstract

Described herein are methods for distinguishing between somatic and germline variants, and devices for implementing such methods. In certain implementations of the methods, the methods may include identifying a genomic sequence of interest in a patient sample at a genomic locus, identifying one or more proxy genomic sequences for the sequence of interest, comparing the observed frequency of the sequence of interest to a centrality measure of the observed frequencies of the one or more proxy genomic sequences, and characterizing the genomic sequence of interest as either germline or somatic based on the comparison.

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] Cross-reference of related applications This application claims the interests of U.S. Provisional Patent Application No. 63 / 035,572, filed on 5 June 2020, and U.S. Provisional Patent Application No. 63 / 041,437, filed on 19 June 2020, both of which are incorporated herein by reference in their entirety.

[0002] Technical field This disclosure relates to a system and method for distinguishing somatic cell genome sequences from germline genome sequences. [Background technology]

[0003] background The germline genome sequence refers to the sequence that an organism inherits from its parents. In particular, if one or both of an organism's parents have a specific genomic mutation (or if an organism experiences a specific mutation in its very early development), those mutations may be in the organism's germline and will be passed on to its offspring (if any).

[0004] In contrast, somatic genome sequences are sequences that are not passed from parent to offspring. For example, organisms can develop genomic mutations caused by external factors (e.g., pollution, radiation, diet, smoking, etc.), and these genomic mutations are limited to specific tissues, fluids, or other anatomical materials. In some cases, these mutations can lead to undesirable medical conditions, including but not limited to cancer.

[0005] Precision medicine is the field of treating patients with therapies that target their individual characteristics or conditions. For many patients (including cancer patients), this may involve determining genomic information about both the patient's "normal" genomic state and the genomic state of the patient's "abnormal" tissues, fluids, or other anatomical materials. This information may originate from samples taken from the patient, such as tumor biopsies, blood samples, or any other type of sample containing both normal and abnormal tissues, fluids, or other anatomical materials.

[0006] These samples can be assayed to determine (at least partially) the genomic sequence of the material they contain. However, it can sometimes be difficult to identify whether a particular genomic sequence originates from a patient's normal or abnormal anatomical material; that is, it can sometimes be difficult to determine whether a particular genomic sequence is germline or somatic.

[0007] Understanding whether genetic variants observed in the DNA of cancer patients are germline or somatic is crucial in both clinical practice and cancer research. The somatic / germline distinction can be made, for example, by sequencing matched tumor and normal tissues from the same patient. Variants present in the tumor but not in normal tissue are classified as somatic, while those present in both are classified as germline. However, such dual-sample approaches are constrained by cost and sample availability. Typically, matched normal samples are not available in clinical practice. For example, in the case of tissue biopsy, a single sample containing both the tumor and adjacent normal tissue is collected. Therefore, there is a need to develop methods that can reliably classify detected variants based on their origin—somatic or germline. [Overview of the project] [Means for solving the problem]

[0008] overview Methods, devices, and computer-readable media for distinguishing somatic cell genome sequences from germline genome sequences are described herein.

[0009] This specification provides a method for identifying a target genome sequence as germline or somatic cell, wherein the method provides a plurality of nucleic acid molecules obtained from a sample from the target, wherein the plurality of nucleic acid molecules include a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; optionally, attaching one or more adapters to one or more nucleic acids from the plurality of nucleic acid molecules; amplifying nucleic acid molecules from the plurality of nucleic acid molecules; capturing nucleic acid molecules from the amplified nucleic acid molecules, wherein the captured nucleic acid molecules are captured from the amplified nucleic acid molecules by hybridization to one or more bait molecules; and sequencing the captured nucleic acid molecules with a sequencer. A method is disclosed comprising: obtaining multiple sequence reads corresponding to the above-mentioned genomic loci; selecting a target genomic sequence at a genomic locus from the one or more genomic loci using one or more processors; selecting one or more proxy genomic sequences for the target genomic sequence using one or more processors; determining an allele frequency distance using summary statistics or distributions indicating the observed allele frequencies of the target genomic sequence and the observed allele frequencies of the one or more proxy genomic sequences using one or more processors; and identifying the target genomic sequence as germline or somatic cell using the allele frequency distance using one or more processors.

[0010] In some embodiments, the subjects are cancer patients. In some embodiments, the sample includes a tissue biopsy sample, a liquid biopsy sample, a circulating tumor cell (CTC) sample, a cell-free DNA (cfDNA) sample, or a normal control. In some embodiments, the sample is a liquid biopsy sample and includes blood, plasma, cerebrospinal fluid, sputum, feces, urine, or saliva. In some embodiments, tumor nucleic acid molecules are derived from the tumor portion of a heterogeneous tissue biopsy sample, and non-tumor nucleic acid molecules are derived from the normal portion of a heterogeneous tissue biopsy sample. In some embodiments, tumor nucleic acid molecules are derived from the circulating tumor DNA (ctDNA) fraction of a cell-free DNA sample, and non-tumor nucleic acid molecules are derived from the non-tumor fraction of the cell-free DNA sample. In some embodiments, one or more adapters include amplification primers or sequencing adapters. In some embodiments, one or more bait molecules include one or more nucleic acid molecules, each nucleic acid molecule including a region complementary to the region of the captured nucleic acid molecule. In some embodiments, amplifying the nucleic acid molecules includes performing polymerase chain reaction (PCR) or isothermal amplification techniques. In some embodiments, sequencing involves the use of next-generation sequencing (NGS) technology. In some embodiments, sequencing involves a next-generation sequencer. In some embodiments, one or more proxy genome sequences are located within a defined segment of the genome sequence of interest, and the selected genome sequence of interest is located within the same defined segment. In some embodiments, the genome sequence of interest is segmented into multiple segments based on the uniformity of copy numbers within each segment. In some embodiments, the summary statistic is the mean allele frequency or the median allele frequency. In some embodiments, the allele frequency distance is determined using a distribution showing the observed allele frequencies of the genome sequence of interest and the observed frequencies of multiple proxy genome sequences, and the genome sequence of interest is identified as germline or somatic based on the probability that the observed allele frequencies of the genome sequence of interest fit or do not fit within the distribution.

[0011] In some embodiments, a method for identifying a target genome sequence as germline or somatic cell includes: selecting a target genome sequence at a genomic locus from within a patient genome sequence obtained for a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules using one or more processors; selecting one or more proxy genome sequences for the target genome sequence using one or more processors; determining an allele frequency distance using summary statistics or distributions showing the observed allele frequencies of the target genome sequence and the observed allele frequencies of one or more proxy genome sequences using one or more processors; and identifying (e.g., classifying) the target genome sequence as germline or somatic cell using the allele frequency distance using one or more processors.

[0012] In some embodiments of this method, the summary statistic is the mean allele frequency or the median allele frequency. In some embodiments, the allele frequency distance is determined using a distribution showing the observed allele frequencies of the genome sequence of interest and the observed frequencies of several proxy genome sequences, and the genome sequence of interest is identified as germline or somatic cell based on the probability that the observed allele frequencies of the genome sequence of interest fit or do not fit within the distribution.

[0013] In some embodiments, the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule. In some embodiments, the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule.

[0014] In some embodiments, the method further includes sequencing tumor nucleic acid molecules and non-tumor nucleic acid molecules from a patient sample to determine the patient genome sequence. In some embodiments, the patient genome sequence is obtained or determined using next-generation sequencing technology. In some embodiments, the sequencer is a next-generation sequencer.

[0015] In some embodiments of this method, one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment. In some embodiments, the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment. In some embodiments, this method includes segmenting the patient genome sequence into multiple segments.

[0016] In some embodiments of this method, the patient genome sequence is determined using targeted sequencing. In some embodiments, the targeted sequencing includes targeted sequencing of one or more cancer-related genes or parts thereof. In some embodiments, the targeted sequencing includes targeted sequencing of one or more exon regions.

[0017] In some embodiments, the method includes: identifying a target genomic sequence in a patient sample at a genomic locus using one or more processors; identifying one or more proxy genomic sequences for the target sequence using one or more processors; comparing the observed frequency of the target sequence with a centrality measure of the observed frequencies of one or more proxy genomic sequences using one or more processors; and, based on this comparison, identifying (e.g., classifying or characterizing) the target genomic sequence as either germline or somatic.

[0018] In some embodiments of this method, one or more proxy genome sequences contain single nucleotide polymorphisms (SNPs).

[0019] In some embodiments of this method, one or more proxy genome sequences contain alleles.

[0020] In some embodiments, the method further includes identifying, by one or more processors, a segment of the patient's genome that contains the genomic locus. In some embodiments, identifying, by one or more processors, the segment includes performing a segmentation procedure on a contiguous portion of the patient's genome. In some embodiments, the portion of the patient's genome is large enough to identify three different segments. In some embodiments, the proxy is identified, by one or more processors, to be located on the same segment as the genomic locus. In some embodiments, the segmentation procedure identifies segments according to whether genomic parameters are equal across each individual segment. In some embodiments, the genomic parameter is copy number.

[0021] In some embodiments of any of the above methods for identifying a target genomic sequence as germline or somatic, identifying, by one or more processors, the target genomic sequence as germline or somatic comprises: inputting an allele frequency distance into a trained statistical model; and outputting, from the trained statistical model, a value indicative of the likelihood that the target genomic sequence is germline or a value indicative of the likelihood that the target genomic sequence is somatic. In some embodiments, the allele frequency distance is adjusted to correct for contamination level in the patient sample, low sequencing read depth, noisy estimates of allele frequency, low segment germline single nucleotide polymorphism (SNP) count, or high variability of segment germline SNP allele frequencies. In some embodiments, the trained statistical model includes a function that associates the allele frequency distance with a value indicative of the likelihood that the target genomic sequence is germline or a value indicative of the likelihood that the target genomic sequence is somatic.

[0022] In some embodiments, the trained statistical model is a logistic regression model. In some embodiments, the trained statistical model is trained using tumor samples having known germline sequences. In some embodiments, the trained statistical model is trained using data on tumor samples having known germline sequences and known somatic sequences. In some embodiments, the method further includes training a statistical model using data on tumor samples having known germline sequences. In some embodiments, the method further includes training a statistical model using data on tumor samples having known germline sequences and known somatic sequences.

[0023] In some embodiments, the trained statistical model is trained using data on variant allele frequencies that exclude variants located in genomic regions known to have allele frequencies deviating from expected values. In some embodiments, the method further includes training a statistical model using data on variant allele frequencies that exclude variants located in genomic regions known to have allele frequencies deviating from expected values.

[0024] In some embodiments, the trained statistical model is trained using data incorporating prior knowledge of the likelihood of variants that are germline, somatic variants, or clonal hematopoiesis (CHIP) variants with uncertain potential, based on historical data or a database. In some embodiments, the method further includes training a statistical model using data incorporating prior knowledge of the likelihood of variants that are germline, somatic variants, or clonal hematopoiesis (CHIP) variants with uncertain potential, based on historical data or a database.

[0025] In some embodiments, the trained statistical model is trained using data describing the noise level for a given variant call and its genomic context. In some embodiments, the method further includes training the statistical model using data describing the noise level for a given variant call and its genomic context.

[0026] In some embodiments, one or more proxy genome sequences include single nucleotide polymorphisms (SNPs). In some embodiments, one or more proxy genome sequences include alleles. In some embodiments of this method, the genome sequence of interest includes a genome variant.

[0027] In some embodiments of this method, the method further includes generating a report showing the target genome sequence as either germline or somatic cells using one or more processors. In some embodiments, the method includes transmitting the report to, for example, a healthcare provider. In some embodiments, the report is transmitted via a computer network or peer-to-peer connection.

[0028] In some embodiments of any of the above methods, the patient sample is derived from a tissue biopsy containing tumor tissue and non-tumor tissue. In some embodiments, the tissue biopsy is a solid tissue biopsy or a liquid biopsy. In some embodiments, the tissue biopsy is a liquid biopsy containing blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample includes cell-free DNA (cdDNA) obtained from the subject. In some embodiments, the patient sample includes circulating tumor DNA (ctDNA) obtained from the subject.

[0029] This specification further describes methods for treating a patient's cancer, each comprising: identifying one or more target genomic sequences as somatic cells using one or more of the above methods with one or more processors; selecting a cancer treatment mode based on one or more identified somatic cell sequences; and treating the cancer with the selected cancer treatment mode. In some embodiments, one or more identified somatic cell sequences are involved in the success of cancer treatment with the selected treatment mode. In some embodiments, the method comprises determining the microsatellite instability state of the cancer using one or more identified somatic cell sequences with one or more processors; and selecting a cancer treatment mode based on the microsatellite instability state of the cancer. In some embodiments, the method comprises determining the tumor mutation load for the cancer using one or more identified somatic cell sequences with one or more processors; and selecting a cancer treatment mode based on whether the tumor mutation load exceeds a predetermined tumor mutation load threshold. In some embodiments, the cancer treatment mode comprises administering an effective amount of one or more anticancer agents to the patient if the tumor mutation load exceeds a predetermined threshold. In some embodiments, the one or more anticancer agents include cancer immunotherapy agents. In some embodiments, the cancer immunotherapy agent is an immune checkpoint inhibitor.

[0030] This specification also describes a method for monitoring the progression or recurrence of cancer in a patient, the method comprising: identifying one or more target genomic sequences as somatic cells using one or more of the above methods with one or more processors; and detecting the presence or absence of one or more target genomic sequences identified as somatic cells in a second patient sample obtained from the patient after the cancer has been treated with one or more processors. In some embodiments, the method comprises obtaining a second patient sample from the patient. In some embodiments, the method comprises treating the patient's cancer after the first patient sample is obtained from the patient and before the second patient sample is obtained from the patient. In some embodiments, the second patient sample comprises cell-free DNA. In some embodiments, detecting the presence or absence of one or more target genomic sequences identified as somatic cells in the second patient sample comprises sequencing nucleic acid molecules in the second patient sample.

[0031] This specification further describes a method for selecting neoantigens for a cancer vaccine personalized for a subject with cancer, the method comprising: identifying one or more target genomic sequences as somatic cells using one or more of the above methods with one or more processors, wherein the one or more target genomic sequences identified as somatic cells are located within the exon region of a gene; and selecting from the one or more target genomic sequences identified as somatic cells, a genomic sequence encoding a neoantigen suitable as a cancer vaccine for a subject, using one or more processors. In some embodiments, the method further comprises producing a vaccine containing the neoantigen.

[0032] This specification also describes a non-temporary computer-readable storage medium for storing one or more programs, wherein the one or more programs include instructions, and the instructions are executed by one or more processors of an electronic device, causing the electronic device to: select a target genome sequence at a genomic locus from within a patient genome sequence obtained for a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; select one or more proxy genome sequences for the target genome sequence; determine an allele frequency distance using a summary statistic or distribution showing the observed allele frequencies of the target genome sequence and the observed allele frequencies of one or more proxy genome sequences; and identify the target genome sequence as germline or somatic cell using the allele frequency distance. In some embodiments, the summary statistic is the mean allele frequency or median allele frequency. In some embodiments, the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit or do not fit within the distribution. In some embodiments, the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule. In some embodiments, the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule.

[0033] In some embodiments of a non-temporary computer-readable storage medium, one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment. In some embodiments, the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment.

[0034] In some embodiments of a non-temporary computer-readable storage medium, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause the electronic device to segment a patient genome sequence into multiple segments.

[0035] In some embodiments of non-temporary computer-readable storage media, the patient genome sequence is determined using targeted sequencing. In some embodiments, the patient genome sequence is determined using next-generation sequencing. In some embodiments, the targeted sequencing includes targeted sequencing of one or more cancer-related genes or parts thereof. In some embodiments, the targeted sequencing includes targeted sequencing of one or more exon regions.

[0036] In some embodiments, a non-temporary computer-readable storage medium stores one or more programs, which include instructions, and when these instructions are executed by one or more processors of an electronic device, the electronic device is caused to identify a target genomic sequence in a patient sample at a genomic locus, to identify one or more proxy genomic sequences for the target sequence, to identify the observed frequency of the target sequence against a centrality measure of the observed frequencies of the one or more proxy genomic sequences, and, based on this comparison, to characterize the target genomic sequence as either germline or somatic.

[0037] In some embodiments of a non-temporary computer-readable storage medium, one or more programs further include instructions, which, when executed by one or more processors of the electronic device, cause the electronic device to generate a report indicating the desired genome sequence as either germline or somatic cells. In some embodiments, the electronic device includes a display, and one or more programs further include instructions, which, when executed by one or more processors of the electronic device, cause the electronic device to display the report.

[0038] In some embodiments of non-temporary computer-readable storage media, one or more proxy genome sequences include single nucleotide polymorphisms (SNPs).

[0039] In some embodiments of non-temporary computer-readable storage media, one or more proxy genome sequences include alleles.

[0040] In some embodiments of a non-temporary computer-readable storage medium, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause the electronic device to identify segments of the patient's genome containing genomic loci. In some embodiments, identifying segments includes performing a segmentation procedure on a contiguous portion of the patient's genome. In some embodiments, the portion of the patient's genome is large enough to identify three distinct segments. In some embodiments, one or more proxy genomic sequences are identified as being located on the same segment as the genomic locus. In some embodiments, the segmentation procedure identifies segments according to whether genomic parameters are equal throughout each individual segment. In some embodiments, the genomic parameter is the copy number.

[0041] In some embodiments of non-temporary computer-readable storage media, the genome sequence of interest includes genome variants.

[0042] In some embodiments of a non-temporary computer-readable storage medium, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause the electronic device to receive sequencing data related to the patient genome sequence. In some embodiments, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause the electronic device to assemble the patient genome sequence using the sequencing data. In some embodiments, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause a sequencer to operate to sequence nucleic acid molecules derived from a patient sample and thereby obtain sequencing data.

[0043] In some embodiments of a non-temporary computer-readable storage medium, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause the electronic device to generate a report indicating the desired genome sequence as either germline or somatic cells. In some embodiments, one or more programs further include instructions, which, when executed by one or more processors of an electronic device, cause the electronic device to transmit the report using a computer network.

[0044] In some embodiments of non-temporary computer-readable storage media, the electronic device includes a display, and one or more programs further include instructions, which, when executed by one or more processors of the electronic device, cause the electronic device to display a report.

[0045] In some embodiments of non-temporary computer-readable storage media, one or more proxy genome sequences include single nucleotide polymorphisms (SNPs).

[0046] In some embodiments of non-temporary computer-readable storage media, one or more proxy genome sequences include alleles.

[0047] In some embodiments of non-temporary computer-readable storage media, the genome sequence of interest includes genome variants.

[0048] Also disclosed herein is an electronic device comprising one or more processors and a memory for storing one or more programs configured to be executed by the one or more processors, wherein the one or more programs include instructions for selecting a target genome sequence at a genomic locus from within a patient genome sequence obtained for a patient sample comprising a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; instructions for selecting one or more proxy genome sequences for the target genome sequence; instructions for determining an allele frequency distance using summary statistics or distributions indicating the observed allele frequencies of the target genome sequence and the observed allele frequencies of the one or more proxy genome sequences; and instructions for identifying the target genome sequence as germline or somatic cell using the allele frequency distance. In some embodiments, the summary statistics are mean allele frequencies or median allele frequencies. In some embodiments, the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of multiple proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit or do not fit within the distribution. In some embodiments, tumor nucleic acid molecules and non-tumor nucleic acid molecules include DNA molecules. In some embodiments, tumor nucleic acid molecules and non-tumor nucleic acid molecules include RNA molecules. In some embodiments, the patient genome sequence is determined using next-generation sequencing.

[0049] In some embodiments of the electronic device, one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment. In some embodiments, the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment. In some embodiments, one or more programs further include instructions for segmenting the patient genome sequence into multiple segments.

[0050] In some embodiments of electronic devices, the patient genome sequence is determined using targeted sequencing. In some embodiments, the targeted sequencing includes targeted sequencing of one or more genes or parts thereof that are associated with cancer. In some embodiments, the targeted sequencing includes targeted sequencing of one or more exon regions.

[0051] In some embodiments, the electronic device comprises one or more processors and a memory for storing one or more programs configured to be executed by the one or more processors, wherein the one or more programs include instructions for identifying a target genome sequence in a patient sample at a genomic locus; instructions for identifying one or more proxy genome sequences for the target sequence; instructions for comparing the observed frequency of the target genome sequence with a centrality measure of the observed frequency of the one or more proxy genome sequences; and instructions for identifying the target genome sequence as germline or somatic cell based on this comparison.

[0052] In some embodiments of electronic devices, one or more proxy genome sequences contain single nucleotide polymorphisms (SNPs).

[0053] In some embodiments of electronic devices, one or more proxy genome sequences include alleles.

[0054] In some embodiments of the electronic device, one or more programs further include instructions for identifying segments of the patient's genome that contain genomic loci. In some embodiments, identifying segments involves performing a segmentation procedure on a contiguous portion of the patient's genome. In some embodiments, the portion of the patient's genome is large enough to identify three distinct segments. In some embodiments, proxies are identified as those located on the same segment as the genomic loci. In some embodiments, the segmentation procedure identifies segments according to whether genomic parameters are equal throughout each individual segment. In some embodiments, the genomic parameter is the copy number.

[0055] In some embodiments of electronic devices, the target genome sequence includes genome variants.

[0056] In some embodiments of the electronic device, one or more programs further include instructions for receiving sequencing data related to the patient genome sequence. In some embodiments, one or more programs further include instructions for assembling the patient genome sequence using the sequencing data. In some embodiments, one or more programs further include instructions for causing a sequencer to sequence nucleic acid molecules derived from a patient sample, thereby obtaining sequencing data.

[0057] In some embodiments of electronic devices, one or more proxy genome sequences contain single nucleotide polymorphisms (SNPs).

[0058] In some embodiments of electronic devices, one or more proxy genome sequences include alleles.

[0059] In some embodiments of electronic devices, the target genome sequence includes genome variants.

[0060] In some embodiments of the electronic device, one or more programs further include instructions for generating a report showing the desired genome sequence as either germline or somatic cells. In some embodiments, one or more programs further include instructions for transmitting the report over a computer network or peer-to-peer connection. In some embodiments, the device further comprises a display, and one or more programs further include instructions for displaying the report.

[0061] In some embodiments of the electronic device, the patient sample is derived from a tissue biopsy containing tumor and non-tumor tissue. In some embodiments, the tissue biopsy is a solid tissue biopsy or a liquid biopsy. In some cases, the tissue sample is a liquid biopsy containing blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample includes cell-free DNA (cfDNA) obtained from the subject. In some embodiments, the patient sample includes circulating tumor DNA (ctDNA) obtained from the subject.

[0062] This specification also describes a system comprising either the electronic devices described herein or a sequencer configured to sequence nucleic acid molecules derived from a patient sample. In some embodiments, the sequencer is a next-generation sequencer.

[0063] This specification discloses a method for identifying a target genome sequence as germline or somatic, the method comprising: identifying the target genome sequence in a patient sample at a genomic locus using one or more processors; identifying a proxy genome sequence for the target genome sequence using one or more processors; comparing the observed allele fraction of the target genome sequence with the observed allele fraction of the proxy genome sequence using one or more processors; and identifying the target genome sequence as germline or somatic based on the comparison using one or more processors. In some embodiments, the proxy genome sequence has the same copy number as the target genome sequence. In some embodiments, identifying the target genome sequence as germline or somatic using one or more processors comprises: inputting the allele frequency distance into a trained statistical model; and outputting from the trained statistical model a value indicating the likelihood that the target genome sequence is germline, or a value indicating the likelihood that the target genome sequence is somatic. In some embodiments, the allele fractions of the genome sequence and the proxy genome sequence are determined using next-generation sequencing technology. In some embodiments, the allele fractions of the genome sequence and the allele fractions of the proxy genome sequence are determined using microarray technology. In some embodiments, the patient sample includes a solid tissue biopsy or a liquid biopsy. In some embodiments, the patient sample is a liquid biopsy including blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample includes cell-free DNA (cfDNA) obtained from the subject. In some embodiments, the patient sample includes circulating tumor DNA (ctDNA) obtained from the subject. In some embodiments, the patient is a cancer patient. In certain embodiments, for example, the following items are provided: (Item 1) A method for identifying a target genome sequence as germline or somatic cell, wherein the method is To provide a plurality of nucleic acid molecules obtained from a sample from a target, wherein the plurality of nucleic acid molecules include a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. Selectively attaching one or more adapters to one or more nucleic acids from the plurality of nucleic acid molecules, The amplification of nucleic acid molecules from the aforementioned plurality of nucleic acid molecules, The capture of nucleic acid molecules from amplified nucleic acid molecules, wherein the captured nucleic acid molecules are captured from the amplified nucleic acid molecules by hybridization to one or more bait molecules. The captured nucleic acid molecules are sequenced using a sequencer to obtain multiple sequence readings corresponding to one or more genomic loci, One or more processors select a target genome sequence at a genome locus from the one or more genome loci, The one or more processors select one or more proxy genome sequences for the target genome sequence, The one or more processors determine the allele frequency distance using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of the one or more proxy genome sequences. The one or more processors use the allele frequency distance to identify the target genome sequence as germline or somatic cells. Methods that include... (Item 2) The method according to item 1, wherein the subject is a cancer patient. (Item 3) The method according to item 1 or item 2, wherein the sample includes a tissue biopsy sample, a liquid biopsy sample, a circulating tumor cell (CTC) sample, a cell-free DNA (cfDNA) sample, or a normal control. (Item 4) The method according to item 3, wherein the sample is a liquid biopsy sample and includes blood, plasma, cerebrospinal fluid, sputum, feces, urine, or saliva. (Item 5) The method according to any one of items 1 to 3, wherein the tumor nucleic acid molecule is derived from the tumor portion of the heterogeneous tissue biopsy sample, and the non-tumor nucleic acid molecule is derived from the normal portion of the heterogeneous tissue biopsy sample. (Item 6) The method according to any one of items 1 to 3, wherein the tumor nucleic acid molecule is derived from the circulating tumor DNA (ctDNA) fraction of the cell-free DNA sample, and the non-tumor nucleic acid molecule is derived from the non-tumor fraction of the cell-free DNA sample. (Item 7) The method according to any one of items 1 to 6, wherein the one or more adapters include an amplification primer or a sequencing adapter. (Item 8) The method according to any one of items 1 to 7, wherein the one or more bait molecules comprise one or more nucleic acid molecules, and each nucleic acid molecule comprises a region complementary to the region of the captured nucleic acid molecule. (Item 9) The method according to any one of items 1 to 8, wherein the amplification of nucleic acid molecules includes performing a polymerase chain reaction (PCR) or isothermal amplification technique. (Item 10) The sequencing method described in any one of items 1 to 9, wherein the sequencing includes the use of next-generation sequencing (NGS) technology. (Item 11) The method according to any one of items 1 to 10, wherein the sequencer includes a next-generation sequencer. (Item 12) The method according to any one of items 1 to 11, wherein one or more proxy genome sequences are located within a defined segment of the target genome sequence, and the selected target genome sequence is located within the same defined segment. (Item 13) The method according to item 12, wherein the target genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment. (Item 14) The method according to any one of items 1 to 13, wherein the summary statistic is the mean allele frequency or the central allele frequency. (Item 15) The method according to any one of items 1 to 14, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not. (Item 16) A method for identifying a target genome sequence as germline or somatic cell, wherein the method is One or more processors select a target genomic sequence at a genomic locus from within the patient genomic sequence obtained for a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. The one or more processors select one or more proxy genome sequences for the target genome sequence, The one or more processors determine the allele frequency distance using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of the one or more proxy genome sequences. The one or more processors use the allele frequency distance to identify the target genome sequence as germline or somatic cells. Methods that include... (Item 17) The method according to item 16, comprising sequencing the tumor nucleic acid molecules and non-tumor nucleic acid molecules from the patient sample using a sequencer to determine the patient genome sequence. (Item 18) The method described in item 17, wherein the patient genome sequence is obtained using next-generation sequencing technology. (Item 19) The method according to item 17, wherein the sequencer is a next-generation sequencer. (Item 20) The method according to any one of items 16 to 19, wherein one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment. (Item 21) The method according to item 20, wherein the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment. (Item 22) The method according to item 20 or 21, comprising segmenting the patient genome sequence into multiple segments. (Item 23) The method according to any one of items 16 to 22, wherein the summary statistic is the mean allele frequency or the central allele frequency. (Item 24) The method according to any one of items 16 to 23, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not. (Item 25) The method according to any one of items 16 to 24, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule. (Item 26) The method according to any one of items 16 to 25, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule. (Item 27) The method according to any one of items 16 to 26, wherein the patient genome sequence is determined using targeted sequencing. (Item 28) The method according to item 27, wherein the targeted sequencing comprises targeted sequencing of one or more genes or a portion thereof that are related to cancer. (Item 29) The method according to item 27 or item 28, wherein the targeted sequencing comprises targeted sequencing of one or more exon regions. (Item 30) A method for identifying a target genome sequence as germline or somatic cell, wherein the method is Identifying target genomic sequences in patient samples at genomic loci using one or more processors, The one or more processors identify one or more proxy genome sequences for the target sequence, The one or more processors compare the observed frequency of the target genome sequence with a centrality measure of the observed frequency of the one or more proxy genome sequences. The one or more processors identify the target genome sequence as germline or somatic cell based on the comparison. Methods that include... (Item 31) The method according to item 30, further comprising identifying a segment of the patient's genome containing the genomic locus using one or more processors. (Item 32) The method according to item 31, wherein identifying the segment by one or more processors includes performing a segmentation procedure on a contiguous portion of the patient's genome. (Item 33) The method according to item 32, wherein the portion of the patient's genome is large enough to identify three different segments. (Item 34) The method according to item 31, wherein the proxy is identified by one or more processors such that it is located within the same segment as the genomic locus. (Item 35) The method according to item 32, wherein the segmentation procedure involves identifying segments by one or more processors according to whether the genomic parameters are equal throughout each individual segment. (Item 36) The method according to item 35, wherein the genome parameter is the copy number. (Item 37) The one or more processors described above can identify the target genome sequence as germline or somatic cells. Inputting allele frequency distances into a pre-trained statistical model, The trained statistical model outputs a value indicating the likelihood that the target genome sequence is germline, or a value indicating the likelihood that the target genome sequence is somatic. The method described in any one of items 16 to 36, including the method described in item 16 to 36. (Item 38) The method according to any one of items 16 to 37, wherein the allele frequency distance is adjusted to correct for contamination levels in the patient sample, low sequencing read depth, noisy estimates of allele frequencies, low segment germline single nucleotide polymorphism (SNP) counts, or high variability in segment germline SNP allele frequencies. (Item 39) The method according to item 37 or item 38, wherein the trained statistical model includes a function that associates the allele frequency distance with a value indicating the likelihood that the target genome sequence is germline, or with a value indicating the likelihood that the target genome sequence is somatic. (Item 40) The method according to any one of items 37-39, wherein the pre-trained statistical model is a logistic regression model. (Item 41) The method according to any one of items 37 to 40, further comprising training the statistical model using data on tumor samples having known germline sequences. (Item 42) The method according to any one of items 37 to 41, further comprising training the statistical model using data on tumor samples having known germline sequences and known somatic cell sequences. (Item 43) The method according to any one of items 37 to 40, wherein the pre-trained statistical model is trained using data on tumor samples having known germline sequences. (Item 44) The method according to item 43, wherein the trained statistical model is trained using data from tumor samples having known germline sequences and known somatic cell sequences. (Item 45) The method according to any one of items 37 to 44, further comprising training the statistical model with data on variant allele frequencies to exclude variants located in genomic regions known to have allele frequencies that deviate from expected values. (Item 46) The method according to any one of items 37 to 44, wherein the pre-trained statistical model is trained using data on variant allele frequencies to exclude variants located in genomic regions known to have allele frequencies that deviate from expectations. (Item 47) The method according to any one of items 37 to 46, further comprising training the statistical model with data that incorporate prior knowledge of the likelihood of a variant being a germline, somatic variant, or clonal hematopoietic (CHIP) variant with undetermined potential, based on historical data or a database. (Item 48) The method according to any one of items 37-46, wherein the trained statistical model is trained using data that incorporates prior knowledge of the likelihood of the variant being a germline, somatic variant, or clonal hematopoietic (CHIP) variant with undetermined potential, based on historical data or a database. (Item 49) The method according to any one of items 37 to 48, further comprising training the statistical model with data describing the noise level for a given variant call and its genomic context. (Item 50) The method according to any one of items 37 to 48, wherein the pre-trained statistical model is trained using data that describes the noise level for a given variant call and its genomic context. (Item 51) The method according to any one of items 16 to 50, wherein the one or more proxy genome sequences include a single nucleotide polymorphism (SNP). (Item 52) The method according to any one of items 16 to 51, wherein one or more proxy genome sequences include alleles. (Item 53) The method according to any one of items 16 to 52, wherein the target genome sequence includes a genome variant. (Item 54) The method according to any one of items 16 to 53, further comprising using one or more processors to generate a report showing the desired genome sequence as germline or somatic cells. (Item 55) The method described in item 54, which includes sending the aforementioned report to a healthcare provider. (Item 56) The aforementioned report is transmitted via a computer network or peer-to-peer connection, as described in item 54 or item 55. (Item 57) The method according to any one of items 16 to 56, wherein the patient sample is derived from a tissue biopsy including tumor tissue and non-tumor tissue. (Item 58) The method according to item 57, wherein the tissue biopsy is a solid tissue biopsy or a liquid biopsy. (Item 59) The method according to item 58, wherein the tissue biopsy is a liquid biopsy including blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. (Item 60) The method according to any one of items 16 to 59, wherein the patient sample includes cell-free DNA (cfDNA) obtained from the subject. (Item 61) The method according to any one of items 16 to 60, wherein the patient sample includes circulating tumor DNA (ctDNA) obtained from the subject. (Item 62) A method of treating a patient's cancer, Identifying one or more target genome sequences as somatic cells using one or more processors according to the method described in any one of items 16 to 61, Selecting a cancer treatment approach based on one or more identified somatic cell sequences, Treating the cancer using the selected cancer treatment method and Methods that include... (Item 63) The method according to item 62, wherein the one or more identified somatic cell sequences are involved in the success of cancer treatment using the selected therapeutic mode. (Item 64) The one or more processors determine the microsatellite instability state of the cancer using the one or more identified somatic cell sequences, Selecting the cancer treatment method based on the microsatellite instability state of the cancer. The method described in item 62, including the method described in item 62. (Item 65) The one or more processors determine the tumor mutation load for the cancer using the one or more identified somatic cell sequences, The cancer treatment method is selected based on the fact that the tumor mutation burden exceeds a predetermined threshold for tumor mutation burden. The method described in item 62, including the method described in item 62. (Item 66) The method according to item 64 or item 65, wherein the cancer treatment method includes administering an effective amount of one or more anticancer drugs to the patient when the tumor mutation burden exceeds a predetermined threshold. (Item 67) The method according to item 66, wherein the one or more anticancer agents include a cancer immunotherapy agent. (Item 68) The method according to item 67, wherein the cancer immunotherapy agent is an immune checkpoint inhibitor. (Item 69) A method for monitoring the progression or recurrence of cancer in a patient, Identifying one or more target genome sequences as somatic cells using the method described in any one of items 16 to 67, wherein the patient sample is obtained from a patient with cancer, and the patient sample is identified as one or more target genome sequences as somatic cells. The one or more processors detect the presence or absence of the one or more target genome sequences identified as somatic cells in a second patient sample obtained from the patient after the cancer has been treated. Methods that include... (Item 70) The method of item 69, comprising obtaining the second patient sample from the aforementioned patient. (Item 71) The method according to item 69 or item 70, comprising treating the cancer of the patient after the first patient sample is obtained from the patient and before the second patient sample is obtained from the patient. (Item 72) The method according to any one of items 69 to 71, wherein the second patient sample contains cell-free DNA. (Item 73) The method according to any one of items 69 to 72, wherein detecting the presence or absence of the one or more target genomic sequences identified as somatic cells in the second patient sample comprises sequencing nucleic acid molecules in the second patient sample. (Item 74) A method for selecting neoantigens for a personalized cancer vaccine for subjects with cancer, Identifying one or more target genome sequences as somatic cells using the method described in any one of items 16 to 67, wherein the one or more target genome sequences identified as somatic cells are located within the exon region of a gene. The process involves selecting a genomic sequence encoding a neoantigen suitable as a cancer vaccine for the target from the one or more target genomic sequences identified as somatic cells using the aforementioned processor. Methods that include... (Item 75) The method according to item 74, further comprising preparing a vaccine containing the neoantigen. (Item 76) A non-temporary computer-readable storage medium for storing one or more programs, wherein the one or more programs include instructions, and these instructions are executed by one or more processors of an electronic device, From the patient genome sequence obtained from a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules, the target genome sequence at the genomic locus was selected. Select one or more proxy genome sequences for the aforementioned target genome sequence. The allele frequency distance is determined using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of one or more proxy genome sequences, and The target genome sequence is identified as germline or somatic cell using the allele frequency distance. Non-temporary computer-readable storage medium. (Item 77) The non-temporary computer-readable storage medium according to item 76, wherein one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment. (Item 78) A non-temporary computer-readable storage medium according to item 77, wherein the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment. (Item 79) A non-temporary computer-readable storage medium according to item 77 or 78, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to segment the patient genome sequence into a plurality of segments. (Item 80) A non-temporary computer-readable storage medium according to any one of items 76 to 79, wherein the summary statistic is the mean allele frequency or the central allele frequency. (Item 81) A non-temporary computer-readable storage medium according to any one of items 76 to 80, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not. (Item 82) A non-temporary computer-readable storage medium according to any one of items 76 to 81, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule. (Item 83) A non-temporary computer-readable storage medium according to any one of items 76 to 82, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include RNA molecules. (Item 84) A non-temporary computer-readable storage medium according to any one of items 76 to 83, wherein the patient genome sequence is determined using targeted sequencing. (Item 85) A non-temporary computer-readable storage medium according to any one of items 76 to 84, wherein the patient genome sequence is determined using next-generation sequencing. (Item 86) A non-temporary computer-readable storage medium according to item 84 or 85, wherein the targeted sequencing includes targeted sequencing of one or more genes or parts thereof related to cancer. (Item 87) A non-temporary computer-readable storage medium according to any one of items 84 to 86, wherein the targeted sequencing comprises targeted sequencing of one or more exon regions. (Item 88) A non-temporary computer-readable storage medium for storing one or more programs, wherein the one or more programs include instructions, and these instructions are executed by one or more processors of an electronic device, Identify the target genomic sequence in the patient sample at the genomic locus. Identify one or more proxy genome sequences for the aforementioned target sequence. The observed frequency of the target sequence is identified against the centrality measure of the observed frequencies of one or more proxy genome sequences, and A non-temporary computer-readable storage medium that, based on the comparison described above, characterizes the target genome sequence as either germline or somatic cell. (Item 89) A non-temporary computer-readable storage medium according to item 88, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to identify a segment of the patient's genome containing the genomic locus. (Item 90) A non-temporary computer-readable storage medium according to item 88, wherein identifying the segment involves performing a segmentation procedure on a continuous portion of the patient's genome. (Item 91) A non-temporary computer-readable storage medium as described in item 90, wherein the portion of the patient's genome is large enough to identify three different segments. (Item 92) A non-temporary computer-readable storage medium according to any one of items 88 to 91, wherein one or more proxy genome sequences are identified as being located on the same segment as the genomic locus. (Item 93) A non-temporary computer-readable storage medium according to any one of items 90-92, wherein the segmentation procedure identifies segments according to whether the genomic parameters are equal throughout each individual segment. (Item 94) A non-temporary computer-readable storage medium as described in item 93, wherein the genome parameter is the copy number. (Item 95) A non-temporary computer-readable storage medium according to any one of items 76 to 94, wherein the one or more programs further include instructions, and when such instructions are executed by one or more processors of the electronic device, the electronic device causes the electronic device to receive sequencing data relating to the patient genome sequence. (Item 96) A non-temporary computer-readable storage medium according to item 95, wherein the one or more programs further include instructions, and when such instructions are executed by one or more processors of the electronic device, the electronic device causes the electronic device to assemble the patient genome sequence using the sequencing data. (Item 97) A non-temporary computer-readable storage medium according to item 95 or 96, wherein the one or more programs further include instructions, which, when executed by one or more processors of the electronic device, cause a sequencer to operate to sequence nucleic acid molecules derived from the patient sample and thereby obtain the sequencing data. (Item 98) A non-temporary computer-readable storage medium according to any one of items 76 to 97, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to generate a report indicating the desired genome sequence as either germline or somatic cells. (Item 99) A non-temporary computer-readable storage medium according to any one of items 76 to 98, wherein the one or more programs further include instructions, and if such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to transmit the report using a computer network. (Item 100) A non-temporary computer-readable storage medium according to any one of items 76 to 99, wherein the electronic device comprises a display, and the one or more programs further include instructions, which are executed by the one or more processors of the electronic device, causing the electronic device to display the report. (Item 101) A non-temporary computer-readable storage medium according to any one of items 76 to 100, wherein one or more proxy genome sequences contain single nucleotide polymorphisms (SNPs). (Item 102) A non-temporary computer-readable storage medium according to any one of items 76 to 101, wherein one or more proxy genome sequences contain alleles. (Item 103) A non-temporary computer-readable storage medium according to any one of items 76 to 102, wherein the genome sequence for the purpose contains genome variants. (Item 104) It is an electronic device, One or more processors, A memory for storing one or more programs configured to be executed by the one or more processors, wherein the one or more programs are A command to select a target genome sequence at a genomic locus from within the patient genome sequence obtained from a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. A command to select one or more proxy genome sequences for the aforementioned target genome sequence, An instruction to determine the allele frequency distance using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of one or more proxy genome sequences, and A memory that stores one or more programs, including an instruction to identify the target genome sequence as germline or somatic cell using the allele frequency distance. An electronic device equipped with the following features. (Item 105) The electronic device according to item 104, wherein one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment. (Item 106) The electronic device according to item 105, wherein the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment. (Item 107) The electronic device according to any one of items 104 to 106, wherein the one or more programs further include instructions for segmenting the patient genome sequence into multiple segments. (Item 108) An electronic device according to any one of items 104 to 107, wherein the summary statistic is the mean allele frequency or the central allele frequency. (Item 109) The electronic device according to any one of items 104 to 108, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not. (Item 110) The electronic device according to any one of items 104 to 109, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule. (Item 111) An electronic device according to any one of items 104 to 110, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule. (Item 112) An electronic device according to any one of items 104 to 111, wherein the patient genome sequence is determined using next-generation sequencing. (Item 113) An electronic device according to any one of items 104 to 112, wherein the patient genome sequence is determined using targeted sequencing. (Item 114) The electronic device according to item 113, wherein the targeted sequencing includes targeted sequencing of one or more genes or a portion thereof that are related to cancer. (Item 115) The electronic device according to item 113 or item 114, wherein the targeted sequencing includes targeted sequencing of one or more exon regions. (Item 116) It is an electronic device, One or more processors, A memory for storing one or more programs configured to be executed by the one or more processors, wherein the one or more programs are A command to identify the target genome sequence in a patient sample at a genomic locus. An instruction to identify one or more proxy genome sequences for the aforementioned target sequence, A command to compare the observed frequency of the target genome sequence with a centrality measure of the observed frequency of one or more proxy genome sequences, and A memory that stores one or more programs, including an instruction to identify the target genome sequence as germline or somatic cell based on the comparison, An electronic device equipped with the following features. (Item 117) The electronic device according to item 116, wherein one or more programs further include instructions for identifying a segment of the patient's genome in which the genomic locus is located. (Item 118) The electronic device according to item 117, wherein identifying the segment includes performing a segmentation procedure on a continuous portion of the patient's genome. (Item 119) The electronic device according to item 118, wherein the portion of the patient's genome is large enough to identify three different segments. (Item 120) An electronic device according to any one of items 117 to 119, wherein one or more proxy genome sequences are identified as being located within the same segment as the genomic locus. (Item 121) The electronic device according to any one of items 118-120, wherein the segmentation procedure identifies segments according to whether the genomic parameters are equal throughout each individual segment. (Item 122) The electronic device described in item 121, wherein the genome parameter is the copy number. (Item 123) The electronic device according to any one of items 104 to 122, wherein the one or more programs further include instructions for receiving sequencing data related to the patient genome sequence. (Item 124) The electronic device according to item 123, wherein the one or more programs further include instructions for assembling the patient genome sequence using the sequencing data. (Item 125) The electronic device according to item 123 or 124, wherein the one or more programs further include instructions for a sequencer to sequence nucleic acid molecules derived from the patient sample and thereby obtain sequencing data. (Item 126) An electronic device according to any one of items 104 to 125, wherein one or more proxy genome sequences include a single nucleotide polymorphism (SNP). (Item 127) An electronic device according to any one of items 104 to 126, comprising one or more proxy genome sequences containing alleles. (Item 128) An electronic device according to any one of items 104 to 127, wherein the aforementioned target genome sequence includes a genome variant. (Item 129) The electronic device according to any one of items 104 to 128, wherein the one or more programs further include instructions for generating a report showing the desired genome sequence as either germline or somatic cells. (Item 130) The electronic device described in item 129, wherein the one or more programs further include instructions for transmitting the report via a computer network or peer-to-peer connection. (Item 131) The electronic device according to item 129 or 130, wherein the device further comprises a display, and the one or more programs further include instructions for displaying the report. (Item 132) The aforementioned patient sample is derived from a tissue biopsy containing tumor tissue and non-tumor tissue, item 104~ An electronic device as described in any one of paragraphs 131. (Item 133) The electronic device according to item 132, wherein the tissue biopsy is a solid tissue biopsy or a liquid biopsy. (Item 134) The electronic device according to item 133, wherein the tissue biopsy is a liquid biopsy including blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. (Item 135) An electronic device according to any one of items 104 to 134, wherein the patient sample contains cell-free DNA (cfDNA) obtained from the subject. (Item 136) An electronic device according to any one of items 104 to 135, wherein the patient sample contains circulating tumor DNA (ctDNA) obtained from the subject. (Item 137) A system comprising an electronic device described in any one of items 104 to 136 and a sequencer configured to sequence nucleic acid molecules derived from the patient sample. (Item 138) The system described in item 137, wherein the sequencer is a next-generation sequencer. (Item 139) A method for identifying a target genome sequence as germline or somatic cell, wherein the method is Identifying target genomic sequences in patient samples at genomic loci using one or more processors, The one or more processors identify proxy genome sequences for the target genome sequence, The one or more processors compare the observed allele fraction of the target genome sequence with the observed allele fraction of the proxy genome sequence. The one or more processors identify the target genome sequence as germline or somatic cell based on the comparison. Methods that include... (Item 140) The method according to item 139, wherein the proxy genome sequence has the same copy number as the target genome sequence. (Item 141) The one or more processors described above can identify the target genome sequence as germline or somatic cells. Inputting allele frequency distances into a pre-trained statistical model, The trained statistical model outputs a value indicating the likelihood that the target genome sequence is germline, or a value indicating the likelihood that the target genome sequence is somatic. The method described in item 139 or item 140, including the method described in item 139 or item 140. (Item 142) The method according to any one of items 139 to 141, wherein the allele fraction of the genome sequence and the allele fraction of the proxy genome sequence are determined using next-generation sequencing technology. (Item 143) The method according to item 142, wherein the allele fraction of the genome sequence and the allele fraction of the proxy genome sequence are determined using microarray technology. (Item 144) The method according to any one of items 139 to 143, wherein the patient sample includes a solid tissue biopsy or a liquid biopsy. (Item 145) The method according to item 144, wherein the patient sample is a liquid biopsy containing blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. (Item 146) The method according to any one of items 139 to 145, wherein the patient sample includes cell-free DNA (cfDNA) obtained from the subject. (Item 147) The method according to any one of items 139 to 146, wherein the patient sample includes circulating tumor DNA (ctDNA) obtained from the subject. (Item 148) The method according to any one of items 139 to 147, wherein the patient is a cancer patient.

[0064] Incorporation by reference All publications, patents, and patent applications mentioned in this specification are hereby incorporated by reference in their entirety, as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference in its entirety. In case of conflict between the terms of this specification and the terms of the incorporated references, the terms of this specification shall govern.

Brief Description of the Drawings

[0065] [Figure 1] FIG. 1 is a schematic diagram of a section of a patient's genome.

[0066] [Figure 2] FIG. 2 is a flowchart of a process for distinguishing between germline and somatic genomic sequences.

[0067] [Figure 3] FIG. 3 is a schematic diagram of genomic segmentation.

[0068] [Figure 4] FIG. 4 shows an exemplary system including an electronic device that can be used to execute the method described in this specification.

[0069] [Figure 5A]Figure 5A illustrates an exemplary process for determining the expected differences in variant allele fractions for somatic and germline variants given the same tumor fraction, ploidy, and copy number.

[0070] [Figure 5B] Figure 5B shows an exemplary method for determining allele frequency distances from expected germline allele frequencies (AFDIS), and an exemplary density distribution of AFDIS, from which an empirical cumulative distribution function (ECDF) can be constructed.

[0071] [Figure 5C] Figure 5C shows an exemplary plot of AFDIS plotted against the calculated purity of the tumor sample.

[0072] [Figure 5D] Figure 5D shows a non-limiting example of a ROC curve for the classification of somatic cells and germline variants in a tumor sample by the method disclosed herein.

[0073] [Figure 5E] Figure 5E shows a non-restrictive example of a probability plot of an exemplary logistic regression model that may be used in several embodiments.

[0074] [Figure 5F] Figure 5F shows plots of somatic cell probabilities for different variants determined using an exemplary logistic regression model.

[0075] [Figure 5G] Figure 5G shows the claimed method improvement over the conventional SGZ method.

[0076] [Figure 5H] Figure 5H shows a non-limiting example of sensitivity plots of training and test data used to train and test a logistic regression model according to the exemplary methods disclosed herein.

[0077] [Figure 5I] Figure 5I shows a non-limiting example of positive predictive value (PPV) plots for training and test data used to train and test a logistic regression model according to the exemplary methods disclosed herein.

[0078] [Figure 5J] Figure 5J shows a non-limiting example of data for variant classification in the BRCA1 and BRCA2 genes using an exemplary embodiment of the described method.

[0079] [Figure 5K] Figure 5K shows a non-limiting example of data for variant classification in the STH11 gene using an exemplary embodiment of the described method.

[0080] [Figure 6A] Figure 6A shows a non-restrictive example of a plot of variant allele frequency (AF) versus segment minor allele frequency (MAF) for known germline variants in tumor samples.

[0081] [Figure 6B] Figure 6B shows an unrestricted example of density versus variant AF plots corresponding to segment MAF values of 0.1, 0.2, and 0.3, respectively, derived from the data plotted in Figure 6A. [Modes for carrying out the invention]

[0082] Detailed explanation Methods, devices, and computer-readable media for distinguishing somatic cell genome sequences from germline genome sequences are described herein. A target genome sequence in a patient sample at a genomic locus can be identified. One or more proxy genome sequences can then be identified for the target sequence. The observed frequency of the target sequence can be compared to a centrality measure of the observed frequencies of one or more proxy genome sequences, and based on this comparison, the target genome sequence can be characterized as either a germline sequence or a somatic cell sequence.

[0083] Several methods have been developed in the past to determine the somatic / germline status of variants in single-sample settings. These include matching against public germline databases such as dbSNP, or using substitutes constructed from a large number of normal individuals instead of matching normals. See, for example, Hiltemann et al., "Discriminating somatic and germline mutations in tumor DNA samples without matching normal," Genome Res., Vol. 25, No. 9, pp. 1382-1390 (2015). However, such methods are ineffective when dealing with rare germline variants limited to families or small populations. There is also the so-called "basic method," which considers variants with allele frequencies (or allele fractions) close to 50% or 100% as germline, and classifies those that do not meet this criterion as somatic. See Jones et al., "Personalized genomic analyses for cancer mutation discovery and interpretation," Sci. Transl. Med., Vol. 7, No. 283, pp. 283ra53 (2015). This basic method cannot explain the fact that aneuploidy can significantly deviate the allele frequencies of germline variants from the expected 50% or 100%. The terms “allelic frequency” and “allelic fraction” are used interchangeably herein and refer to the proportion of sequence reads corresponding to a particular allele to the total number of sequence reads for a genomic locus.

[0084] The SGZ (Somatic Germline Conjugation) algorithm, released in early 2018, sought to provide a solution to the single-sample somatic / germline classification problem by considering tumor contents, tumor ploidy, and local copy number. SGZ demonstrated significantly superior accuracy in somatic / germline identification on validation datasets compared to "basic methods" (Sun et al., "A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal," PLoS Comput Biol., Vol. 14, No. 2, e1005965 (2018), which is incorporated herein by reference in its entirety). The application of the SGZ algorithm to FMI's large-scale parallel sequencing (MPS)-based diagnostic products has enabled effective somatic / germline status determination for short variants (substitutions and indels), making it an essential tool for applications such as tumor mutational burden (TMB) estimation.

[0085] The method described herein for somatic cell / germline classification represents a further improvement over the SGZ approach. The new approach is based on the same fundamental principle: in tumor / normal mixtures, somatic and germline variants often have different expected allele frequencies determined by tumor fraction, tumor ploidy, and local copy number. However, in contrast to SGZ, which estimates expected germline allele frequencies by computational modeling of tumor fraction, tumor ploidy, and local copy number, the novel method disclosed herein directly infers expected germline allele frequencies from known germline SNPs located on the same copy number segment as the variant in question. Therefore, using the method herein, it is not necessary to determine or model copy number or tumor purity to obtain accurate callings for somatic and germline variants.

[0086] In some embodiments, a trained model, such as a logistic regression model, is used to predict the probability that a variant is somatic based on the difference between the observed variant allele frequency and the estimated expected germline variant allele frequency. In some embodiments, the model is trained with matched tumor / normal pair data and validated on independent datasets. In some embodiments, the model is trained with data from tumor samples having known germline (and, optionally, known somatic) sequences. In some embodiments, the model is trained with data from mixed tumor / normal samples having known germline (and, optionally, known somatic) sequences. This validation demonstrates that the novel classification index is superior to SGZ in sensitivity and positive predictive value (PPV) for somatic variant classification.

[0087] The determined genome sequence may be a somatic variant sequence or a germline sequence. There are publicly accessible databases of known germline sequences (e.g., dbSNP( 1671675357232_0 Matching between known germline sequences (available at) or gnomAD (see gnomad.broadinstitute.org), and sequences determined by sequencing nucleic acids in samples obtained from the subject, indicates a high probability that the sequence associated with the sample is a germline sequence. However, non-matching with known germline sequences does not demonstrate that the sequence is a somatic variant sequence, as it may be a previously unknown (or unrecorded) germline sequence of the subject. The methods described herein enable the classification of sequences as germline sequences or somatic variant sequences.

[0088] Method for calling somatic cell sequences or germline sequences The methods described herein enable the identification of a target genomic sequence as a germline sequence or a somatic cell sequence. In some embodiments, the somatic cell sequence is related to the patient's cancer. For example, the patient sample may comprise a mixture of tumor nucleic acid molecules (i.e., nucleic acid molecules derived either directly (e.g., in the case of a tumor biopsy) or indirectly (e.g., in the case of a liquid biopsy or body fluid sample containing circulating tumor DNA (ctDNA) and cell-free DNA (cfDNA)) from the tumor) and non-tumor nucleic acid molecules (i.e., nucleic acid molecules derived from non-tumorous, preferably healthy tissue, cells, liquid biopsy samples, or body fluid samples). The method may comprise the steps of selecting a target genomic sequence from within the patient genomic sequence (i.e., a genomic sequence obtained for the patient, which may be the whole genome or a portion thereof (e.g., an exome or target region within the whole genome)) and selecting one or more proxy genomic sequences for the target genomic sequence. The patient genomic sequence may comprise one or more alleles at any given locus (e.g., a somatic cell sequence and / or germline sequence at any given locus).

[0089] The patient's genome sequence can be determined by sequencing nucleic acid molecules from a sample (e.g., a mixed tumor / normal tissue sample, or a cell-free DNA (cfDNA) sample containing a mixture of ctDNA and non-tumor cfDNA). The target genome sequence can be identified or selected from the patient's genome sequence at a genomic locus. The selected genome sequence is a test sequence characterized as germline or somatic. In some embodiments, the target genome sequence differs from a reference sequence. In some embodiments, the target genome sequence differs from a sequence in a selected germline sequence database.

[0090] Figure 1 is a schematic diagram of a sample genome region. Region 100 may include the entire genome of an organism, or it may include only a fraction of the entire genome. Region 100 is shown as a continuous line in Figure 1, but generally, region 100 may include several components that are physically separated on the organism's chromosome(s). In some implementations, the sample from which region 100 is determined may include normal patient tissue, fluids containing normal cells or cell-free DNA, or other anatomical material. In some implementations, the sample may include abnormal (e.g., cancerous or genetically mutated) tissue, fluids containing abnormal cells or circulating tumor DNA, or other anatomical material. In some implementations, the sample may include a combination of normal and abnormal tissue, body fluids, or other anatomical material.

[0091] The genomic region 100 shown in Figure 1 may correspond to a single strand or strand fragment of DNA, or a strand or strand fragment of RNA. Although not shown in Figure 1, region 100 contains sequences of various bases (i.e., cytosine ("C"), guanine ("G"), adenine ("A"), thymine ("T"), or uracil ("U")). Specific sequences of bases can often determine important anatomical features or characteristics of a patient, such as whether the patient has cancer, and if so, whether a treatment may be effective or ineffective.

[0092] The techniques described below involve characterizing a target sequence 102 within a genomic region 100 as either germline or somatic. Characterization is assisted by the use of a reference sequence 104, which is an exemplary genomic sequence representing a “normal” (e.g., non-cancerous) patient. In some implementations, the reference sequence 104 may include a sequence determined by the Human Genome Project, e.g., hg19.

[0093] Reference sequence 104 contains known polymorphic regions 106a and 106b. Polymorphic regions 106a and 106b are regions (containing any number of bases, from a single base to hundreds or more bases) where a variation in the genome sequence of a particular organism is expected across a population of organisms without corresponding adverse effects. For example, humans have polymorphic regions corresponding to various hair colors, eye colors, or other individualized features. Genomic region 100, corresponding to an actual patient sample, has specific base values 108a and 108b at the locations in region 100 corresponding to polymorphic regions 106a and 106b in reference sequence 104. In other words, polymorphic regions 106a and 106b in reference sequence 104 are the locations where a person's specific features (e.g., hair color) are determined. Base values 108a and 108b are the individualized determinations of those features (e.g., red hair) that describe a particular patient.

[0094] In some cases, the polymorphic regions 106a and 106b contain one or more single nucleotide polymorphisms (or "SNPs"). In some cases, the polymorphic region may contain the entire allele or a portion thereof.

[0095] Figure 2 is a flowchart of the process for distinguishing germline and somatic cell genome sequences. Process 200 begins with identifying (i.e., selecting or classifying) the genomic region of interest (step 202). In some implementations, step 202 includes identifying the region of interest (i.e., the sequence of interest) 102 from within a larger genomic region 100.

[0096] The determination of a genome sequence (e.g., genome region 100) from a physical sample can be achieved in various ways. One such method is described in U.S. Patent No. 9,340,830, and another is described in U.S. Patent Publication No. 2017 / 0356053, both of which are incorporated herein by reference in their entirety. More generally, there is a category of machines capable of determining the gene sequence of an input sample, called genome sequencers. In some cases, the disclosed methods and systems may be implemented using any of the various next-generation sequencing (NGS) technologies and sequencers, including circulating array sequencers and single-molecule sequencers configured for large-scale parallel sequencing. Furthermore, there are various known subregions of the genome of humans and other organisms that are known to be associated with various medical conditions.

[0097] The techniques described herein do not depend on the use of a specific sequencing platform or specific sequencing technique, and any of these machines and associated techniques can be used in step 202. In some cases, the disclosed methods may be implemented using alternative nucleic acid sequencing techniques, such as microarrays and fluorescence in situ hybridization (FISH).

[0098] In some implementations, the region of interest (i.e., sequence) 102 is identified as corresponding to a known gene locus within the reference genome 104. In some implementations, the region of interest 102 corresponds to a mutation in the reference sequence 104 (i.e., a subsection of the genomic region 100 other than the polymorphic region, having a gene sequence different from the gene sequence of the corresponding portion of the reference sequence 104). In some implementations, the sequence of interest corresponds to a gene associated with a medical condition the patient has. In some implementations, the region of interest 102 is an oncogene or a part thereof.

[0099] In step 204, one or more proxy genome sequences are identified for the genome sequence (step 204). The selected one or more proxy genome sequences may be known germline sequences (e.g., based on matching with known germline sequences from a database of known germline sequences, or by sequencing healthy tissue, cells, or cell-free DNA from the subject or another healthy individual). Referring to Figure 1, one feature of proxy 110 is that it is a sequence at a locus that is known to (a) encode germline genetic information and (b) have the same copy number as the target sequence 102 (e.g., because it has been confirmed to be physically close to or located within the same copy number segment as the target sequence 102). An alternative characterization requires that proxy 110 is known to encode somatic genetic information. For convenience, this document assumes that proxy 110 encodes germline information unless otherwise specified, but those skilled in the art will understand the equivalence of the two approaches.

[0100] The germline status of specific surrogate sequence candidates can be found in research literature, publicly available databases (e.g., dbSNP( 1671675357232_1 Known somatic variants may be known from gnomAD (available at gnomad.broadinstitute.org) or discovered by other abstitutes. On the other hand, somatic variants can be identified from matched tumor / normal samples, i.e., samples from the same patient containing both tumor DNA and non-tumor ("normal") DNA. In particular, variants found in tumor DNA but not in the corresponding normal DNA are necessarily somatic. Known somatic variants may also be discovered by other abstitutes.

[0101] Referring to Figure 3, in some implementations, step 204 is performed by employing a segmentation process. In such a process, a portion of the patient's genome 100 is divided into segments (depicted by dashed vertical lines in Figure 3) based on genetic parameters. Segments are defined such that all parameter values within a particular segment are equal (within a desired range or a desired threshold). For example, segments may be contiguous sequences having approximately the same (i.e., within a desired range or a desired threshold) sequencing depth or copy number. In some implementations, the genetic parameters used to segment the input include copy number or the frequency of the allele or sub-allelic segment of interest. One or more proxy sequences may be located within the same segment as the genome sequence of interest, and therefore it is very likely that one or more proxy genome sequences and the genome sequence of interest will have the same copy number.

[0102] Various segmentation procedures are known in this field. For example, iSeg (the entirety of which is incorporated herein by reference, described in Girimurugan et al., "iSeg: an Efficient Algorithm for Segmentation of Genomic and Epigenomic Data," BMC Bioinformatics, Vol. 19: p. 131 (2018)), CBS (the entirety of which is incorporated herein by reference, described in Olshen et al., "Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data," Biostatistics, October 2004; Vol. 5 (No. 4): pp. 557-5572), SLMSuite (the entirety of which is incorporated herein by reference, described in Orlandini et al., "SLMSuite: A Suite of Algorithms for Segmenting Genomic Profiles," BMC Bioinformatics, Vol. 18: p. 321 (2017)), Pelt (Killick et al., "Optimal detection of changepoints with a linear computational cost," Journal of the The American Statistical Association, Vol. 107:p. 500 (2012), is one of four such algorithms. In some embodiments, the patient genome sequence is segmented into multiple segments based on the uniformity of copy numbers within each segment.

[0103] Referring again to Figure 2, in some implementations, only proxies 110 located on the same segment as the region of interest 102 are identified. In some implementations, proxy 110 contains all known germline SNPs located on the same segment as the region of interest 102. In some implementations, proxy 110 contains all known germline alleles located on the same segment as the region of interest 102. In some implementations, for example, when it is difficult to correctly segment the genome sequence into segments corresponding to different copy numbers, only proxies 110 that are less than or equal to a predetermined number of bases away from the region of interest 102 are identified. For example, in some cases, the maximum number of bases required to separate the region of interest from the proxy sequence may range from approximately 10 to approximately 1,000 bases. In some cases, the maximum number of bases required to separate the desired region from the proxy sequence may be approximately 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 bases. In some cases, the maximum number of bases required to separate the desired region from the proxy sequence may be any value within the range of values described in this paragraph.

[0104] In step 206, the frequency of proxy 110 is identified. In step 208, the allele frequencies (allele fractions) of sequences from the region of interest (i.e., the genome sequence of interest) 102 are identified. Here, “frequency” refers to a normalized statistical frequency, e.g., the number of occurrences of a sequence or proxy in the sample divided by the total number of occurrences of any given sequence at the same genomic locus. In some implementations, several frequency measurements may be performed. The allele frequencies of the genome sequence of interest and one or more proxy genome sequences can be determined by sequencing nucleic acid molecules in a sample from the subject. In some cases, allele frequencies may be determined using other methodologies, e.g., microarrays or fluorescence in situ hybridization (FISH) techniques. If several proxies are used, outlier proxy frequencies may be discarded, and the remaining frequencies may be combined as a single statistical centrality measure (e.g., a summary statistic such as the mean, median, or mode, or a distribution of allele frequencies of the proxy sequences (e.g., a probability distribution)), resulting in step 210 involving a single numerical comparison. For example, in some embodiments, the centrality measure (summary statistic) is the mean allele frequency for one or more proxy sequences. In some embodiments, the centrality measure (summary statistic) is the median allele frequency for one or more proxy sequences. When a single proxy genome sequence is used, the centrality measure of the observed frequency of the proxy genome sequence is the frequency of that proxy sequence. In some embodiments, the centrality measure may be the distribution of observed allele frequencies for the proxy sequence.

[0105] In determination 210, a single proxy frequency or multiple proxy frequencies (e.g., a measure of centrality of observed frequencies of one or more proxy sequences) are compared to a single or multiple frequency in a region of interest to determine whether they are equal. Throughout this specification and application, the term “equal” includes “equal within a desired range” or “equal within a desired threshold,” which can be determined routinely based on the desired selectivity and specificity of process 200. The range or threshold can be set, for example, using a statistical threshold or statistical test selected by those skilled in the art. If, instead of combining proxy frequencies as described above, several proxies 110 are used and individual comparisons are made, then determination 210 is “yes” if a certain percentage of the comparisons (e.g., greater than 50%, greater than 55%, greater than 60%, greater than 65%, greater than 70%, greater than 75%, greater than 80%, greater than 85%, greater than 90%, or greater than 95%) are equal.

[0106] If the proxy frequency is equal to the frequency of the target sequence, the target sequence is classified as germline (step 212). Otherwise, the target sequence is classified as somatic (step 214). Alternatively, if it is selected that proxy 110 is known to encode somatic information (instead of germline), then equal frequencies are interpreted as target sequences that are somatic, and unequal frequencies are interpreted as target sequences that are germline.

[0107] In some implementations, the comparison in determination 210 may also be used to eliminate potentially erroneous classifications. In particular, the frequency of true somatic variants is necessarily lower than that of true germline variants because both tumor DNA and non-tumor DNA contribute to the germline variant frequency, while only tumor DNA contributes to the somatic variant frequency. Therefore, in some implementations, if the frequency of the sequence of interest exceeds the proxy frequency, then the sequence of interest is classified as germline.

[0108] For example, in some embodiments, comparing the observed frequency of a target genome sequence to a measure of centrality of the observed frequencies of one or more proxy genome sequences may include determining the “allelic frequency distance” (AFDIS) of the target genome sequence from the expected allele frequencies. If the target genome sequence is a germline sequence, the expected allele frequencies are determined based on the frequencies (or summary statistics showing the observed frequencies of one or more proxy sequences) of one or more proxy sequences that are assumed to be germline based on the selection of one or more proxy sequences. In some embodiments, the AFDIS may be expressed numerically as follows: AFDIS=AF 生殖系列 -AF バリアント During the ceremony, AF 生殖系列 AF is the expected allele frequency if the target genome sequence were germline, as determined based on the observed allele frequencies of one or more proxy sequences. バリアント This represents the observed allele frequencies for the target genome sequence.

[0109] In some embodiments, the allele frequency distance can be determined using the distribution of observed frequencies of proxy genome sequences. The distribution can be used to determine the probability that the genome sequence of interest is germline or somatic. In some embodiments, the allele frequency distance is the probability that the observed frequencies of the genome sequence of interest fit (or do not fit) the distribution of observed frequencies of a set of proxy sequences. For example, if the allele frequencies of the genome sequence of interest fall within the distribution, the genome sequence of interest can be identified as a germline sequence. If the allele frequencies of the genome sequence of interest do not fit within the distribution, the genome sequence of interest can be identified as somatic. Those skilled in the art may select statistical tests or predetermined thresholds to determine whether the allele frequencies of the genome sequence of interest fit within the distribution.

[0110] In some embodiments, allele frequency distances can be used to classify a target genome sequence. For example, in some embodiments, if the allele frequency distance exceeds a selected threshold, the target genome sequence is classified as a somatic cell. In some embodiments, if the allele frequency distance falls below a selected threshold, the target genome sequence is classified as a germline. Thresholds can be set based on a desired precision or specificity tolerance.

[0111] In some embodiments, the classification of a target genome sequence as germline or somatic cell may involve the use of a statistical model. The statistical model may, for example, take allelic frequency distances to a given target genome sequence and output the classification of the target genome sequence as somatic cell (or potentially somatic cell) or germline (or potentially germline). The classification may also be based on the probability that the target genome sequence is somatic cell or germline. In some implementations, the target genome sequence may be classified as ambiguous if, for example, the probability that the sequence is somatic cell or germline is not sufficiently high. The probability threshold for making a call may be based on the desired specificity and / or precision of the call. For example, in some embodiments, if the probability that the target genome sequence is somatic is greater than any one of 0.8, 0.85, 0.9, 0.95, 0.96, 0.97, 0.98, or 0.99 (or any selected value between them), the target genome sequence is classified as somatic; if the probability that the target genome sequence is somatic is less than any one of 0.2, 0.15, 0.1, 0.05, 0.04, 0.03, 0.02, or 0.01 (or any selected value between them), the target genome sequence is classified as germline. Based on the statistical model, target genome sequences that are not classified as somatic or germline may be labeled as ambiguous.

[0112] In some embodiments, the statistical model is trained using data from one or more matched tumor / normal sample pairs. The normal sample in the matched tumor / normal sample pair can be sequenced to establish a ground truth for germline sequences, and this tumor sample can be sequenced to establish a ground truth for somatic variant sequences (i.e., non-germline sequences that follow the matched normal sample). Sequencing data from the tumor sample, which may contain a mixture of normal and tumor nucleic acid molecules, is then used to determine the probability of it being a somatic cell (p somatic ) is equal to 1) or germline (p somatic The allele frequency distance for a selected target genome sequence can be determined, which is labeled as (where is equal to 0). Then, a function relating the allele frequency distance to the probability of being a somatic cell can be generated using the training data.

[0113] Other methods may be used to train the statistical model. For example, in some embodiments, the model is trained using only germline sequence data or only somatic cell sequence data.

[0114] In some implementations, the comparison in step 210 may be performed indirectly by a statistical model. For example, if the median allele frequency of the set of proxies is used as the central measure in step 206, then a logistic regression model can be constructed to describe the difference in the allele frequencies of the target sequence from the median of the median allele frequencies of the proxies. In some implementations, this logistic regression model is used as described above.

number

[0115] The rationale underlying this characterization is that each proxy is physically close to the target sequence in the patient's genome. Therefore, the proxy and the target sequence are likely to experience identical or similar genomic dynamics or mutations, such as duplication events or deletions. Rather than attempting to model the specific dynamics of the target sequence to correlate observed frequencies with germline / somatic state, this approach replaces such models with direct empirical measurements. This approach offers advantages insofar as conventional models have historically been somewhat insensitive or inaccurate.

[0116] The methods described herein may further include generating a report showing one or more target genomic sequences as germline or somatic cells. The generated report can be transmitted (e.g., via a computer network) to a patient, healthcare provider, or other party. This report is particularly useful for evaluating cancer treatments, determining treatment effectiveness, monitoring cancer progression or recurrence, designing personalized cancer vaccines, and other beneficial uses.

[0117] Electronic devices and systems Figure 4 shows an example of a system according to one embodiment. Device 400 may be a host computer connected to a network. Device 400 may be a client computer or a server. As shown in Figure 4, device 400 may be any preferred type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device (portable electronic device, e.g., telephone or tablet). The device may include, for example, one or more of a processor 410, an input device 420, an output device 430, memory storage 440, and / or a communication device 460. The input device 420 and the output device 430 may be connectable to or integrated with the computer. In some embodiments, the device is configured to operate a sequencer 470 that can sequence nucleic acid molecules in a patient sample to obtain sequencing data.

[0118] The input device 420 may be any suitable device that provides input, such as a touchscreen, keyboard or keypad, mouse, or voice recognition device. The output device 430 may be any suitable device that provides output, such as a display, touchscreen, haptic device, or speaker.

[0119] The memory storage 440 may be any suitable device that provides storage, such as electrical, magnetic, or optical memory, including RAM, cache, hard drive, or removable storage disk. The communication device 460 may include any suitable device that can send and receive signals over a network, such as a network interface chip or device. The components of the computer may be connected in any suitable manner, such as a physical bus or wirelessly.

[0120] Software such as the SGZ module 450 and other sequence analysis and variant calling program modules are stored in memory storage 440 and can be executed by processor(s) 410, and may include, for example, code for an AFDIS-based logistic regression model and other programming to perform the functions of this disclosure (for example, as performed in the device described above).

[0121] Software such as the SGZ module 450, as well as other sequence analysis and variant calling program modules, can also be stored and / or transmitted in any non-temporary computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device (e.g., those described above), and can fetch and execute instructions associated with the software from the instruction execution system, apparatus, or device. In the context of this disclosure, the computer-readable storage medium may be any medium such as storage 440, which may contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.

[0122] Software such as the SGZ Module 450, as well as other sequence analysis and variant calling program modules, can also be propagated by or in any transmission medium for use with instruction execution systems, apparatus, or devices (e.g., those described above), and can fetch and execute instructions associated with the software from the instruction execution systems, apparatus, or devices. In the context of this disclosure, the transmission medium can be any medium that can communicate, propagate, or transmit transmission programming for use by or in connection with instruction execution systems, apparatus, or devices. Transmission-readable media may include, but are not limited to, wired or wireless transmission media of electronic, magnetic, optical, electromagnetic, or infrared.

[0123] Device 400 may be connected to a network which may be any preferred type of interconnected communication system. The network may implement any preferred communication protocol and may be protected by any preferred security protocol. The network may include network links of any preferred configuration that can implement the transmission and reception of network signals, such as wireless network connections (T1 or T3 lines), cable networks, DSL, or telephone lines.

[0124] Device 400 can implement any operating system suitable for running over a network. Software such as the SGZ module 450 and other sequence analysis and variant calling program modules can be written in any suitable programming language such as C, C++, Java®, or Python. In various embodiments, application software embodying the functions of this disclosure can be deployed in different configurations (e.g., in a client / server configuration, or via a web browser as a web-based application or web service).

[0125] Subject, sample, and sequencing The sample used in the methods described herein (e.g., a patient sample) may include a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. Tumor nucleic acid molecules can be obtained directly or indirectly from a tumor. For example, tumor nucleic acid molecules can be obtained from a tumor tissue biopsy. Tumor biopsies often include both tumor and non-tumor tissue, thereby providing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. In some embodiments, tumor nucleic acid molecules and non-tumor nucleic acid molecules are obtained from a body fluid or liquid biopsy sample (e.g., blood, plasma, cerebrospinal fluid, etc.) that may contain cell-free (or circulating free) DNA, including tumor (e.g., circulating tumor DNA or ctDNA) and non-tumor cell-free nucleic acid molecules.

[0126] Patient samples may be taken, for example, from subjects who have cancer, are suspected of having cancer, or have previously received treatment for cancer. In certain embodiments, samples are obtained from subjects with solid tumors, hematological malignancies, or metastatic forms thereof. In certain embodiments, samples are obtained from subjects who have cancer or are at risk of having cancer. In certain embodiments, samples are obtained from subjects who are not receiving treatment to treat cancer, are receiving treatment to treat cancer, or have previously received treatment to treat cancer, as described herein.

[0127] Various tissues can be sources of samples used in this method. Genomic or subgenomic nucleic acids (e.g., DNA or RNA) can be isolated from the target sample (e.g., a sample containing tumor cells, a blood sample, a blood component sample, a sample containing cell-free DNA (cfDNA), a sample containing circulating tumor DNA (ctDNA), a sample containing circulating tumor cells (CTCs), or any normal control (e.g., normal adjacent tissue (NAT)).

[0128] In some embodiments, the sample is obtained from a liquid biopsy. Liquid biopsy patient samples may be derived from, for example, blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva.

[0129] In some embodiments, the patient sample is derived from a solid tissue sample, such as a solid tumor biopsy. A solid tumor biopsy often contains a mixture of tumor tissue and non-tumor tissue. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a frozen or previously frozen sample. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a preserved sample (e.g., a chemically preserved sample). In certain embodiments, the sample is a formalin-fixed paraffin-embedded (FFPE) sample.

[0130] In some embodiments, the tumor purity of a patient sample (i.e., the portion of the sample that is tumor nucleic acid molecules compared to the total nucleic acid molecules) for any of the sample types disclosed herein is about 1% or more, about 5% or more, about 10% or more, about 15% or more, about 20% or more, about 25% or more, about 30% or more, about 40% or more, about 50% or more, about 60% or more, about 70% or more, or about 80% or more. In some embodiments, the tumor purity of a patient sample is about 99% or less, about 95% or less, about 90% or less, about 85% or less, about 80% or less, about 75% or less, about 70% or less, about 60% or less, about 50% or less, about 40% or less, about 30% or less, about 25% or less, or about 20% or less.

[0131] In one embodiment, the method further includes obtaining a sample, for example, a patient sample as described herein. The sample may be obtained directly or indirectly. In one embodiment, the sample is obtained, for example, by isolation or purification from a sample containing cfDNA. In one embodiment, the sample is obtained, for example, by isolation or purification from a sample containing ctDNA. In one embodiment, the sample is obtained, for example, by isolation or purification from a sample containing both malignant and non-malignant cells (e.g., tumor-infiltrating lymphocytes). In one embodiment, the sample is obtained, for example, by isolation or purification from a sample containing CTCs. In some embodiments, the sample is obtained by solid tissue biopsy.

[0132] Sequencing libraries can be prepared from patient samples using known methods. Nucleic acid molecules can be purified or isolated from patient samples. In some embodiments, isolated nucleic acids are fragmented or sheared using known methods. For example, nucleic acid molecules can be fragmented by physical shearing (e.g., sonication), enzymatic cleavage, chemical cleavage, and other methods well known to those skilled in the art. Nucleic acids can be linked to adapter sequences for sequencing. In some cases, the adapter may include amplification primers and / or sequencing adapters. In some cases, nucleic acid molecules purified or isolated from patient samples or sequencing libraries prepared therefrom can be amplified, for example, by polymerase chain reaction (PCR) or isothermal amplification methods well known to those skilled in the art.

[0133] In some embodiments, nucleic acid molecules from patient samples and used to prepare a sequencing library (or a selected (e.g., captured) subset thereof) are sequenced to generate the patient genome sequence. Sequencing methods are well known in the art and can be carried out using multiplex (e.g., next-generation) or single-molecule sequencing. The patient genome sequence determined by sequencing does not have to be the patient's entire genome. For example, in some embodiments, a portion of the patient's genome (i.e., less than the whole genome) is sequenced using targeted sequencing methods (e.g., the use of specific probe (or bait) molecules for capture based on hybridization). See, for example, U.S. Patent No. 9,340,830B2. Targeted sequencing may be used to target, for example, one or more exon regions, one or more intron regions, one or more intragene regions, one or more 3'-UTRs (untranslated regions), and / or one or more 5'-UTRs.

[0134] In some embodiments, targeted sequencing may be used to sequence one or more genes or parts of one or more genes that are associated with cancer. Exemplary cancer-related genes that can be sequenced using targeted sequencing include ABL2, AKT2, AKT3, ARAF, ARFRP1, ARID1A, ATM, ATR, AURKA, AURKB, BCL2, BCL2A1, BCL2L1, BCL2L2, BCL6, BRCA1, BRCA2, CARD11, CBL, CCND1, CCND2, CCND3, CCNE1, CDH1, CDH2, CDH20, CDH5, CDK4, CDK6, CDK8, CDKN2B, CDKN2C, CHEK1, CHEK2, CRKL, CRLF2, DNMT3A, DOT1L, EPHA3, EPHA5, EPHA6, EPHA7, EPHB1, EPHB4, EPHB6, ERBB3, ERBB4, ERG, ETV1, ETV4, ETV5, ETV6, EWSR1, EZ H2, FANCA, FBXW7, FGFR4, FLT1, FLT4, FOXP4, GATA1, GNA11, GNAQ, GNAS, GPR124, GUCY1A2, HOXA3, HSP90AA1, IDH1, IDH2, IGF1R, IGF2R, IKBKE, IKZF1, INHBA, IRS2, JAK1, JAK3, JUN, KDR, LRP1B, LTK, MAP2K1, MAP2K2, MAP2K4, MCL1, MDM2, MDM4, MEN1, MITF, MLH1, MPL, MRE11A, MSH2, MSH6 , MTOR, MUTYH, MYCL1, MYCN, NF2, NKX2-1, NTRK1, NTRK3, PAK3, PAX5, PDGFRB, PIK3R1, PKHD1, PLCG1, PRKDC, PTCH1, PTPN11, PTPRD, RAF1, RARA , RICTOR, RPTOR, RUNX1, SMAD2, SMAD3, SMAD4, SMARCA4, SMARCB1, SMO, SOX10, SOX2, SRC, STK11, TBX22, TET2, TGFBR2, TMPRSS2, TOP1, TSC1, T SC2, USP9X, VHL, WT1, ABL1, AKT1, ALK, APC, AR, BRAF, CDKN2A, CEBPA, CTNNB1, EGFR, ERBB2, ESR1, FGFR1, FGFR2, FGFR3, FLT3, HRAS, JAK2, KIT,Examples include, but are not limited to, KRAS, MET, MLL, MYC, NF1, NOTCH1, NPM1, NRAS, PDGFRA, PIK3CA, PTEN, RB1, RET, and TP53.

[0135] In certain embodiments, the sample is obtained from a subject having cancer. Exemplary cancers include, but are not limited to, B-cell cancers such as multiple myeloma, melanoma, breast cancer, lung cancer (such as non-small cell lung cancer or NSCLC), bronchial cancer, colorectal cancer, prostate cancer, and pancreatic cancer, as well as stomach cancer, ovarian cancer, bladder cancer, brain or central nervous system cancer, peripheral nervous system cancer, esophageal cancer, cervical cancer, uterine cancer or endometrial cancer, oral or pharyngeal cancer, liver cancer, kidney cancer, testicular cancer, biliary tract cancer, and small intestine or adnexal cancer. Salivary gland cancer, thyroid cancer, adrenal gland cancer, osteosarcoma, chondrosarcoma, hematological cancers, adenocarcinoma, inflammatory myofibroblastic tumor, gastrointestinal stromal tumor (GIST), colon cancer, multiple myeloma (MM), myelodysplastic syndrome (MDS), myeloproliferative disorder (MPD), acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), chronic myeloid leukemia (CML), chronic lymphocytic leukemia (CLL), polycystic belladonna, Hodgkin lymphoma, non-Hodgkin lymphoma Lymphoma quinis (NHL), soft tissue sarcoma, fibrosarcoma, myxosarcoma, liposarcoma, osteogenic sarcoma, chordoma, angiosarcoma, endosplenic synovoma, mesothelioma, Ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinoma, medullary carcinoma, bronchogenic carcinoma, renal cell carcinoma, hepatocellular carcinoma, cholangiocarcinoma, choriocarcinoma, seminoma, embryonic carcinoma, Wilms' tumor, bladder cancer, epithelial carcinoma, glioma, astrocytoma This includes cell tumors, medulloblastomas, craniopharyngiomas, ependymomas, pineal glandomas, hemangioblastomas, acoustic neuromas, oligodendrogliomas, meningiomas, neuroblastomas, retinoblastomas, cellular lymphomas, mantle cell lymphomas, hepatocellular carcinoma A, thyroid cancer, gastric cancer, head and neck cancer, small cell carcinoma, essential thrombocythemia, agnogenous myeloid metaplasia, hypereosinophilic syndrome, systemic mastocytosis, the well-known hypereosinophilia, chronic eosinophilic leukemia, neuroendocrine carcinomas, and cancerous tumors.

[0136] In one embodiment, cancer is a hematological malignancy (or pre-malignancy). As used herein, hematological malignancy refers to tumors of hematopoietic or lymphoid tissue, such as tumors affecting the blood, bone marrow, or lymph nodes. Exemplary hematological malignancies include leukemia (e.g., acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), chronic myeloid leukemia (CML), hairy cell leukemia, acute monocytic leukemia (AMoL), chronic myelomonocytic leukemia (CMML), juvenile myelomonocytic leukemia (JMML), or macrogranular lymphocytic leukemia), lymphoma (e.g., AIDS-associated lymphoma, cutaneous T-cell lymphoma, Hodgkin lymphoma (e.g., classical Hodgkin lymphoma or nodular lymphocyte-predominant Hodgkin lymphoma)), mycosis fungoides, non-Hodgkin lymphoma This includes, but is not limited to, peritoneal tumors (e.g., B-cell non-Hodgkin lymphoma (e.g., Burkitt lymphoma, small lymphocytic lymphoma / small lymphoma (CLL / SLL), diffuse large B-cell lymphoma, follicular lymphoma, immunoblastic large cell lymphoma, progenitor B-lymphoblastic lymphoma, or mantle cell lymphoma) or T-cell non-Hodgkin lymphoma (mycosis fungoides, anaplastic large cell lymphoma, or progenitor T-lymphoblastic lymphoma)) and primary central nervous system tumors. As used herein, pre-malignant tumors refer to tissue that is not yet malignant but is ready to become malignant.

[0137] In some embodiments, the sample is obtained, for example, from, a subject having a condition or disease, such as a hyperproliferative disorder (e.g., as described herein) or a non-cancerous indication, such as a patient, or collected. In some embodiments, the disease is a hyperproliferative disorder. In some embodiments, the hyperproliferative disorder is cancer, such as a solid tumor or a hematological malignancy. In some embodiments, the cancer is a solid tumor. In some embodiments, the cancer is a hematological malignancy, such as leukemia or lymphoma.

[0138] In some embodiments, the subject has cancer. In some embodiments, the subject has been treated for cancer or is being treated for cancer. In some embodiments, the subject needs to be monitored for cancer progression or regression after being treated, for example, with cancer therapy. In some embodiments, the subject needs to be monitored for cancer recurrence. In some embodiments, the subject is at risk of having cancer. In some embodiments, the subject has not been treated with cancer therapy. In some embodiments, the subject has a genetic predisposition to cancer (e.g., having a mutation that increases the baseline risk of developing cancer). In some embodiments, the subject is exposed to an environment that increases the risk of developing cancer (e.g., radiation or chemicals). In some embodiments, the subject needs to be monitored for the development of cancer.

[0139] In some embodiments, the patient has been previously treated with targeted therapy, e.g., one or more targeted therapies. In some embodiments, a post-targeted therapy sample, e.g., a specimen, is obtained and, e.g., collected from a patient who has been previously treated with targeted therapy. In some embodiments, the post-targeted therapy sample is a specimen obtained, e.g., collected after the completion of targeted therapy.

[0140] In some embodiments, the patient has not been previously treated with targeted therapy. In some embodiments, for patients not previously treated with targeted therapy, the sample includes excision, e.g., original excision, or recurrence, e.g., disease recurrence after treatment, e.g., untargeted therapy. In some embodiments, the sample is a primary tumor or metastasis, e.g., a metastatic biopsy, or a portion thereof. In some embodiments, the sample is obtained from the tumor, e.g., the site with the highest percentage of tumor cells, compared to an adjacent site, e.g., an adjacent site with tumor cells. In some embodiments, the sample is obtained from the site with the largest tumor focus, e.g., a tumor site, compared to an adjacent site, e.g., an adjacent site with tumor cells.

[0141] In some embodiments, the subject is a human.

[0142] Cancer treatment methods The genomic profile of cancer can often influence the likelihood of success for various cancer treatment approaches. For example, a given anticancer drug may be more likely to succeed in treating a particular cancer with a certain genomic profile than in treating a particular cancer with a different genomic profile. The methods described herein can be used to characterize the genomic profile of cancer by distinguishing somatic sequences that may be cancer-related from germline sequences.

[0143] For example, a method for treating a patient's cancer may include identifying (e.g., classifying) one or more target genomic sequences as somatic cells using the method described herein, and selecting a mode of cancer treatment based on one or more identified somatic cell sequences. The cancer can then be treated with an effective amount of the selected mode of cancer treatment. This enables personalized cancer treatment for a patient based on somatic cell sequences specific to that patient's cancer. In contrast, if the treatment selection is based on germline variants rather than somatic cell variants, there is a risk that the selected mode of treatment may be ineffective for the patient's cancer.

[0144] Exemplary cancer treatments may include, for example, selected chemotherapy agents, selected immunotumor agents (such as immune checkpoint inhibitors), surgical resection, radiotherapy, targeted therapy, gene expression regulators, angiogenesis inhibitors, and hormone therapy.

[0145] Cancer treatment may be selected, for example, based on the association between one or more identified somatic cell sequences and successful cancer treatments using selected therapeutic modes. Exemplary associations between cancer type, somatic cell sequences, and therapeutic modes are listed in Table 1. [Table 1-1] [Table 1-2]

[0146] The microsatellite instability (MSI) status of cancer may be useful in selecting the mode of cancer treatment. Microsatellite instability may be due to a deficiency in the DNA mismatch repair (MMR) pathway in cancer cells, which results in an abnormally high frequency of gene mutations. See Kim et al., "The Landscape of Microsatellite Instability in Colorectal and Endometrial Cancer Genomes," Cell, Vol. 155, No. 4, pp. 858-868 (2013). MSI status is generally characterized as high (MSI-H), low (MSI-L), or stable (MSS) (or alternatively, MSI-H or not MSI-H; or MSI-H or MSI undetermined) based on the MSI signature. The MSI-H status has been detected in several types of solid tumors and may be an indicator of the success of cancer treatment with a particular mode of cancer treatment. See Cortes-Ciriano et al., "A molecular portrait of microsatellite instability across multiple cancers," Nature Communications, Vol. 8, p. 15180 (2017). Mutations in microsatellites (i.e., MSI events) can be detected by distinguishing somatic sequences from germline sequences using the methods described herein.

[0147] The success of certain cancer treatment modes is related to the MSI-H status of the cancer. For example, PD-1 inhibitors (i.e., pembrolizumab) have been shown to be particularly effective in treating MSI-H solid tumors (e.g., unresectable or metastatic solid tumors). In some embodiments, cancers determined to have an MSI-H status are treated with an effective dose of an immunotumor agent. In some embodiments, cancers determined to have an MSI-H status are treated with an effective dose of an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is AMP-224, AMP-514, atezolizumab, AUNP12, avelumab, BGB-A317, BMS-986189, CA-170, canrelizumab, semiprimab, CK-301, dostarimab, durvalumab, ipilimumab, INCMGA00012, KN035, nivolumab, pembrolizumab, cintilimab, spartalizumab, tislerizumab, or tripalimab. In some embodiments, cancer determined to have an MSI-H state is treated with an effective dose of a PD-1 inhibitor, a PD-L1 inhibitor, or a CTLA-4 inhibitor. In some embodiments, cancer determined to have an MSI-H state is treated with an effective dose of pembrolizumab.

[0148] In some embodiments, the method for treating cancer includes: identifying (e.g., classifying) one or more target genomic sequences as somatic cells using the method described herein; determining the microsatellite instability state of the cancer using the identified somatic cell sequences; and selecting a cancer treatment mode based on the microsatellite instability state of the cancer. The cancer can then be treated with an effective amount of the selected cancer treatment mode. In some embodiments, the cancer is colorectal cancer, endometrial cancer, biliary tract cancer, bladder cancer, breast cancer, esophageal cancer, gastric cancer, gastroesophageal junction cancer, pancreatic cancer, prostate cancer, renal cell carcinoma, retroperitoneal adenocarcinoma, sarcoma, small cell lung cancer, small intestine cancer, or thyroid cancer.

[0149] In some embodiments, the tumor mutational burden (TMB) of cancer is determined using one or more somatic sequences identified using the method described herein to select a treatment mode. TMB is a cancer genomic biomarker that quantifies the frequency of somatic mutations in a patient's tumor. High TMB correlates with higher neoantigen expression, which helps the immune system recognize the tumor. This has been detected across numerous tumor types and is associated with improved response rates and extended progression-free survival in patients receiving immunotherapy. See Goodman et al., "Tumor Mutational Burden as an Independent Predictor of Response to Immunotherapy in Diverse Cancers," Mol. Cancer Ther., Vol. 16, No. 11, pp. 2598-2608 (2017).

[0150] Tumor mutational burden can be determined for cancer by identifying cancer-related somatic cell sequences using the methods described herein.

[0151] TMB can provide a quantitative value that allows for the selection of cancer treatment modes based on whether the tumor mutational burden is above or below a predetermined threshold. In some embodiments, the predetermined threshold is approximately 5 mutations / Mb, 10 mutations / Mb, 15 mutations / Mb, 20 mutations / Mb, 25 mutations / Mb, 30 mutations / MB, 40 mutations / Mb, 50 mutations / Mb or more, or any number in between (for example, the predetermined threshold may be 5 mutations / Mb to approximately 50 mutations / Mb). As an example, certain immunotumor agents have been found to be particularly effective when used to treat tumors with high tumor mutational burdens. See, for example, Fabrizio et al., "Beyond microsatellite testing: assessment of tumor mutational burden identifies subsets of colorectal cancer who may respond to immune checkpoint inhibition," J. Gastrointestinal Oncology, Vol. 9, No. 4, pp. 610-617 (2018).

[0152] In some embodiments, cancers determined to have a TMB exceeding a predetermined threshold are treated with an effective dose of an immunotumor agent. In some embodiments, cancers determined to have a TMB exceeding a predetermined threshold are treated with an effective dose of an immune checkpoint inhibitor. In some embodiments, the immune checkpoint inhibitor is AMP-224, AMP-514, atezolizumab, AUNP12, avelumab, BGB-A317, BMS-986189, CA-170, canrelizumab, semiprimab, CK-301, dostarimab, durvalumab, ipilimumab, INCMGA00012, KN035, nivolumab, pembrolizumab, cintilimab, spartalizumab, tislerizumab, or tripalimab. In some embodiments, cancers determined to have a TMB exceeding a predetermined threshold are treated with an effective dose of a PD-1 inhibitor, a PD-L1 inhibitor, or a CTLA-4 inhibitor. In some embodiments, cancers determined to have a TMB exceeding a predetermined threshold are treated with an effective dose of pembrolizumab. In some embodiments, cancers determined to have a TMB exceeding a predetermined threshold are treated with an effective dose of pembrolizumab, where the predetermined threshold is approximately 10 mutations / Mb.

[0153] In some embodiments, a method for treating cancer includes: identifying one or more target genomic sequences as somatic cells using the methods described herein; determining the tumor mutational load for cancer using one or more identified somatic cell sequences; and selecting a cancer treatment mode based on tumor mutational loads that exceed a predetermined threshold. The cancer can then be treated with an effective amount of the selected cancer treatment mode. In some embodiments, cancer is colorectal cancer, endometrial cancer, biliary tract cancer, bladder cancer, breast cancer, esophageal cancer, gastric cancer, gastroesophageal junction cancer, pancreatic cancer, prostate cancer, renal cell carcinoma, retroperitoneal adenocarcinoma, sarcoma, small cell lung cancer, small intestine cancer, or thyroid cancer.

[0154] Monitoring of cancer progression Monitoring cancer progression and / or detecting minimal residual disease is beneficial for evaluating cancer treatment plans and / or monitoring cancer recurrence in patients. Cancer patients can receive cancer treatment until the cancer is no longer detectable. Nevertheless, patients may remain susceptible to recurrence. Patients can be monitored for cancer recurrence by detecting nucleic acid molecules derived from recurrent tumors (e.g., ctDNA molecules). In other embodiments, cancer patients can be treated for the disease, and the progression of cancer (e.g., increase or decrease in tumor volume) can be monitored by quantifying the amount of tumor nucleic acid molecules detected in the patient (e.g., at the ctDNA level).

[0155] Identifying somatic cell sequences can be particularly useful in monitoring cancer progression or detecting minimal residual disease in cancer. Somatic cell sequences can provide a genomic signature for cancer and can be used to distinguish tumor nucleic acid molecules from non-tumor nucleic acid molecules.

[0156] Patient samples can be obtained and analyzed at two or more time points to monitor cancer progression or recurrence. A first sample is analyzed to identify one or more somatic cell sequences according to the method described herein. The first sample may be obtained before, during, or after cancer treatment, but the patient generally has some degree of detectable cancer.

[0157] A second sample may be obtained at a later point in time after the patient has been treated for cancer and can be analyzed to determine whether one or more identified somatic cell sequences are present in the sample. The presence of somatic cell sequences indicates that the patient still has cancer or that the cancer has recurred. The absence of detectable somatic cell sequences does not definitively prove that the patient does not have cancer, but indicates that the cancer level may be low.

[0158] The second patient sample may be of the same type as the first patient sample, or it may be of a different type. In some embodiments, the second patient sample is obtained from a liquid biopsy. For example, a liquid biopsy patient sample may be blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva. In some embodiments, the patient sample is obtained from a solid tissue sample, such as a solid tumor biopsy. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a frozen sample or a previously frozen sample. In some embodiments, the solid tissue biopsy sample is a fresh sample. In some embodiments, the solid tissue biopsy sample is a preserved sample (e.g., a chemically preserved sample). In certain embodiments, the sample is a formalin-fixed paraffin-embedded (FFPE) sample.

[0159] Somatic cell sequences may be detected in DNA or RNA (or both) from a second sample. The presence or absence of somatic cell sequences in the second sample may be detected by sequencing, quantitative PCR (qPCR), reverse transcription PCR (RT-PCR), fluorescence in situ hybridization (FISH), or any other preferred method for the specific detection of one or more somatic cell sequences. In certain embodiments, nucleic acid molecules are isolated from the second sample. In some embodiments, nucleic acid molecules are detected directly from the second sample.

[0160] In some embodiments, the presence of one or more somatic cell sequences is identified in a second sample, and the patient may be treated for cancer using the same or a different mode of treatment as in which the cancer was previously treated.

[0161] In some embodiments, a method for monitoring cancer progression or recurrence in a patient includes identifying one or more target genomic sequences as somatic cells using the method described herein, wherein the patient sample is obtained from a patient with cancer; obtaining a second patient sample from the patient after the cancer has been treated; and detecting the presence or absence of the one or more target genomic sequences identified as somatic cells in the second patient sample. For example, one or more target genomic sequences may be identified as somatic cells by selecting the target genomic sequence at a genomic locus from a patient genomic sequence obtained for a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; selecting one or more proxy genomic sequences for the target genomic sequence; determining an allele frequency distance using summary statistics showing the observed allele frequencies of the target genomic sequence and the observed frequencies of one or more proxy genomic sequences; and identifying the target genomic sequence as germline or somatic cells using the allele frequency distance. In some embodiments, the method includes treating the patient's cancer after a first patient sample is obtained from the patient and before a second patient sample is obtained from the patient. In some embodiments, the method includes treating the patient's cancer if the presence of one or more target genomic sequences identified as somatic cells is detected in the second patient sample.

[0162] Neoantigen selection and cancer vaccine production Somatic sequences detected in the exon regions of various genes may be suitable, for example, as neoantigens in the development of personalized cancer vaccines. Peptides can be generated based on nucleic acid sequences encoded by somatic variant sequences that can stimulate the immune system and kill cancer cells. See, for example, Richters et al., "Best practices for bioinformatics characterization of neoantigens for clinical utility," Genome Medicine, Vol. 11, p. 56 (2019).

[0163] In some embodiments, a method for selecting a neoantigen for a cancer vaccine personalized for a subject with cancer comprises identifying one or more target genomic sequences as somatic cells using the method described herein, wherein the one or more target genomic sequences identified as somatic cells are located within the exon region of a gene, and selecting from the one or more target genomic sequences identified as somatic cells a genomic sequence encoding a neoantigen suitable as a cancer vaccine for a subject. For example, the one or more target genomic sequences may be identified as somatic cells by: selecting the target genomic sequence at a genomic locus from a patient genomic sequence obtained for a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules; selecting one or more proxy genomic sequences for the target genomic sequence; determining an allele frequency distance using summary statistics showing the observed allele frequencies of the target genomic sequence and the observed frequencies of one or more proxy genomic sequences; and identifying the target genomic sequence as germline or somatic cell using the allele frequency distance.

[0164] In some embodiments, the method further includes producing a vaccine containing a neoantigen. [Examples]

[0165] Example 1 - Identification between somatic variants and germline variants based on allele frequency distance (AFDIS) The following examples are provided to illustrate exemplary embodiments of the invention as described herein and are not intended to limit the scope of the invention.

[0166] Using the previously described SGZ algorithm (e.g., Sun et al. (2018), ibid.), the difference in the expected variant allele frequencies of somatic and germline variants (e.g., a mutation substituting C for T) can be determined, provided that, as shown in FIG. 5A, the tumor fraction of the sample, the number of variant alleles, and the copy number of the genomic locus are determined. The variant allele frequencies (VAFs) expected for somatic and germline variants can be determined as follows: [Number] where p is the tumor purity, V is the number of variant alleles, and C is the copy number of the allele. For example, if the tumor purity (p) of the sample is 0.25, the number of variant alleles (V) is 3, and the copy number (C) is 4, then if the variant is somatic, the expected allele frequency is 0.3, and if it is germline, the expected allele frequency is 0.6. See, e.g., Sun et al., "A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal", PLoS Comput Biol., Vol. 14, No. 2, p. e1005965 (2018).

[0167] This example provides an alternative approach to the aforementioned SGZ algorithm that does not require modeling of the values of tumor purity, variant allele number, or copy number. The allele frequency distance from the expected germline allele frequency (AFDIS) is determined as follows: AFDIS = AF 生殖系列 - AF バリアント AF 生殖系列 is the allele frequency of the sequence assuming that the sequence is a definitive germline sequence, as defined by the allele frequency of the corresponding proxy sequence. AFバリアント This represents the observed allele frequency of a given sequence that is being characterized. To understand the allele frequency distance distribution of germline variants, the Circular Binary Segmentation algorithm described in Olshen et al., "Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data," Biostatistics, Vol. 5, No. 4, pp. 557-572 (October 2004), was used to segment the genomic sequences from 3802 tumor samples based on copy number uniformity. Approximately 2.1 million known germline variants (identified in the dbSNP and / or gnomAD databases) were selected from the 3802 samples, and the allele frequency (based on sequencing) of each germline variant was compared with the median allele frequency of proxy sequences within the same segment to determine the allele frequency distance of each germline variant. The probability density of approximately 2.1 million germline variants from the 3,802 samples is shown in Figure 5B, and the selected values are shown in Table 2. An empirical cumulative distribution function (ECDF) can be constructed from this germline AFDIS distribution data, and this can be used to assess the probability that a given AFDIS is induced from a germline variant. [Table 2]

[0168] The AFDIS threshold of 0.1, corresponding to the cumulative distribution of 0.993 based on the ECDF described above, was empirically determined to effectively separate somatic cells from germline variants. As shown in Table 2, AFDIS thresholds ranging from approximately 0.05 to 0.1 all provided good differentiation between somatic and germline variants. Nevertheless, a trained statistical model was constructed to understand the probability that any given sequence is germline or somatic, as described below.

[0169] Next, allelic frequency distances were determined for 92 high-purity / low-purity tumor samples with known germline sequences, somatic cell sequences, and tumor purity, all of which were matched genotypes. Generally, low-purity samples are considered to be approximations of normal samples and allow for reliable determination of the somatic cell vs. germline status of variants within them. Therefore, ground truths for the somatic cell / germline status of selected sequences were established using low-purity samples. Figure 5C shows variant AFDIS for germline and somatic cell sequences from 92 tumor samples, plotted against the calculated purity of the samples. Gray circles represent ground truth somatic cell sequences, and black circles represent ground truth germline sequences.

[0170] Example 2 - Logistic regression of somatic cell / germline states based on AFDIS Logistic regression models were generated using available data from 21 matched tumor / normal pairs (lung squamous cell carcinoma (n=5), ovarian serous carcinoma (n=4), lung adenocarcinoma (n=3), invasive ductal carcinoma of the breast (n=2), anal carcinoma (n=1), urothelial carcinoma of the bladder (n=1), CRC (n=1), renal clear cell carcinoma (n=1), high-grade ovarian serous carcinoma (n=1), dermatosarcoma (n=1), and endometrial adenocarcinoma (n=1). Matched tumor / normal pairs enabled reliable determination of somatic and germline sequences. Figure 5D shows the receiver operating characteristic (ROC) curve of this approach, i.e., the classification model for distinguishing between somatic and germline variants. The graph plots of the true positive (TP) and false positive (FP) performance are shown. The model's "one-miss cross-validation" (LOOCV) results showed a precision of 0.97 (95% confidence interval = [0.95, 0.99]) and Cohen's (unweighted) kappa statistic of 0.93. The model was trained using matched tumor / normal pair data to output the probability that a given sequence is a somatic sequence. For known germline sequences in the training data, the probability that the sequence is somatic is 0. For known somatic sequences in the training data, the probability that the sequence is somatic is 1. The logistic regression model was trained using the training dataset according to the following function:

number

[0171] The AFDIS data, calculated as described above for variants in a total of 188 tumor samples across three different test sets, was input into a trained model to determine the probability that each selected sequence was somatic or germline. Based on the somatic variant probability, the variant sequences were labeled as somatic (above the somatic probability threshold), germline (below the germline probability threshold), or ambiguous (i.e., between the somatic and germline probability thresholds). See Figure 5F.

[0172] As shown in Figure 5G, the AFDIS classification index demonstrates an improvement over the conventional SGZ method in the classification of a set of 93 tumor samples with matched normal samples used to validate the conventional SGZ method. The genomic sequences of the 93 tumor samples were obtained using a different hybrid capture bait set than that used in the training dataset, demonstrating that the AFDIS classification index is robust and applicable to genomic data collected in various ways. Non-limiting examples of various levels of performance of this method (#True Positive (True), #False Positive (FP), and Positive Predictive Value (PPV)) are outlined in Table 3. [Table 3]

[0173] A non-limiting example of sample-level sensitivity performance data for this method is shown in Figure 5H, and a non-limiting example of positive predictive value (PPV) performance data is shown in Figure 5I. The "violin plots" shown in Figures 5H and 5I have a plot shape that indicates the probability density of the values on the vertical axis. The box plot nested inside the violin plot shows the median, first and third quartiles, minimum, maximum, and outliers of the parameter plotted on the vertical axis. In the PPV plot of this example, the PPV for most samples is 100%, and therefore the median, maximum, and first and third quartile indices are compressed.

[0174] A non-limiting example of data for variant classification in the BRCA1 and BRCA2 genes is shown in Figure 5J. A non-limiting example of data for variant classification in the STK11 gene is shown in Figure 5K. As expected, BRCA1 and BRCA2 mutations were found to be richer in germline-derived variants in breast cancer compared to other cancer types (p=0.025 chi-squared test), and STK11 mutations were found to be richer in somatic-derived variants in lung cancer compared to other cancer types (p=0.0026 chi-squared test).

[0175] Example 3 - Logistic regression of somatic cell / germline states based on AFDIS The disclosed method for distinguishing between somatic variants and germline variants is based on a comparison of the allele frequency (AF) of the variant in question with the allele frequencies of known variants located adjacent to its genomic location. In some cases, as described above, known germline variants in germline databases (e.g., public databases) can be used for comparison. If the AF of the variant in question is very similar to or very different from the AF of nearby known germline variants, it would be concluded that the variant in question is very likely or unlikely to be germline, respectively.

[0176] Generally, the AF of a given variant is primarily determined by its copy number and the tumor fraction of the sample. The tumor fraction is a constant for a particular sample, and therefore, the AF of a given variant in a given sample is largely determined by its copy number. This means that the AF can be compared to the AF of a germline variant of the same copy number to infer the somatic / germline state of the variant. Two non-limiting examples implementing such a comparison are described below and in Example 4.

[0177] In one implementation, an "allele frequency distance" (AFDIS) is calculated, which represents the distance between the AF of the variant in question and the median AF of germline variants located on the same copy number segment (for example, located in the same physically contiguous part of the genomic segment, or, as long as the segment exists at the same copy number as the variant in question, located in a discontinuous part of the genomic segment). First, the AFDIS was calculated as follows: AFDIS=|MAF バリアント -MAF セグメント | In the formula, MAF = minor allele frequency, i.e., the absolute distance between them was calculated using the minor allele frequencies for both the target variant and the median minor allele frequency for the segment germline variant. Next, to capture the relationship between “somatic cell probability” and AFDIS, a logistic regression model was trained with a training dataset consisting of known somatic cell and germline variants. The model was then trained using distance with direction, i.e., AFDIS = AFDIS = AF セグメント -AF バリアント It is improved by redefining it as, here, AF セグメントAFDIS is the median allele frequency for segment germline variants. In this formula, the sign of AFDIS explains somatic variants with lower allele frequencies compared to germline variants of the same copy number, when normal tissue, cells, or cfDNA are mixed in the sample. This is because sequencing reads from normal parts of the sample or normal cells in the blood will contain germline variants rather than somatic variants. The logistic regression model is trained to recognize that a negative AFDIS is associated with a low probability that the variant is somatic. The use of directional AFDIS calculations improved the model's performance in distinguishing between somatic and germline variants.

[0178] The AFDIS-based approach has the advantage of being computationally simple and easy to perform, and therefore can be easily modified to include other considerations in a given implementation. Specifically, since AFDIS is a single predictor variable in the logistic regression model, the AFDIS value can be easily adjusted to correct the results to account for other potential technical issues. For example, to account for the increased uncertainty introduced by mild contamination of nucleic acid samples, adjustments can be applied to the AFDIS value depending on the contamination level to move the AFDIS value to a range that corresponds to a more accurate classification of somatic / germline variants by the model. Similar adjustments can be made to account for additional uncertainties introduced by factors such as low read depth, noisy AF estimation, low segment germline SNP count, and high variability in segment germline SNP AF. The extent and manner in which these adjustments are implemented can be designed and adjusted using training datasets containing known somatic and germline variants.

[0179] Example 4 - Germline exclusion based on the probability distribution of germline allele frequencies In this particular implementation, a large dataset of known germline variants is constructed, each having its own AF and corresponding segment MAF, which is the median MAF of other known germline variants located in the same copy number segment; Figure 6A shows a plot of variant AF versus segment MAF. For an unknown variant to be classified, its AF and corresponding segment MAF are determined. To classify the unknown variant, data is taken from a known germline dataset containing a subset of known germline variants that have segment MAFs similar to the unknown variant's segment MAF (e.g., one of the three density vs. variant AF plots shown in Figure 6B, corresponding to the variant allele frequency distributions around segment MAFs of 0.1, 0.2, and 0.3, respectively, as shown in Figure 6A). Using this data, a distribution of germline AF values for a given segment MAF can be established (i.e., a given copy number, since segment MAF is essentially determined by the copy number of the segment). The AF of the unknown variant is compared to this germline AF distribution to estimate the probability that the unknown variant is a germline variant. For example, an unknown variant with an AF of either 0.1 or 0.9 and a segment MAF of 0.1 is likely to be a germline variant, while an unknown variant with an AF of 0.4 and a segment MAF of 0.1 is likely to be a somatic variant.

[0180] Example 5 - Performance Verification The disclosed method provides an exemplary technique for selecting somatic cell variants from baseline tissue or liquid biopsy samples for plasma monitoring. To further enhance performance for this particular purpose, several additional measures have been devised, including (i) selection of well-behaving variants (e.g., by excluding variants located in genomic regions where allele frequencies are known or expected to deviate from expected values, such as variants located in regions with repetitive sequences or regions sharing homology with other regions of the genome) for constructing a logistic regression model; (ii) incorporating prior knowledge of the likelihood of variants being germline, somatic, or clonal hematopoiesis variants with undetermined potential, based on historical data and publicly available databases; and (iii) taking into account the noise level of variant calling and its genomic context. These measurements have been found to improve the performance of somatic cell variant classification.

[0181] The ability of a disclosed AFDIS-based logistic regression model to distinguish somatic variants from germline variants in a sample was validated, for example, using data from matched tumor / normal pairs. Tables 4 and 5 outline the initial training and test datasets used to develop the logistic regression model, as well as non-limiting examples of performance metrics (false positives (FP), sensitivity, and positive predictive value (PPV)) obtained for various levels and sample levels, respectively. [Table 4] [Table 5]

[0182] The dataset used in the variant calling pipeline validation study included data from 86 matched tissue / peripheral blood mononuclear cell (PBMC) sample pairs. Performance metrics at various levels and sample levels are summarized in Tables 6 and 7, respectively. [Table 6] [Table 7]

[0183] The dataset used in the additional variant calling pipeline validation study included data from 746 matched tissue / peripheral blood mononuclear cell (PBMC) sample pairs. Performance metrics at various levels and sample levels are summarized in Tables 8 and 9, respectively. [Table 8] [Table 9]

[0184] It will be understood that the above methods and systems are presented as examples, not as limitations. Numerous variations, additions, omissions, and other modifications will be apparent to those skilled in the art. In addition, the order or presentation of the method steps in the above description and drawings is not intended to require this order of performing the enumerated steps unless a specific order is explicitly requested or is evident from the context.

[0185] The methods and processes of the present invention described herein are intended to include any preferred methods of causing one or more other parties or entities to perform the process, unless a different meaning is expressly provided or is evident from the context. In some embodiments, such parties or entities do not need to be under the direction or control of the other parties or entities, nor do they need to be located within a particular jurisdiction. Thus, for example, a statement or enumeration such as “add a first number to a second number” includes causing one or more parties or entities to add two numbers together. For example, if person X makes an equal deal with person Y to add two numbers, and person Y actually adds the two numbers, then both person X and person Y perform the process indicated by the fact that person Y actually added the numbers, and by the fact that person X caused person Y to add the numbers. Furthermore, if person X is located in the United States and person Y is located outside the United States, the method is performed in the United States by person X being involved in causing the process to be performed.

[0186] The terms used in the description of the various embodiments described herein are intended solely to describe a particular embodiment and are not intended to limit it. As used in the description of the various embodiments described and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural form unless the context clearly indicates otherwise. The terms “and / or” as used herein will also be understood to refer to and encompass any possible combination of one or more of the enumerated items relating to the description. Where used herein, the terms “includes,” “including,” “comprises,” and / or “comprising” express the presence of the described features, integers, processes, operations, elements, and / or components, but will not exclude the presence or addition of one or more other features, integers, processes, operations, elements, components, and / or groups thereof.

[0187] All publications, patents, and patent application disclosures referenced herein are incorporated herein by reference in their entirety. In the event of any conflict between the references incorporated herein and this disclosure, this disclosure shall prevail.

[0188] While specific embodiments of the present invention have been shown and described, it will be apparent to those skilled in the art that various changes and modifications in form and detail can be made without departing from the spirit and scope of the invention as defined by the following claims. The following claims are intended to include all possible changes and modifications and should be interpreted in the broadest sense permitted by law.

Claims

1. A method for identifying a target genome sequence as germline or somatic cell, wherein the method is To provide a plurality of nucleic acid molecules obtained from a sample from a target, wherein the plurality of nucleic acid molecules include a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. Selectively attaching one or more adapters to one or more nucleic acids from the plurality of nucleic acid molecules, The amplification of nucleic acid molecules from the aforementioned plurality of nucleic acid molecules, The capture of nucleic acid molecules from amplified nucleic acid molecules, wherein the captured nucleic acid molecules are captured from the amplified nucleic acid molecules by hybridization to one or more bait molecules. The captured nucleic acid molecules are sequenced using a sequencer to obtain multiple sequence readings corresponding to one or more genomic loci, One or more processors select a target genome sequence at a genome locus from one or more genome loci, The one or more processors select one or more proxy genome sequences for the target genome sequence, The one or more processors determine the allele frequency distance using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of the one or more proxy genome sequences. The one or more processors use the allele frequency distance to identify the target genome sequence as germline or somatic cells. Methods that include...

2. The method according to claim 1, wherein the subject is a cancer patient.

3. The method according to claim 1 or claim 2, wherein the sample includes a tissue biopsy sample, a liquid biopsy sample, a circulating tumor cell (CTC) sample, a cell-free DNA (cfDNA) sample, or a normal control.

4. The method according to claim 3, wherein the sample is a liquid biopsy sample and includes blood, plasma, cerebrospinal fluid, sputum, feces, urine, or saliva.

5. The method according to any one of claims 1 to 3, wherein the tumor nucleic acid molecule is derived from the tumor portion of the heterogeneous tissue biopsy sample, and the non-tumor nucleic acid molecule is derived from the normal portion of the heterogeneous tissue biopsy sample.

6. The method according to any one of claims 1 to 3, wherein the tumor nucleic acid molecule is derived from the circulating tumor DNA (ctDNA) fraction of the cell-free DNA sample, and the non-tumor nucleic acid molecule is derived from the non-tumor fraction of the cell-free DNA sample.

7. The method according to any one of claims 1 to 6, wherein the one or more adapters include an amplification primer or a sequencing adapter.

8. The method according to any one of claims 1 to 7, wherein the one or more bait molecules comprise one or more nucleic acid molecules, and each nucleic acid molecule comprises a region complementary to the region of the captured nucleic acid molecule.

9. The method according to any one of claims 1 to 8, wherein amplification of nucleic acid molecules is performed by carrying out a polymerase chain reaction (PCR) or isothermal amplification technique.

10. The method according to any one of claims 1 to 9, wherein the sequencing includes the use of next-generation sequencing (NGS) technology.

11. The method according to any one of claims 1 to 10, wherein the sequencer includes a next-generation sequencer.

12. The method according to any one of claims 1 to 11, wherein one or more proxy genome sequences are located within a defined segment of the target genome sequence, and the selected target genome sequence is located within the same defined segment.

13. The method according to claim 12, wherein the target genome sequence is segmented into a plurality of segments based on the uniformity of copy numbers within each segment.

14. The method according to any one of claims 1 to 13, wherein the summary statistic is the mean allele frequency or the central allele frequency.

15. The method according to any one of claims 1 to 14, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not.

16. A method for identifying a target genome sequence as germline or somatic cell, wherein the method is One or more processors select a target genomic sequence at a genomic locus from within the patient genomic sequence obtained for a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. The one or more processors select one or more proxy genome sequences for the target genome sequence, The one or more processors determine the allele frequency distance using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of the one or more proxy genome sequences. The one or more processors use the allele frequency distance to identify the target genome sequence as germline or somatic cells. Methods that include...

17. The method according to claim 16, comprising sequencing the tumor nucleic acid molecules and non-tumor nucleic acid molecules from the patient sample using a sequencer to determine the patient genome sequence.

18. The method according to claim 17, wherein the patient genome sequence is obtained using next-generation sequencing technology.

19. The method according to claim 17, wherein the sequencer is a next-generation sequencer.

20. The method according to any one of claims 16 to 19, wherein one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment.

21. The method according to claim 20, wherein the patient genome sequence is segmented into a plurality of segments based on the uniformity of copy numbers within each segment.

22. The method according to claim 20 or 21, comprising segmenting the patient genome sequence into a plurality of segments.

23. The method according to any one of claims 16 to 22, wherein the summary statistic is the mean allele frequency or the central allele frequency.

24. The method according to any one of claims 16 to 23, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not.

25. The method according to any one of claims 16 to 24, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule.

26. The method according to any one of claims 16 to 25, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule.

27. The method according to any one of claims 16 to 26, wherein the patient genome sequence is determined using targeted sequencing.

28. The method according to claim 27, wherein the targeted sequencing comprises targeted sequencing of one or more genes or a portion thereof related to cancer.

29. The method according to claim 27 or claim 28, wherein the targeted sequencing comprises targeted sequencing of one or more exon regions.

30. A method for identifying a target genome sequence as germline or somatic cell, wherein the method is Identifying target genomic sequences in patient samples at genomic loci using one or more processors, The one or more processors identify one or more proxy genome sequences for the target sequence, The one or more processors compare the observed frequency of the target genome sequence with a centrality measure of the observed frequency of the one or more proxy genome sequences. The one or more processors identify the target genome sequence as germline or somatic cell based on the comparison. Methods that include...

31. The method according to claim 30, further comprising using one or more processors to identify a segment of the patient's genome containing the genomic locus.

32. The method according to claim 31, wherein identifying the segment by the one or more processors includes performing a segmentation procedure on a continuous portion of the patient's genome.

33. The method according to claim 32, wherein the portion of the patient's genome is large enough to identify three different segments.

34. The method according to claim 31, wherein the one or more proxy genome sequences are identified by the one or more processors such that they are located within the same segment as the genomic locus.

35. The method according to claim 32, wherein the segmentation procedure involves one or more processors identifying segments according to whether the genomic parameters are equal throughout each individual segment.

36. The method according to claim 35, wherein the genome parameter is the copy number.

37. The one or more processors described above can identify the target genome sequence as germline or somatic cells. Inputting allele frequency distances into a pre-trained statistical model, The trained statistical model outputs a value indicating the likelihood that the target genome sequence is germline, or a value indicating the likelihood that the target genome sequence is somatic. The method according to any one of claims 16 to 36, including the method described in any one of claims 16 to 36.

38. The method according to any one of claims 16 to 37, wherein the allele frequency distance is adjusted to correct for contamination levels in the patient sample, low sequencing read depth, noisy estimates of allele frequencies, low segment germline single nucleotide polymorphism (SNP) counts, or high variability in segment germline SNP allele frequencies.

39. The method according to claim 37 or claim 38, wherein the trained statistical model includes a function that associates the allele frequency distance with a value indicating the likelihood that the target genome sequence is germline, or with a value indicating the likelihood that the target genome sequence is somatic.

40. The method according to any one of claims 37 to 39, wherein the pre-trained statistical model is a logistic regression model.

41. The method according to any one of claims 37 to 40, further comprising training the statistical model using data on tumor samples having known germline sequences.

42. The method according to any one of claims 37 to 41, further comprising training the statistical model with data on tumor samples having known germline sequences and known somatic cell sequences.

43. The method according to any one of claims 37 to 40, wherein the trained statistical model is trained using data on tumor samples having known germline sequences.

44. The method according to claim 43, wherein the trained statistical model is trained using data on tumor samples having known germline sequences and known somatic cell sequences.

45. The method according to any one of claims 37 to 44, further comprising training the statistical model with data on variant allele frequencies to exclude variants located in genomic regions known to have allele frequencies that deviate from expected values.

46. The method according to any one of claims 37 to 44, wherein the trained statistical model is trained using data on variant allele frequencies to exclude variants located in genomic regions known to have allele frequencies that deviate from expected values.

47. The method according to any one of claims 37 to 46, further comprising training the statistical model with data that incorporates prior knowledge of the likelihood of variants being germline, somatic variants, or clonal hematopoietic (CHIP) variants with undetermined potential, based on historical data or a database.

48. The method according to any one of claims 37 to 46, wherein the trained statistical model is trained using data that incorporates prior knowledge of the likelihood of variants being germline, somatic variants, or clonal hematopoietic (CHIP) variants with undetermined potential, based on historical data or a database.

49. The method according to any one of claims 37 to 48, further comprising training the statistical model with data describing the noise level for a given variant call and its genomic context.

50. The method according to any one of claims 37 to 48, wherein the trained statistical model is trained using data that describes the noise level for a given variant call and its genomic context.

51. The method according to any one of claims 16 to 50, wherein the one or more proxy genome sequences include a single nucleotide polymorphism (SNP).

52. The method according to any one of claims 16 to 51, wherein the one or more proxy genome sequences include alleles.

53. The method according to any one of claims 16 to 52, wherein the target genome sequence includes a genome variant.

54. The method according to any one of claims 16 to 53, further comprising generating a report showing the target genome sequence as germline or somatic cells using one or more of the processors.

55. The method according to claim 54, comprising transmitting the aforementioned report to a healthcare provider.

56. The method according to claim 54 or claim 55, wherein the report is transmitted via a computer network or peer-to-peer connection.

57. The method according to any one of claims 16 to 56, wherein the patient sample is derived from a tissue biopsy including tumor tissue and non-tumor tissue.

58. The method according to claim 57, wherein the tissue biopsy is a solid tissue biopsy or a liquid biopsy.

59. The method according to claim 58, wherein the tissue biopsy is a liquid biopsy containing blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva.

60. The method according to any one of claims 16 to 59, wherein the patient sample comprises cell-free DNA (cfDNA) obtained from the subject.

61. The method according to any one of claims 16 to 60, wherein the patient sample comprises circulating tumor DNA (ctDNA) obtained from the subject.

62. A method for obtaining one or more somatic cell sequences as indicators of cancer treatment modes to be used for treating a patient's cancer, Identifying one or more target genome sequences as somatic cells using the method described in any one of claims 16 to 61, Includes, A method wherein one or more identified somatic cell sequences serve as an indicator that the cancer treatment mode should be used for the treatment of the cancer.

63. The method according to claim 62, wherein the one or more identified somatic cell sequences are involved in the success of cancer treatment using the selected therapeutic mode.

64. The microsatellite instability state of the cancer is determined using the one or more identified somatic cell sequences by the one or more processors. Includes, The method according to claim 62, wherein the microsatellite instability state of the cancer serves as an indicator that the cancer treatment method should be used for the treatment of the cancer.

65. The one or more processors determine the tumor mutation load for the cancer using the one or more identified somatic cell sequences. Includes, The method according to claim 62, wherein the tumor mutation load exceeding a predetermined threshold for tumor mutation load serves as an indicator that the cancer treatment mode should be used for the treatment of the cancer.

66. The method according to claim 65, wherein the cancer treatment method includes administering an effective amount of one or more anticancer drugs to the patient when the tumor mutation load exceeds a predetermined threshold.

67. The method according to claim 66, wherein the one or more anticancer agents include a cancer immunotherapy agent.

68. The method according to claim 67, wherein the cancer immunotherapy agent is an immune checkpoint inhibitor.

69. A method for monitoring the progression or recurrence of cancer in a patient, Identifying one or more target genome sequences as somatic cells using the method described in any one of claims 16 to 67, wherein the patient sample is obtained from a patient with cancer, and the identification of one or more target genome sequences as somatic cells is performed. The one or more processors detect the presence or absence of the one or more target genome sequences identified as somatic cells in a second patient sample obtained from the patient after the cancer has been treated. Methods that include...

70. The method according to claim 69, wherein the second patient sample is obtained from the patient who is undergoing cancer treatment after the first patient sample has been obtained from the patient.

71. The method according to any one of claims 69 to 70, wherein the second patient sample comprises cell-free DNA.

72. The method according to any one of claims 69 to 71, wherein detecting the presence or absence of one or more target genomic sequences identified as somatic cells in the second patient sample comprises sequencing nucleic acid molecules in the second patient sample.

73. A method for selecting neoantigens for a personalized cancer vaccine for subjects with cancer, Identifying one or more target genome sequences as somatic cells using the method described in any one of claims 16 to 67, wherein the one or more target genome sequences identified as somatic cells are located within the exon region of a gene. The process involves selecting a genomic sequence encoding a neoantigen suitable as a cancer vaccine for the target from the one or more target genomic sequences identified as somatic cells using the aforementioned processor. Methods that include...

74. The method according to claim 73, further comprising preparing a vaccine containing the neoantigen.

75. A non-temporary computer-readable storage medium for storing one or more programs, wherein the one or more programs include instructions, and these instructions are executed by one or more processors of an electronic device, From the patient genome sequence obtained from a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules, the target genome sequence at the genomic locus was selected. Select one or more proxy genome sequences for the target genome sequence. The allele frequency distance is determined using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of one or more proxy genome sequences, and The target genome sequence is identified as germline or somatic cell using the allele frequency distance. Non-temporary computer-readable storage medium.

76. The non-temporary computer-readable storage medium according to claim 75, wherein one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment.

77. The non-temporary computer-readable storage medium according to claim 76, wherein the patient genome sequence is segmented into a plurality of segments based on the uniformity of the copy number within each segment.

78. The non-temporary computer-readable storage medium according to claim 76 or 77, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to segment the patient genome sequence into a plurality of segments.

79. The non-temporary computer-readable storage medium according to any one of claims 75 to 78, wherein the summary statistic is the mean allele frequency or the central allele frequency.

80. The non-temporary computer-readable storage medium according to any one of claims 75 to 79, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not.

81. The non-temporary computer-readable storage medium according to any one of claims 75 to 80, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule.

82. The non-temporary computer-readable storage medium according to any one of claims 75 to 81, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule.

83. The non-temporary computer-readable storage medium according to any one of claims 75 to 82, wherein the patient genome sequence is determined using targeted sequencing.

84. The non-temporary computer-readable storage medium according to any one of claims 75 to 83, wherein the patient genome sequence is determined using next-generation sequencing.

85. The non-temporary computer-readable storage medium according to claim 83 or 84, wherein the targeted sequencing comprises targeted sequencing of one or more genes or a portion thereof related to cancer.

86. The non-temporary computer-readable storage medium according to any one of claims 83 to 85, wherein the targeted sequencing comprises targeted sequencing of one or more exon regions.

87. A non-temporary computer-readable storage medium for storing one or more programs, wherein the one or more programs include instructions, and these instructions are executed by one or more processors of an electronic device, Identify the target genomic sequence in the patient sample at the genomic locus. Identify one or more proxy genome sequences for the aforementioned target sequence. The observed frequency of the target sequence is identified against the centrality measure of the observed frequencies of one or more proxy genome sequences, and A non-temporary computer-readable storage medium that, based on the comparison described above, characterizes the target genome sequence as either germline or somatic cell.

88. The non-temporary computer-readable storage medium according to claim 87, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to identify a segment of the patient's genome containing the genomic locus.

89. The non-temporary computer-readable storage medium according to claim 87, wherein identifying the segment comprises performing a segmentation procedure on a continuous portion of the patient's genome.

90. The non-temporary computer-readable storage medium according to claim 89, wherein the portion of the patient's genome is large enough to identify three different segments.

91. A non-temporary computer-readable storage medium according to any one of claims 87 to 90, wherein one or more proxy genome sequences are identified as being located on the same segment as the genome locus.

92. The non-temporary computer-readable storage medium according to any one of claims 89 to 91, wherein the segmentation procedure identifies segments according to whether the genomic parameters are equal throughout each individual segment.

93. The non-temporary computer-readable storage medium according to claim 92, wherein the genome parameter is the copy number.

94. A non-temporary computer-readable storage medium according to any one of claims 75 to 93, wherein the one or more programs further include instructions, and when such instructions are executed by one or more processors of the electronic device, the electronic device is caused to receive sequencing data relating to the patient genome sequence.

95. The non-temporary computer-readable storage medium according to claim 94, wherein the one or more programs further include instructions, and when such instructions are executed by one or more processors of the electronic device, the electronic device causes the electronic device to assemble the patient genome sequence using the sequencing data.

96. The non-temporary computer-readable storage medium according to claim 94 or 95, wherein the one or more programs further include instructions, which, when executed by one or more processors of the electronic device, cause a sequencer to operate to sequence nucleic acid molecules derived from the patient sample and thereby obtain the sequencing data.

97. A non-temporary computer-readable storage medium according to any one of claims 75 to 96, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to generate a report indicating the desired genome sequence as either germline or somatic cells.

98. A non-temporary computer-readable storage medium according to any one of claims 75 to 97, wherein the one or more programs further include instructions, and when such instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to transmit the report using a computer network.

99. A non-temporary computer-readable storage medium according to any one of claims 75 to 98, wherein the electronic device comprises a display, and the one or more programs further include instructions, and when the instructions are executed by the one or more processors of the electronic device, the electronic device causes the electronic device to display the report.

100. A non-temporary computer-readable storage medium according to any one of claims 75 to 99, wherein the one or more proxy genome sequences include a single nucleotide polymorphism (SNP).

101. A non-temporary computer-readable storage medium according to any one of claims 75 to 100, wherein the one or more proxy genome sequences include alleles.

102. A non-temporary computer-readable storage medium according to any one of claims 75 to 101, wherein the target genome sequence includes a genome variant.

103. It is an electronic device, One or more processors, A memory for storing one or more programs configured to be executed by the one or more processors, wherein the one or more programs are A command to select a target genome sequence at a genomic locus from within the patient genome sequence obtained from a patient sample containing a mixture of tumor nucleic acid molecules and non-tumor nucleic acid molecules. A command to select one or more proxy genome sequences for the aforementioned target genome sequence, An instruction to determine the allele frequency distance using summary statistics or distributions that show the observed allele frequencies of the target genome sequence and the observed allele frequencies of one or more proxy genome sequences, and A memory that stores one or more programs, including an instruction to identify the target genome sequence as germline or somatic cell using the allele frequency distance. An electronic device equipped with the following features.

104. The electronic device according to claim 103, wherein one or more proxy genome sequences are located within a defined segment of the patient genome sequence, and the selected target genome sequence is located within the same defined segment.

105. The electronic device according to claim 104, wherein the patient genome sequence is segmented into a plurality of segments based on the uniformity of copy numbers within each segment.

106. The electronic device according to any one of claims 103 to 105, wherein the one or more programs further include instructions for segmenting the patient genome sequence into a plurality of segments.

107. The electronic device according to any one of claims 103 to 106, wherein the summary statistic is the mean allele frequency or the central allele frequency.

108. The electronic device according to any one of claims 103 to 107, wherein the allele frequency distance is determined using a distribution showing the observed allele frequencies of the target genome sequence and the observed frequencies of a plurality of proxy genome sequences, and the target genome sequence is identified as germline or somatic cell based on the probability that the observed allele frequencies of the target genome sequence fit within the distribution or not.

109. The electronic device according to any one of claims 103 to 108, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include a DNA molecule.

110. The electronic device according to any one of claims 103 to 109, wherein the tumor nucleic acid molecule and the non-tumor nucleic acid molecule include an RNA molecule.

111. The electronic device according to any one of claims 103 to 110, wherein the patient genome sequence is determined using next-generation sequencing.

112. The electronic device according to any one of claims 103 to 111, wherein the patient genome sequence is determined using targeted sequencing.

113. The electronic device according to claim 112, wherein the targeted sequencing includes targeted sequencing of one or more genes or a portion thereof related to cancer.

114. The electronic device according to claim 112 or claim 113, wherein the targeted sequencing includes targeted sequencing of one or more exon regions.

115. It is an electronic device, One or more processors, A memory for storing one or more programs configured to be executed by the one or more processors, wherein the one or more programs are A command to identify the target genome sequence in a patient sample at a genomic locus. A command to identify one or more proxy genome sequences for the aforementioned target sequence, A command to compare the observed frequency of the target genome sequence with a centrality measure of the observed frequency of one or more proxy genome sequences, and A memory that stores one or more programs, including an instruction to identify the target genome sequence as germline or somatic cell based on the comparison, An electronic device equipped with the following features.

116. The electronic device according to claim 115, wherein the one or more programs further include instructions for identifying a segment of the patient's genome in which the genomic locus is located.

117. The electronic device according to claim 116, wherein identifying the segment comprises performing a segmentation procedure on a continuous portion of the patient's genome.

118. The electronic device according to claim 117, wherein the portion of the patient's genome is large enough to identify three different segments.

119. The electronic device according to any one of claims 116 to 118, wherein one or more proxy genome sequences are identified as being located within the same segment as the genomic locus.

120. The electronic device according to any one of claims 117 to 119, wherein the segmentation procedure identifies segments according to whether the genomic parameters are equal throughout each individual segment.

121. The electronic device according to claim 120, wherein the genome parameter is the copy number.

122. The electronic device according to any one of claims 103 to 121, wherein the one or more programs further include instructions for receiving sequencing data related to the patient genome sequence.

123. The electronic device according to claim 122, wherein the one or more programs further include instructions for assembling the patient genome sequence using the sequencing data.

124. The electronic device according to claim 122 or 123, wherein the one or more programs further include instructions for a sequencer to sequence nucleic acid molecules derived from the patient sample and thereby obtain sequencing data.

125. The electronic device according to any one of claims 103 to 124, wherein the one or more proxy genome sequences include a single nucleotide polymorphism (SNP).

126. The electronic device according to any one of claims 103 to 125, wherein one or more proxy genome sequences include alleles.

127. The electronic device according to any one of claims 103 to 126, wherein the target genome sequence includes a genome variant.

128. The electronic device according to any one of claims 103 to 127, wherein the one or more programs further include instructions for generating a report showing the desired genome sequence as either germline or somatic cells.

129. The electronic device according to claim 128, wherein the one or more programs further include instructions for transmitting the report via a computer network or peer-to-peer connection.

130. The electronic device according to claim 128 or 129, wherein the device further comprises a display, and the one or more programs further include instructions for displaying the report.

131. The electronic device according to any one of claims 103 to 130, wherein the patient sample is derived from a tissue biopsy including tumor tissue and non-tumor tissue.

132. The electronic device according to claim 131, wherein the tissue biopsy is a solid tissue biopsy or a liquid biopsy.

133. The electronic device according to claim 132, wherein the tissue biopsy is a liquid biopsy including blood, plasma, cerebrospinal fluid, sputum, feces, urine, or saliva.

134. The electronic device according to any one of claims 103 to 133, wherein the patient sample comprises cell-free DNA (cfDNA) obtained from the subject.

135. The electronic device according to any one of claims 103 to 134, wherein the patient sample comprises circulating tumor DNA (ctDNA) obtained from the subject.

136. A system comprising an electronic device according to any one of claims 103 to 135 and a sequencer configured to sequence nucleic acid molecules derived from the patient sample.

137. The system according to claim 136, wherein the sequencer is a next-generation sequencer.

138. A method for identifying a target genome sequence as germline or somatic cell, wherein the method is Identifying target genomic sequences in patient samples at genomic loci using one or more processors, The one or more processors identify proxy genome sequences for the target genome sequence, The one or more processors compare the observed allele fraction of the target genome sequence with the observed allele fraction of the proxy genome sequence. The one or more processors identify the target genome sequence as germline or somatic cell based on the comparison. Methods that include...

139. The method according to claim 138, wherein the proxy genome sequence has the same copy number as the target genome sequence.

140. The one or more processors described above can identify the target genome sequence as germline or somatic cells. Inputting allele frequency distances into a pre-trained statistical model, The trained statistical model outputs a value indicating the likelihood that the target genome sequence is germline, or a value indicating the likelihood that the target genome sequence is somatic. The method according to claim 138 or claim 139, including the method described in claim 138 or claim 139.

141. The method according to any one of claims 138 to 140, wherein the allele fraction of the genome sequence and the allele fraction of the proxy genome sequence are determined using next-generation sequencing technology.

142. The method according to claim 141, wherein the allele fraction of the genome sequence and the allele fraction of the proxy genome sequence are determined using microarray technology.

143. The method according to any one of claims 138 to 142, wherein the patient sample includes a solid tissue biopsy or a liquid biopsy.

144. The method according to claim 143, wherein the patient sample is a liquid biopsy containing blood, plasma, cerebrospinal fluid, sputum, stool, urine, or saliva.

145. The method according to any one of claims 138 to 144, wherein the patient sample comprises cell-free DNA (cfDNA) obtained from the subject.

146. The method according to any one of claims 138 to 145, wherein the patient sample comprises circulating tumor DNA (ctDNA) obtained from the subject.

147. The method according to any one of claims 138 to 146, wherein the patient is a cancer patient.