Methods and systems for mass-spectrometry-based detection of non-canonical peptides
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BIONTECH SE
- Filing Date
- 2025-10-31
- Publication Date
- 2026-06-18
AI Technical Summary
Existing mass-spectrometry data analysis techniques are inadequate for accurately identifying non-canonical proteins and peptides, leading to false positives and unsuitable detection of the 'dark' proteome, which is crucial for diagnosing and treating diseases like cancer and infectious diseases.
A method using machine-learning classifiers and unique input feature design to enhance mass-spectrometry data analysis, eliminating false positives by determining quality features and prediction values for candidate peptides, and selecting a subset of detected peptides for inclusion in a final set.
Facilitates accurate identification of non-canonical peptides, enabling targeted immunotherapies for cancer and infectious diseases by improving the detection of non-canonical proteins in biological samples.
Smart Images

Figure US2025053521_18062026_PF_FP_ABST
Abstract
Description
Attorney Docket No.: 2013237-1511METHODS AND SYSTEMS FOR MASS-SPECTROMETRY-BASED DETECTION OF NON-CANONICAL PEPTIDESCROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and benefit of U.S. Provisional Patent Application No. 63 / 715,436 filed November 1, 2024, the disclosure of which is incorporated by reference herein in its entirety.SEQUENCE LISTING
[0002] The instant application contains a Sequence Listing, which has been submitted electronically in XML format and is hereby incorporated by reference in its entirety. The XML copy, created October 31, 2025, is named 2013237-1511.xml, and is 29,392 bytes in size.BACKGROUND
[0003] To first order, proteins are produced via transcription and translation of protein coding genes, each into to a single, canonical protein. However, a variety of mechanisms may give rise to non-canonical proteins, which, in turn, are increasingly understood to account for a large, but “dark”, fraction of the proteome. For example, non-canonical proteins may be encoded by regions in DNA that were not previously considered to have the ability to express or, at the translation level, may be produced via alternative translation and splicing events. This dark proteome may be relevant to diagnosis and treatment of diseases. Accordingly, improved technologies for accurately identifying non-canonical proteins and peptides are needed.SUMMARY
[0004] Presented herein are technologies for identifying non-canonical polypeptide targets within a sample via mass-spectrometry. Among other things, non-canonical target detection technologies leverage machine-learning classifiers along with unique input feature design to eliminate and / or mitigate false positive identifications that can plague attempts to detect small quantities of non-canonical polypeptide targets in biological samples. Accordingly, methods and systems of the present disclosure address significant shortcomings of previous Page 1 of 7113071326vlAttorney Docket No.: 2013237-1511 mass-spectrometry data analysis techniques that rendered them unsuitable for detecting targets from the “dark” proteome. In doing so, technologies of the present disclosure facilitate target identification for immunotherapies, including treatments for cancer and / or infectious diseases.
[0005] In some aspects, the present disclosure provides methods for detecting non- canonical peptides within a biological sample via mass-spectrometry, said provided methods including: (a) obtaining, by a processor of a computing device, mass spectrometry data for the biological sample, said mass spectrometry data including one or more sample spectrum (spectra); (b) identifying, by the processor, based on the mass spectrometry data, a plurality of candidate peptides, each candidate peptide having a corresponding candidate mass spectrum determined, by the processor, to match at least a portion of the one or more sample spectrum (spectra) of the mass spectrometry data, and wherein at least a portion of the plurality of candidate peptides are non-canonical peptides; (c) determining, by the processor, for each candidate peptide, values for one or more quality features, wherein, for a particular candidate peptide, the one or more quality features measure a quality [e.g., accuracy, explanatory power (e.g., percentage accounted for), likelihood of being correct] with which the candidate mass spectrum corresponding to the particular candidate peptide matches the portion of the one or more sample spectrum (spectra);(d) determining, by the processor, for each of the plurality of candidate peptides, a corresponding prediction value using a machine learning model, wherein, for a particular candidate peptide, the corresponding prediction value (i) measures a predicted likelihood of, and / or (ii) classifies, the particular candidate peptide being present in the biological sample as determined by the machine learning model based on a set of input feature values including (i) the values for the one or more quality features determined for the particular candidate peptide and (ii) a value of a peptide source feature that indicates whether the particular candidate peptide is a non-canonical peptide;(e) selecting, by the processor, a subset of the plurality of candidate peptides for inclusion in a final set of detected peptides, based on their corresponding prediction values; and (f) storing and / or providing (e.g., for display and / or further processing), by the processor, the final set of detected peptides.
[0006] In some embodiments, a biological sample is a cell sample (e.g., a solution of cells; e.g., a dissociated cell sample).Page 2 of 7113071326vlAttorney Docket No.: 2013237-1511
[0007] In some embodiments, a biological sample is a tissue sample (e.g., a tissue biopsy).
[0008] In some embodiments, a biological sample includes cancer cells.
[0009] In some embodiments, a biological sample is an organoid sample.
[0010] In some embodiments, a biological sample is a sample obtained from a subject(e.g., a human patient) having been diagnosed with cancer.
[0011] In some embodiments, a biological sample includes cells infected with an infectious agent (e.g., virus, bacterium, parasite, etc.).
[0012] In some embodiments, a biological sample is a sample obtained from a subject (e.g., a human patient) having been infected with an infectious agent (e.g., virus, bacterium, parasite, etc.).
[0013] In some embodiments, mass spectrometry data is or has been obtained using a purified version of a biological sample obtained following one or more sample preparation steps.
[0014] In some embodiments, one or more sample preparation steps include isolation of MHC -bound peptides from the biological sample.
[0015] In some embodiments, one or more sample preparation steps include a protease digestion step (e.g., trypsin digestion).
[0016] In some embodiments, mass spectrometry data is tandem mass spectrometry data.
[0017] In some embodiments, mass spectrometry data includes a plurality of sample spectra, each sample spectra generated via a MS / MS scan associated with a particular selected precursor ion of a particular survey scan.
[0018] In some embodiments, provided methods include generating the mass spectrometry data using a tandem mass spectrometer.
[0019] In some embodiments, step (b) includes identifying, by the processor, for each of at least a portion of the one or more sample spectrum (spectra), a matching candidate peptide.
[0020] In some embodiments, provided methods include for a given sample spectrum: selecting, by the processor, a plurality of prospective candidate peptides from one or more targetPage 3 of 7113071326vlAttorney Docket No.: 2013237-1511 databases; determining, for each of the plurality of prospective candidate peptides, a corresponding candidate mass spectrum; determining, by the processor, for each of the plurality of prospective candidate peptides, one or more corresponding spectral similarity scores, wherein, for a particular prospective candidate peptide, the one or more corresponding spectral similarity scores are determined based on the corresponding candidate mass spectrum and the given sample spectrum; and selecting, by the processor, one or more of the prospective candidate peptides as matching candidate peptides.
[0021] In some embodiments, one or more corresponding spectral similarity scores include a cross-correlation score determined based on a cross-correlation between (i) the prospective candidate peptide’s corresponding mass spectrum and (ii) the given sample spectrum.
[0022] In some embodiments, one or more target databases include one or more sequence database(s).
[0023] In some embodiments, one or more target databases include a canonical human proteome sequence.
[0024] In some embodiments, one or more target databases include a non-canonical proteome database including one or more polypeptide sequences of one or more non-canonical proteins.
[0025] In some embodiments, a non-canonical proteome database includes one or more polypeptide sequences resulting from an alternative genetic event.
[0026] In some embodiments, an alternative genetic event includes: transcription from a novel and / or unannotated open reading frame; transcription from a pseudogene; an insertion and / or deletion mutation; a frameshift mutation; a transposable element insertion or deletion; and / or insertion of a retroviral element.
[0027] In some embodiments, a non-canonical proteome database includes one or more polypeptide sequences translated from a polyribonucleotide sequence resulting from an alternative transcriptional event.Page 4 of 7113071326vlAttorney Docket No.: 2013237-1511
[0028] In some embodiments, an alternative transcriptional event includes: transcription initiated at an alternative start site; and / or transcription terminated at an alternative termination site.
[0029] In some embodiments, a non-canonical proteome database includes one or more polypeptide sequences translated from a polyribonucleotide sequence resulting from an alternative post-transcriptional event.
[0030] In some embodiments, an alternative post-transcriptional event includes: a post- transcriptional mutation of a polyribonucleotide sequence; a truncation of a polyribonucleotide sequence; and / or alternative splicing.
[0031] In some embodiments, a non-canonical proteome database includes one or more polypeptide sequences translated from an alternative polyribonucleotide sequence.
[0032] In some embodiments, an alternative polyribonucleotide sequence includes: a long non-coding RNA sequence, a junction of an exon and a transposable element (JET), a transposable element, or a circular RNA.
[0033] In some embodiments, a non-canonical proteome database includes one or more polypeptide sequences resulting from an alternative translation event.
[0034] In some embodiments, an alternative translation event includes: translation from an internal ribosome entry site; incorrect incorporation of one or more amino acids into a polypeptide chain (e.g., Aberrant Translation products); and / or premature termination of a polypeptide chain (e.g., microproteins).
[0035] In some embodiments, a non-canonical proteome database includes one or more polypeptide sequences resulting from an alternative post-translational event.
[0036] In some embodiments, an alternative post-translational event includes: glycosylation; phosphorylation; SUMOylation; methylation; acylation; and / or truncation and / or cleavage.
[0037] In some embodiments, a non-canonical proteome database includes polypeptide sequences produced via alternative splicing events.Page 5 of 7113071326vlAttorney Docket No.: 2013237-1511
[0038] In some embodiments, a non-canonical proteome database includes polypeptide sequences produced from undiscovered and / or unannotated open reading frames.
[0039] In some embodiments, a non-canonical proteome database includes polypeptide sequences generated based on one or more post-translational modifications.
[0040] In some embodiments, a non-canonical proteome database includes endogenous retrovirus (ERV)-derived polypeptide sequences.
[0041] In some embodiments, a non-canonical proteome database includes polypeptide sequences determined based on somatic mutations in genomic sequence data.
[0042] In some embodiments, a non-canonical proteome database includes polypeptide sequences of one or more infectious agents (e.g., viral polypeptides, bacterial polypeptides, parasite polypeptides, etc.).
[0043] In some embodiments, step (b) includes, for at least a portion of the candidate peptides, generating, as the corresponding candidate mass spectrum, a predicted mass spectrum based on a sequence of the candidate peptide.
[0044] In some embodiments, provided methods include, for a particular candidate peptide, at least a portion of the one or more quality features measure a likelihood that the particular candidate peptide produced the matching portion of the one or more sample spectrum (spectra) (e.g., based on the corresponding candidate mass spectrum of the particular candidate peptide).
[0045] In some embodiments, provided methods include, for a particular candidate peptide, at least a portion of the one or more quality features measure (e.g., quantify) a similarity between (i) the candidate mass spectrum corresponding to the particular candidate peptide and (ii) a particular, matching, sample spectrum of the one or more sample spectra and / or a portion thereof.
[0046] In some embodiments, quality features include one or more scores and / or features listed in Tables 1 and 2.Page 6 of 7113071326vlAttorney Docket No.: 2013237-1511
[0047] In some embodiments, a peptide source feature is a binary feature identifying the particular candidate peptide as having been selected from (i) a canonical (e.g., human proteome) sequence database or (ii) one or more of non-canonical polypeptide sequence databases.
[0048] In some embodiments, a peptide source feature has a value selected from three or more possible values (e.g., as in an enumerated data type) identifying the particular candidate peptide as having been selected from (i) a canonical (e.g., human proteome) sequence database or (ii) one of two or more non-canonical polypeptide sequence databases.
[0049] In some embodiments, a machine learning model is a support vector machine (SVM).
[0050] In some embodiments, a machine learning model is an artificial neural network (ANN).
[0051] In some embodiments, a machine learning model is or has been trained using a supervised training method.
[0052] In some embodiments, a machine learning model is or has been trained using a semi-supervised training method.
[0053] In some embodiments, provided methods include using the final set of detected peptides in creation of a pharmaceutical composition (e.g., an immunogenic composition, e.g., a vaccine composition).
[0054] In some embodiments, a pharmaceutical composition includes one or more polynucleotide(s) encoding at least a portion of the peptides of the final set.
[0055] In some aspects, the present disclosure provides systems including: a processor of a computing device; and memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to perform provided methods as described herein (e.g., in paragraphs above).
[0056] In some embodiments, provided systems include one or more target sequence databases, each sequence database including polypeptide sequences for detection / identification within the biological sample.Page 7 of 7113071326vlAttorney Docket No.: 2013237-1511
[0057] In some embodiments, one or more target sequence databases include a non- canonical polypeptide sequence database (e.g., a non-canonical polypeptide sequence as described herein, e.g., in paragraphs above, such as, for example, paragraphs
[0023] to
[0041] ).
[0058] In some embodiments, provided methods include a tandem mass spectrometer.
[0059] In some aspects, the present disclosure provides methods for selecting peptides for inclusion (e.g., as neoepitopes) in a pharmaceutical composition for prevention and / or treatment of cancer (e.g., a personalized cancer vaccine), said provided methods including: (a) obtaining, by a processor of a computing device, mass spectrometry data for a sample obtained from a subject having cancer and / or at risk for cancer; (b) selecting, by the processor, a plurality of candidate peptides for detection in the sample, said plurality of candidate peptides selected from one or more target peptide databases; (c) determining, by the processor, using a machine learning model, one or more of the plurality of candidate peptides to be present in the sample based at least in part on the mass spectrometry data and including, by the processor, the one or more candidate peptides determined to be present in the sample in a set of detected peptides; and (d) selecting one or more peptides from the set of detected peptides for inclusion in the pharmaceutical composition.
[0060] In some embodiments, one or more target peptide databases includes a mutant database including sequences of polypeptides determined to harbor mutations associated with the cancer.
[0061] In some embodiments, at step (c), for a given candidate peptide the machine learning model receives, as input: (i) values for a set of quality features determined for the given candidate peptide based on (A) one or more sample spectrum (spectra) of the mass spectrometry data and (B) a corresponding candidate mass spectrum determined for the given candidate peptide; and (ii) a peptide source feature that identifies the given candidate peptide as selected from the mutant database or a canonical human proteome sequence database.
[0062] In some embodiments, at step (c), for a given candidate peptide, a machine learning model generates, as output, a prediction value measuring a likelihood, as determined by the machine learning model, that the given candidate peptide is present in the sample.Page 8 of 7113071326vlAttorney Docket No.: 2013237-1511
[0063] In some embodiments, provided methods include determining polynucleotide sequences encoding the one or more peptides selected for inclusion in the pharmaceutical composition.
[0064] In some embodiments, step (d) includes determining, by the processor, an immunogenic presentation score for each of member of the set of detected peptides and selecting the one or more peptides for inclusion in the pharmaceutical composition based at least in part on the determined immunogenic presentation scores.
[0065] In some aspects, the present disclosure provides pharmaceutical compositions including one or more polypeptides corresponding to non-canonical peptides having been detected in a biological sample based on mass spectrometry data and a prediction value determined using a machine learning-based classifier as described herein (e.g., in paragraphs above).
[0066] In some aspects, the present disclosure provides pharmaceutical compositions including one or more nucleotides (e.g., ribonucleic acid) encoding one or more non-canonical peptides having been detected in a biological sample based on mass spectrometry data and a prediction value determined using a machine learning-based classifier as described herein (e.g., in paragraphs above).
[0067] Features of embodiments described with respect to one aspect of the invention may be applied with respect to another aspect of the invention.BRIEF DESCRIPTION OF THE DRAWING
[0068] The foregoing and other objects, aspects, features, and advantages of the present disclosure will become more apparent and better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:
[0069] FIG. 1A is a schematic illustrating an approach for generating mass-spectrometry data, according to an illustrative embodiment.
[0070] FIG. IB is a schematic illustrating an approach for generating tandem mass- spectrometry (MS / MS) data, according to an illustrative embodiment.Page 9 of 7113071326vlAttorney Docket No.: 2013237-1511
[0071] FTG. 2A is a block-flow diagram illustrating a process for identifying and scoring candidate peptides, according to an illustrative embodiment.
[0072] FIG. 2B is a schematic illustrating certain ions produced from peptides and appearing in mass spectrometry data, according to an illustrative embodiment.
[0073] FIG. 2C is a schematic showing an illustrative observed sample mass spectrum compared with an illustration of a theoretically predicted mass spectrum for a candidate peptide.
[0074] FIG. 2D is a schematic illustrating scoring peptide-spectrum matches (PSMs) along with decoy peptides for determining score cutoff for achieving a desired, estimated, false discovery rate (FDR), according to an illustrative embodiment.
[0075] FIG. 2E is a schematic illustrating scoring peptide-spectrum matches (PSMs) along with decoy peptides for determining score cutoff for achieving a desired, estimated, false discovery rate (FDR), according to an illustrative embodiment.
[0076] FIG. 3 is a block-flow diagram illustrating a process for detecting non-canonical targets within a sample, according to an illustrative embodiment.
[0077] FIG. 4 is a block-flow diagram illustrating an iterative process for training a semi-supervised machine-learning classifier using mass-spectrometry data, according to an illustrative embodiment.
[0078] FIG. 5A is a schematic illustrating sources of tumor enriched dark proteome targets, according to an illustrative embodiment.
[0079] FIG. 5B is a block flow diagram of an example process for identifying and scoring candidate epitopes for cancer immunotherapy-based treatment, according to an illustrative embodiment.
[0080] FIG. 5C is a block flow diagram of an example process for identifying and scoring candidate epitopes for infectious disease vaccine development, according to an illustrative embodiment.
[0081] FIG. 6 is a block diagram of an exemplary cloud computing environment, used in certain embodiments.Page 10 of 7113071326vlAttorney Docket No.: 2013237-1511
[0082] FTG. 7 is a block diagram of an example computing device and an example mobile computing device used in certain embodiments.
[0083] FIG. 8 is a graph showing predicted HLA-motif enrichment for canonical peptides, decoy peptides, and endogenous retroviral (ERV)-derived peptides, according to an illustrative embodiment.
[0084] FIG. 9 is a schematic and a bar plot showing numbers of canonical and ERV- derived peptides detected in a biological sample using a search and score approach with a single FDR threshold, according to an illustrative embodiment.
[0085] FIG. 10 is a schematic and a bar plot showing numbers of canonical and peanut peptides detected in a biological sample using a search and score approach with a single FDR threshold, according to an illustrative embodiment.
[0086] FIG. 11A is an illustrative schematic showing impact of search space heterogeneity on FDR rates when a single cutoff threshold is determined, according to an illustrative embodiment.
[0087] FIG. 11B is an illustrative schematic showing impact of search space heterogeneity on FDR rates when a single cutoff threshold is determined, according to an illustrative embodiment.
[0088] FIG. 11C is an illustrative schematic showing impact of search space heterogeneity on FDR rates when a single cutoff threshold is determined, according to an illustrative embodiment.
[0089] FIG. 12 is a schematic illustrating use of a machine learning model to classify candidate peptides, according to an illustrative embodiment.
[0090] FIG. 13 is a bar graph comparing number of peanut peptides detected via a standard search and score-based threshold approach with a machine learning model-based classifier approach using a peptide source feature.
[0091] FIG. 14 is a bar graph comparing number of long noncoding RNA IncRNA peptides detected via a standard search and score-based threshold approach with a machine learning model-based classifier approach using a peptide source feature.Page 11 of 7113071326vlAttorney Docket No.: 2013237-1511
[0092] FTG. 15 is set of box and whisker plots for different types of peptides and detection methods.
[0093] FIG. 16A is an MS2 fragmentation plot of an endogenous MVAEPPRV peptide. The x-axis depicts the retention time of the peptides in minutes, and they-axis depicts the fragment intensity.
[0094] FIG. 16B is an MS2 fragmentation plot of a synthetic MVAEPPRV. The X-axis depicts the retention time of the peptides in minutes, and the Y-axis depicts the fragment intensity.
[0095] FIG. 17 is a head-to-toe plot comparing fragment peaks of endogenous (top) and synthetic (bottom) MVAEPPRV peptides, confirming the presence of (e.g., HLA binding of) the endogenous peptide.
[0096] The features and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and / or structurally similar elements.CERTAIN DEFINITIONS
[0097] About The term “about”, when used herein in reference to a value, refers to a value that is similar, in context to the referenced value. In general, those skilled in the art, familiar with the context, will appreciate the relevant degree of variance encompassed by “about” in that context. For example, in some embodiments, the term “about” may encompass a range ofvalues that within 25%, 20%, 19%, 18%, 17%, 16%, 15%, 14%, 13%, 12%, 11%, 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1%, or less of the referred value.
[0098] Agent: As used herein, the term “agent,” may refer to a physical entity. In some embodiments, an agent may be characterized by a particular feature and / or effect. For example, as used herein, the term “therapeutic agent” refers to a physical entity has a therapeutic effect and / or elicits a desired biological and / or pharmacological effect. In some embodiments, an agent may be a compound, molecule, or entity of any chemical class including, for example, a smallPage 12 of 7113071326vlAttorney Docket No.: 2013237-1511 molecule, polypeptide, nucleic acid, saccharide, lipid, metal, or a combination or complex thereof. In some embodiments, part or all of an agent may be depicted herein as a chemical structure, or may be described using chemical nomenclature and / or with reference to general principles of organic chemistry, e.g., in accordance with the Periodic Table of Elements, CAS version, Handbook of Chemistry and Physics, 75th Ed; “Organic Chemistry”, Thomas Sorrell, University Science Books, Sausalito: 1999, and / or “March’s Advanced Organic Chemistry”, 5th Ed., Ed.: Smith, M.B. and March, J., John Wiley & Sons, New York: 2001, the entire contents of which are hereby incorporated by reference. Unless otherwise stated or clear from context, chemical structures depicted herein may be considered to reference or include one or more, or all, stereoisomeric (e.g., enantiomeric or diastereomeric) forms of the structure, and / or one or more, or all, geometric or conformational isomeric forms of the structure. For example, unless otherwise indicated or clear, both R and S configurations of a stereocenter may be contemplated in embodiments of the disclosure. In some embodiments, a compound may be described and / or utilized as a particular single stereochemical isomer; alternatively or additionally, in some embodiments, such a compound may be described and / or utilized as a combination (e.g., a mixture) of one or more enantiomeric (e.g., diastereomeric) forms (e.g., as a racemic preparation). Analogously, in some embodiments, a single geometric isomer may be described and / or utilized; in some embodiments, a combination (e.g., a mixture) of geometric (or conformational) isomers may be described and / or utilized. Unless otherwise stated or clear from context, all tautomeric forms of provided compounds are within the scope of the disclosure. Still further, unless otherwise indicated or clear from context, in some embodiments, a particular chemical compound (e.g., as may be represented by a depicted chemical structure) may be described and / or utilized in an alternative isotopic form - i.e., in a form in which one or more atoms is isotopically altered (e.g., so that a hydrogen is replaced by deuterium or tritium, and / or a carbon is replaced by 13C- or 14C-. Thus, in some embodiments, a particular compound may be described and / or utilized as or in an isotopically enriched preparation.
[0099] Amino acid: In its broadest sense, as used herein, the term “amino acid” refers to a compound and / or substance that can be, is, or has been incorporated into a polypeptide chain, e g., through formation of one or more peptide bonds. In some embodiments, an amino acid has the general structure H2N-C(H)(R)-COOH. In some embodiments, an amino acid is a naturally-occurring amino acid. In some embodiments, an amino acid is a non-natural aminoPage 13 of 7113071326vlAttorney Docket No.: 2013237-1511 acid; in some embodiments, an amino acid is a D-amino acid; in some embodiments, an amino acid is an L-amino acid. “Standard amino acid” refers to any of the twenty standard L-amino acids commonly found in naturally occurring peptides. “Nonstandard amino acid” refers to any amino acid, other than the standard amino acids, regardless of whether it is prepared synthetically or obtained from a natural source. In some embodiments, an amino acid, including a carboxy- and / or amino-terminal amino acid in a polypeptide, can contain a structural modification as compared with the general structure above. For example, in some embodiments, an amino acid may be modified by methylation, amidation, acetylation, pegylation, glycosylation, phosphorylation, and / or substitution (e.g., of the amino group, the carboxylic acid group, one or more protons, and / or the hydroxyl group) as compared with the general structure. In some embodiments, such modification may, for example, alter the circulating half-life of a polypeptide containing the modified amino acid as compared with one containing an otherwise identical unmodified amino acid. In some embodiments, such modification does not significantly alter a relevant activity of a polypeptide containing the modified amino acid, as compared with one containing an otherwise identical unmodified amino acid. As will be clear from context, in some embodiments, the term “amino acid” may be used to refer to a free amino acid; in some embodiments it may be used to refer to an amino acid residue of a polypeptide.
[0100] Antigen. The term “antigen”, as used herein, refers to an agent that elicits an immune response; and / or (ii) an agent that binds to a T cell receptor (e.g., when presented by an MHC molecule) or to an antibody. In some embodiments, an antigen elicits a humoral response (e.g., including production of antigen-specific antibodies); in some embodiments, an antigen elicits a cellular response (e.g., involving T cells whose receptors specifically interact with the antigen). In some embodiments, an antigen binds to an antibody. In some embodiment, an antigen may or may not induce a particular physiological response in an organism. In general, an antigen may be or include any chemical entity such as, for example, a small molecule, a nucleic acid, a polypeptide, a carbohydrate, a lipid, a polymer (in some embodiments other than a biologic polymer [e.g., other than a nucleic acid or amino acid polymer) etc. In some embodiments, an antigen is or comprises a polypeptide. In some embodiments, an antigen is or comprises a glycan. Those of ordinary skill in the art will appreciate that, in general, an antigen may be provided in isolated or pure form, or alternatively may be provided in crude form (e.g., together with other materials, for example in an extract such as a cellular extract or otherPage 14 of 7113071326vlAttorney Docket No.: 2013237-1511 relatively crude preparation of an antigen-containing source). In some embodiments, antigens utilized in accordance with the present invention are provided in a crude form. In some embodiments, an antigen is a recombinant antigen.
[0101] Cancer. The term “cancer” is used herein to generally refer to a disease or condition in which cells of a tissue of interest exhibit relatively abnormal, uncontrolled, and / or autonomous growth, so that they exhibit an aberrant growth phenotype characterized by a significant loss of control of cell proliferation. In some embodiments, cancer may comprise cells that are precancerous (e.g., benign), malignant, pre-metastatic, metastatic, and / or non-metastatic. In some embodiments, cancer may be characterized by a solid tumor. In some embodiments, cancer may be characterized by a hematologic tumor. In general, examples of different types of cancers known in the art include, for example, triple negative breast cancer (TNBC), hematopoietic cancers including leukemias, lymphomas (Hodgkin’s and non-Hodgkin’s), myelomas and myeloproliferative disorders, sarcomas, melanomas, adenomas, carcinomas of solid tissue, squamous cell carcinomas of the mouth, throat, larynx, and lung, liver cancer, genitourinary cancers such as prostate, cervical, bladder, uterine, and endometrial cancer and renal cell carcinomas, bone cancer, pancreatic cancer, skin cancer, cutaneous or intraocular melanoma, cancer of the endocrine system, cancer of the thyroid gland, cancer of the parathyroid gland, head and neck cancers, ovarian cancer, breast cancer, glioblastomas, colorectal cancer, gastro-intestinal cancers and nervous system cancers, benign lesions such as papillomas, and the like.
[0102] Corresponding to. As used herein, the term “corresponding to” refers to a relationship between two or more entities. For example, the term “corresponding to” may be used to designate the position / identity of a structural element in a compound or composition relative to another compound or composition (e.g., to an appropriate reference compound or composition). For example, in some embodiments, a monomeric residue in a polymer (e.g., an amino acid residue in a polypeptide or a nucleic acid residue in a polynucleotide) may be identified as “corresponding to” a residue in an appropriate reference polymer. For example, those of ordinary skill will appreciate that, for purposes of simplicity, residues in a polypeptide are often designated using a canonical numbering system based on a reference related polypeptide, so that an amino acid “corresponding to” a residue at position 190, for example,Page 15 of 7113071326vlAttorney Docket No.: 2013237-1511 need not actually be the 190th amino acid in a particular amino acid chain but rather corresponds to the residue found at 190 in the reference polypeptide; those of ordinary skill in the art readily appreciate how to identify “corresponding” amino acids. For example, those skilled in the art will be aware of various sequence alignment strategies, including software programs such as, for example, BLAST, CS-BLAST, CUSASW++, DIAMOND, FASTA, GGSEARCH / GL SEARCH, Genoogle, HMMER, HHpred / HHsearch, IDF, Infernal, KLAST, USEARCH, parasail, PSI- BLAST, PSI-Search, ScalaBLAST, Sequilab, SAM, SSEARCH, SWAPHI, SWAPHI-LS, SWIMM, or SWIPE that can be utilized, for example, to identify “corresponding” residues in polypeptides and / or nucleic acids in accordance with the present disclosure. Those of skill in the art will also appreciate that, in some instances, the term “corresponding to” may be used to describe an event or entity that shares a relevant similarity with another event or entity (e.g., an appropriate reference event or entity). To give but one example, a gene or protein in one organism may be described as “corresponding to” a gene or protein from another organism in order to indicate, in some embodiments, that it plays an analogous role or performs an analogous function and / or that it shows a particular degree of sequence identity or homology, or shares a particular characteristic sequence element.
[0103] Encode. As used herein, the term “encode” or “encoding” refers to sequence information of a first molecule that guides production of a second molecule having a defined sequence of nucleotides (e.g., a polyribonucleotide) or a defined sequence of amino acids. For example, a DNA molecule can encode an RNA molecule (e.g., by a transcription process that includes a DNA-dependent RNA polymerase enzyme). An RNA molecule can encode a polypeptide (e.g., by a translation process). A gene or a polynucleotide encodes a polypeptide if transcription and translation of an RNA corresponding to that gene or polynucleotide produces the polypeptide in a cell or other biological system. In some embodiments, a coding region of a polyribonucleotide encoding a target antigen refers to a coding strand, the nucleotide sequence of which is identical to the polyribonucleotide sequence of such a target antigen. In some embodiments, a coding region of a polyribonucleotide encoding a target antigen refers to a noncoding strand of such a target antigen, which may be used as a template for transcription of a gene or cDNA.Page 16 of 7113071326vlAttorney Docket No.: 2013237-1511
[0104] Epitope As used herein, the term “epitope” refers to a moiety that is specifically recognized by an immune system (e.g., an immune system component) of a subject. For example, in some embodiments, an epitope may be a moiety that is specifically recognized by a T cell, a B cell, an immunoglobulin (e.g., antibody or receptor), binding component or an aptamer. In some embodiments, an epitope is comprised of a plurality of chemical atoms or groups on an antigen. In some embodiments, such chemical atoms or groups are surface- exposed when the antigen adopts a relevant three-dimensional conformation. In some embodiments, such chemical atoms or groups are physically near to each other in space when the antigen adopts such a conformation. In some embodiments, at least some such chemical atoms are groups are physically separated from one another when the antigen adopts an alternative conformation (e.g., is linearized).
[0105] Expression. As used herein, the term “expression” of a nucleic acid sequence refers to the generation of a gene product from the nucleic acid sequence. In some embodiments, a gene product can be a transcript, e.g., a polyribonucleotide as provided herein. In some embodiments, a gene product can be a polypeptide. In some embodiments, expression of a nucleic acid sequence involves one or more of the following: (1) production of an RNA template from a DNA sequence (e.g., by transcription); (2) processing of an RNA transcript (e.g., by splicing, editing, etc.); (3) translation of an RNA into a polypeptide or protein; and / or (4) post- translational modification of a polypeptide or protein.
[0106] Identity . As used herein, the term “identity” refers to the overall relatedness between polynucleotide molecules (e.g., DNA molecules and / or RNA molecules) and / or between polypeptide molecules. In some embodiments, polynucleotide molecules (e.g., DNA molecules and / or RNA molecules) and / or between polypeptide molecules are considered to be “substantially identical” to one another if their sequences are at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% identical. Calculation of the percent identity of two nucleic acid or polypeptide sequences, for example, can be performed by aligning the two sequences for optimal comparison purposes (e.g., gaps can be introduced in one or both of a first and a second sequence for optimal alignment and non-identical sequences can be disregarded for comparison purposes). In certain embodiments, the length of a sequence aligned for comparison purposes is at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 75%, at least 80%, at leastPage 17 of 7113071326vlAttorney Docket No.: 2013237-151185%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or substantially 100% of the length of a reference sequence. The nucleotides at corresponding positions are then compared. When a position in the first sequence is occupied by the same residue (e.g., nucleotide or amino acid) as the corresponding position in the second sequence, then the molecules are identical at that position. The percent identity between the two sequences is a function of the number of identical positions shared by the sequences, taking into account the number of gaps, and the length of each gap, which needs to be introduced for optimal alignment of the two sequences. The comparison of sequences and determination of percent identity between two sequences can be accomplished using a mathematical algorithm. For example, the percent identity between two nucleotide sequences can be determined using the algorithm of Meyers and Miller, 1989, which has been incorporated into the ALIGN program (version 2.0). In some exemplary embodiments, nucleic acid sequence comparisons made with the ALIGN program use a PAM120 weight residue table, a gap length penalty of 12 and a gap penalty of 4. The percent identity between two nucleotide sequences can, alternatively, be determined using the GAP program in the GCG software package using an NWSgapdna.CMP matrix.
[0107] In order. As used herein with reference to a polynucleotide or polyribonucleotide, “in order” refers to the order of features from 5' to 3' along the polynucleotide or polyribonucleotide. As used herein with reference to a polypeptide, “in order” refers to the order of features moving from the N-terminal-most of the features to the C-terminaL most of the features along the polypeptide. “In order” does not mean that no additional features can be present among the listed features. For example, if Features A, B, and C of a polynucleotide are described herein as being “in order, Feature A, Feature B, and Feature C,” this description does not exclude, e.g., Feature D being located between Features A and B.
[0108] Machine learning module, machine learning model: As used herein, the terms “machine learning module” and “machine learning model” are used interchangeably and refer to a computer implemented process (e.g., a software function) that implements one or more particular machine learning algorithms, such as an artificial neural networks (ANNs), random forest, decision trees, support vector machines, and the like, in order to determine, for a given input, one or more output values. In certain embodiments, machine learning models are deep learning models or deep neural networks - for example, ANNs that comprise, in addition to anPage 18 of 7113071326vlAttorney Docket No.: 2013237-1511 input layer and an output layer, one or more hidden layers (e.g., in between). In some embodiments, machine learning modules implementing machine learning techniques are trained in a supervised manner, for example using curated and / or manually annotated datasets. In certain embodiments, machine learning models may be trained in an unsupervised manner, using unlabeled data. In certain embodiments, a machine learning model may be trained via a reinforcement approach, for example wherein a reward / penalty system is used to train a machine learning model to learn strategies for accomplishing specified tasks. Training a machine learning model may be used to determine various parameters of a model, such as weights associated with layers in neural networks. In some embodiments, once a machine learning module is trained, e.g., to accomplish a specific task, such as classifying example peptides based on mass spectrometry data, values of determined parameters are fixed and the machine learning module is used to process new data (e.g., different from the training data), such as new candidate peptides. The process of presenting a machine learning model with multiple examples, comparing its output to [e.g., pre-assigned target labels representing, for each example a desired and / or known (e.g., ground truth) classification, state, etc. of the example], and updating parameters to progressively improve performance may be referred to as training, while the use of a (e.g., previously trained) machine learning model to generate predictions about new data, for which ground truth values may be unknown, may be referred to as inference. In some embodiments, machine learning modules may receive feedback, e.g., based on user review of accuracy, and such feedback may be used as additional training data, for example to dynamically update the machine learning module. In some embodiments, a trained machine learning module is a classification algorithm with adjustable and / or fixed (e.g., locked) parameters, e.g., a random forest classifier. In some embodiments, two or more machine learning modules may be combined and implemented as a single module and / or a single software application. In some embodiments, two or more machine learning modules may also be implemented separately, e.g., as separate software applications. A machine learning module may be software and / or hardware. For example, a machine learning module may be implemented entirely as software, or certain functions of an ANN module may be carried out via specialized hardware (e.g., via an application specific integrated circuit (ASIC), field programmable gate arrays (FPGAs), and the like).Page 19 of 7113071326vlAttorney Docket No.: 2013237-1511
[0109] Neoantigen. As used herein, the term “neoantigen” refers to an antigen that is not present in a reference, such as a normal non-cancerous or germline cell, but is present in a cancer cell. In some embodiments, a neoantigen includes one or more mutations relative to a corresponding antigen present in a normal non-cancerous or germline cell.
[0110] Neoantigen epitope. As used herein, the term “neoantigen epitope” refers to an epitope that is not present in a reference, such as a normal non-cancerous or germline cell, but is present in a cancer cell.
[0111] Nucleic acid / Polynucleotide. As used herein, the term “nucleic acid” refers to a polymer of at least 10 nucleotides or more. In some embodiments, a nucleic acid is or comprises DNA. In some embodiments, a nucleic acid is or comprises RNA. In some embodiments, a nucleic acid is or comprises peptide nucleic acid (PNA). In some embodiments, a nucleic acid is or comprises a single stranded nucleic acid. In some embodiments, a nucleic acid is or comprises a double-stranded nucleic acid. In some embodiments, a nucleic acid comprises both single and double-stranded portions. In some embodiments, a nucleic acid comprises a backbone that comprises one or more phosphodiester linkages. In some embodiments, a nucleic acid comprises a backbone that comprises both phosphodiester and non-phosphodiester linkages. For example, in some embodiments, a nucleic acid may comprise a backbone that comprises one or more phosphorothioate or 5'-N-phosphoramidite linkages and / or one or more peptide bonds, e.g., as in a “peptide nucleic acid”. In some embodiments, a nucleic acid comprises one or more, or all, natural residues (e.g., adenine, cytosine, deoxyadenosine, deoxy cytidine, deoxy guanosine, deoxythymidine, guanine, thymine, uracil). In some embodiments, a nucleic acid comprises on or more, or all, non-natural residues. In some embodiments, a non-natural residue comprises a nucleoside analog (e.g., 2-aminoadenosine, 2-thiothymidine, inosine, pyrrol o-pyrimidine, 3 - methyl adenosine, 5-methylcytidine, C-5 propynyl-cytidine, C-5 propynyl-uridine, 2- aminoadenosine, C5-bromouridine, C5-fluorouridine, C5-iodouridine, C5-propynyl-uridine, C5 - propynyl-cytidine, C5-methylcytidine, 2-aminoadenosine, 7-deazaadenosine, 7-deazaguanosine, 8-oxoadenosine, 8 -oxoguanosine, 6-O-methylguanine, 2-thiocytidine, methylated bases, intercalated bases, and combinations thereof). In some embodiments, a non-natural residue comprises one or more modified sugars (e.g., 2'-fluororibose, ribose, 2'-deoxyribose, arabinose, and hexose) as compared to those in natural residues. In some embodiments, a nucleic acid has aPage 20 of 7113071326vlAttorney Docket No.: 2013237-1511 nucleotide sequence that encodes a functional gene product such as an RNA or polypeptide. In some embodiments, a nucleic acid has a nucleotide sequence that comprises one or more introns. In some embodiments, a nucleic acid may be prepared by isolation from a natural source, enzymatic synthesis (e.g., by polymerization based on a complementary template, e.g., in vivo or in vitro), reproduction in a recombinant cell or system, or chemical synthesis. In some embodiments, a nucleic acid is at least 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 1 10, 120, 130, 140, 150, 160, 170, 180, 190, 20, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 600, 700, 800, 900, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10,000, 10,500, 11,000, 11,500, 12,000, 12,500, 13,000, 13,500, 14,000, 14,500, 15,000, 15,500, 16,000, 16,500, 17,000, 17,500, 18,000, 18,500, 19,000, 19,500, or 20,000 or more residues or nucleotides long.
[0112] Polypeptide. As used herein, the term “polypeptide” refers to a polymeric chain of amino acids. In some embodiments, a polypeptide has an amino acid sequence that occurs in nature. In some embodiments, a polypeptide has an amino acid sequence that does not occur in nature. In some embodiments, a polypeptide has an amino acid sequence that is engineered in that it is designed and / or produced through action of the hand of man. In some embodiments, a polypeptide may comprise or consist of natural amino acids, non-natural amino acids, or both. In some embodiments, a polypeptide may comprise or consist of only natural amino acids or only non-natural amino acids. In some embodiments, a polypeptide may comprise D-amino acids, L- amino acids, or both. In some embodiments, a polypeptide may comprise only D-amino acids. In some embodiments, a polypeptide may comprise only L-amino acids. In some embodiments, a polypeptide may include one or more pendant groups or other modifications, e g., modifying or attached to one or more amino acid side chains, at the polypeptide’s N-terminus, at the polypeptide’s C-terminus, or any combination thereof. In some embodiments, such pendant groups or modifications comprise acetylation, amidation, lipidation, methylation, pegylation, etc., including combinations thereof. In some embodiments, a polypeptide may be cyclic, and / or may comprise a cyclic portion. In some embodiments, a polypeptide is not cyclic and / or does not comprise any cyclic portion. In some embodiments, a polypeptide is linear. In some embodiments, a polypeptide may be or comprise a stapled polypeptide. In some embodiments, the term “polypeptide” may be appended to a name of a reference polypeptide, activity, or structure; in such instances it is used herein to refer to polypeptides that share the relevantPage 21 of 7113071326vlAttorney Docket No.: 2013237-1511 activity or structure and thus can be considered to be members of the same class or family of polypeptides. For each such class, the present specification provides and / or those skilled in the art will be aware of exemplary polypeptides within the class whose amino acid sequences and / or functions are known; in some embodiments, such exemplary polypeptides are reference polypeptides for the polypeptide class or family. In some embodiments, a member of a polypeptide class or family shows significant sequence homology or identity with, shares a common sequence motif (e.g., a characteristic sequence element) with, and / or shares a common activity (in some embodiments at a comparable level or within a designated range) with a reference polypeptide of the class; in some embodiments with all polypeptides within the class). For example, in some embodiments, a member polypeptide shows an overall degree of sequence homology or identity with a reference polypeptide that is at least about 30-40%, and is often greater than about 50%, 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or more and / or includes at least one region (e.g., a conserved region that may in some embodiments be or comprise a characteristic sequence element) that shows very high sequence identity, often greater than 90% or even 95%, 96%, 97%, 98%, or 99%. Such a conserved region usually encompasses at least 3-4 and often up to 35 or more amino acids; in some embodiments, a conserved region encompasses at least one stretch of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 or more contiguous amino acids. In some embodiments, a relevant polypeptide may comprise or consist of a fragment of a parent polypeptide.
[0113] Ribonucleic acid (RNA) or Polyribonucleotide. As used herein, the term “ribonucleic acid,” “RNA,” or “polyribonucleotide” refers to a polymer of ribonucleotides. In some embodiments, an RNA is single stranded. In some embodiments, an RNA is double stranded. In some embodiments, an RNA comprises both single and double stranded portions. In some embodiments, an RNA can comprise a backbone structure as described in the definition of “Nucleic acid / Polynucleotide” above. An RNA can be a regulatory RNA (e.g., siRNA, microRNA, etc.), or a messenger RNA (mRNA). In some embodiments, an RNA is a mRNA. In some embodiments, where an RNA is a mRNA, an RNA typically comprises at its 3' end a poly(A) region. In some embodiments, where an RNA is a mRNA, an RNA typically comprises at its 5' end an art-recognized cap structure, e.g., for recognizing and attachment of a mRNA to a ribosome to initiate translation. In some embodiments, an RNA is a synthetic RNA. SyntheticPage 22 of 7113071326vlAttorney Docket No.: 2013237-1511RNAs include RNAs that are synthesized in vitro (e.g., by enzymatic synthesis methods and / or by chemical synthesis methods).
[0114] Ribonucleotide. As used herein, the term “ribonucleotide” encompasses unmodified ribonucleotides and modified ribonucleotides. For example, unmodified ribonucleotides include the purine bases adenine (A) and guanine (G), and the pyrimidine bases cytosine (C) and uracil (U). Modified ribonucleotides may include one or more modifications including, but not limited to, for example, (a) end modifications, e.g., 5' end modifications (e.g., phosphorylation, dephosphorylation, conjugation, inverted linkages, etc.), 3' end modifications (e.g., conjugation, inverted linkages, etc.), (b) base modifications, e.g. , replacement with modified bases, stabilizing bases, destabilizing bases, or bases that base pair with an expanded repertoire of partners, or conjugated bases, (c) sugar modifications (e.g., at the 2' position or 4' position) or replacement of the sugar, and (d) internucleoside linkage modifications, including modification or replacement of the phosphodi ester linkages. The term “ribonucleotide” also encompasses ribonucleotide triphosphates including modified and non-modified ribonucleotide triphosphates.
[0115] Subject: As used herein, the term “subject” refers to an organism to be administered with a composition described herein, e.g., for experimental, diagnostic, prophylactic, and / or therapeutic purposes. Typical subjects include animals (e.g., mammals such as mice, rats, rabbits, non-human primates, domestic pets, etc.) and humans. Tn some embodiments, a subject is a human subject. In some embodiments, a subject is suffering from a disease, disorder, or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, a subject is susceptible to a disease, disorder, or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, a subject displays one or more symptoms or characteristics of a disease, disorder, or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, a subject displays one or more non-specific symptoms of a disease, disorder, or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, a subject does not display any symptom or characteristic of a disease, disorder, or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, a subject is someone with one or more features characteristic of susceptibility to or risk of a disease, disorder, or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments,Page 23 of 7113071326vlAttorney Docket No.: 2013237-1511 a subject is a patient. In some embodiments, a subject is an individual to whom diagnosis and / or therapy is and / or has been administered.
[0116] Therapy . The term “therapy” refers to an administration or delivery of an agent or intervention that has a therapeutic effect and / or elicits a desired biological and / or pharmacological effect (e.g., has been demonstrated to be statistically likely to have such effect when administered to a relevant population). In some embodiments, a therapeutic agent or therapy is any substance that can be used to alleviate, ameliorate, relieve, inhibit, prevent, delay onset of, reduce severity of, and / or reduce incidence of one or more symptoms or features of a disease, disorder, and / or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, a therapeutic agent or therapy is a medical intervention that can be performed to alleviate, relieve, inhibit, present, delay onset of, reduce severity of, and / or reduce incidence of one or more symptoms or features of a disease, disorder, and / or condition.
[0117] Treat. As used herein, the term “treat,” “treatment,” or “treating” refers to any method used to partially or completely alleviate, ameliorate, relieve, inhibit, prevent, delay onset of, reduce severity of, and / or reduce incidence of one or more symptoms or features of a disease, disorder, and / or condition (e.g., cancer and / or a cancer-associated condition). Treatment may be administered to a subject who does not exhibit signs of a disease, disorder, and / or condition (e.g., cancer and / or a cancer-associated condition). In some embodiments, treatment may be administered to a subject who exhibits only early signs of the disease, disorder, and / or condition (e.g., cancer and / or a cancer-associated condition), for example for the purpose of decreasing the risk of developing pathology associated with the disease, disorder, and / or condition. In some embodiments, treatment may be administered to a subject at a later-stage of disease, disorder, and / or condition (e.g., cancer and / or a cancer-associated condition).DETAILED DESCRIPTION
[0118] It is contemplated that systems, architectures, devices, methods, and processes of the claimed invention encompass variations and adaptations developed using information from the embodiments described herein. Adaptation and / or modification of the systems, architectures, devices, methods, and processes described herein may be performed, as contemplated by this description.Page 24 of 7113071326vlAttorney Docket No.: 2013237-1511
[0119] Throughout the description, where articles, devices, systems, and architectures are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are articles, devices, systems, and architectures of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.
[0120] It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.
[0121] The mention herein of any publication, for example, in the Background section, is not an admission that the publication serves as prior art with respect to any of the claims presented herein. The Background section is presented for purposes of clarity and is not meant as a description of prior art with respect to any claim.
[0122] Documents are incorporated herein by reference as noted. Where there is any discrepancy in the meaning of a particular term, the meaning provided in the Definition section above is controlling.
[0123] Headers are provided for the convenience of the reader - the presence and / or placement of a header is not intended to limit the scope of the subject matter described herein.
[0124] The present disclosure provides technologies for identification of non-canonical target peptides in a sample via mass-spectrometry. Among other things, in certain embodiments, technologies of the present disclosure include methods and systems for analyzing mass- spectrometry data to identify non-canonical target peptides. In certain embodiments, analysis tools presented herein originate and / or are motivated by an insight that previous massspectrometry-based peptide identification techniques, which rely on setting detection thresholds that are nominally calculated to obtain a desired false discovery rate (FDR), can, surprisingly, in fact lead to extremely high - near 100% - FDR when used to detect non-canonical peptides. Recognizing this challenge, the approaches of the present disclosure utilize machine-learning classifiers that receive particular input features. Namely, in certain embodiments, as described in further detail herein, methods and systems of the present disclosure utilize a peptide sourcePage 25 of 7113071326vlAttorney Docket No.: 2013237-1511 feature that identifies whether a candidate peptide whose presence is being tested for is, for example, from a canonical (e.g., human) proteome, or a non-canonical peptide. As demonstrated herein, supplying a machine-learning classifier with (among other features) a peptide source feature allows it to effectively learn and / or account for class heterogeneity, such as a relative likelihood of rarer non-canonical peptides, and confidently identify them, avoiding false positives, even in the presence of a large background of canonical peptides that may often be present in biological samples.
[0125] Among other things, the ability to accurately and reliably detect non-canonical target peptides in biological samples that mass-spectrometry technologies described herein provide allows for identification of targets for treatment of diseases such as cancer and infectious agents, such as viral infections. Technologies described herein may be utilized in the context of hit and lead identification, or may be “online”, used in pipelines used to generate personalized treatments for patients, such as for cancer immunotherapies.A. Protein Mass Spectrometry
[0126] Turning to FIGs. 1A and B, in certain embodiments, methods and systems of the present disclosure are used to analyze mass spectrometry data to, for example, identify peptides with a biological sample. As shown in FIG. 1A, mass-spectrometry data may be generated for a given biological sample via workflows that include, for example, various sample preparation steps to produce and isolate peptides for analysis, along with mass-spectrometry techniques such as tandem mass-spectrometry. i. Mass-Spectrometry, Sample Preparation, and Data Acquisition
[0127] For example, as shown in FIG. 1A, in certain embodiments, a sample preparation step is performed to obtain, for a raw biological sample 102, a purified biological sample comprising isolated peptides of interest.
[0128] A raw biological sample 102 may be, for example, a cell sample, an organoid sample, or tissue sample. For example, in certain embodiments, peptide detection technologies described herein may be used in connection with (e.g., to identify peptides within) cancerPage 26 of 7113071326vlAttorney Docket No.: 2013237-1511 samples, such that a raw biological sample 102 is or is derived from a sample obtained from one or more subjects having or suspected to have cancer. Such cancer samples may be blood or other bodily fluids (e.g., liquid biopsy) and may be purified, for example, to isolate cells of interest. In certain embodiments, a cancer sample is, or is derived from, a tissue sample, for example as obtained via biopsy.
[0129] In certain embodiments, a raw biological sample 102 is processed via sample preparation step(s) 104 to isolate and purify potential peptides of interest therein, to obtain a purified sample 106 for input to a mass spectrometer 108.
[0130] A sample preparation step 104 may, for example, include immunopeptidomics techniques used to isolate and purify peptides bound to major histocompatibility complex (MHC) molecules, such as MHC class I (MHC I), MHC class II (MHC II), human leukocyte antigen (HLA) class I (HLA I), HLA class II (HLA II). Sample preparation steps 104 for isolating and purifying MHC -bound peptides may include, for example, cell or tissue lysis, immunoprecipitation, and purification steps. Example protocols for various immunopeptidomic steps are included, for example, at Thermofisher.com / us / en / home / industrial / mass- spectrometry / immunopeptidomics-mass-spectrometry / sample-preparation.html#purify.
[0131] In certain embodiments, a sample preparation step 104 includes digestion with one or more proteases (e.g., trypsin).
[0132] In certain embodiments, once obtained, a sample, such as a purified sample 106, is provided to a mass-spectrometer 108, which generates mass-spectrometry data 110 for the sample.
[0133] As shown in FIG. IB, in certain embodiments, a mass-spectrometer 108 is a tandem mass spectrometer, and is coupled to a chromatography component, such as, for example, a liquid chromatography (LC) column 152. A sample may be eluted through a LC column 152, and the resultant effluent 154 provided as input to a mass-spectrometer 108, such as a tandem mass spectrometer (e.g., via electrospray). As illustrated in FIG. IB, a tandem MS instrument may perform a survey scan 156 followed by one or more MS / MS scans 158, for example one for each of one or more identified precursor ions. A survey scan may generate a survey spectrum 182, which can be used to identify one or more precursor ions. For each of atPage 27 of 7113071326vlAttorney Docket No.: 2013237-1511 least a portion of the identified precursor ions, a MS / MS scan may be performed, resulting in one or more MS / MS spectrum (spectra) 184 for each survey spectrum.
[0134] This scanning process may be repeated, over time, as sample exits from the purification component (e.g., elutes from a chromatography column), to obtain a plurality of MS / MS spectra. The plurality of MS / MS spectra may be arranged as sets of MS / MS spectra, each set associated with a particular time point, and each MS / MS spectrum 184 of a set associated with a particular precursor ion of a survey spectrum 182. ii. Searching for Candidate Peptides and Scoring Spectrum Matches
[0135] Turning to FIG. 2A, in certain embodiments, mass spectrometry data 202 generated as described herein may be analyzed to identify peptides present within a biological sample. For example, a biological sample may be used to generate mass spectrometry data 202 comprising a plurality of observed sample spectra, as described herein. Observed sample spectra may then be used to search for and identify an initial set of one or more candidate peptides 204, potentially present in the biological sample.
[0136] For example, an initial set of candidate peptides 204 may be identified by matching observed sample spectra to one or more candidate peptides, for example to determine candidate peptides that may be responsible for producing at least a portion of the observed sample spectra.
[0137] For example, as illustrated in FIG. 2A, a particular observed sample spectrum 226 may be selected, for use in identifying one or more candidate peptides. In certain embodiments, one or more reference libraries 232 may be used as sources of candidate peptides. A reference library 232 may be a sequence database, comprising a sequence of one or more particular proteomes, such as a canonical human proteome. A reference library 232 may be a list of one or more peptides or proteins.
[0138] Reference libraries 232 may be scanned, and potential candidate peptides selected. For each potential candidate peptide 234, a candidate mass spectrum may be determined 236, and compared with observed sample spectrum 226 to determine whether candidate mass spectrum 236 is a match for observed sample spectrum 226. Spectrum matchesPage 28 of 7113071326vlAttorney Docket No.: 2013237-1511 may be identified based on various criteria that measure, for example, a similarity between a particular candidate spectrum and a particular observed sample spectrum. Measurements of spectral similarity according to various criteria may be quantified by determining, for a given sample spectrum and a candidate peptide and its corresponding mass spectrum, one or more spectral similarity scores. Spectral similarity scores may, accordingly, be determined based on observes sample spectrum 226 and candidate mass spectrum 236 and used to identify candidate peptide 234 as a match or not. One or more determined spectral similarity scores may be used individually or in combination - such as via an overall similarity score as a function of one or more different similarity scores - to determine whether candidate peptide 234 is a match.
[0139] Additionally or alternatively, other quality scores, not necessarily based on and / or measuring a similarity between candidate mass spectrum 236 and observed sample spectrum 226 may be used to determine whether candidate peptide 234 and its corresponding candidate mass spectrum is a match for an observed sample spectrum 226.Table 1. Example match quality scoresPage 29 of 7113071326vlAttorney Docket No.: 2013237-1511
[0140] This process of comparing an observed sample spectrum with candidate mass spectra determined for peptides selected from a reference library may be repeated for each of at least a portion of sample spectra in mass spectrometry data 202 to identify an initial set of candidate peptides.Page 30 of 7113071326vlAttorney Docket No.: 2013237-1511
[0141] Search results - that is candidate peptides for inclusion in an initial set of matches may be selected in various ways, e.g., according to desired rules and / or selection criteria. For example, in certain embodiments, for a given sample spectrum, a single matching candidate peptide is determined and / or retained, such that a single candidate peptide is identified for each sample spectrum. For example, for a given sample mass spectrum, only a highest scoring candidate peptide may be retained. In certain embodiments, for a given sample mass spectrum, one or more (e.g., a plurality of) candidate peptides are determined and / or retained. For example, for each sample spectrum, an N (where .V is an integer, e.g., 1, 2, 3, 4, 5, 10, 15, etc.) highest scoring candidate peptides may be retained, such that for each sample spectrum, a same number (TV) of candidate peptides are determined. In certain embodiments, for each sample spectrum, only those candidate peptides scoring above a particular threshold are retained, such that a number of candidate peptides determined for given sample spectrum may vary from spectrum to spectrum. In certain cases, a sample spectrum may not be associated with any matching candidate peptides, e.g., if none score above the selected threshold.
[0142] Candidate peptide spectra that are used to match candidate peptides with observed sample spectra may be determined and / or obtained in various ways, for example via lookup in a database of candidate spectra and / or via theoretical prediction.
[0143] For example, turning to FIGs. 2B and 2C, in certain embodiments, sequence data may be used to generate theoretically predicted mass spectra for candidate peptides. For example, as shown in FIG. 2B, predicted ions, such as b-ions 254a, y-ions 254b, and internal ions 256, may be determined for a particular candidate peptide 252 and a theoretically predicted mass spectrum generated based on the predicted ions. Certain groups of ions may be relevant for particular types of sample preparation steps. For example, b-ions and y-ions are typically associated with mass-spectrometry data generated via tryptic digestion, while internal ions are associated with immunopeptidomic mass spectrometry data that aims to identify MHC (e.g., HLA) bound peptides. Turning to FIG. 2C, a theoretically predicted candidate mass spectrum may then be compared with an observed sample mass spectrum and one or more similarity scores that measure a similarity between selected candidate mass spectrum and the sample spectrum determined as described herein, and used to identify candidate peptide as a match or not.Page 31 of 7113071326vlAttorney Docket No.: 2013237-1511
[0144] In certain embodiments, an initial set of candidate peptides identified via an initial spectral search approach as described above may be referred to as peptide-spectrum matches (PSMs).Hi. Filtering for Desired False Discovery Rates (FDR)
[0145] In certain embodiments, an initial set of candidate peptides (e.g., PSMs) may be refined and / or filtered to obtain a desired false discovery rate (FDR). In particular, as described herein, candidate peptides in an initial set (e g., of PSMs) can be scored and filtered based on their determined scores in comparison with a threshold value, such that only candidate peptides scoring at or above a selected threshold are retained. Turning to FIGs. 2D and 2E, a threshold value that achieves a particular (e.g., desired) FDR can be determined via a decoy search, whereby a search against a reference library comprising decoy peptides (“decoy library”, for short) is performed as described above. Decoy peptides are peptides that are known or believed to not be present in a sample for which mass spectrometry data 202 was generated. Accordingly, in contrast to a reference library comprising target peptides (“target library”, for short), for which at least a portion of the target peptides are expected to be present in a sample, none of the decoy peptides are expected to be present.
[0146] Decoy libraries may be constructed by reversing and / or shuffling one or more peptides (e.g., all) of a target library. For example, a canonical human proteome sequence may be reversed, or shuffled, to create a decoy library from which decoy peptides may be selected. In certain embodiments, a decoy library may be a sequence of a proteome that cannot, or is extremely unlikely to, be present in the sample (e g., could only be present as the result of contamination). For example, for sample comprising human biological material, a decoy library may be a proteome of a plant or nut, such as peanut.
[0147] By searching and scoring candidate peptides from decoy libraries along with target libraries, values of thresholds can be tuned to expressly exclude all but a few highest scoring decoys, in order to achieve a desired estimated FDR.
[0148] For example, in the illustrative schematic shown in FIG. 2D, each of nine observed sample spectra are matched to a candidate peptide (PSM) via a hypothetical search of a target library and a decoy library, the latter being a reversed version of the target library. The initial set of candidate peptides identified includes a mixture of target and decoy peptides, and Page 32 of 7113071326vlAttorney Docket No.: 2013237-1511 each candidate peptide is scored. Turning to FIG. 2E, sorting candidate peptides according to their scores shows that highest scoring candidate peptides tend to be from the target library, while decoys tend to score lower. By setting a threshold value such that only a certain, e.g., desired, percentage of decoy peptides score higher than the threshold value, a desired FDR can be obtained. iv. Non-Canonlcal Peptide Classes
[0149] As described and demonstrated herein, in particular in Example 1, while search and score cutoff techniques described above can achieve reasonable accuracy in identifying standard, canonical, peptides in a sample, they suffer serious drawbacks when searching for rare classes of peptides that may be present in a sample at low levels, or not at all. Examples of these non-canonical peptides include peptides associated with unidentified and / or unknown proteins as well as peptides resulting from alternative methods of transcribing, and / or translating proteins from, RNA (e.g., via alternative splicing), translation of non-canonical open reading frames from RNA molecules, and ribosomal infidelity, e.g., resulting in amino acid substitution and / or frameshifting, and post-translational modifications, etc.
[0150] In some embodiments, a non-canonical peptide comprises a polypeptide sequence resulting from an alternative genetic event. In some embodiments, an alternative genetic event comprises: transcription from a novel and / or unannotated open reading frame; transcription from a pseudogene; an insertion and / or deletion mutation; a frameshift mutation; a transposable element insertion or deletion; and / or insertion of a retroviral element.
[0151] In some embodiments, a non-canonical peptide comprises a polypeptide sequence translated from a polyribonucleotide sequence resulting from an alternative transcriptional event. In some embodiments, an alternative transcriptional event comprises: transcription initiated at an alternative start site; and / or transcription terminated at an alternative termination site.
[0152] In some embodiments, a non-canonical peptide comprises a polypeptide sequence translated from a polyribonucleotide sequence resulting from an alternative post-transcriptional event. In some embodiments, an alternative post-transcriptional event comprises: a post-Page 33 of 7113071326vlAttorney Docket No.: 2013237-1511 transcriptional mutation of a polyribonucleotide sequence; a truncation of a polyribonucleotide sequence; and / or alternative splicing.
[0153] In some embodiments, a non-canonical peptide comprises a polypeptide sequence translated from an alternative polyribonucleotide sequence. In some embodiments, an alternative polyribonucleotide sequence comprises: a long non-coding RNA sequence, a junction of an exon and a transposable element (JET), a transposable element, or a circular RNA.
[0154] In some embodiments, a non-canonical peptide comprises a polypeptide sequence resulting from an alternative translation event. In some embodiments, an alternative translation event comprises: translation from an internal ribosome entry site; incorrect incorporation of one or more amino acids into a polypeptide chain; and / or premature termination of a polypeptide chain.
[0155] In some embodiments, a non-canonical peptide comprises a polypeptide sequence resulting in an alternative post-translational event. In some embodiments, an alternative post- translational event comprises: glycosylation; phosphorylation; SUMOylation; methylation; acylation; and / or truncation and / or cleavage.B. Machine Learning-Based Techniques for Identification of Non-Canonical Proteins
[0156] In certain embodiments, machine learning-based classifiers may be used to address challenges associated with identifying non-canonical peptide classes. Turning to FIG. 3, in certain embodiments, a machine learning model, such as, e.g., a classifier model, may be used to classify a particular candidate as predicted to be present in a sample (i.e., a “hit”, or true positive) or not present (i.e., a false positive).
[0157] For example, as shown in FIG. 3, a particular candidate peptide may be classified as likely present or not using a machine learning model that receives input features that include ( / ) values of one or more spectral similarity features, measuring, e.g., as described herein, a similarity between (i) candidate mass spectrum corresponding to particular candidate peptide and (ii) an observed sample spectrum, as well as (2) a peptide source feature that identifies a source of the particular candidate peptide - namely, whether is a target peptide selected from aPage 34 of 7113071326vlAttorney Docket No.: 2013237-1511 canonical proteome or a non-canonical peptide. Based on these input features, machine learning model may then output a prediction (e.g., one or more prediction value(s)) that classifies, and / or can be used to classify, as a detected peptide or not present, as in undetected and / or a false positive that should be excluded from a final list of detected peptides. z. Input Features
[0158] In certain embodiments, input features may include quality features that measure a quality of a match between candidate mass spectrum and observed sample spectrum. Quality features may be determined based on, e g., as a function of, (i) the sample mass spectrum and (ii) the candidate mass spectrum and, e.g., provide a measure of similarity of the spectra, a percentage explanation of the candidate mass spectrum, etc. Any of the scores listed in Table 1 above may be used as quality features.
[0159] In certain embodiments, input features may include other features, based on, for example, features associated with sample preparation, features associated with the candidate peptide sequence, and other features, several examples of which are shown in Table 2, below. See also, e.g., L. Kall et al., “Semi-supervised learning for peptide identification from shotgun proteomics datasets,” Nature Methods, 4(11) 923-925 (2007).Table 2. Example input featuresPage 35 of 7113071326vlAtorney Docket No.: 2013237-1511Page 36 of 7113071326vlAttorney Docket No.: 2013237-1511ii. Classifier Values
[0160] As described herein, and illustrated in FIG. 3, machine learning model 322 may generate, as output a prediction that classifies, or can be used to classify, a particular candidate peptide 304 as detected (i.e., present in a sample) or not (e.g., not likely to be present in a sample). Prediction may take the form of one or more prediction values, such as, for example, a number (e.g., a real-valued number; e.g., a floating point number) that quantifies likelihood that the particular candidate peptide is present in a sample, such as a number between 0 and 1, representing a probability, or a number representing an odds ratio, or a logit or log-odds. In certain embodiments, prediction value is or may be determined from two numbers, such as a logit representation. In certain embodiments, classifier value may be a binary value, such as a Boolean “True” or “False”, or a binary 1 or 0. In certain embodiments, an internal or intermediate output of machine learning model 322 may be a floating point number between 0 and 1, which may, in turn, be compared with a classification threshold value to generate a Boolean “True” or “False”.Hi. Machine learning models
[0161] Machine learning model 322 may be implemented using any of several different machine learning techniques, including, without limitation, artificial neural networks (ANNs), support vector machines (SVMs), random-forests, naive Bayes classifiers, and the like.Page 37 of 7113071326vlAttorney Docket No.: 2013237-1511 iv. Machine learning model training procedures
[0162] Machine learning classifiers used in connection with this approach may be trained as above, in supervised or semi-supervised fashions.
[0163] For example, in a supervised training approach, a (e.g., fully labeled) set of dedicated training examples may be obtained and used to train machine learning model 322 to distinguish between candidate peptides that are present within a sample and decoys, which are not, based on values of input features. Each training example may correspond to an example peptide that is representative of candidate peptides (e.g., both target peptides and decoys) that machine learning model 322 will, at inference, be tasked with classifying and may comprise (z) a set of input feature values and (zz) a (e.g., known) label, the latter representing a ground truth, known classification of example peptide. During training, machine learning model may, accordingly, be repeatedly provided with input feature values of example peptides and generate output predictions based on the received input feature values. Determined output predictions may then be compared with each example’s label, to assess accuracy of machine learning model’s prediction. Parameters, such as weights in an ANN, SVM coefficients, and the like, may be adjusted based on functions, such as a loss function, that measure a difference between machine learning model’s prediction and the examples’ labels. This procedure of generating predictions for examples and tuning adjustable weights and / or coefficients of machine learning model to improve accuracy may be performed in an iterative fashion, in accordance with various optimization methods such as gradient descent, sub-gradient descent, coordinate descent, etc.
[0164] Once trained, values of adjustable weights and / or coefficients of machine learning model 322 are held fixed, and machine learning model 322 may be used to generate predictions for candidate peptides, whose class / label is as yet unknown - a stage typically referred to as inference.
[0165] Additionally or alternatively, in certain embodiments, machine learning model may be trained in a semi-supervised fashion, for example as described in L. Kall et al., “Semisupervised learning for peptide identification from shotgun proteomics datasets,” Nature Methods, 4(11) 923-925 (2007) with regard to the “Percolator” algorithm. In certain embodiments, semi-supervised training may be used to avoid a need to obtain a fully labeled dataset of training examples, which may require time-consuming and painstaking manualPage 38 of 7113071326vlAttorney Docket No.: 2013237-1511 labeling, and often cannot be reused / transferred if changes in experimental conditions are changed.
[0166] Turning to FIG. 4, semi-supervised training approaches may utilize an initial set of candidate peptides (e.g., PSMs) identified via a search and score approach, e.g., as described in Section A, above, in lieu of a fully labeled (e.g., manually curated) dataset of dedicated training examples. That is, example semi-supervised training process shown in FIG. 4 utilizes (i) decoys, which do (e.g., necessarily) have known labels as negative examples and (ii) high- scoring candidate peptides (e.g., PSMs) as putative positive examples, along with an iterative procedure to train a machine learning model-based classifier, such as an SVM.
[0167] For example, in certain embodiments, in an initial step 400, mass spectrometry data 402 is analyzed to search for, and score, 406 candidate peptides selected from a target library 404a and a decoy library 404b. Search and score step 406 may be performed twice, once using target library 404a and once using decoy library 404b, to generate a first plurality of candidate peptides selected from target library 404a, which may be present in a sample, along with a second plurality of candidate peptides, selected from decoy library 404b, which are known to not be present in the sample. These first and second sets of candidate peptides may be combined to yield an initial set of candidate peptides which may be used for training machine learning model. In certain embodiments, a portion of initial set of candidate peptides is selected as a training split 408a, and a portion held out 408b. Candidate peptides may be ranked according to preliminary scores, such as spectral similarity scores, determined via initial search and score procedure 406, and a high scoring subset selected 410 for use as positive training examples 414a. Decoys may be selected (e.g., at random) 412 and used as negative examples 414b
[0168] An iterative procedure 450 may then be used to train a machine learning-based classifier and update training examples. In particular, in a first iteration, positive 414a and negative examples 414b are used to train 452 a machine learning classifier. Once trained, machine learning classifier may be used to score 454 candidate peptides from training split 408a. These new, machine learning-based, scores may then be used to select high scoring candidate peptides 460 for use as an updated set of positive examples 464a and decoys from training split again selected 462 (e.g., at random) for use as an updated set of negative examples 464b. ThisPage 39 of 7113071326vlAttorney Docket No.: 2013237-1511 procedure 450 may be repeated, for example a set number of iterations, or until various measures of convergence are achieved. In this manner, rather than using ground truth labels as positive examples, a set of high scoring candidate peptides are used as positive examples, avoiding a need for time consuming manual labeling. v. Peptide Source Feature
[0169] Among other things, the present disclosure includes the insight that, for non- canonical peptides, a peptide source feature that conveys a source of a candidate peptide can be used to ensure accurate classification, even for non-canonical peptides that are present at low rates, or may not be present at all, within a sample.
[0170] Without wishing to be bound to any particular theory, it is believed that by providing a machine learning classifier with an indication as to whether a given candidate peptide belongs to a special - e.g., non-canonical - class allows the machine learning classifier to account for search space heterogeneity, and determine, via its training procedure, whether peptides from the special, non-canonical, class are present in the sample or not. That is, if the hypothesis that non-canonical peptides are present within a sample is invalid, then during training, the machine learning classifier will effectively learn that candidate peptides identified as originating from the non-canonical class are less likely to be present in the sample and, accordingly, score candidate peptides in this class lower. In other words, the machine learning classifier will learn that the peptide source feature should weigh heavily in whether or not a given candidate peptide is in fact likely to be present in a sample. On the other hand, if non- canonical peptides are indeed present in the sample, then the machine learning model will learn to weigh other features, such as quality features (and / or others) in order to make stringent but confident calls.
[0171] Accordingly, as demonstrated in Examples 1 and 2 below, combining machine- leaming-based classifiers with a dedicated peptide source input feature opens the door for highly accurate detection of non-canonical peptides, with faithful FDRs. In doing so, a variety of applications may, accordingly, employ the technologies of the present disclosure.Page 40 of 7113071326vlAttorney Docket No.: 2013237-1511C. Detection of Non-Canonical Peptides for Diagnostics and Treatment
[0172] Turning to FIGs. 5A-C, non-canonical peptide identification technologies described herein may be leveraged to identify targets for disease diagnostics and / or treatment. For example, as illustrated in FIG. 5A, methods and systems of the present disclosure may be used, in certain embodiments, to evaluate various biological sources of non-canonical (e.g., “dark”) proteome targets, such as circular RNA, LncRNAs, etc.
[0173] In certain embodiments, non-canonical peptide targets may be associated with cancer, such as tumor associated antigens and / or neoantigens and / or portions (e.g., epitopes) thereof. In certain embodiments, non-canonical peptide targets may be associated with infectious agents, such as viruses (e.g., SARS-CoV-2, influenza, mpox, etc.), bacteria, parasites (e.g., malaria) and the like.
[0174] In certain embodiments, peptide targets identified via mass-spectrometry techniques described herein may be nominated for targeting via various modalities, including, but not limited to, engineered T-cell (TCR-T) therapy, soluble TCR therapy, chimeric antigen receptor T-cell therapy (CAR-T), antibody-drug conjugates, and the like. For example, in certain embodiments technologies of the present disclosure may be used to detect non-canonical peptides and TCRs against detected peptides determined and used to enhance T-cells to increase affinity and capacity to eliminate cancer cells. In certain embodiments, non-canonical peptides identified via technologies described herein may be analyzed to identify surface proteins which can, in turn, be used to design ADCs harboring identified surface proteins. In certain embodiments, RNA vaccines, e.g., as described in PCT Publications WO 2012 / 159643, WO 2014 / 082729, WO 2012 / 159754, WO 2020 / 020444 and WO 2020 / 020894, the content of each of which is incorporated by reference herein in its entirety.
[0175] Turning to FIG. 5B, in certain embodiments, non-canonical peptide detection technologies of the present disclosure may be used in connection with personalized treatment approaches, e.g., as part of a cancer treatment pipeline. As illustrated in FIG. 5B, in certain embodiments a database comprising oncogenic mutations may be created 502 and used as a library of non-canonical peptide targets 504. For example, bioinformatics pipelines may be used to identify oncogenic mutations 502 in protein coding genes and generate a database of protein variants that can be stored in mutant peptide database 504. Immunopeptidome massPage 41 of 7113071326vlAttorney Docket No.: 2013237-1511 spectrometry may then be used to analyze biological samples obtained from patients, such as on cells or tissue samples obtained, e.g., via biopsy. Non-canonical peptide detection technologies of the present disclosure may then be used in connection with mutant database 504 and a canonical human proteome sequence database 506 to identify peptides in obtained cancer samples 508. As described herein, accurate detection of non-canonical - namely, oncogenic mutation-based - peptides can be ensured through use of a peptide source flag that allows machine learning-based classifiers to account for search space heterogeneity. Peptides from mutant database 506 identified in the sample may correspond to neoantigens and / or portions thereof (e.g., neoepitopes), specific to cancer cells and, accordingly, may be selected for targeting via cancer immunotherapy techniques. Peptide sequences identified via the approaches herein may be subjected to further analysis, for example immunogenic presentation scoring 510, for selection for inclusion in pharmaceutical compositions 512, such as ribonucleic acid (RNA)- based vaccines for cancer treatment. Further details regarding RNA-based compositions encoding neoepitopes for cancer treatment are described, for example, in PCT Publications WO 2012 / 159643, WO 2014 / 082729, WO 2012 / 159754, WO 2020 / 020444 and WO 2020 / 020894, the content of each of which is incorporated by reference herein in their entirety.
[0176] Turning to FIG. 5C, in certain embodiments, mass spectrometry analysis techniques of the present disclosure may be used for identification of infectious agent polypeptides for use in pharmaceutical compositions for, e.g., immunogenic compositions (e.g., vaccines). As illustrated in FIG. 5C, in certain embodiments, infected cells, e.g., from a human or animal subject, may be obtained 552 and used to generate mass spectrometry data. Candidate peptides may be identified using a canonical human proteome sequence database 556 along with a database of polypeptide sequences for the infectious agent 554 - the latter being treated as the non-canonical peptide database. Mass spectrometry techniques of the present disclosure may be used to identify MHC -bound peptides associated with the infectious agent 558, which, in turn, may be targeted for inclusion in a pharmaceutical composition, such as a vaccine 562. In certain embodiments, identified infectious agent polypeptides are analyzed to evaluate immunogenic presentation 560.
[0177] Accordingly, by improving accuracy with which non-canonical peptides which may, or may not, be present in a sample under test can be identified via mass spectrometry,Page 42 of 7113071326vlAttorney Docket No.: 2013237-1511 methods and systems of the present disclosure open up new avenues for creation of pharmaceutical compositions, e.g., for treatment of cancer, infectious disease etc.D. Machine Learning, Computer System, and Network Environment
[0178] Certain embodiments described herein make use of computer algorithms in the form of software instructions executed by a computer processor. In certain embodiments, the software instructions include a machine learning (ML) module, also referred to herein as artificial intelligence (Al) software. As used herein, a machine learning module refers to a computer implemented process (e.g., a software function) that implements one or more specific machine learning techniques, e.g., artificial neural networks (ANNs), e.g., convolutional neural networks (CNNs), random forest, decision trees, support vector machines, and the like, in order to determine, for a given input, one or more output values. In certain embodiments, the input comprises image data and / or alphanumeric data which can include 2D and / or 3D datasets, numbers, words, phrases, or lengthier strings, for example. In certain embodiments, the one or more output values comprise image data (e g., 2D and / or 3D datasets) and / or values representing numeric values, words, phrases, or other alphanumeric strings.
[0179] In certain embodiments, machine learning modules implementing machine learning techniques are trained, for example, using datasets that include categories of data described herein. Such training may be used to determine various parameters of machine learning algorithms implemented by a machine learning module, such as weights associated with layers in neural networks. In certain embodiments, once a machine learning module is trained, e.g., to accomplish a specific task such as identifying certain response strings, values of determined parameters are fixed and the (e.g., unchanging, static) machine learning module is used to process new data (e.g., different from the training data) and accomplish its trained task without further updates to its parameters (e.g., the machine learning module does not receive feedback and / or updates). In certain embodiments, available input data includes training data and validation data, e.g., where the validation data is separate and non-overlapping with the training data. For example, in certain embodiments, training data is used during the training process to optimize a model, whereas validation data is used to check the accuracy of the model while operating on previously unseen data. In certain embodiments, training data is divided intoPage 43 of 7113071326vlAttorney Docket No.: 2013237-1511 batches (e.g., portions) that is sequentially used (e.g., in random order) as sets of inputs to train a model. In certain embodiments, a model is trained multiple times (e.g., epochs) on the entire set of training data. In certain embodiments, machine learning modules may receive feedback, e.g., based on user review of accuracy, and such feedback may be used as additional training data, to dynamically update the machine learning module. In certain embodiments, two or more machine learning modules may be combined and implemented as a single module and / or a single software application. In certain embodiments, two or more machine learning modules may also be implemented separately, e.g., as separate software applications. A machine learning module may be software and / or hardware. For example, a machine learning module may be implemented entirely as software, or certain functions of an ANN module may be carried out via specialized hardware (e.g., via an application specific integrated circuit (ASIC) and / or field programmable gate arrays (FPGAs)).
[0180] In certain embodiments, machine learning modules implementing machine learning techniques may be composed of individual nodes (e.g., units, neurons). A node may receive a set of inputs that may include at least a portion of a given input data for the machine learning module and / or at least one output of another node. A node may have at least one parameter to apply and / or a set of instructions to perform (e.g., mathematical functions to execute) over the set of inputs. In certain embodiments, node instructions may include a step to provide various relative importance to the set of inputs using various parameters, such as weights. The weights may be applied by performing scalar multiplication (e.g., or other mathematical functions) between a set of input values and the parameters, resulting in a set of weighted inputs. In certain embodiments, a node may have a transfer function to combine the set of weighted inputs into one output value. A transfer function may be implemented by a summation of all the weighted inputs and the addition of an offset (e.g., bias) value. In certain embodiments, a node may have an activation function to introduce non-linearity into the output value. Non-limiting examples of the activation function include Rectified Linear Activation (ReLu), logistic (e.g., sigmoid), hyperbolic tangent (tanh), and softmax. In certain embodiments, a node may have a capability of remembering previous states (e.g., recurrent nodes). Previous states may be applied to the input and output values using a set of learning parameters.Page 44 of 7113071326vlAttorney Docket No.: 2013237-1511
[0181] In certain embodiments, the machine learning module comprises a deep learning architecture composed of nodes organized into layers. For example, a layer is a set of nodes that receives data input (e.g., weighted or non-weighted input), transforms it (e.g., by carrying out instructions, e.g., applying a set of functions e.g., linear and / or non-linear functions), and passes transformed values as output (e.g., to the next layer). In certain embodiments, the set of nodes in a particular layer may share the same parameters and instructions without interacting with each other. A machine learning module may be composed of at least one layer (e.g., ordered). Examples of types of layers include convolutional layers (e.g., layers with a kernel, a matrix of parameters that is slid across an input to be multiplied with multiple input values to reduce them to a single output value); fully connected (FC) layers (e.g. all nodes are connected to all outputs of the previous layer); recurrent layers, long / short term memory (LSTM) layers, gated recurrent unit (GRU) layers (e.g., nodes with the various abilities to memorize and apply their previous inputs and / or outputs); batch normalization (BN) layers (e.g., layers that normalize a set of outputs from another layer, allowing for more independent learning of individual layers); activation layers (e.g., layers with nodes that only contain an activation function); and / or (un)pooling layers [e.g., layers that reduce (increase) dimensions of an input by summarizing (splitting) input values in defined patches).
[0182] In certain embodiments, the performance of a machine learning module may be characterized by its ability to produce output data with specific accuracy. To achieve specific accuracy, a training process is performed to find optimal parameters, such as weights, for each node in each layer of the machine learning module. In certain embodiments, the training process of a machine learning module may involve using output data to calculate an objective function (e.g., cost function, loss function, error function) that needs to be optimized (e.g., minimized, maximized). For example, a machine learning objective function may be a combination of a loss function and regularization parameter. The loss function is related to how well the output is able to predict the input. The loss function may take various forms, like mean squared error, mean absolute error, binary cross-entropy, categorical cross-entropy, for example. The regularization term may be needed to prevent overfitting and improve generalization of the training process. Examples of regularization techniques include LI Regularization or Lasso Regression, L2 Regularization or Ridge Regression, and Dropout (e.g., dropping layer outputs at random during training process).Page 45 of 7113071326vlAttorney Docket No.: 2013237-1511
[0183] In certain embodiments, objective function optimization of a machine learning module may involve finding at least one (e.g., all) of the present global optima (e.g., as opposed to local optima). In certain embodiments, the algorithm for objective function optimization follows principles of mathematical optimization for a multi-variable function and relies on achieving specific accuracy of the process. Examples of objective function optimization algorithms include gradient descent, nonlinear conjugate gradient, random search, Levenberg- Marquardt algorithm, limited-memory Broyden-Fietcher-Goldfarb-Shanno algorithm, pattern search, basin hopping method, Krylov method, Adam method, genetic algorithm, particle swarm optimization, surrogate optimization, and simulated annealing.
[0184] In certain embodiments, technologies of the present disclosure may be provided using a network environment. For example, as shown in FIG. 6, an implementation of a network environment 600 for use in providing systems, methods, and architectures as described herein is shown and described. In brief overview, referring now to FIG. 6, a block diagram of an exemplary cloud computing environment 600 is shown and described. The cloud computing environment 600 may include one or more resource providers 602a, 602b, 602c (collectively, 602). Each resource provider 602 may include computing resources. In some implementations, computing resources may include any hardware and / or software used to process data. For example, computing resources may include hardware and / or software capable of executing algorithms, computer programs, and / or computer applications. In some implementations, exemplary computing resources may include application servers and / or databases with storage and retrieval capabilities. Each resource provider 602 may be connected to any other resource provider 602 in the cloud computing environment 600. In some implementations, the resource providers 602 may be connected over a computer network 608. Each resource provider 602 may be connected to one or more computing device 604a, 604b, 604c (collectively, 604), over the computer network 608.
[0185] The cloud computing environment 600 may include a resource manager 606. The resource manager 606 may be connected to the resource providers 602 and the computing devices 604 over the computer network 608. In some implementations, the resource manager 606 may facilitate the provision of computing resources by one or more resource providers 602 to one or more computing devices 604. The resource manager 606 may receive a request for aPage 46 of 7113071326vlAttorney Docket No.: 2013237-1511 computing resource from a particular computing device 604. The resource manager 606 may identify one or more resource providers 602 capable of providing the computing resource requested by the computing device 604. The resource manager 606 may select a resource provider 602 to provide the computing resource. The resource manager 606 may facilitate a connection between the resource provider 602 and a particular computing device 604. In some implementations, the resource manager 606 may establish a connection between a particular resource provider 602 and a particular computing device 604. In some implementations, the resource manager 606 may redirect a particular computing device 604 to a particular resource provider 602 with the requested computing resource.
[0186] FIG. 7 shows an example of a computing device 700 and a mobile computing device 750 that can be used to implement the techniques described in this disclosure. The computing device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The mobile computing device 750 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting.
[0187] The computing device 700 includes a processor 702, a memory 704, a storage device 706, a high-speed interface 708 connecting to the memory 704 and multiple high-speed expansion ports 710, and a low-speed interface 712 connecting to a low-speed expansion port 714 and the storage device 706. Each of the processor 702, the memory 704, the storage device 706, the high-speed interface 708, the high-speed expansion ports 710, and the low-speed interface 712, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 702 can process instructions for execution within the computing device 700, including instructions stored in the memory 704 or on the storage device 706 to display graphical information for a GUI on an external input / output device, such as a display 716 coupled to the high-speed interface 708. In other implementations, multiple processors and / or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices may be connected, with eachPage 47 of 7113071326vlAttorney Docket No.: 2013237-1511 device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system). Thus, as the term is used herein, where a plurality of functions are described as being performed by “a processor”, this encompasses embodiments wherein the plurality of functions are performed by any number of processors (one or more) of any number of computing devices (one or more). Furthermore, where a function is described as being performed by “a processor”, this encompasses embodiments wherein the function is performed by any number of processors (one or more) of any number of computing devices (one or more) (e.g., in a distributed computing system).
[0188] The memory 704 stores information within the computing device 700. In some implementations, the memory 704 is a volatile memory unit or units. Tn some implementations, the memory 704 is a non-volatile memory unit or units. The memory 704 may also be another form of computer-readable medium, such as a magnetic or optical disk.
[0189] The storage device 706 is capable of providing mass storage for the computing device 700. In some implementations, the storage device 706 may be or contain a computer- readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. Instructions can be stored in an information carrier. The instructions, when executed by one or more processing devices (for example, processor 702), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices such as computer- or machine- readable mediums (for example, the memory 704, the storage device 706, or memory on the processor 702).
[0190] The high-speed interface 708 manages bandwidth-intensive operations for the computing device 700, while the low-speed interface 712 manages lower bandwidth-intensive operations. Such allocation of functions is an example only. In some implementations, the highspeed interface 708 is coupled to the memory 704, the display 716 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 710, which may accept various expansion cards (not shown). In the implementation, the low-speed interface 712 is coupled to the storage device 706 and the low-speed expansion port 714. The low-speed expansion port 714, which may include various communication ports (e.g., USB, Bluetooth®, Ethernet, wirelessPage 48 of 7113071326vlAttorney Docket No.: 2013237-1511Ethernet) may be coupled to one or more input / output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
[0191] The computing device 700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 720, or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 722. It may also be implemented as part of a rack server system 724. Alternatively, components from the computing device 700 may be combined with other components in a mobile device (not shown), such as a mobile computing device 750. Each of such devices may contain one or more of the computing device 700 and the mobile computing device 750, and an entire system may be made up of multiple computing devices communicating with each other.
[0192] The mobile computing device 750 includes a processor 752, a memory 764, an input / output device such as a display 754, a communication interface 766, and a transceiver 768, among other components. The mobile computing device 750 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of the processor 752, the memory 764, the display 754, the communication interface 766, and the transceiver 768, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
[0193] The processor 752 can execute instructions within the mobile computing device 750, including instructions stored in the memory 764. The processor 752 may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor 752 may provide, for example, for coordination of the other components of the mobile computing device 750, such as control of user interfaces, applications run by the mobile computing device 750, and wireless communication by the mobile computing device 750.
[0194] The processor 752 may communicate with a user through a control interface 758 and a display interface 756 coupled to the display 754. The display 754 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 756 may comprise appropriate circuitry for driving the display 754 to present graphical and other information to aPage 49 of 7113071326vlAttorney Docket No.: 2013237-1511 user. The control interface 758 may receive commands from a user and convert them for submission to the processor 752. In addition, an external interface 762 may provide communication with the processor 752, so as to enable near area communication of the mobile computing device 750 with other devices. The external interface 762 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
[0195] The memory 764 stores information within the mobile computing device 750. The memory 764 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. An expansion memory 774 may also be provided and connected to the mobile computing device 750 through an expansion interface 772, which may include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 774 may provide extra storage space for the mobile computing device 750, or may also store applications or other information for the mobile computing device 750. Specifically, the expansion memory 774 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, the expansion memory 774 may be provide as a security module for the mobile computing device 750, and may be programmed with instructions that permit secure use of the mobile computing device 750. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
[0196] The memory may include, for example, flash memory and / or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, instructions are stored in an information carrier. The instructions, when executed by one or more processing devices (for example, processor 752), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices, such as one or more computer- or machine-readable mediums (for example, the memory 764, the expansion memory 774, or memory on the processor 752). In some implementations, the instructions can be received in a propagated signal, for example, over the transceiver 768 or the external interface 762Page 50 of 7113071326vlAttorney Docket No.: 2013237-1511
[0197] The mobile computing device 750 may communicate wirelessly through the communication interface 766, which may include digital signal processing circuitry where necessary. The communication interface 766 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA (Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication may occur, for example, through the transceiver 768 using a radio-frequency. In addition, short- range communication may occur, such as using a Bluetooth®, Wi-Fi™, or other such transceiver (not shown). In addition, a GPS (Global Positioning System) receiver module 770 may provide additional navigation- and location-related wireless data to the mobile computing device 750, which may be used as appropriate by applications running on the mobile computing device 750.
[0198] The mobile computing device 750 may also communicate audibly using an audio codec 760, which may receive spoken information from a user and convert it to usable digital information. The audio codec 760 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 750. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music fdes, etc.) and may also include sound generated by applications operating on the mobile computing device 750.
[0199] The mobile computing device 750 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 780. It may also be implemented as part of a smart-phone 782, personal digital assistant, or other similar mobile device.
[0200] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and / or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and / or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive dataPage 51 of 7113071326vlAttorney Docket No.: 2013237-1511 and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
[0201] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and / or object-oriented programming language, and / or in assembly / machine language. As used herein, the terms machine-readable medium and computer- readable medium refer to any computer program product, apparatus and / or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and / or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and / or data to a programmable processor.
[0202] To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
[0203] The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.Page 52 of 7113071326vlAttorney Docket No.: 2013237-1511
[0204] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
[0205] In some implementations, various modules described herein can be separated, combined or incorporated into single or combined modules. Modules depicted in the figures are not intended to limit the systems described herein to the software architectures shown therein.
[0206] Elements of different implementations described herein may be combined to form other implementations not specifically set forth above. Elements may be left out of the processes, computer programs, databases, etc. described herein without adversely affecting their operation. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Various separate elements may be combined into one or more individual elements to perform the functions described herein.Throughout the description, where apparatus and systems are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are apparatus, and systems of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.
[0207] It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.
[0208] While the invention has been particularly shown and described with reference to specific preferred embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.Page 53 of 7113071326vlAttorney Docket No.: 2013237-1511E. Examples i. Example 1: Utilizing Peptide Source Features with Machine Learning-Based Classifiers to Overcome Inaccurate False Discovery Rates for Non-Canonical Peptide Classes
[0209] The present example demonstrates that a conventional search and score approach using a score cutoff leads to potential false identifications of non-canonical candidate peptides. The search and score implementation in this example used Spectrum Mill to identify and score peptides. Spectrum Mill is particularly well suited for HLA-peptide description in MS and, notably utilizes internal ions for peptide identification and, accordingly, tends to outperform other methods that may emphasize b and y ions typically associated with standard tryptic digestbased mass-spectrometry data.
[0210] To evaluate performance of techniques based on a search and score approach together with an FDR-based score cutoff, the present example aimed to identify non-canonical peptides produced via Endogenous Retroviruses (ERVs) in samples. ERVs may produce HLA-I peptides, but are excluded from standard proteome searches. ERVs are a product of retro- transposition of genomic DNA and have been studied for their expression across cancer-types.Since ERVs are highly expressed in colon cancer, the present example aimed to identify ERV- based peptides in colon cancer samples. Mass spectrometry data from colon cancer samples was analyzed using Spectrum Mill to detect ERV peptides.
[0211] HLA-motif enrichment analysis was used, to evaluate the quality of ERV peptides that were detected by Spectrum Mill. FIG. 8 is a graph of percent rank (for best-matching ingenotype allele) for canonical peptides, decoy peptides, and ERV peptides. As shown in FIG. 8, canonical peptides have high HLA-motif enrichment (i.e., strong predicted HLA binding), and therefore are considered relatively confident detections. The decoy peptides, however, have lower motif enrichment (i.e., poor predicted HLA binding). The ERV peptides have lower motif enrichment, similar to that of decoy peptides, suggesting that FDR is not well-controlled when Spectrum Mill is used for detection of ERV peptides. Accordingly, Spectrum Mill algorithm may not be suitable for identification of non-canonical peptides.
[0212] To evaluate the ability of a search and score-based threshold approach like Spectrum Mill for detection of non-canonical peptides (e.g., ERV peptides), a single FDRPage 54 of 7113071326vlAttorney Docket No.: 2013237-1511 threshold was determined and applied to detection of canonical candidate peptides and ERV peptides.
[0213] FIG. 9 shows the number of identified peptides for canonical candidate peptides 902 and ERV peptides 904. In this example, MS data was generated from 44 normal and cancerous colon samples. The target FDR was set to 1% for detection of canonical candidate peptides 902 and the ERV peptides 904 in the Spectrum Mill implementation 906. As shown by the graph 908 , 21,300 canonical peptides and 4,200 ERV peptides were detected. Accordingly, the Spectrum Mill implementation identified a number of non-canonical, ERV-based, peptides. Since detection was performed using a 1% FDR target threshold, the conventional search and score-based threshold implies that a relatively large number of non-canonical, ERV-based, peptides have been detected (e.g., given the stringent FDR threshold). This result, however, is at odds with the predicted binding shown in FIG. 8, suggesting that the majority of the detected ERV peptides were false positives.
[0214] To further evaluate the discrepancy between these two results, and potential errors stemming from a conventional search and score-based threshold approach, mass spectrometry data was generated from 85 mono-allelic cell lines and used to search for peptides selected from a “ridiculome” - i.e., a class of peptides that could not be present in the sample - namely, peanut peptides.
[0215] FIG. 10 shows the number of identified peptides for canonical candidate peptides 1002 and ridiculome peptides 1004. The target FDR was set to 1% for detection of canonical candidate peptides 1002 and the ridiculome peptides 1004 in the Spectrum Mill implementation 1006. As shown by graph 1008, 89,600 canonical peptides were detected, and, despite the fact peanut peptides should be entirely absent from the samples analyzed, a relatively large number of - 4,250 - peanut peptides were detected via the Spectrum Mill implementation. Although a threshold was set to achieve a desired 1% FDR, the FDR for the “special” - i.e., hypothetical - peptide class - peanut - is in fact 100%. That is, since there was no peanut in the sample, all the detected peanut peptides are false positives. This result illustrates that using a conventional search and score-based thresholding approach cannot be used to reliably detect special classes of peptides - namely, non-canonical peptides - that, on the one hand, may be present in a sample or, on the other hand, may not be present at all.Page 55 of 7113071326vlAttorney Docket No.: 2013237-1511
[0216] Without wishing to be bound to any particular theory, it is believed that false identifications of non-canonical peptides, or peptides that may not be present in samples, by conventional search and score-based threshold techniques, even at stringent FDR thresholds, is caused by a search space heterogeneity that is not accounted for when a single FDR threshold is used, as illustrated in FIGs. 11A-C. FIGs. 11A-C illustrate how peptide spectrum match quality scores and thresholds based thereon may - accurately or inaccurately - capture different classes of peptides that may or may not be present in a sample.
[0217] FIG. 11A illustrates a typical case, where peptides from, e.g., a canonical human proteome class are searched for and scored, along with decoys (obtained by reversing the canonical human proteome). As illustrated in FIG. 11A, threshold may be established to achieve a desired FDR, e.g., at 1%.
[0218] FIG. 11B illustrates how match quality scores may be distributed for a class of peptides that are not present in a sample, such as the peanut peptides described above.
[0219] FIG. 11C is a graph of match quality score per candidate peptide, for a combined (canonical and non-canonical) class of peptides. The candidate peptides include canonical target peptides, canonical decoy (e.g., reversed) peptides, peanut target peptides and peanut decoy (e.g., reversed) peptides. For a combined class of peptides, a 1% FDR region is established by using a suitable score cutoff that is determined by the canonical peptides, and, accordingly, may capture peanut peptides 1102a, 1102b, which are not present in the sample.
[0220] To address the challenge of search space heterogeneity, identified as a hurdle to detection of non-canonical peptides, a machine learning-based classifier was utilized. In the present example, a public SVM-based tool, Percolator, was modified to use ( / ) a usual set of input features 1202 relating to spectrum match quality and other characteristics of the candidate peptide and / or observed spectrum (such as, backbone cleavage score, scored peak intensity, delta parent mass, charge number, and delta rankl-rank2), as described e.g., in L. Kall et al., “Semisupervised learning for peptide identification from shotgun proteomics datasets,” Nature Methods, 4(11) 923-925 (2007), and, additionally, (2), a new, previously undescribed, peptide source feature 1204 that indicates whether a candidate peptide is a member of a non-canonical class (potentially not present at all in the sample), or not.Page 56 of 7113071326vlAttorney Docket No.: 2013237-1511
[0221] As illustrated in FIG. 12 Percolator is a SVM based algorithm that aims to create combinatorial decision boundaries between target peptides and decoy peptides. FIG. 12 illustrates a reduced dimension 2D plot of target peptide hits 1206a and decoy peptide hits 1206b and identifies a decision boundary 1208 to separate the two classes of peptides 1206a, 1206b. To overcome the issue of class-specific FDR scoring, a new input feature (“is_peptide_non- canonical?”) 1204 was defined for Percolator, providing the class a peptide belongs to.
[0222] Without wishing to be bound to any particular theory, if, for a given sample, a hypothesis that, e.g., non-canonical peptides are present, is wrong, then it is believed that the Percolator tool (or other machine learning-based classifier) will learn this class as untrustworthy, e g., during training, the machine learning tool learns that non-canonical peptides are not present, and classifies them as false (e.g., not detected). However, if non-canonical peptides are indeed present in the sample, then it is believed that the machine learning-based classifier, Percolator, will be able to make stringent but confident calls. Accordingly, by including a peptide source feature as input, a machine learning-based classifier (in this example, Percolator) is able to perform class specific peptide scoring.
[0223] The performance of Percolator with the new input feature (“is_peptide_non- canonical?” as illustrated in FIG. 12) was first tested using peanut proteome. FIG. 13 shows the number of identified peanut (mono-allelic cell line data) using standard Spectrum Mill algorithm 44, and using Percolator with new input feature (e.g., is_peptide_peanuf?). Standard implementation of Spectrum Mill identified 4.25K target peanut peptides, whereas Percolator (with the peptide source feature) identified only 110 target peanut peptides, reducing the number of false positives significantly.
[0224] The performance of Percolator with the new input feature (is_peptide_non- canonical?) was then tested using human IncRNAs, a genuine source of peptides. FIG. 14 shows the number of identified IncRNAs (mono-allelic cell line data) using standard Spectrum Mill algorithm, and using Percolator with new input feature (e.g., is_peptide_lcnRNA?). Standard implementation of Spectrum Mill identified 7.17K target IncRNA peptides, whereas Percolator (with the peptide source feature) identified 1.68K target IncRNA peptides. Introduction of Percolator reduces the number of identified target IncRNA peptides, but not to the extent the number of identified target peanut peptides were reduced, as shown in FIG. 13. Accordingly,Page 57 of 7113071326vlAttorney Docket No.: 2013237-1511 introduction of new input feature results in stringent selections of IncRNA peptides and percolator learns that IncRNA peptides are plausible target class. ii. Example 2: Identification of IncRNA-based Non-Canonical Peptides
[0225] This example demonstrates an experimental validation of the above-described approach for detecting non-canonical peptides from the dark proteome using mass spectrometry, namely, long non-coding RNA (IncRNA)-derived peptides. LncRNAs are non-coding RNA species that are at least 200 nucleotides in length. Traditionally, IncRNAs were viewed as junk sequences, however, recently, IncRNAs have been shown to harbor significant coding potential and have been experimentally validated to be translated. See, e.g., P. Zeng et al., “Defining Essentiality Score of Protein-Coding Genes and Long noncoding RNAs,” Front. Genet., 9, 380 (2018) and T. Ouspenskaia et al., “Unannotated proteins expand the MHC-I-restricted immunopeptidome in cancer,” Nat Biotechnol, 40(2), 209-217 (2022).
[0226] The approach described above was applied to detect IncRNA-derived peptides in samples via mass-spectrometry, and HLA -motif enrichment analysis was used to assess quality of detected peptides. Results of the analysis are shown in FIG. 15, which shows a graph of percent rank (for best-matching in-genotype allele) for canonical peptides, decoy peptides, and ERV peptides, similar to the data shown in FIG. 8, along with additional data for the Spectrum Mill + Percolator approach as applied to detection of IncRNAs. The right two box-and-whisker- style plots show HLA-motif enrichment for all detected IncRNA peptides and those IncRNA peptides that were observed at least five times (“greater than 5 hits”), thereby instilling greater confidence in their validity. As shown in FIG. 15, canonical peptides have high HLA-motif enrichment (i.e., strong predicted HLA binding), and are considered relatively confident detections, while decoy peptides have lower motif enrichment (i.e., poor predicted HLA binding). The ERV peptides, detected by standard Spectrum Mill implementation, have lower motif enrichment, similar to that of decoy peptides. The IncRNA peptides, detected by Spectrum Mill and Percolator with new input features, have higher motif enrichment.
[0227] FIGs. 16A-B show MS2 fragmentation plots of an endogenous MVAEPPRV peptide (FIG. 16A) and a synthetic MVAEPPRV peptide (FIG. 16B). The x-axis depicts the retention time of the peptides in minutes, the j’-axis the fragment intensity.Page 58 of 7113071326vlAttorney Docket No.: 2013237-1511
[0228] FTG. 17 shows a head-to-toe plot comparing fragment peaks of endogenous (top;1702) MVAEPPRV peptide (e.g., the spectrum matched to) and a synthetic (bottom; 1704) MVAEPPRV peptide, confirming presence of the endogenous peptide.EQUIVALENTS
[0229] Elements of different implementations described herein may be combined to form other implementations not specifically set forth above. Elements may be left out of the processes, computer programs, databases, etc. described herein without adversely affecting their operation. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Various separate elements may be combined into one or more individual elements to perform the functions described herein.
[0230] Throughout the description, where apparatus and systems are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are apparatus, and systems of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.
[0231] It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.
[0232] While the invention has been particularly shown and described with reference to specific preferred embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.Page 59 of 7113071326vl
Claims
Attorney Docket No.: 2013237-1511What is claimed is:
1. A method for detecting non-canonical peptides within a biological sample via mass- spectrometry, the method comprising:(a) obtaining, by a processor of a computing device, mass spectrometry data for the biological sample, said mass spectrometry data comprising one or more sample spectrum (spectra);(b) identifying, by the processor, based on the mass spectrometry data, a plurality of candidate peptides, each candidate peptide having a corresponding candidate mass spectrum determined, by the processor, to match at least a portion of the one or more sample spectrum (spectra) of the mass spectrometry data, and wherein at least a portion of the plurality of candidate peptides are non-canonical peptides;(c) determining, by the processor, for each candidate peptide, values for one or more quality features, wherein, for a particular candidate peptide, the one or more quality features measure a quality with which the candidate mass spectrum corresponding to the particular candidate peptide matches the portion of the one or more sample spectrum (spectra);(d) determining, by the processor, for each of the plurality of candidate peptides, a corresponding prediction value using a machine learning model, wherein, for a particular candidate peptide, the corresponding prediction value (z) measures a predicted likelihood of, and / or (zz) classifies, the particular candidate peptide being present in the biological sample as determined by the machine learning model based on a set of input feature values comprising (z) the values for the one or more quality features determined for the particular candidate peptide and (zz) a value of a peptide source feature that indicates whether the particular candidate peptide is a non-canonical peptide;(e) selecting, by the processor, a subset of the plurality of candidate peptides for inclusion in a final set of detected peptides, based on their corresponding prediction values; and(f) storing and / or providing, by the processor, the final set of detected peptides.Page 60 of 7113071326vlAttorney Docket No.: 2013237-15112. The method of claim 1, wherein the biological sample is a cell sample.
3. The method of claim 1, wherein the biological sample is a tissue sample.
4. The method of any one of the preceding claims, wherein the biological sample comprises cancer cells.
5. The method of any one of the preceding claims, wherein the biological sample is an organoid sample.
6. The method of any one of the preceding claims, wherein the biological sample is a sample obtained from a subject having been diagnosed with cancer.
7. The method of any one of the preceding claims, wherein the biological sample comprises cells infected with an infectious agent.
8. The method of claim 7, wherein the biological sample is a sample obtained from a subject having been infected with an infectious agent.
9. The method of claim 8, wherein the infectious agent is a virus.
10. The method of any one of the preceding claims, wherein the mass spectrometry data is or has been obtained using a purified version of the biological sample obtained following one or more sample preparation steps.Page 61 of 7113071326vlAttorney Docket No.: 2013237-151111 . The method of claim 10, wherein the one or more sample preparation steps comprise isolation of MHC -bound peptides from the biological sample.
12. The method of claim 10 or 11, wherein the one or more sample preparation steps comprise a protease digestion step.
13. The method of any one of the preceding claims, wherein the mass spectrometry data is tandem mass spectrometry data.
14. The method of any one of the preceding claims, wherein the mass spectrometry data comprises a plurality of sample spectra, each sample spectra generated via a MS / MS scan associated with a particular selected precursor ion of a particular survey scan.
15. The method of any one of the preceding claims, comprising generating the mass spectrometry data using a tandem mass spectrometer.
16. The method of any one of the preceding claims, wherein step (b) comprises identifying, by the processor, for each of at least a portion of the one or more sample spectrum (spectra), a matching candidate peptide.
17. The method of claim 16, comprising, for a given sample spectrum: selecting, by the processor, a plurality of prospective candidate peptides from one or more target databases; determining, for each of the plurality of prospective candidate peptides, a corresponding candidate mass spectrum; determining, by the processor, for each of the plurality of prospective candidate peptides, one or more corresponding spectral similarity scores, wherein, for a particular prospectivePage 62 of 7113071326vlAttorney Docket No.: 2013237-1511 candidate peptide, the one or more corresponding spectral similarity scores are determined based on the corresponding candidate mass spectrum and the given sample spectrum; and selecting, by the processor, one or more of the prospective candidate peptides as matching candidate peptides.
18. The method of claim 17, wherein the one or more corresponding spectral similarity scores comprise a cross-correlation score determined based on a cross-correlation between (i) the prospective candidate peptide’s corresponding mass spectrum and (ii) the given sample spectrum.
19. The method of claim 17 or 18, wherein the one or more target databases comprise one or more sequence database(s).
20. The method of claim 19, wherein the one or more target databases comprise a canonical human proteome sequence.
21. The method of claim 19 or 20, wherein the one or more target databases comprise a non- canonical proteome database comprising one or more polypeptide sequences of one or more non- canonical proteins.
22. The method of claim 21, wherein the non-canonical proteome database comprises one or more polypeptide sequences resulting from an alternative genetic event.
23. The method of claim 22, wherein the alternative genetic event comprises: transcription from a novel and / or unannotated open reading frame; transcription from a pseudogene;Page 63 of 7113071326vlAttorney Docket No.: 2013237-1511 an insertion and / or deletion mutation; a frameshift mutation; a transposable element insertion or deletion; and / or insertion of a retroviral element.
24. The method of any one of claims 21 to 23, wherein the non-canonical proteome database comprises one or more polypeptide sequences translated from a polyribonucleotide sequence resulting from an alternative transcriptional event.
25. The method of claim 24, wherein the alternative transcriptional event comprises: transcription initiated at an alternative start site; and / or transcription terminated at an alternative termination site.
26. The method of any one of claims 21 to 25, wherein the non-canonical proteome database comprises one or more polypeptide sequences translated from a polyribonucleotide sequence resulting from an alternative post-transcriptional event.
27. The method of claim 26, wherein the alternative post-transcriptional event comprises: a post-transcriptional mutation of a polyribonucleotide sequence; a truncation of a polyribonucleotide sequence; and / or alternative splicing.
28. The method of any one of claims 21 to 27, wherein the non-canonical proteome database comprises one or more polypeptide sequences translated from an alternative polyribonucleotide sequence.Page 64 of 7113071326vlAttorney Docket No.: 2013237-151129. The method of claim 28, wherein the alternative polyribonucleotide sequence comprises: a long non-coding RNA sequence, a junction of an exon and a transposable element (JET), a transposable element, or a circular RNA.
30. The method of any one of claims 21 to 29, wherein the non-canonical proteome database comprises one or more polypeptide sequences resulting from an alternative translation event.
31. The method of claim 30, wherein the alternative translation event comprises: translation from an internal ribosome entry site; incorrect incorporation of one or more amino acids into a polypeptide chain; and / or premature termination of a polypeptide chain.
32. The method of any one of claims 21 to 31, wherein the non-canonical proteome database comprises one or more polypeptide sequences resulting from an alternative post-translational event.
33. The method of claim 32, wherein the alternative post-translational event comprises: glycosylation; phosphorylation;SUMOylation; methylation; acylation; and / or truncation and / or cleavage.
34. The method of any one of claims 21 to 33, wherein the non-canonical proteome database comprises polypeptide sequences produced via alternative splicing events.Page 65 of 7113071326vlAttorney Docket No.: 2013237-151135. The method of claim any one of claims 21 to 34, wherein the non-canonical proteome database comprises polypeptide sequences produced from undiscovered and / or unannotated open reading frames.
36. The method of any one of claims 21 to 35, wherein the non-canonical proteome database comprises polypeptide sequences generated based on one or more post-translational modifications.
37. The method of any one of claims 21 to 36, wherein the non-canonical proteome database comprises endogenous retrovirus (ERV)-derived polypeptide sequences.
38. The method of any one of claims 21 to 37, wherein the non-canonical proteome database comprises polypeptide sequences determined based on somatic mutations in genomic sequence data.
39. The method of any one of claims 21 to 38, wherein the non-canonical proteome database comprises polypeptide sequences of one or more infectious agents.
40. The method of any one of the preceding claims, wherein step (b) comprises, for at least a portion of the candidate peptides, generating, as the corresponding candidate mass spectrum, a predicted mass spectrum based on a sequence of the candidate peptide.
41. The method of any one of the preceding claims, wherein, for a particular candidate peptide, at least a portion of the one or more quality features measure a likelihood that the particular candidate peptide produced the matching portion of the one or more sample spectrum (spectra).Page 66 of 7113071326vlAttorney Docket No.: 2013237-151142. The method of any one of the preceding claims, wherein, for a particular candidate peptide, at least a portion of the one or more quality features measure a similarity between (i) the candidate mass spectrum corresponding to the particular candidate peptide and (ii) a particular, matching, sample spectrum of the one or more sample spectra and / or a portion thereof.
43. The method of any one of the preceding claims, wherein the quality features comprise one or more scores and / or features listed in Tables 1 and 2.
44. The method of any one of the preceding claims, wherein the peptide source feature is a binary feature identifying the particular candidate peptide as having been selected from (i) a canonical sequence database or (ii) one or more of non-canonical polypeptide sequence databases.
45. The method of any one of the preceding claims, wherein the peptide source feature has a value selected from three or more possible values identifying the particular candidate peptide as having been selected from (i) a canonical sequence database or (ii) one of two or more non- canonical polypeptide sequence databases.
46. The method of any one of the preceding claims, wherein the machine learning model is a support vector machine (SVM).
47. The method of any one of the preceding claims, wherein the machine learning model is an artificial neural network (ANN).
48. The method of any one of the preceding claims, wherein the machine learning model is or has been trained using a supervised training method.Page 67 of 7113071326vlAttorney Docket No.: 2013237-151149. The method of any one of the preceding claims, wherein the machine learning model is or has been trained using a semi-supervised training method.
50. The method of any one of the preceding claims, comprising using the final set of detected peptides in creation of a pharmaceutical composition.
51. The method of claim 50, wherein the pharmaceutical composition comprises one or more polynucleotide(s) encoding at least a portion of the peptides of the final set.
52. A sy stem compri si n : a processor of a computing device; and memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to perform the method of any one of claims 1 to 51.
53. The system of claim 52, comprising one or more target sequence databases, each sequence database comprising polypeptide sequences for detection / identifi cation within the biological sample.
54. The system of claim 53, wherein the one or more target sequence databases comprise a non-canonical polypeptide sequence database.
55. The system of any one of claims 52 to 54, comprising a tandem mass spectrometer.Page 68 of 7113071326vlAttorney Docket No.: 2013237-151156. A method for selecting peptides for inclusion in a phar aceutical composition for prevention and / or treatment of cancer, the method comprising:(a) obtaining, by a processor of a computing device, mass spectrometry data for a sample obtained from a subject having cancer and / or at risk for cancer;(b) selecting, by the processor, a plurality of candidate peptides for detection in the sample, said plurality of candidate peptides selected from one or more target peptide databases;(c) determining, by the processor, using a machine learning model, one or more of the plurality of candidate peptides to be present in the sample based at least in part on the mass spectrometry data and including, by the processor, the one or more candidate peptides determined to be present in the sample in a set of detected peptides; and(d) selecting one or more peptides from the set of detected peptides for inclusion in the pharmaceutical composition.
57. The method of claim 56, wherein the one or more target peptide databases comprises a mutant database comprising sequences of polypeptides determined to harbor mutations associated with the cancer.
58. The method of claim 57, wherein, at step (c), for a given candidate peptide the machine learning model receives, as input:(i) values for a set of quality features determined for the given candidate peptide based on (A) one or more sample spectrum (spectra) of the mass spectrometry data and (B) a corresponding candidate mass spectrum determined for the given candidate peptide; and(ii) a peptide source feature that identifies the given candidate peptide as selected from the mutant database or a canonical human proteome sequence database.
59. The method of any one of claims 56 to 58, wherein, at step (c), for a given candidate peptide, the machine learning model generates, as output, a prediction value measuring aPage 69 of 7113071326vlAttorney Docket No.: 2013237-1511 likelihood, as determined by the machine learning model, that the given candidate peptide is present in the sample.
60. The method of any one of claims 56 to 59, comprising determining polynucleotide sequences encoding the one or more peptides selected for inclusion in the pharmaceutical composition.
61. The method of any one of claims 56 to 60, wherein step (d) comprises determining, by the processor, an immunogenic presentation score for each member of the set of detected peptides and selecting the one or more peptides for inclusion in the pharmaceutical composition based at least in part on the determined immunogenic presentation scores.
62. A pharmaceutical composition comprising one or more polypeptides corresponding to non-canonical peptides having been detected in a biological sample based on mass spectrometry data and a prediction value determined using a machine learning-based classifier.
63. A pharmaceutical composition comprising one or more nucleotides encoding one or more non-canonical peptides having been detected in a biological sample based on mass spectrometry data and a prediction value determined using a machine learning-based classifier.Page 70 of 7113071326vl