Deep learning model for predicting the immunogenicity of tumor-specific neoantigens MHC I or II.
By using a neural network model to jointly predict MHC binding affinity and the presentation probability of tumor-specific neoantigens, the problem of low accuracy in predicting tumor-specific neoantigens in existing technologies is solved, enabling efficient and personalized cancer vaccine design.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- AMAZON TECH INC
- Filing Date
- 2021-12-01
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies struggle to accurately predict tumor-specific neoantigens with high positive predictive values in cancer patients, leading to inefficient, time-consuming, and labor-intensive personalized cancer vaccine design.
By developing a neural network model, we can jointly predict the presentation probability of MHC class I or MHC class II binding affinity and tumor-specific neoantigens. We can also optimize the model performance by utilizing the peptide sequence and flanking region of tumor-specific neoantigens and HLA allele pseudo-sequences to improve prediction accuracy.
It achieves high predictive accuracy for the immunogenicity of tumor-specific neoantigens MHC class I or MHC class II, supports the effective design of personalized cancer vaccines, and increases the likelihood of immune response.
Smart Images

Figure CN117136410B_ABST
Abstract
Description
[0001] This application claims the benefit of U.S. Provisional Application No. 63 / 139,074, filed January 19, 2021, the entire contents of which are incorporated herein by reference. 1. Background Technology
[0003] Cancer is a leading cause of death worldwide, accounting for a quarter of all deaths. Siegel et al., CA: A Cancer Journal for Clinicians, 68:7-30 (2018). There were 18.1 million new cancer cases and 9.6 million cancer-related deaths in 2018. Bray et al., CA: A Cancer Journal for Clinicians, 68(6):394-424. There are many existing standard of care cancer therapies, including ablation techniques (e.g., surgery and radiation) and chemotherapy techniques (e.g., chemotherapeutic agents). Unfortunately, such therapies are often associated with serious risks, toxic side effects, extremely high costs, and uncertain efficacy.
[0004] Cancer immunotherapy (e.g., cancer vaccines) has emerged as a promising approach to cancer treatment. The goal of cancer immunotherapy is to control and utilize the immune system to selectively destroy cancer while leaving normal tissue unharmed. Traditional cancer vaccines typically target tumor-associated antigens (TAAs). TAs are normally present in normal tissues but are overexpressed in cancer. However, because these antigens are often present in normal tissues, immune tolerance can prevent immune activation. Several clinical trials targeting TAs have failed to demonstrate a durable beneficial effect compared to standard care treatment. (Li et al., Ann Oncol., 28 (Supplement 12): xii 11-xii 17 (2017)).
[0005] Neoantigens represent attractive targets for cancer immunotherapy. Neoantigens are non-self-homogeneous proteins that are individual-specific. They originate from random somatic mutations in the genome of tumor cells and are not expressed on the surface of normal cells. Same source Because neoantigens are expressed only on tumor cells and therefore do not induce central immune tolerance, cancer vaccines targeting cancer neoantigens have potential advantages, including reduced central immune tolerance and improved safety. Same source .
[0006] The mutational landscape of cancer is complex, and tumor mutations are often unique to each individual subject. Most somatic mutations detected by sequencing do not generate effective neoantigens. Only a small fraction of mutations in tumor DNA or tumor cells are transcribed, translated, and processed into tumor-specific neoantigens with enough accuracy to design potentially effective vaccines. Furthermore, not all neoantigens are immunogenic. In fact, the proportion of T cells that spontaneously recognize endogenous neoantigens is approximately 1% to 2%. See Paul et al., J. Immunol., 192, 5831-5839 (2013); Yewdell, Immunity, 25, 533-543 (2006). Of the approximately 1% of MHC-binding neoantigens, only about 50% are recognized by T cells, and only 30-40% are naturally processed to enable tumor cell killing. Same source .
[0007] Current computer simulation methods primarily focus on modeling which neoantigen peptides bind to MHC-I or MHC-II molecules or predicting which neoantigens might be processed into short peptides by tumor cells and presented by MHC class I / II molecules. Available tools lack the predictive accuracy to determine which of the presented peptides are immunogenic. Therefore, existing methods are associated with low positive predictive values. For example, in one study, three melanoma patients were each immunized with seven peptides with in vitro-proven MHC binding affinity <500 nM (Carreno et al., Science, 348, 803-808 (2015)). Of the 21 peptides tested, only 9 induced T cell responses. Same source If personalized vaccines containing neoantigen peptides are designed using methods with low positive predictive value, patients are less likely to receive therapeutic neoantigens that can elicit an immune response against cancer. Furthermore, current techniques are time-consuming and labor-intensive.
[0008] Therefore, systematic identification of personalized neoantigens in cancer patients is a crucial prerequisite for the successful development of personalized cancer vaccines. Consequently, effectively and accurately predicting immunogenic neoantigen candidates with high positive predictive value for personalized vaccines remains a challenge. 2. Summary of the Invention
[0010] This disclosure relates to a novel method for predicting the immunogenicity of tumor-specific neoantigens (MHC class I or MHC class II) by jointly predicting MHC class I or MHC class II binding affinity and predicting the likelihood that the tumor-specific neoantigen will be presented on the cell surface by MHC class I or MHC class II proteins. The method accurately identifies tumor-specific neoantigens with high predictive value, which may be processed by tumor cells into peptides that bind to MHC class I or MHC class II molecules of a subject and may contact T-cell receptors and ultimately become immunogenic. This method has high predictive accuracy for identifying neoantigens that will elicit an immune response, which is crucial for developing effective personalized immunogenic compositions (e.g., cancer vaccines). This has been an obstacle when using existing methods.
[0011] Furthermore, the methods described herein outperform the gold standard predictor, the MHCflurry-1.4 binding affinity predictor, and the MHCflurry-2.0 predictor. See the Examples section for additional details. Each of these models is a separate predictor. MHCflurry-1.4 is an allele-specific class I MHC binding predictor (O'Donnell et al., Cell System, 7:129-132 (2018)). The MHCflurry-2.0 predictor is a pan-allele predictor for MHC class I presenting peptides.
[0012] The inventors have further developed an immunogenicity assessment dataset to create a unique benchmark that directly evaluates the method's ability to retrieve immunogenic peptides from a large pool of peptide candidates for a given MHC class I allele. This ability is a crucial component of vaccine design based on personalized immunotherapy.
[0013] To more clearly describe the methods used to predict the immunogenicity of tumor-specific neoantigens MHC class I, Figure 1 A schematic flowchart of the method is provided.
[0014] A method for predicting the immunogenicity of tumor-specific neoantigens (MHC class I or MHC class II) begins by obtaining a peptide sequence of the tumor-specific neoantigen and a corresponding flanking region of the peptide sequence. The flanking region may be an amino acid sequence located directly to the left or right of the tumor-specific neoantigen peptide. For example, the flanking region may be an amino acid sequence located at the C-terminus and / or N-terminus of the tumor-specific neoantigen peptide. Typically, the length of the flanking region may be about 10 amino acids. For example, the length of a flanking region directly to the left of the tumor-specific neoantigen may be about 5 amino acids. For example, the length of a flanking region directly to the right of the tumor-specific neoantigen may be about 5 amino acids. The peptide sequence and the flanking region are then encoded into numerical vectors. Each numerical vector contains the amino acid residues encoding the peptide and the flanking region of the tumor-specific neoantigen, as well as the positions of the amino acid residues. HLA allele pseudo-sequences representing HLA alleles are obtained. The length of the HLA pseudo-sequence may be at least about 20 to about 100 amino acids. Preferably, the length of the HLA pseudo-sequence is at least about 30 to 60 amino acids. The HLA allele pseudo-sequence is encoded into the corresponding numerical vector. The HLA allele can be type A, type B, type C, type DQ, type DP, or type DR.
[0015] Then, a neural network model is used to jointly predict the binding affinity of tumor-specific neoantigens MHC class I or MHC class II, and for each peptide of interest, the numerical probability that the corresponding peptide will be presented on the cell surface by MHC class I or MHC class II proteins. The neural network model can be a pan-allele model, an allele-specific model, a supertype-specific model, or a combination thereof.
[0016] Initially, the neural network model is trained on a training dataset to optimize its performance. The training dataset includes a peptide-MHC class I or MHC class II affinity measurement dataset and a cell surface peptide presentation dataset. Preferably, the neural network model is trained on both positive and negative training data. The negative training data may contain peptides that do not have tumor-specific neoantigen MHC class I or MHC class II binding affinity and / or are not presented on the cell surface by MHC class I or MHC class II proteins.
[0017] The model input layer includes a numerical vector containing the peptide sequence and flanking regions of the tumor-specific neoantigen, as well as a numerical vector including a numerical vector containing the HLA allele pseudo-sequence. Next, each of these numerical vectors is encoded into an amino acid embedding layer. Then, the neural network model flattens the amino acid embedding layer to generate numerical vector representations of each peptide sequence and flanking regions of the tumor-specific neoantigen, as well as the HLA allele pseudo-sequence.
[0018] To predict the binding affinity of a peptide-specific neoantigen to MHC class I or MHC class II proteins, the tumor-specific neoantigen peptide sequence and an HLA allele pseudosequence are concatenated. The model further includes the application of one or more layers and / or one or more activation functions. For example, the model may include the application of one or more linked layers. For example, the model may include the application of a dropout layer. For example, the model may include the application of an activation function. In an example, the model may include the application of one or more linked layers, one or more dropout layers, and / or the application of activation functions. The output is a numerical score representing the binding affinity of the peptide ligand to MHC class I or MHC class II proteins. To predict the probability that a tumor-specific neoantigen will be presented on the cell surface by MHC class I or MHC class II proteins, the peptide sequence of interest, flanking regions of the peptide sequence, and the HLA allele pseudosequence are concatenated into a single numerical vector. The predicted peptide ligand-MHC class I or MHC class II binding affinity is also concatenated. The model further includes the application of one or more layers and / or one or more activation functions. For example, the model may include the application of one or more linked layers. For example, the model may include the application of a dropout layer. For example, the model may include applying an activation function. The model may further include applying one or more connected layers, applying dropout layers, and / or applying activation functions. The output is the numerical probability that the peptide will be presented on the cell surface by MHC class I or MHC class II proteins. The binding affinity of tumor-specific neoantigens to MHC class I or MHC class II proteins and the numerical probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins are indicators of the immunogenicity of tumor-specific neoantigens MHC class I. Typically, MHC class I immunogenicity is CD8+ T cell immunogenicity. MHC class II immunogenicity is CD4+ immunogenicity.
[0019] The method may further include validating the neural network by: applying one or more ranking metrics to an immunogenicity validation dataset; ranking the peptides based on the predicted MHC class I binding affinity of the peptide for each allele in the immunogenicity validation dataset and the numerical probability that the peptide will be presented on the cell surface by MHC class I proteins; and summing the ranking metrics for all alleles. The ranking metrics can be summarized using weighted allele frequencies. The method may additionally include calibrating the neural network model. The neural network model can be calibrated using probabilistic calculations. These calculations can estimate the overall presentation probability of the subject's alleles.
[0020] Tumor-specific neoantigens predicted to be immunogenic in MHC class I or MHC class II can be selected for use in immunogenic compositions. Typically, about 10 to about 20 tumor-specific neoantigens can be selected for use in immunogenic compositions. 3. Description of the attached drawings
[0022] Figure 1 This is the model architecture diagram. Model inputs: a) peptide sequence (with flanking regions); b) allele pseudo-sequences. Model outputs: a) predicted binding affinity; b) predicted presentation probability. Loss functions used to train the model: a) MSE with inequality loss for binding affinity prediction; b) binary focus loss for presentation probability prediction.
[0023] Figure 2A It is a graph depicting the ratio of peptides that do not have a similarity greater than a threshold to their partner peptides. Figure 2B This is a graph showing the distribution of peptide length across three datasets (affinity, presentation, and immunogenicity).
[0024] Figure 3 This is a graph showing the distribution of peptide-MHC binding affinity tags. The y-axis represents the percentage of samples belonging to each bin in the training data.
[0025] Figure 4 This is a graph showing the distribution of HLA allele supertypes in the dataset samples.
[0026] Figures 5A to 5C It is a graph depicting the sample distribution of each HLA allele. Figure 5A The sample distribution of peptide MHC binding affinity is shown. Figure 5B The sample distribution of cell surface peptide presentation is shown. Figure 5C The sample distribution of T cell immunogenicity is shown.
[0027] Figure 6 It is a chart depicting the immunogenic allele frequency weights based on allele frequencies in the US population.
[0028] Figure 7 This is a graph showing a performance comparison with different α values, which determines the weight of each loss component.
[0029] Figure 8A and Figure 8B This shows the immunogenicity rate and the predicted presentation probability ( Figure 8A ) and predicted affinity ( Figure 8B A graph showing the correlation between the two. This experiment was performed on an immunogenicity validation set where ground-based immunogenicity labels were known and presentation and binding affinity were predicted using a generalized model.
[0030] Figure 9 An exemplary provider network environment is shown.
[0031] Figure 10 This is a block diagram of an exemplary provider network that provides storage services and hardware virtualization to customers according to some implementation schemes.
[0032] Figure 11 This is a block diagram illustrating an exemplary computer system.
[0033] Figure 12 The model input is shown. The model receives two sequences: a token sequence and a corresponding segment sequence. The token sequence is obtained by concatenation. <cls>Tokens, allele pseudo-sequence tokens,<SE[> Tokens, n-flanking tokens, peptides, c-flanking tokens and <eos>It is composed of tokens. The segment sequence provides a corresponding index indicating which segment the corresponding token belongs to.
[0034] Figure 13 This is a schematic diagram of a converter layer consisting of a multi-head self-attention module followed by a feedforward module (composed of two linear layers and GELU activation in between). Layer normalization is applied at the beginning of each module, and residual de-conversion is applied at the end of each component before residual connection. 4. Detailed Implementation
[0036] This disclosure relates to a novel method for predicting the immunogenicity of tumor-specific neoantigens MHC class I or MHC class II by jointly predicting MHC class I or MHC class II binding affinity and predicting the likelihood that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins. The novel method is preferably used for predicting the immunogenicity of tumor-specific neoantigens MHC class I or MHC class II.
[0037] The method includes obtaining sequencing data of tumor-specific neoantigens and HLA allele pseudosequences representing HLA alleles. For example, exome, transcriptome, and / or whole-genome nucleotide sequencing can be used to obtain sequencing data and peptide sequences of tumor-specific neoantigens. The method may further include encoding the peptide sequence and optional flanking regions of each tumor-specific neoantigen into corresponding numerical vectors. Each numerical vector includes information describing the amino acids constituting the peptide sequence and the positions of the amino acid sequences.
[0038] The method may further include encoding HLA pseudo-sequences into numerical vectors. The method may include inputting numerical vectors into a neural network model to jointly predict the binding affinity of tumor-specific neoantigens to MHC class I or MHC class II, and for each tumor-specific neoantigen, the numerical probability that the corresponding peptide will be presented on the cell surface by MHC class I or MHC class II proteins. Both predictions can be used as indicators for predicting MHC class I or MHC class II immunogenicity (e.g., CD8+ T cell immunogenicity or CD4+ T cell immunogenicity). After inputting the numerical vectors into the neural network model, the numerical vectors can be converted into an amino acid embedding layer, which can then be flattened to produce a numerical vector representation of each peptide sequence, optionally a peptide flanking region, and an HLA allele pseudo-sequence for the tumor-specific neoantigen.
[0039] Next, the neural network model can be used to predict the binding affinity of tumor-specific neoantigens to MHC class I or MHC class II and the probability that the tumor-specific neoantigen will be presented on the cell surface by MHC class I or MHC class II proteins. These predictions can be performed by splicing tumor-specific neoantigen peptide sequences, HLA allele pseudosequences, and optional peptide flanking regions. One or more layers and / or functions can then be applied. For example, one or more fully connected dense layers can be applied. For example, one or more discarded layers can be applied. For example, one or more activation functions can be applied. In an embodiment, a combination of one or more fully connected dense layers, one or more discarded layers, and / or one or more activation functions can be applied. The output is a numerical score representing the binding affinity of tumor-specific MHC class I or MHC class II and / or the numerical probability that the peptide will be presented on the cell surface by MHC class I or MHC class II proteins. These predicted values can serve as indicators of immunogenicity. Immunogenic tumor-specific neoantigens can then be selected for inclusion in personalized immunogenic compositions.
[0040] The predictions disclosed herein are based on a training dataset for identification. The training dataset includes multiple samples. The training dataset may include a peptide MHC class I or MHC class II affinity measurement dataset and a cell surface peptide presentation dataset. Preferably, the neural network model is trained on both positive and negative training data. The negative training data may contain peptides that do not have tumor-specific neoantigen MHC class I or MHC class II binding affinity and / or are not presented on the cell surface by MHC class I or MHC class II proteins.
[0041] Neoantigens predicted to be immunogenic in MHC class I or MHC class II can be selected for inclusion in the immunogenic composition. Typically, about 10 to about 20 tumor-specific neoantigens can be selected for the immunogenic composition. For example, the immunogenic composition may contain about 5, about 6, about 7, about 8, about 9, about 10, about 11, about 12, about 13, about 14, about 15, about 16, about 17, about 18, about 19, about 20, about 21, about 22, about 23, about 24, or about 25 tumor-specific neoantigens.
[0042] I. Definition
[0043] All publications and patents cited in this disclosure are incorporated herein by reference in their entirety. If any material incorporated herein by reference contradicts or is inconsistent with this specification, this specification shall supersede any such material. References to any reference herein do not imply that such references are prior art to this disclosure. Various terms used throughout the specification and claims are relevant to various aspects of this specification. Unless otherwise stated, such terms shall be given their ordinary meaning in the art. Other specifically defined terms shall be interpreted in a manner consistent with the definitions provided herein.
[0044] As used herein, unless the context clearly indicates otherwise, the singular forms "a," "an," and "the" include the plural forms. Unless otherwise specifically indicated, the terms "including," "such as," etc., are intended to convey, but are not limited to, those included.
[0045] As used herein, the term "cancer" refers to a physiological condition in which a population of cells in a subject is characterized by uncontrolled proliferation, immortality, metastatic potential, rapid growth and proliferation rates, and / or certain morphological features. Cancer can typically take the form of a tumor or mass, but can exist alone in the subject's body or circulate in the bloodstream as independent cells, such as leukemia or lymphoma cells. The term cancer encompasses all types of cancer and metastatic tumors, including hematologic malignancies, solid tumors, sarcomas, carcinomas, and other solid and non-solid tumors. Examples of cancer include, but are not limited to, carcinomas, lymphomas, blastomas, sarcomas, and leukemias. More specific examples of this type of cancer include squamous cell carcinoma, small cell lung cancer, non-small cell lung cancer, lung adenocarcinoma, lung squamous cell carcinoma, peritoneal cancer, hepatocellular carcinoma, gastrointestinal cancer, pancreatic cancer, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, hepatoma, breast cancer (e.g., triple-negative breast cancer, hormone receptor-positive breast cancer), osteosarcoma, melanoma, colon cancer, colorectal cancer, endometrial cancer (e.g., serous) or uterine cancer, salivary gland cancer, kidney cancer, liver cancer, prostate cancer, vulvar cancer, thyroid cancer, hepatocellular carcinoma, and various types of head and neck cancer. Triple-negative breast cancer is breast cancer that is negative for estrogen receptor (ER), progesterone receptor (PR), and Her2 / neu gene expression. Hormone receptor-positive breast cancer is breast cancer that is positive for at least one of the following: ER or PR, and negative for Her2 / neu (HER2).
[0046] As used herein, the term "neoantigen" refers to an antigen with at least one alteration that distinguishes it from its corresponding parent antigen, for example, through mutations in tumor cells or tumor cell-specific post-translational modifications. Mutations may include frameshifts, insertions / deletions, missense or nonsense substitutions, splice site alterations, genomic rearrangements or gene fusions, or any genomic expression alteration that produces a neoantigen. Mutations may include splice mutations. Tumor cell-specific post-translational modifications may include aberrant phosphorylation. Tumor cell-specific post-translational modifications may also include proteasome-generated splice antigens. See Lipe et al., Science, 354(6310):354:358 (2016). Typically, point mutations account for approximately 95% of mutations in tumors, with the remainder being insertions / deletions and frameshifts. See Snyder et al., N Engl J Med., 371:2189-2199 (2014).
[0047] As used herein, the term "tumor-specific neoantigen" is a neoantigen that is present in the tumor cells or tissues of a subject but not in the subject's normal cells or tissues.
[0048] As used herein, the term "immunogenicity" refers to the ability to elicit an immune response (e.g., a T-cell response, a B-cell response, or both).
[0049] As used in this article, the term "HLA allele pseudo-sequence" refers to an amino acid sequence generated by an algorithm to represent the amino acid sequence of the HLA allele.
[0050] As used herein, the term "subject" refers to any animal, such as any mammal, including but not limited to humans, non-human primates, rodents, etc. In some embodiments, the mammal is a mouse. In some embodiments, the mammal is a human.
[0051] As used herein, the term "tumor cell" refers to any cell that is a cancer cell or originates from a cancer cell. The term "tumor cell" can also refer to a cell that exhibits cancer-like properties, such as uncontrolled proliferation, resistance to growth signals, metastatic ability, and loss of the ability to undergo programmed cell death.
[0052] As used in this article, the term "neural network" refers to a machine learning model used for classification or regression, which consists of an element-wise nonlinear structure typically trained via stochastic gradient descent and backpropagation after multiple linear transformations.
[0053] Any term not directly defined herein should be understood to have the meaning generally associated with it as understood within the field of this invention. Certain terms are discussed herein to provide practitioners with additional guidance on describing the compositions, apparatus, methods, etc., of various aspects of the invention and how to manufacture or use them. It should be understood that the same thing can be described in more than one way. Therefore, alternative language and synonyms may be used for any one or more terms discussed herein. Whether terms are described in detail or discussed herein is not important. Several synonyms or alternative methods, materials, etc., are provided. Unless explicitly stated, the use of one or more synonyms or equivalents does not preclude the use of other synonyms or equivalents. The use of examples (including examples of terms) is for illustrative purposes only and does not limit the scope and meaning of the various aspects of the invention herein.
[0054] This article provides additional descriptions of the methods and guidance on methodological practice. For ease of presentation, further details and guidance are provided regarding preferred aspects of predicting MHC class I immunogenicity by jointly predicting MHC class I binding affinity and the likelihood of tumor-specific neoantigens being presented on cell surfaces by MHC class I proteins. Further details and guidance also relate to predicting MHC class II immunogenicity.
[0055] II. Training
[0056] The neural network model may include training the neural network model on a training dataset to optimize the performance of the neural network model, so that the neural network can predict the binding affinity of tumor-specific neoantigens MHC class I or MHC class II and the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins.
[0057] The training dataset used in the methods described herein includes multiple samples. Training data can include a variety of data. The training dataset may include peptide MHC class I or MHC class II affinity measurement datasets and cell surface peptide presentation datasets, as well as optional immunogenicity datasets. The training dataset may include human rhinovirus data. Negative samples can be used for immunogenicity assessment. The training dataset can be used to train one or more neural network models. In embodiments, one or more neural network models can be trained. For example, at least two or more neural networks can be trained. At least about three, four, five, six, seven, eight, nine, ten, or more neural networks can be trained.
[0058] Data sets containing MHC class I or MHC class II affinity measurements of peptides may include experimentally measured binding affinity peptides for specific MHC class I or MHC class II alleles. Such datasets may be obtained from one or more data sources, such as publicly available data sources. For example, the dataset may be obtained from the Immunoeptope Database ("IEDB", iedb.org). The training dataset may be further expanded based on one or more data sources. The training dataset may be further refined for the methods disclosed herein. For example, the training data may contain predictions of binding affinity between the peptide and each associated MHC molecule. Experimentally measured binding affinities in the dataset may be quantitative (associated with inequalities "="), qualitative (associated with inequalities "<" or ">"), or a combination thereof. Quantitative data may include IC50 mM values. Qualitative datasets can be represented as high positive (e.g., binding affinity <100 nm), intermediate positive (e.g., binding affinity <1,000 nm), low positive (e.g., binding affinity <5,000 nm), or negative (e.g., binding affinity >5,000 nm). Training datasets containing MHC class I or MHC class II affinity measurements may contain peptides eluted from MHC and identified by mass spectrometry.
[0059] The MHC class I affinity measurement dataset can be further refined to retain a subset of predicted binding affinities for specific MHC class I peptide alleles. For example, entries for HLA-A, HLA-B, and / or HLA-C alleles can be retained. Peptides of a specific length can also be retained. Peptides with a length of at least about 5 amino acids to about 20 amino acids can be retained. The amino acid length of the peptide can be about 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids. Peptides in the training set can have the same or different lengths and can vary depending on the type of MHC allele. Preferably, the peptide length is about 5 to about 15 amino acids. Peptides containing post-translational modifications or atypical amino acids can be removed.
[0060] The MHC class II affinity measurement dataset can be further refined to retain subsets of predicted binding affinities for specific MHC class II peptide alleles. For example, entries for HLA-DP, HLA-DQ, and / or HLA-DR alleles can be retained. For example, peptides of a specific length can be retained. Peptides with a length of at least about 5 amino acids to about 40 amino acids can be retained. The amino acid length of the peptide can be about 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, or 40 amino acids. The peptides in the training set can have the same or different lengths and can vary depending on the type of MHC allele. Typically, peptides are about 13 to about 35 amino acids long. Peptides containing post-translational modifications or atypical amino acids can be removed.
[0061] MHC class I or MHC class II affinity measurements can be provided as a regression model. Specifically, a loss function can be used. Exemplary loss functions include cross-entropy loss, mean squared error, Hubble loss, Kullback-Leibler, MAE (L1), MAE (L3), likelihood function, and hinge loss. In particular, variations of the mean squared loss function can be used. The mean squared loss function can be derived from L... BA-MSE This indicates that the measurement is associated with either (>) or (<), and only contributes to the loss when the inequality is violated, for both quantitative and qualitative peptide-MHC binding affinity measurements in the dataset. The possible expression is:
[0062]
[0063]
[0064] Option 1
[0065] Y i and Ý i This represents the target and predicted peptide-MHC binding affinity value for the i-th sample. The affinity target can be converted before training. The IC50 nM value is converted from the range [0, 50000] to the target value in the range [0, 1]. The following function can be used to convert the IC50 nM value:
[0066]
[0067] Option 2
[0068] MHC class I or MHC class II affinity measurement datasets may contain at least approximately 5,000, approximately 10,000, approximately 15,000, approximately 20,000, approximately 25,000, approximately 30,000, approximately 35,000, approximately 40,000, approximately 45,000, approximately 50,000, approximately 60,000, approximately 70,000, approximately 80,000, approximately 90,000, approximately 100,000, approximately 150,000, approximately 200,000, approximately 250,000, approximately 300,000, approximately 35 Measurements of the binding affinity of 0,000, approximately 400,000, approximately 450,000, approximately 500,000, approximately 550,000, approximately 600,000, approximately 650,000, approximately 700,000, approximately 750,000, approximately 800,000, approximately 850,000, approximately 900,000, approximately 950,000, approximately 1,000,000, approximately 1,250,000, approximately 1,500,000, approximately 1,750,000, approximately 2,000,000 or more peptides to MHC class I or MHC class II peptide alleles. Typically, an MHC class I or MHC class II affinity measurement dataset contains at least approximately 20,000 unique peptides.
[0069] Cell surface peptide presentation datasets can contain peptides known to be presented via HLA molecules. For example, cell surface peptides can be identified by peptide elution experiments or by mass spectrometry data. Cell surface peptide presentation datasets can be obtained from one or more data sources, such as publicly available data sources. For example, peptides generated from the Immunoepitope Database ("IEDB", iedb.org) or the SysteMHC project may be useful data sources. Cell surface peptide presentation datasets can be further generated experimentally. For example, peptides can be prepared by diluting peptides from cell lines expressing HLA peptides and analyzing said peptides by mass spectrometry. The training dataset can be further expanded based on one or more data sources.
[0070] These training datasets are typically selected for the methods disclosed in this paper. The peptide sequences are usually represented as strings, where each character represents an amino acid. The peptide sequences can be converted into numerical vectors that include information describing the amino acids of the peptide and the positions of those amino acids. The numerical vectors can be binary classifications. For example, with k... i A peptide sequence of 1 amino acid p i The peptide sequence is represented by a row vector of 20 amino acids (20-k), where a single element of the alphabet corresponding to an amino acid at a specific position in the sequence has a value of 1. The remaining elements have values of 0. For example, for the amino acid alphabets A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y, the 3-amino acid peptide sequence AFP can be represented by a row vector of 60 elements, and p i =1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0. When the training dataset contains amino acid sequences of varying lengths, the numerical vector may include padding characters to encode peptides of equal length. Padding characters may be applied to the left or right of the peptide sequence. Those skilled in the art will recognize that other types of classification systems can be applied.
[0071] Loss functions can be applied to cell surface peptide presentation datasets. Exemplary loss functions include cross-entropy loss, mean squared error, Hubble loss, Kullback-Leibler, MAE (L1), MAE (L3), likelihood function, and hinge loss. Specifically, the cross-entropy loss function can be applied to the cell surface peptide presentation dataset. In a particular implementation, focus loss binary classification can be used. Focus loss binary classification can be used to reduce imbalance in the dataset. Focus loss can be represented as L... P-FL It is a weighted extension of the standard binary cross-entropy loss, which places greater emphasis on poorly classified samples. The focus loss expression is as follows:
[0072]
[0073] Among the correct categories p li The predicted probability is
[0074]
[0075] Option 3
[0076] γ is a real parameter, which is set to 1. In the binary case, Y i ε {0,1} is the ground reality label, and Ý i It is the predicted presentation probability of the i-th sample.
[0077] The ranking objective can be further used for training on cell surface peptide presentation datasets. For example, N-axis classification can be used for ranking-oriented training. N-axis classification allows positive and negative samples to compete in the dataset. The samples can then be classified, where each group of N samples is a positive sample (N = the number of negative samples + 1). For the N-axis classification loss, cross-entropy or focus loss functions can be applied.
[0078] Cell surface peptide presentation datasets may contain at least approximately 5,000, approximately 10,000, approximately 15,000, approximately 20,000, approximately 25,000, approximately 30,000, approximately 35,000, approximately 40,000, approximately 45,000, approximately 50,000, approximately 60,000, approximately 70,000, approximately 80,000, approximately 90,000, approximately 100,000, approximately 150,000, approximately 200,000, approximately 250,000, approximately 300,000, and approximately 3 Approximately 50,000, 400,000, 450,000, 500,000, 550,000, 600,000, 650,000, 700,000, 750,000, 800,000, 850,000, 900,000, 950,000, 1,000,000, 1,250,000, 1,500,000, 1,750,000, 2,000,000 or more peptides. Training datasets with more than 35,000 samples are preferred.
[0079] The neural network model can be trained on all training data or a portion thereof. For example, a neural network model can be trained on approximately 100% of the training data, approximately 95%, approximately 90%, approximately 85%, approximately 80%, approximately 75%, approximately 70%, approximately 65%, approximately 60%, approximately 55%, or less of the training dataset. A neural network model can be trained on all training data in the MHC class I or MHC class II affinity measurement set and on all training data in the cell surface peptide presentation training dataset. For example, a neural network model can be trained on approximately 100%, approximately 95%, approximately 90%, approximately 85%, approximately 80%, approximately 75%, approximately 70%, approximately 65%, approximately 60%, approximately 55%, or less of the MHC class I or MHC class II affinity measurement dataset and / or approximately 100%, approximately 95%, approximately 90%, approximately 85%, approximately 80%, approximately 75%, approximately 70%, approximately 65%, approximately 60%, approximately 55%, or less of the cell surface peptide presentation training dataset.
[0080] In one implementation, training data from one or more training datasets can be cross-trained. For example, an MHC class I or MHC class II affinity measurement dataset and a cell surface peptide presentation dataset can be cross-trained. Each dataset typically contains a single known target. For example, the MHC class I or MHC class II affinity measurement data includes peptide affinity, and the cell surface peptide presentation dataset contains peptides that can be presented on the cell surface by MHC class I or MHC class II proteins. To cross-train the training datasets, the target for each training set can be inferred. For example, peptides presented on the cell surface can be inferred to have high binding affinity values, and peptides not presented on the cell surface can be inferred to have low binding affinity. For example, peptides with high binding affinity can be inferred to be peptides presented on the cell surface, while peptides with low binding affinity can be inferred to be peptides not presented on the cell surface.
[0081] In one implementation, self-distillation can be performed on training data from one or more training datasets. Self-distillation can be performed by extracting binding affinity and presentation estimates from multiple samples. These samples can be added to corresponding weakly labeled data in the training dataset. Self-distillation can be performed using multi-allelic spectral data. Self-distillation can also be performed using positive presenters. For positive presenters in the training dataset with unknown binding affinity, an established model can be used to estimate the binding affinity.
[0082] Neural network models are preferably trained on both positive and negative training data to limit bias. Training a network on an imbalanced dataset can bias the model by learning more representations of the dominant class, while other classes may be ignored. For example, a neural network model trained only on a positive training dataset may be biased towards overpredicting the binding affinity of peptide tumor-specific neoantigens to MHC class I or MHC class II, or overpredicting the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins. A neural network model trained only on a negative training set may be biased towards underpredicting the binding affinity of peptide tumor-specific neoantigens to MHC class I or MHC class II, or underpredicting the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins.
[0083] MHC class I or MHC class II affinity measurement datasets typically include both positive and negative training data. For example, positive training data may include binding affinity predictions classified as positive (e.g., binding affinity <5,000 nm). Negative training data may include binding affinity predictions classified as negative (e.g., binding affinity >5,000 nm). If desired, additional negative training data can be incorporated into the training set by augmenting the training dataset to include random peptides with low affinity. For example, random peptides may have qualitatively weak affinity targets of approximately >20,000 nm.
[0084] Cell surface peptide presentation training datasets typically contain positive training data (e.g., peptides presented on the cell surface by MHC class I proteins) and do not contain negative training data (e.g., peptides that cannot be presented on the cell surface by MHC class I proteins). When the training dataset does not contain negative training data, the positive training dataset can be used to generate a probabilistic negative training dataset (e.g., a negative training dataset derived from a positive training dataset). Negative training datasets can be generated by shuffling 'positive' peptides of HLA alleles. Peptides can be shuffled by changing amino acid lengths (e.g., making the peptide longer or shorter). Alternatively, the peptide amino acid sequence can be modified by, for example, amino acid substitution, insertion, or deletion. Insertion includes fusion of amino and / or carboxyl termini and intra-sequence insertion of one or more amino acid residues. Deletion is characterized by the removal of one or more amino acid residues from the peptide sequence. Amino acid substitution is usually a single-residue substitution, but can also occur at multiple positions. Substitutions, deletions, insertions, or any combination thereof can be combined to obtain peptides that are not presented on the cell surface by MHC class I or MHC class II proteins. For example, a peptide sequence having the following amino acid sequence: AVGGGERRYIKL can be modified to: CVGGGEHRYIMNNL.
[0085] Furthermore, or in combination with peptide shuffling, HLA shuffling can be used to generate negative training datasets. HLA alleles classified as 'positive' (e.g., HLA alleles that present the corresponding peptide on the cell surface) can be replaced with different alleles that do not belong to the positive allele supertype.
[0086] Alternatively, or in combination with peptide shuffling and / or HLA shuffling, HRV negative sampling can be used to generate negative training datasets.
[0087] The training data can be further filtered to remove redundant peptides. For example, duplicate peptides (e.g., those with the same amino acid sequence) can be removed, so that the training dataset contains unique peptides. Those skilled in the art will readily understand how to identify the peptides (i.e., determine whether the peptides are the same or different).
[0088] The trained neural network model can be validated using an immunogenicity dataset. Validating the neural network model may include applying one or more ranking metrics to the immunogenicity dataset. Peptides in the immunogenicity validation dataset can be ranked based on their predicted MHC class I or MHC class II binding affinity and the numerical probability that the peptide will be presented on the cell surface by MHC class I or MHC class II proteins. Ranking metrics for all alleles can be aggregated. The ranking metrics can be aggregated using weighted allele frequencies.
[0089] In implementation, unlabeled datasets can be used to train neural network models. For example, peptides in a cell surface peptide presentation dataset may be unlabeled. Without being bound by theory, it is argued that unlabeled datasets (e.g., peptide sequences) can provide numerical vector representations that more accurately characterize the input peptide sequence.
[0090] III. Model Architecture
[0091] This disclosure relates to using a neural network model to jointly predict the binding affinity of tumor-specific neoantigens MHC class I or MHC class II and, for each peptide of interest, the numerical probability that the corresponding peptide will be presented by MHC class I or MHC class II proteins on the cell surface (i.e., the surface of tumor cells). The neural network model is suitable for tumor-specific neoantigens that the neural network model has encountered or has not encountered during training.
[0092] A neural network model can be a single neural network comprising a series of nodes arranged in one or more layers. Nodes can be connected to other nodes via connections with their respective association parameters. The value at a particular node can be represented as the sum of the values of nodes connected to that particular node, weighted by association parameters mapped by an activation function associated with that particular node. The neural network model used in the methods described herein can be a pan-allele model, an allele-specific model, a supertype-specific model, or a combination thereof.
[0093] In one specific embodiment, the method includes converting a peptide sequence of a tumor-specific neoantigen into a numerical vector. The peptide sequence is typically represented as a string, where each character represents an amino acid. The peptide sequence can be converted into a numerical vector containing information describing the amino acids and their positions within the peptide. The numerical vector can be binary-coded. For example, it can have k... i A peptide sequence of 1 amino acid p i The peptide sequence is represented by a row vector of 20 amino acids (20-k), where a single element of the alphabet corresponding to an amino acid at a specific position in the sequence has a value of 1. The remaining elements have values of 0. For example, the 4-amino acid peptide sequence AGQY, with amino acid alphabets A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y, can be represented by a row vector of 80 elements, and p i =1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 00 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 11 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 11 00 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1.
[0094] The peptide sequence of a tumor-specific neoantigen can be from about 5 amino acids to about 40 amino acids in length. For example, the peptide sequence can be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, or 40 amino acids. MHC class I molecules bind to short peptides. MHC class I molecules can accommodate peptides that are typically from about 5 amino acids to about 10 amino acids in length. In the embodiment, the peptide sequence of the tumor-specific neoantigen is a short peptide with a length of about 5 amino acids to about 10 amino acids. MHC class II molecules bind to longer peptides. MHC class II can accommodate peptides that are typically about 13 to about 25 amino acids in length. In this embodiment, the peptide sequence of the tumor-specific neoantigen is a long peptide with a length of about 13 to 25 amino acids.
[0095] The peptide sequences of tumor-specific neoantigens can be of the same or different lengths. When the peptide sequences of tumor-specific neoantigens have different lengths (e.g., one peptide sequence is 7 amino acids and another peptide sequence is 15 amino acids), padding characters can be added to the numerical vector such that each peptide of the tumor-specific neoantigen reaches the maximum peptide length including flanking regions (e.g., 15 amino acids). Padding characters can be added to the C-terminus or N-terminus of the flanking regions. As an example, padding characters would encode amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y.
[0096] The peptide sequence of a tumor-specific neoantigen may be included in a sequence flanking the tumor-specific neoantigen peptide. The flanking sequence may be located directly to the left of the tumor-specific neoantigen peptide sequence, directly to the right of the tumor-specific neoantigen peptide sequence, or both.
[0097] In an embodiment, the peptide sequence of the tumor-specific neoantigen may include at least one C-terminal sequence flanking the tumor-specific neoantigen peptide within its source protein sequence, or at least one N-terminal sequence flanking the tumor-specific neoantigen peptide within its source protein sequence. Preferably, the peptide sequence of the tumor-specific neoantigen includes at least one C-terminal amino acid sequence flanking the tumor-specific neoantigen peptide and at least one N-terminal amino acid sequence flanking the tumor-specific neoantigen peptide.
[0098] The length of the flank region can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40 or more amino acids. The length of the flanking region directly to the left of the tumor-specific neoantigen peptide can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more amino acids. The length of the flanking region directly to the right of the tumor-specific neoantigen peptide can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more amino acids. In an embodiment, the peptide sequence of the tumor-specific neoantigen includes a flanking region directly to the left of the tumor-specific neoantigen with a length of up to about 10 amino acids and / or a flanking region directly to the right of the tumor-specific neoantigen with a length of up to about 10 amino acids. Preferably, the flanking region comprises a 5-amino-acid-long flanking region directly to the left of the tumor-specific neoantigen and a 5-amino-acid-long flanking region directly to the right of the tumor-specific neoantigen. The flanking regions can be similarly encoded into the numerical vector described above.
[0099] The method further includes converting HLA allele pseudo-sequences into numerical vectors. The HLA allele pseudo-sequence represents an HLA allele. An HLA allele pseudo-sequence can be from approximately 5 amino acids to approximately 100 amino acids. For example, an HLA allele pseudo-sequence can have 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, or 5... The length of the HLA allele pseudosequence can be 2, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 amino acids. The length of the HLA allele pseudosequence can be approximately 30 to approximately 60 amino acids. In the specific embodiments of the method disclosed herein, the length of the HLA allele pseudosequence is approximately 40 to approximately 50 amino acids.
[0100] The input to the neural network model may include (i) a numerical vector containing a peptide sequence of a tumor-specific neoantigen and flanking regions of the peptide sequence, and (ii) a numerical vector containing HLA allele pseudo-sequences. Optionally, the input to the neural network may include a segmentation identifier sequence. The segmentation identifier sequence informs the model which segment each amino acid belongs to. Next, the numerical vectors containing the peptide sequence of the tumor-specific neoantigen and flanking regions of the peptide sequence, and (ii) the numerical vector containing the HLA allele pseudo-sequences, are encoded into one or more embedding layers. The embedding layers translate the high-dimensional vectors containing the peptide sequence of the tumor-specific neoantigen and flanking regions of the peptide sequence, as well as the numerical vectors containing the HLA allele pseudo-sequences, into a low-dimensional space. The embedding layers can be considered as the first layer of the neural network model. The embedding layers can then be flattened to produce numerical vector representations of each peptide sequence, flanking region, and HLA allele pseudo-sequence of the tumor-specific neoantigen.
[0101] To predict the binding affinity of tumor-specific neoantigen peptides to MHC class I or MHC class II, the tumor-specific peptide sequence and the HLA allele pseudosequence are spliced together. This means that the tumor-specific neoantigen peptide sequence and the HLA allele pseudosequence are linked together in a chain or tandem. Flanking regions do not need to be spliced for predicting MHC class I or MHC class II binding affinity. Although not mandatory, flanking regions may be necessary in some cases. Once the tumor-specific neoantigen peptide sequence and HLA pseudosequence have been spliced, one or more parameters (e.g., layers and / or functions) can be applied.
[0102] Exemplary layers that can be applied include, but are not limited to, fully connected dense layers, sequence layers, activation layers, normalization layers, dropout layers, pruning layers, pooling layers and upper pooling layers, combination layers, object detection layers, or generative adversarial network layers.
[0103] Exemplary fully connected dense layers include 2D convolutional layers, 3D convolutional layers, 2D grouped convolutional layers, transposed 2D-convolutional layers, transposed 3D convolutional layers, or fully connected dense layers. Exemplary sequence layers include sequence input layers, LSTM layers, bidirectional LSTM layers, GRU layers, sequence folding layers, sequence unrolling layers, flattening layers, or word embedding layers. Exemplary activation layers include ReLU layers, leaky ReLU layers, clipped ReLU layers, ELU activation layers, hyperbolic tangent activation layers, or PReLU layers. Exemplary normalization, deflation, and pruning layers include batch normalization layers, group normalization layers, channel-wise local response normalization layers, dropout layers, 2D pruning layers, 3D pruning layers, 2D resize layers, and 3D resize layers. Exemplary pooling and top-pooling layers include average pooling layers, 3D layers, global average pooling layers, 3-D global average pooling layers, max pooling layers, 3-D max pooling layers, global max pooling layers, or max top-pooling layers. Exemplary combination layers include addition layers, multiplication layers, depth concatenation layers, and weighted average layers. Exemplary object detection layers include ROI input layers, ROI max pooling layers, ROI alignment layers, anchor box layers, region proposal layers, SSD merging layers, spatial-to-depth layers, region proposal networks, focus loss layers, region proposal networks, and box regression.
[0104] In the implementation, one or more fully connected dense layers may be applied. The fully connected layer multiplies the input by a weight matrix and then adds a bias vector. For example, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, or more fully connected dense layers may be applied. When predicting MHC class I or MHC class II binding affinity, at least three fully connected dense layers are preferably applied.
[0105] In the implementation, one or more activation layers (functions) may be applied. Activation functions may be assigned to neurons or entire layers of neurons. Exemplary activation functions that may be applied are ELU activation functions or reLU layers. Other activation layers described above and / or known to those skilled in the art may be applied. Activation functions can transform a weighted input of summation from a node into the activation or output of that node. For example, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more activation layers (functions) may be applied. Typically, about 1, 2, 3, 4, or 5 activation functions may be applied.
[0106] One or more dropout layers can be applied. Dropout layers help reduce overfitting and thus provide better results. For example, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, or more dropout layers can be applied. Typically, about one, two, three, four, or five dropout layers can be applied.
[0107] In an exemplary neural network model, one or more fully connected dense layers, one or more activation functions, and one or more dropout layers may be applied. In a preferred neural network model, one or more fully connected dense layers, an activation function (e.g., the ELU activation function), and one or more dropout layers may be applied.
[0108] To enable better learning of sequence representations, one or more LSTM layers or one or more bidirectional LTSM layers can be applied. Transformers can also be added. For example, about 1, about 2, about 3, about 4, about 5, about 6, about 7, about 8, about 9, about 10, about 11, about 12, about 13, about 14, about 15 or more transformer layers can be added. Transformers can add position embeddings to the embedded amino acid sequence and can include one or more stacked encoder layers. For example, transformers can include one or more multi-head attention layers, one or more dropout layers, one or more normalization layers, one or more feedforward layers or combinations thereof. Exemplary transformers may include: (1) a multi-head attention layer, (2) a dropout layer (0.1 ratio), (3) a normalization layer, (4) a feedforward layer (linear layer and ReLU layer), (5) a dropout layer (0.1 ratio), and (6) a layer normalization.
[0109] Regression models can be applied to predict the binding affinity of peptide tumor-specific neoantigens to MHC classes I or II. In particular, variants of the mean squared loss function can be used. The mean squared loss function can be derived from L... BA-MSE This indicates that the measurement is associated with either (>) or (<), and only contributes to the loss when the inequality is violated, for both quantitative and qualitative peptide-MHC binding affinity measurements in the dataset. The expressions that can be adopted are shown in Scheme 1 and Scheme 2.
[0110] The output includes a numerical score representing the binding affinity of the peptide ligand to MHC class I or MHC class II.
[0111] The neural network model further includes jointly predicting the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins. To predict the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I proteins, tumor-specific peptide sequences, corresponding flanking regions, and HLA allele pseudo-sequences are concatenated into a single numerical score.
[0112] Once the tumor-specific neoantigen peptide sequence, the corresponding flanking region, and the HLA pseudosequence have been assembled, one or more parameters (e.g., layers and / or functions) can be applied.
[0113] One or more fully connected dense layers may be applied. For example, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, or more fully connected dense layers may be applied. To predict the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or II proteins, it is preferable to apply at least three fully connected dense layers.
[0114] One or more activation functions can be assigned to neurons or an entire layer of neurons. Exemplary activation functions that can be applied are ELU activation functions or reLU layers. Other activation layers described above and / or known to those skilled in the art can be applied. Activation functions can transform a summed, weighted input from a node into an activation or output of that node. For example, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, or more activation layers (functions) can be applied. Typically, about one, two, three, four, or five activation functions can be applied.
[0115] One or more dropout layers can be applied. Dropout layers help reduce overfitting and thus provide better results. For example, one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, or more dropout layers can be applied. Typically, about one, two, three, four, or five dropout layers can be applied.
[0116] Focal loss binary classification (see Scheme 3 above) can be used to predict the probability that tumor-specific neoantigens will be presented on the cell surface by MHC class I or MHC class II proteins. The output is the numerical probability that the peptide will be presented on the cell surface by MHC class I or MHC class II proteins.
[0117] The neural network model can be further calibrated. If the model is not calibrated, the neural network may overestimate or underestimate the probability. Therefore, calibrating the neural network described herein can improve the accuracy and confidence of the predicted probability. To calibrate the neural network, probability calculations can be applied. In particular, probability calculations can be applied to one or more HLA alleles in the subject's HLA alleles. For example, 1, 2, 3, 4, 5, or 6 HLA alleles. Probability calculations can be used to estimate the overall presentation probability of the subject's alleles based on the model's predictions for each HLA allele. The neural network can be further calibrated by calibrating the neural network presentation predictions on a validation dataset. For example, a low-degree polynomial can be applied to the calibration curve. The polynomial coefficients can be constrained to be positive to obtain a monotonically increasing function. Lasso linear regression can be used. The calibrated presentation predictions (e.g., the binding affinity of tumor-specific neoantigen MHC class I and the probability that the tumor-specific neoantigen will be presented on the cell surface by MHC class I proteins) can be used as an indicator of immunogenicity.
[0118] The immunogenicity data described herein can be used to evaluate the performance of neural network models. One or more ranking metrics can be used to evaluate neural network predictions. Exemplary ranking metrics include, but are not limited to, top k items, Precision@K, etc. n DCG K Ranking measures include inverse ranking and positive predictive value. One or more ranking measures can be used to rank all corresponding peptides for each allele in the immunogenicity dataset based on predicted cell surface peptide presentation probabilities and / or predicted binding affinity scores. Weighted allele frequencies can then be used to summarize the ranking measures.
[0119] In one implementation, self-supervised pre-training can be performed. An exemplary training model is mask language modeling and next peptide prediction.
[0120] IV. Computer Implementation of the Method
[0121] Computer systems, programmed or otherwise configured, can be used to implement the methods disclosed herein. A computer system may include a single computing device or multiple computing devices interconnected using one or more computing networks. The computer system can use its computing capabilities to execute the neural network models described herein.
[0122] The computer system may include a central processing unit (CPU), which may be a single-core or multi-core processor, or multiple processors for parallel processing. The system may include memory (e.g., random access memory, read-only memory, flash memory), electronic storage units (e.g., cloud platforms), communication interfaces for communicating with one or more systems, and other peripheral devices, such as data storage devices, other memories, and display adapters. The memory, storage devices, interfaces, and peripheral devices may communicate with the CPU via a communication bus. Any or all of these components may communicate via a shared internal or external network, and the system may communicate with one or more user devices via a network. The network may be the Internet, an extranet, or an Internet / extranet communicating with the Internet. The network may include one or more computer servers that enable distributed computing, such as cloud computing. The computer system may communicate with a processing system. The processing system may be configured to implement the methods disclosed herein.
[0123] Examples of computing devices include, but are not limited to, desktop computers, laptop computers and mobile phones, tablet computers, personal computers, wearable computers, servers, personal digital assistants (PDAs), hybrid PDAs / mobile phones, mobile phones, e-book readers, set-top boxes, voice command devices, cameras, digital media players, etc. In some embodiments, the computing device may have one or more user interfaces, command-line interfaces (CLI), application programming interfaces (APIs), and / or other programming interfaces for submitting training requests, deployment requests, and / or execution requests. In some embodiments, the computing device may execute standalone applications that interact with neural network models.
[0124] In some implementations, the network includes any wired network, wireless network, or a combination thereof. For example, the network can be a personal area network (PAN), local area network (LAN), wide area network (WAN), wireless broadcast network (e.g., for radio or television), cable network, satellite network, cellular telephone network, or a combination thereof. As another example, the network can be a publicly accessible network linking various other networks, possibly operated by a variety of different parties (such as the Internet). In some implementations, the network can be a private or semi-private network, such as a corporate or university intranet. The network may include one or more wireless networks, such as a Global System for Mobile Communications (GSM) network, a Code Division Multiple Access (CDMA) network, a Long Term Evolution (LTE) network, or any other type of wireless network. The network may use protocols and components for communication via the Internet or any other of the aforementioned types of networks. For example, protocols used by the network may include HTTP, HTTP Secure (HTTPS), Message Queuing Telemetry Transport (MQTT), CoAP, etc. Protocols and components for communication via the Internet or any other of the aforementioned types of communication networks are well known to those skilled in the art and therefore will not be described in further detail herein.
[0125] Figure 9 An exemplary provider network (or "service provider system") environment according to some implementations is illustrated. Provider network 900 may provide resource virtualization to customers via one or more virtualization services 910, which allow customers to purchase, rent, or otherwise obtain instances 912 of virtualized resources, including but not limited to computing and storage resources implemented on devices within one or more provider networks in one or more data centers. A local Internet Protocol (IP) address 916 may be associated with resource instance 912; the local IP address is the internal network address of resource instance 912 on provider network 900. In some implementations, provider network 900 may also provide customers with public IP addresses 914 and / or ranges of public IP addresses (e.g., Internet Protocol version 4 (IPv4) or Internet Protocol version 6 (IPv6) addresses) that can be obtained from provider 900.
[0126] As is customary, provider network 900 may, via virtualization service 910, allow service provider customers (e.g., customers of one or more customer networks 950A-950C, whose operations include one or more customer devices 952) to dynamically associate at least some public IP addresses 914 assigned or allocated to the customer with specific resource instances 912 assigned to the customer. Provider network 900 may also allow customers to remap public IP addresses 914 previously mapped to one virtualized computing resource instance 912 assigned to the customer to another virtualized computing resource instance 912 also assigned to the customer. Using the virtualized computing resource instances 912 and public IP addresses 914 provided by the service provider, service provider customers (such as operators of customer networks 950A-950C) may, for example, implement customer-specific applications and present these applications on an intermediate network 940 (such as the Internet). Then, other network entities 920 on the intermediate network 940 can generate traffic destined for a public IP address 914 published by customer networks 950A-950C; this traffic is routed to the service provider's data center, and at the data center, it is routed via the network layer to the local IP address 916 of the virtualized computing resource instance 912 currently mapped to the destination public IP address 914. Similarly, response traffic from the virtualized computing resource instance 912 can be routed back to the intermediate network 940 via the network layer to reach the source entity 920.
[0127] As used herein, a local IP address refers to an internal or "private" network address of a resource instance, such as within a provider network. Local IP addresses may reside within an address block reserved by the Internet Engineering Task Force (IETF) Request for Comments (RFC) 1918 and / or have an address format specified by IETF RFC 4193, and may vary within the provider network. Network traffic originating outside the provider network is not directly routed to local IP addresses; instead, such traffic uses public IP addresses mapped to the local IP address of the resource instance. Provider networks may include network devices or appliances that provide Network Address Translation (NAT) or similar functionality to perform mappings from public IP addresses to local IP addresses and vice versa.
[0128] A public IP address is a variable network address on the Internet assigned to a resource instance by a service provider or customer. Traffic routed to a public IP address is translated (e.g., via 1:1 NAT) and forwarded to the corresponding local IP address of the resource instance.
[0129] Some public IP addresses may be assigned to specific resource instances by the provider's network infrastructure; these public IP addresses may be referred to as standard public IP addresses, or simply standard IP addresses. In some implementations, the mapping of standard IP addresses to the local IP addresses of resource instances is the default startup configuration for all resource instance types.
[0130] At least some public IP addresses can be assigned to or obtained by customers of Provider Network 900; customers can then assign their assigned public IP addresses to specific resource instances assigned to them. These public IP addresses may be referred to as customer public IP addresses, or simply customer IP addresses. Instead of being assigned to resource instances by Provider Network 900 as in the case of standard IP addresses, customer IP addresses can be assigned by the customer, for example, via an API provided by the service provider. Unlike standard IP addresses, customer IP addresses are assigned to customer accounts and can be remapped by the respective customer to other resource instances as needed or desired. Customer IP addresses are associated with customer accounts, not with specific resource instances, and the customer controls the IP address until the customer chooses to release it. Unlike traditional static IP addresses, customer IP addresses allow customers to mask resource instance or availability zone failures by remapping their public IP addresses to any resource instance associated with their customer account. For example, customer IP addresses enable customers to engineer resource instance or software issues by remapping their customer IP addresses to alternative resource instances.
[0131] Figure 10 This is a block diagram of an exemplary provider network that provides storage services and hardware virtualization services to customers according to some implementation schemes. Hardware virtualization service 1020 provides multiple computing resources 1024 (e.g., VMs) to customers. Computing resources 1024 may be rented or leased, for example, to customers of provider network 1000 (e.g., to customers implementing customer network 1050). Each computing resource 1024 may have one or more local IP addresses. Provider network 1000 may be configured to route packets from the local IP addresses of computing resources 1024 to a public internet destination, and to route packets from a public internet source to the local IP addresses of computing resources 1024.
[0132] Provider network 1000 can provide a client network 1050, for example, coupled to intermediate network 1040 via local network 1056, with the ability to implement virtual computing systems 1092 via hardware virtualization service 1020 coupled to intermediate network 1040 and provider network 1000. In some embodiments, hardware virtualization service 1020 can provide one or more APIs 1002, such as network service interfaces, via which client network 1050 can access the functionality provided by hardware virtualization service 1020, for example, via console 1094 (e.g., web-based applications, standalone applications, mobile applications, etc.). In some embodiments, each virtual computing system 1092 at provider network 1000 and client network 1050 can correspond to computing resources 1024 leased, rented, or otherwise provided to client network 1050.
[0133] From an instance of virtual computing system 1092 and / or another client device 1090 (e.g., via console 1094), a client can access the functionality of storage service 1010, for example, via one or more APIs 1002, to access and store data from storage resources 1018A-1018N of virtual data storage devices 1016 (e.g., folders or "buckets," virtual volumes, databases, etc.) provided by provider network 1000. In some embodiments, a virtualized data storage gateway (not shown) may be provided at client network 1050, which may locally cache at least some data, such as frequently accessed or critical data, and may communicate with storage service 1010 via one or more communication channels to upload new or modified data from the local cache in order to maintain the primary storage device (virtualized data storage device 1016) for the data. In some implementations, via virtual computing system 1092 and / or on another client device 1090, a user can install and access virtual data storage device 1016 volumes via storage service 1010, which acts as a storage virtualization service, and these volumes can be presented to the user as local (virtual) storage devices 1098.
[0134] Although Figure 10 Although not shown, virtualization services can also be accessed from resource instances within provider network 1000 via API 1002. For example, a customer, equipment service provider, or other entity can access virtualization services from within a corresponding virtual network on provider network 1000 via API 1002 to request the allocation of one or more resource instances within that virtual network or in another virtual network.
[0135] In some implementations, a system that implements some or all of the techniques described herein may include a general-purpose computer system that includes or is configured to access one or more computer-accessible media, such as Figure 11 The computer system 1100 is shown. In the illustrated embodiment, the computer system 1100 includes one or more processors 1110 coupled to system memory 1120 via an input / output (I / O) interface 1130. The computer system 1100 further includes a network interface 1140 coupled to the I / O interface 1130. Although Figure 11 Computer system 1100 is shown as a single computing device, but in various embodiments, computer system 1100 may include a single computing device or any number of computing devices configured to work together as a single computer system 1100.
[0136] In various embodiments, computer system 1100 may be a single-processor system including one processor 1110, or a multiprocessor system including several processors 1110 (e.g., two, four, eight, or another suitable number). Processor 1110 may be any suitable processor capable of executing instructions. For example, in various embodiments, processor 1110 may be a general-purpose or embedded processor implementing any of a variety of instruction set architectures (ISAs), such as x86, ARM, PowerPC, SPARC, or MIPS ISA, or any other suitable ISA. In a multiprocessor system, the individual processors in processor 1110 may collectively, but not necessarily, implement the same ISA.
[0137] System memory 1120 can store instructions and data accessible by processor 1110. In various embodiments, system memory 1120 can be implemented using any suitable memory technology, such as random access memory (RAM), static RAM (SRAM), synchronous dynamic RAM (SDRAM), non-volatile / flash memory, or any other type of memory. In the illustrated embodiments, program instructions and data implementing one or more desired functions, such as the methods, techniques, and data described above, are shown as being stored within system memory 1120.
[0138] In one embodiment, I / O interface 1130 may be configured to coordinate I / O communications between processor 1110, system memory 1120, and any peripheral devices (including network interface 1140 or other peripheral interfaces) within the device. In some embodiments, I / O interface 1130 may perform any necessary protocol, timing, or other data conversions to convert data signals from one component (e.g., system memory 1120) into a format suitable for use by another component (e.g., processor 1110). In some embodiments, I / O interface 1130 may include support for devices attached via various types of peripheral buses (such as, for example, the Peripheral Component Interconnect (PCI) bus standard or variants of the Universal Serial Bus (USB) standard). In some embodiments, the functionality of I / O interface 1130 may be divided among two or more separate components, such as a northbridge and a southbridge. Furthermore, in some embodiments, some or all of the functions of I / O interface 1130, such as the interface to system memory 1120, may be directly integrated into processor 1110.
[0139] Network interface 1140 can be configured to allow communication between computer system 1100 and other devices 1160 attached to one or more networks 1150 (such as...). Figure 1 The network interface 1140 can exchange data with other computer systems or devices shown. In various embodiments, the network interface 1140 can support communication via any suitable wired or wireless general-purpose data network, such as various types of Ethernet. Additionally, the network interface 1140 can support communication via telecommunications / telephone networks such as analog voice networks or digital fiber optic communication networks, storage area networks (SANs) such as Fibre Channel SANs, or via any other suitable type of network and / or protocol for I / O.
[0140] In some implementations, computer system 1100 includes one or more offload cards 1170 (including one or more processors 1175 and possibly one or more network interfaces 1140) connected using I / O interface 1130 (e.g., a bus implementing a version of the High Speed Peripheral Component Interconnect (PCI-E) standard or another interconnect (such as QuickPath Interconnect (QPI) or UltraPath Interconnect (UPI)). For example, in some implementations, computer system 1100 may act as a master control electronics device hosting computing instances (e.g., operating as part of a hardware virtualization service), and one or more offload cards 1170 may perform a virtualization manager that can manage the computing instances running on the master control electronics device. As an example, in some implementations, offload card 1170 may perform computing instance management operations such as pausing and / or unpausing computing instances, starting and / or terminating computing instances, performing dump / copy operations, etc. In some implementations, these management operations may be performed collaboratively by the offload card 1170 and a hypervisor (e.g., upon request from the hypervisor) executed by other processors 1110A-1110N of the computer system 1100. However, in some implementations, the virtualization manager implemented by the offload card 1170 may adapt to requests from other entities (e.g., from the computing instance itself) and may not collaborate with (or serve) any individual hypervisor.
[0141] In some embodiments, system memory 1120 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above. However, in other embodiments, program instructions and / or data may be received, transmitted, or stored on different types of computer-accessible media. Generally, computer-accessible media may include non-transitory storage media (storage media / memory media), such as magnetic or optical media, for example, a disc or DVD / CD coupled to computer system 1100 via I / O interface 1130. Non-transitory computer-accessible storage media may also include any volatile or non-volatile media that may be included as system memory 1120 or another type of memory in some embodiments of computer system 1100, such as RAM (e.g., SDRAM, Double Data Rate (DDR) SDRAM, SRAM, etc.), read-only memory (ROM), etc. In addition, computer-accessible media may include transmission media or signals (such as electrical signals, electromagnetic signals, or digital signals) transmitted via communication media (such as networks and / or wireless links), such as those implemented via network interface 1140.
[0142] The various implementations discussed or suggested herein can be implemented in a variety of operating environments, which in some cases may include one or more user computers, computing devices, or processing devices that can be used to operate any of a variety of applications. User or client devices may include any of a variety of general-purpose personal computers, such as desktop or laptop computers running standard operating systems, and cellular, wireless, and handheld devices running mobile software and capable of supporting a variety of network connectivity and messaging protocols. Such systems may also include multiple workstations running any of a variety of commercially available operating systems, as well as other known applications for purposes such as development and database management. These devices may also include other electronic devices, such as virtual terminals, thin clients, gaming systems, and / or other devices capable of communicating via a network.
[0143] Most implementations utilize at least one network familiar to those skilled in the art to support communication using any of a variety of widely available protocols, such as Transmission Control Protocol / Internet Protocol (TCP / IP), File Transfer Protocol (FTP), Universal Plug and Play (UPnP), Network File System (NFS), Universal Internet File System (CIFS), Extensible Messaging and Presentation Protocol (XMPP), AppleTalk, etc. The network may include, for example, a Local Area Network (LAN), a Wide Area Network (WAN), a Virtual Private Network (VPN), the Internet, an intranet, an extranet, the Public Switched Telephone Network (PSTN), an infrared network, a wireless network, and any combination thereof.
[0144] In implementations utilizing a web server, the web server can run any of a variety of server or middleware applications, including HTTP servers, File Transfer Protocol (FTP) servers, Common Gateway Interface (CGI) servers, data servers, Java servers, business application servers, etc. The server can also execute programs or scripts in response to requests from user devices, such as by executing one or more web applications, which can be implemented as one or more scripts or programs written in any programming language (such as Java®, C, C#, or C++) or any scripting language (such as Perl, Python, PHP, or TCL) and combinations thereof. The server may also include a database server, including but not limited to those commercially available from Oracle®, Microsoft®, Sybase®, IBM®, etc. The database server can be relational or non-relational (e.g., "NoSQL"), distributed or non-distributed, etc.
[0145] The environment disclosed herein may include various data storage devices and other memories and storage media as described above. They may reside in multiple locations, such as on storage media local to one or more computers (and / or residing within one or more computers), or on storage media remotely across a network to any or all computers. In a particular set of embodiments, information may reside in a storage area network (SAN) familiar to those skilled in the art. Similarly, any necessary files for performing functions attributable to a computer, server, or other network device may be appropriately stored locally and / or remotely. When the system includes computerized devices, each such device may include hardware elements that can be electrically coupled via a bus, including, for example, at least one central processing unit (CPU), at least one input device (e.g., mouse, keyboard, controller, touchscreen, or keypad), and / or at least one output device (e.g., display device, printer, or speaker). Such systems may also include one or more storage devices, such as disk drives, optical storage devices, and solid-state storage devices (such as random access memory (RAM) or read-only memory (ROM)), as well as removable media devices, memory cards, flash memory cards, etc.
[0146] Such devices may also include computer-readable storage medium readers, communication devices (e.g., modems, network interface cards (wireless or wired), infrared communication devices, etc.), and working memory as described above. The computer-readable storage medium reader may be connected to or configured to receive computer-readable storage media, representing remote, local, fixed, and / or removable storage devices, as well as storage media for temporarily and / or more permanently containing, storing, transmitting, and retrieving computer-readable information. Systems and various devices typically also include multiple software applications, modules, services, or other elements residing within at least one working memory device, including operating systems and applications such as client applications or web browsers. It should be understood that alternative embodiments may have various variations different from the embodiments described above. For example, custom hardware may also be used, and / or specific elements may be implemented in hardware, software (including portable software such as applets), or both. Furthermore, connections to other computing devices (e.g., network input / output devices) may be employed.
[0147] Storage media and computer-readable media used for containing code or portions thereof may include any suitable media known or used in the art, including storage media and communication media, such as, but not limited to, volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing and / or transmitting information (such as computer-readable instructions, data structures, program modules or other data), including RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other storage technologies, optical disc read-only memory (CD-ROM), digital versatile optical disc (DVD) or other optical storage devices, magnetic tape cassettes, magnetic tape, disk storage devices or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible by system devices. Based on the disclosure and teachings provided herein, those skilled in the art will understand other ways and / or methods for implementing various embodiments.
[0148] Various implementation schemes have been described in the foregoing description. Specific configurations and details have been set forth for illustrative purposes to provide a thorough understanding of the implementation schemes. However, it will be apparent to those skilled in the art that the described implementation schemes can be practiced without the specific details. Furthermore, well-known features may be omitted or simplified to avoid obscuring the described implementation schemes.
[0149] Parenthesized text and blocks with dashed borders (e.g., large dashes, small dashes, dotted lines, and dotted lines) are used herein to illustrate optional operations for adding additional features to some embodiments. However, this notation should not be construed as implying that these are the only options or optional operations, and / or that blocks with solid borders are not optional in some embodiments.
[0150] V. Immunogenic Compositions
[0151] The present invention further relates to personalized (i.e., subject-specific) immunogenic compositions (e.g., cancer vaccines) comprising one or more tumor-specific antigens selected using the methods described herein. Such immunogenic compositions can be formulated according to standard procedures in the art. The immunogenic compositions are capable of eliciting a specific immune response.
[0152] The immunogenic composition can be formulated such that the selection and amount of the tumor-specific neoantigen are tailored to the subject's specific cancer. For example, the selection of the tumor-specific neoantigen can depend on the specific type of cancer, the stage of cancer, the subject's immune status, and the subject's MHC type.
[0153] The immunogenic composition may contain at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50 or more tumor-specific neoantigens. The immunogenic composition may contain about 10-20 tumor-specific neoantigens, about 10-30 tumor-specific neoantigens, about 10-40 tumor-specific neoantigens, about 10-50 tumor-specific neoantigens, about 10-60 tumor-specific neoantigens, about 10-70 tumor-specific neoantigens, about 10-80 tumor-specific neoantigens, about 10-90 tumor-specific neoantigens, or about 10-100 tumor-specific neoantigens. Preferably, the immunogenic composition contains at least about 10 tumor-specific neoantigens or at least about 20 tumor-specific neoantigens.
[0154] The immunogenic composition may further comprise natural or synthetic antigens. These natural or synthetic antigens can enhance the immune response. Exemplary natural or synthetic antigens include, but are not limited to, pan-DR epitopes (PADREs) and tetanus toxin antigens.
[0155] The immunogenic composition may be in any form, such as synthetic long peptides, RNA, DNA, cells, dendritic cells, nucleotide sequences, polypeptide sequences, plasmids, or vectors.
[0156] The tumor-specific neoantigen can also be contained in a viral vector-based vaccine platform, such as vaccinia, fowlpox, self-replicating alphavirus, Malabar virus, adenovirus (see, for example, Tatsis et al., Molecular Therapy, 10:616-629 (2004)), or lentivirus, including but not limited to second-generation, third-generation, or mixed second / third-generation lentiviruses and recombinant lentiviruses of any generation designed to target specific cell types or receptors (see, for example, Hu et al., Immunol Rev., 239(1): 45-61 (2011), Sakma et al., Biochem J., 443(3):603-18 (2012)). Depending on the packaging capability of the aforementioned viral vector-based vaccine platform, this method can deliver one or more nucleotide sequences encoding one or more tumor-specific neoantigen peptides. The flanking sequences can be non-mutated sequences, can be separated by a linker, or can be preceded by one or more sequences targeting subcellular compartments (see, for example, Gros et al., Nat Med., 22 (4):433-8 (2016), Stronen et al., Science., 352(6291): 1337-1341 (2016), Lu et al., Clin Cancer Res., 20(13):3401-3410 (2014)). Upon introduction into the host, infected cells express one or more tumor-specific neoantigens, thereby triggering a host immune response (e.g., CD8+ or CD4+) against one or more tumor-specific neoantigens. Vaccinia virus vectors and methods useful in immunization protocols are described, for example, in U.S. Patent No. 4,722,848. Another vector is BCG (Bacillus Calmette-Guérin). BCG vectors are described in Stover et al. (Nature 351:456-460 (1991)). A variety of other vaccine vectors that will be obvious to those skilled in the art as described herein may also be used for the therapeutic administration or immunization of neoantigens.
[0157] The immunogenic composition may contain individualized components according to the specific needs of the subject.
[0158] The immunogenic compositions described herein may further comprise adjuvants. An adjuvant is any substance that, when incorporated into the immunogenic composition, increases or otherwise enhances and / or elevates the immune response against tumor-specific neoantigens, but does not produce an immune response against tumor-specific neoantigens when administered alone. The adjuvant preferably produces an immune response against the neoantigen and does not cause allergic reactions or other adverse reactions. It is contemplated herein that the immunogenic compositions may be administered before, together with, simultaneously with, or after the administration of the immunogenic composition.
[0159] Adjuvants can enhance immune responses through several mechanisms, including, for example, lymphocyte recruitment, B and / or T cell stimulation, and macrophage stimulation. When the immunogenic compositions of the present invention contain or are administered with one or more adjuvants, the adjuvants that can be used include, but are not limited to, mineral salt adjuvants or mineral salt gel adjuvants, microparticle adjuvants, microparticle adjuvants, mucosal adjuvants, and immunostimulatory adjuvants. Examples of adjuvants include, but are not limited to, aluminum salts (such as aluminum hydroxide, aluminum phosphate, and aluminum sulfate), and 3-deoxyacylated monophospholipid A (MPL) (see GB 2220211). 、 MF59 (Novartis), AS03 (Glaxo SmithKline), AS04 (Glaxo SmithKline), polysorbate 80 (Tween 80; ICL Americas, Inc.), imidazopyridine compounds (see International Application No. PCT / US2007 / 064857, published as International Publication No. WO2007 / 109812), imidazoquinoxaline compounds (see International Application No. PCT / US2007 / 064858, published as International Publication No. WO2007 / 109813), and saponins, such as QS21 (see Kensil et al., Vaccine Design: The Subunit and Adjuvant Approach (edited by Powell and Newman, Plenum Press, NY, 1995); U.S. Patent No. 5,057,540). In some embodiments, the adjuvant is a Freund's adjuvant (complete or incomplete). Other suitable adjuvants are oil-in-water emulsions (such as squalene or peanut oil), optionally in combination with immunostimulants such as monophospholipid A (see Stoute et al., N. Engl. J. Med. 336, 86-91 (1997)).
[0160] CpG immunostimulatory oligonucleotides have also been reported to enhance the effects of adjuvants in the vaccine setting. Other TLR-binding molecules, such as RNA-binding TLR 7, TLR 8, and / or TLR 9, may also be used.
[0161] Other examples of useful adjuvants include, but are not limited to, chemically modified CpGs (e.g., CpR, Idera), Poly(I:C) (e.g., polyi:CI2U), poly ICLCs, non-CpG bacterial DNA or RNA, and immunologically active small molecules and antibodies such as cyclophosphamide, sunitinib, bevacizumab, Celebrex (celecoxib), NCX-4016, sildenafil, tadalafil, vardenafil, sorafenib, XL-999, CP-547632, pazopanib, ZD2171, AZD2171, ipilimumab, trimemumab, and SC58175, which can have a therapeutic effect and / or act as an adjuvant. In the embodiments, poly ICLCs are preferred adjuvants.
[0162] The immunogenic composition may contain one or more tumor-specific neoantigens described herein, alone or with a pharmaceutically acceptable carrier. Suspensions or dispersions of one or more tumor-specific neoantigens may be used, particularly isotonic suspensions, dispersions, or amphiphilic solvents. The immunogenic composition may be sterile and / or may contain excipients such as preservatives, stabilizers, wetting agents and / or emulsifiers, solubilizers, salts and / or buffers for adjusting osmotic pressure, and is prepared in a manner known per se, for example by means of conventional dispersion and suspension methods. In some embodiments, such dispersions or suspensions may contain viscosity modifiers. The suspension or dispersion is maintained at a temperature of about 2°C to 8°C, or preferably frozen for long-term storage and then thawed shortly before use. For injection, the vaccine or immunogenic preparation may be formulated in an aqueous solution, preferably in a physiologically compatible buffer, such as Hanks' solution, Ringer's solution, or physiological saline buffer. The solution may contain formulations such as suspending agents, stabilizers, and / or dispersants.
[0163] In some embodiments, the compositions described herein additionally contain a preservative, such as the mercury derivative thimerosal. In a specific embodiment, the pharmaceutical compositions described herein contain 0.001% to 0.01% thimerosal. In other embodiments, the pharmaceutical compositions described herein do not contain a preservative.
[0164] Excipients may exist independently of adjuvants. The function of the excipient may be, for example, to increase the molecular weight of the immunogenic composition, enhance activity or immunogenicity, impart stability, enhance biological activity, or prolong serum half-life. Excipients may also be used to assist in the presentation of one or more tumor-specific neoantigens to T cells (e.g., CD4+ or CD8+ T cells). The excipient may be a carrier protein, such as, but not limited to, keyhole hemocyanin, serum proteins (such as transferrin, bovine serum albumin, human serum albumin), thyroglobulin or ovalbumin, immunoglobulins or hormones (such as insulin), or palmitic acid. For human immunity, the carrier is generally a physiologically acceptable and safe carrier. Alternatively, the carrier may be a dextran, such as agarose gel.
[0165] Cytotoxic T cells recognize antigens in the form of peptides bound to MHC molecules, rather than the intact foreign antigens themselves. MHC molecules are located on the cell surface of antigen-presenting cells. Therefore, activation of cytotoxic T cells is possible in the presence of a trimeric complex of the peptide antigen, the MHC molecule, and the antigen-presenting cell (APC). If not only one or more tumor-specific antigens are used to activate cytotoxic T cells, but additional APCs carrying the corresponding MHC molecules are also added, the immune response may be enhanced. Therefore, in some embodiments, the immunogenic composition additionally contains at least one APC.
[0166] Immunogenic compositions may contain an acceptable carrier (e.g., an aqueous carrier). Various aqueous carriers can be used, such as water, buffered water, 0.9% saline, 0.3% glycine, hyaluronic acid, etc. These compositions can be sterilized using conventional, well-known sterilization techniques, or can be sterile filtered. The resulting aqueous solution can be packaged for use as is, or lyophilized, said lyophilized preparation being combined with a sterile solution prior to administration. The composition may, as needed, contain pharmaceutically acceptable adjuvants to approximate physiological conditions, such as pH adjusters and buffers, tonicants, wetting agents, etc., such as sodium acetate, sodium lactate, sodium chloride, potassium chloride, calcium chloride, sorbitan monolaurate, triethanolamine oleate, etc.
[0167] Neoantigens can also be administered via liposomes, which target the neoantigen to specific cellular tissues, such as lymphoid tissue. Liposomes can also be used to extend half-life. Liposomes include emulsions, foams, micelles, insoluble monolayers, liquid crystals, phospholipid dispersions, lamellar layers, etc. In these preparations, the neoantigen to be delivered is incorporated as part of the liposome, alone or in combination with molecules that bind to receptors ubiquitous in lymphocytes (such as monoclonal antibodies binding to the CD45 antigen) or with other therapeutic or immunogenic compositions. Thus, liposomes filled with the desired neoantigen can be directed to a site on lymphoid cells, where the liposomes then deliver the selected immunogenic composition. Liposomes can be formed from standard vesicle-forming lipids, which typically comprise neutral and negatively charged phospholipids and sterols, such as cholesterol. The selection of lipids is typically guided by considerations such as liposome size, acid instability, and the stability of the liposome in the bloodstream. Various methods can be used to prepare liposomes, as described, for example, in Szoka et al., An. Rev. Biophys. Bioeng. 9;467 (1980); U.S. Patent Nos. 4,235,871, 4,501,728, 4,501,728, 4,837,028, and 5,019,369.
[0168] To target immune cells, ligands to be incorporated into liposomes may include, for example, antibodies or fragments thereof specific to cell surface determinants of the desired immune system cells. Liposome suspensions can be administered intravenously, locally, or otherwise, in doses that vary particularly depending on the route of administration, the peptide delivered, and the stage of the disease being treated.
[0169] An alternative approach for targeting immune cells is to incorporate components of the immunogenic composition, such as antigens (i.e., tumor-specific neoantigens), ligands, or adjuvants (e.g., TLRs), into poly(lactic acid-copolyglycolic acid) microspheres. These microspheres can trap components of the immunogenic composition as endosome delivery devices.
[0170] Nucleic acids encoding tumor-specific neoantigens described herein may also be administered to subjects for therapeutic or immunization purposes. Nucleic acids can be conveniently delivered to subjects using numerous methods. For example, nucleic acids can be delivered directly as "naked DNA." This method is described, for example, in Wolff et al., Science 247: 1465-1468 (1990), and U.S. Patent Nos. 5,580,859 and 5,589,466. Nucleic acids can also be administered using ballistic delivery, as described, for example, in U.S. Patent No. 5,204,253. Particles consisting solely of DNA can be administered. Alternatively, DNA can be adhered to particles, such as gold particles. Methods for delivering nucleic acid sequences may include viral vectors, mRNA vectors, and DNA vectors with or without electroporation. Nucleic acids can also be delivered in combination with cationic compounds, such as cationic lipids.
[0171] The immunogenic compositions described herein can be administered to subjects via, but not limited to, oral, intradermal, intratumoral, intramuscular, intraperitoneal, intravenous, topical, subcutaneous, percutaneous, intranasal, and inhalation routes, as well as via scratching (e.g., using a forked needle to puncture the skin surface). Immunogenic compositions can be applied to tumor sites to induce a local immune response against the tumor.
[0172] The dosage of one or more tumor-specific neoantigens may depend on the type of composition and the subject's age, weight, body surface area, individual condition, individual pharmacokinetic data, and administration pattern.
[0173] This document also discloses a method for manufacturing an immunogenic composition comprising one or more tumor-specific neoantigens selected by performing the steps of the method disclosed herein. Immunogenic compositions as described herein can be manufactured using methods known in the art. For example, the method disclosed herein for generating tumor-specific neoantigens or vectors (e.g., vectors comprising at least one sequence encoding one or more tumor-specific neoantigens) may include culturing host cells under conditions suitable for expressing the neoantigen or vector, wherein the host cells contain at least one polynucleotide encoding the neoantigen or vector; and purifying the neoantigen or vector. Standard purification methods include chromatography, electrophoresis, immunology, precipitation, dialysis, filtration, concentration, and chromatographic focusing techniques.
[0174] Host cells may include Chinese hamster ovary (CHO) cells, NSO cells, yeast, or HEK293 cells. Host cells may be transformed with one or more polynucleotides containing a nucleic acid sequence encoding at least one of the tumor-specific neoantigens or vectors disclosed herein. In some embodiments, the isolated polynucleotide may be cDNA.
[0175] 5. Examples
[0176] Example 1.1: Training Data
[0177] The model was trained to predict the probability of peptide-MHC binding and presentation of endogenous peptides on MHC class I. These were considered as indicators of CD8+ T cell immunogenicity. Selected peptide-MHC binding affinity data from MHCflurry ("curated_training_data.no_mass_spec.csv"l) was used, which included data from IEDB [1] and Kim et al. [2]. The only processing step performed on this selected dataset was to add peptide source proteins to extract flanking regions for our negative sampling method. After this processing step, the final dataset used for training was called "curated_training_data.no_mass_spec.multiple_context.blast.v2.csv". From this dataset, only entries with HLA-A / B / C alleles and peptides of length 8-15 without post-translational modifications were retained. These samples were targeted for quantitative ('=') or qualitative ('<' / '>') analysis, as suggested in MHCflurry-1.2 [3]. Qualitative entries in the MHCflurry Selected Dataset represent positive high (<100 nm), positive intermediate (<1000 nm), positive low (<5000 nm), or negative (>5000 nm) qualitative values. Additionally, a cell surface peptide presentation dataset was utilized, consisting of cell surface presented peptides identified by peptide elution experiments and mass spectrometry from several sources, including the Sarkizova et al. [4] dataset, which used mass spectrometry to analyze >185,000 peptides eluted from 95 HLA-A, HLA-B, HLA-C, and HLA-G monoallelic cell lines. Mass spectrometry hits are identified as in the MHCflurry Selected Dataset using mass spectrometry, where relevant samples are identified by the 'mass spectrometry' value in the 'measurement_source' column. This includes 226,684 MS-identified ligands deposited in IEDB[1] or SysteMHC Atlas[5] or published by Abelin et al.[6].
[0178] Additionally, cell surface presentation data were used, obtained at Fred Hutchinson Cancer Center by stably transfecting HEK293 cells to express secreted human HLA molecules covalently linked to β-2-microglobulin. Cell supernatants were then prepared to capture secreted MHC-peptide complexes and analyzed by mass spectrometry. Peptides from the purified complexes are reported. Data files are available in the compressed file "RolandPeptidePresentationData.zip". These data sources consist of peptide sequences found to be presented on MHC class I molecules associated with specific HLA alleles. For our final presentation dataset ("mass_spec_data.multiple_context.blast.allele_supertypes.v3.csv"), samples from all mentioned sources containing HLA-A / B / C alleles and peptide sources were merged to extract flanking regions for use in a negative sampling approach. Near-repeating "extended" samples were also filtered out because samples with the same alleles and the same peptides that only had a single additional amino acid at any edge were considered duplicates, a consequence of inaccuracies in mass spectrometry measurements. Samples (peptide-MHC pairs) appearing in the immunogenicity assessment data were also filtered out from the affinity and presentation datasets to verify that the assessment set was completely "preserved" during the training phase and was not seen by the model at all.
[0179] Before training the new model, these datasets (affinity and presentation) are randomly split into training and validation splits. For each allele, we randomly sample N from the binding affinity samples with the i-th allele. bai And randomly sample N from the presented samples with the i-th allele. pi Peptide. Of which N bai =min(0.25 |Unique Affinity Peptides| i, 100) and N pi =min(0.25 |Unique Presenting Peptide| i , 100).
[0180] For each allele, the sampled peptides (the union of sampled peptides from both datasets) were considered the reserved validation set, and all samples containing these peptides were removed from the training set and used only for validation. If the validation loss did not improve over N consecutive epochs, the validation set was used to determine early stopping of model training. N was set to 20 in all experiments. When training multiple similar models, we used different training-validation splits for ensemble / model selection purposes.
[0181] Human rhinovirus (HR) data were included in our data as negative data from ICS applied to 1600 HRV 15-mers constructed from HRV-1A, HRV-B, and HRV-C chimeras (as described in Fischer et al., (2007), Nature Medicine, 13, 100-106). Specifically, negative samples randomly sampled from the ICS results were used for immunogenicity assessment and tested as part of the immunogenicity assessment data. The remaining HRV samples were considered negative presentation samples and were randomly sampled during training.
[0182] Example 1.2: Immunogenicity assessment data
[0183] To validate how well predictions from BigMHC 1.0 can be translated into predictions of T cell immunogenicity for peptide-MHC binding affinity and cell surface peptide presentation, a T cell immunogenicity dataset was created to test and validate the BigMHC machine learning model. Reported peptide-MHC pairs from the CTL / CD8+ epitope summary table in the HIV Molecular Immunology Database and reported CTL epitope summary tables from the HCV Immunology Database were used. These tables provide experimentally validated HIV / HCV CTL / CD8+ epitopes.
[0184] All these samples belong to the positive immunogenicity category. To obtain peptide-MHC samples that also have negative immunogenicity, we manually reviewed the experiments from which the positive pairs originated. It was found that some of them could be reconstructed from positive results, even though no negative findings were reported. Specifically, many experiments tested all possible peptide-MHC pair combinations for a given set of peptides and a given set of HLA alleles, a method known as the matrix approach. Based on the positive findings, we were able to extract the reported peptides and alleles and conclude that at least (there are additional negatives we might have missed, since if the peptide / allele is negative in all possible pairings, then the stated peptide / allele would not be reported) all possible peptide-MHC pair combinations were tested in the experiments. Given this list of tested peptide-MHC pairs, it can be concluded that any pair not reported as positive in this list is actually negative. 31 of the largest experiments (with the largest sample size tested, |unique allele| × |unique peptide|) were reviewed to verify whether the matrix approach was used, with 18 of the largest experiments employing this method. We cannot infer negative samples from experiments using other methods.
[0185] For the remaining smaller experiments, a matrix approach is assumed, and all negative values are extracted unless the sample is reported as positive in a different experiment, in which case we assume the sample is positive. Only HLA-A / B / C alleles and peptides of length 8-15 are retained.
[0186] In addition to the data mentioned above, we added positive immunogenic samples reported in IEDB (Vita et al., (2019), Nucleic acids research, 47, D339-D343.) and additional randomly sampled HRV-negative samples. Sampling for HRV-negative values was performed.
[0187] For alleles with a relatively low positive:sex sample ratio, we considered an ideal ratio of 1:100 and, where possible (since we did not have HRV samples for all alleles), sampled for HRV-negative values until we reached approximately this ratio. Following this balancing procedure, we filtered out all allele samples with a ratio less than 1:5.
[0188] Following this procedure, 2,985 positive sample pairs and 68,469 negative sample pairs were obtained, covering 110 HLA alleles and 1,416 unique peptides.
[0189] This immunogenicity dataset was split into two sets—a validation set 6 used to tune the hyperparameters of our model and make additional configuration choices; and a test set 7, on which we performed the final benchmark tests.
[0190] Example 1.3: Data Analysis
[0191] The following figure was drawn to better visualize and understand the distribution of the data.
[0192] Peptide length: Figure 2B The peptide length distribution across all three datasets is shown.
[0193] Peptide diversity: The similarity between two peptides is referred to as the number of overlapping amino acids with the best possible alignment. Figure 2A In this analysis, for each similarity threshold, the ratio of peptides with no similarity greater than the given threshold is calculated. For this analysis, only unique peptides are considered, ignoring duplicate peptides in each dataset. Specifically, the affinity dataset consists of 35,467 unique peptides (22.45%) from a total of 158,001 samples, the presentation dataset consists of 265,236 unique peptides (68.93%) from a total of 384,812 samples, and the immunogenicity dataset consists of 1,416 unique peptides (1.98%) from a total of 71,474 samples.
[0194] Target distribution: The distribution of targets and the imbalances between them are different across different datasets: Binding affinity—peptide-MHC binding affinity dataset consists of a mixture of quantitative and qualitative targets. Figure 3 The high-level, qualitative distribution of binding affinity targets is shown. The presentation—cell surface peptide presentation dataset is binary, and our dataset consists only of positive samples. At the start of each epoch in the training device, we apply negative sample mining to generate corresponding negative samples for each positive sample. The immunogenicity—T cell immunogenicity dataset is binary but highly imbalanced, and consists of 2,985 positive peptide-MHC pairs and 68,469 negative peptide-MHC pairs.
[0195] The sample distribution for each supertype—to better visualize and understand the underlying allele distribution in the dataset (which is difficult to visualize on an allele-by-allele basis), Figure 4 The distribution of HLA allele supertypes in our dataset samples is shown. The HLA supertype classification determined by Sidney et al. [7] is applied.
[0196] Sample distribution of each HLA allele—excluding the HLA allele supertype distribution. Figure 4 The distribution of HLA alleles in each peptide-MHC sample of the dataset is also shown.
[0197] Example 1.4: Negative Mining Data During Training
[0198] Cell surface peptide presentation data consists only of "positive" samples, but they do not provide the negative samples (which cannot be presented on cell surfaces) needed to train a binary presentation classifier. Therefore, to train such a classifier, the following strategy for probabilistic negative mining during training is employed:
[0199] HLA allele shuffling. Given a positive sample consisting of a peptide and its corresponding HLA allele, a given allele is replaced by randomly sampling different alleles that do not belong to the positive allele supertype. The HLA supertype classification determined by Sidney et al. [7] is applied, which assigns each HLA allele to one or two HLA supertypes. This classification leaves a small number of unclassified HLA alleles. The unclassified alleles are mapped to the following three additional supertype categories according to their corresponding HLA-A / B / C groups: "unclassified-A", "unclassified-B", and "unclassified-C", and these groups are treated in a manner similar to other supertype categories.
[0200] Peptide shuffling. Given a positive sample consisting of a peptide and its corresponding HLA allele, the given peptide is replaced with a randomly sampled amino acid subsequence of the same length from the peptide source protein. Furthermore, the affinity training dataset is also expanded to include random peptides with qualitatively weak affinity targets (>20,000 nM) sampled from the amino acid data distribution, following the method in MHCflurry-1.6 [8]. The lengths of these random peptides are determined in such a way that each peptide length is forced to have the same number of non-binding data points for each allele.
[0201] HRV Negative Sampling. We randomly sampled from the negative HRV data (excluding samples used for immunogenicity assessment). The basic assumption here is that, in most cases, negative immunogenicity will be caused by presentation on the surface of negative cells, and this can therefore be assumed during training (although there are cases with both positive and negative immunogenicity).
[0202] Example 1.5: Cross-task objective inference
[0203] To enable joint, multi-task training, all training samples from both the binding affinity dataset and the presentation dataset are used. However, for each sample, only a single known target (rather than both tasks) from its corresponding dataset of origin is known. To mitigate this problem (and better leverage multi-task training), targets from each task to the other are inferred by assuming that samples presented on cell surfaces (positive presentations) will also have high binding affinity values and samples with poor binding affinity values will not be presented (negative presentations). Specifically, for each positive presentation sample, we infer qualitative high-affinity targets (<500 nM), while for samples with poor binding affinity measurements (>5000 nM), we infer negative presentation targets. The remaining "missing targets" (targets we cannot infer) are simply ignored (masked) during training by assigning zero sample weights (only for tasks with missing targets).
[0204] Example 1.6: Self-distillation
[0205] Using the BigMHC predictor, we extracted binding affinity and presentation estimates for various samples and added these samples and their corresponding "weak" labels to the training dataset. We performed this self-distillation process in the following two cases:
[0206] 1. Multi-allele mass spectrometry data. We used the MULTI-ALLELIC OLD dataset as described in MHCflurry-2.0, which contains >200K positive mass spectrometry hits. We used BigMHC-1.3.1 prediction to determine which allele in the alleles was the allele causing the hit. First, we filtered out any multi-allele hits with known positive presenters from our presentation training data. Second, we selected the allele with the highest presentation probability and only retained samples if the presentation probability was above a certain threshold (0.5) and the binding affinity was below a certain threshold (5000 nM).
[0207] 2. Positive Presenters. For each positive presenter with an unknown binding affinity in our training data (after performing the steps above), we estimated the binding affinity based on BigMHC-1.3.1 predictions. We added all samples with predicted binding affinity below 5000 nM to the binding affinity training data.
[0208] Example 2.1: Sequence Representation
[0209] Each HLA allele is represented by a 49-amino acid pseudo-sequence, as used in MHCflurry-1.4. This pseudo-sequence encoding uses amino acids at 49 selected positions, as determined by multiple sequence alignment of a large number of MHC class I alleles across species. The HLA allele pseudo-sequence representation is available in the file "allele_sequences.csv". In addition, peptide filling and encoding methods from O'Donnell et al. [3] are used to represent peptides of 8–15 amino acid lengths using fixed-length encodings designed to preserve the location of residues that make the most important stabilizing contact with MHC. These "anchor positions" appear at the beginning or end of the peptide of most alleles. The peptide is represented as a 15-length sequence in which missing residues are filled with the character "X", which is actually the 21st amino acid. The first and last four residues in the peptide are mapped to the first and last four positions in the representation. The middle seven residues are filled as needed: octamers leave all middle positions as X, while decimators fill all positions. In this way, the positions most likely to contain anchor residues are consistently mapped to the same positions in the representation. Flanking regions of the peptide are also encoded, considered as five amino acids on each side, each amino acid spliced at the edge of the encoded peptide.
[0210] In contrast to O'Donnell et al. [3], who used fixed amino acid embeddings based on the BLOSUM62 substitution matrix, a trainable embedding layer was used, which was jointly trained end-to-end with the rest of our neural network. This embedding layer encodes each amino acid in the peptide or allele pseudo-sequence into a 16-dimensional vector.
[0211] Example 2.2: Flanking Area
[0212] For each peptide sequence, all instances of this sequence that are a subset of the longer protein sequences present in the UniProt dataset [9] were identified. The following three UniProt files were searched: (1) the UniProt Human Proteome Dataset, "UP000005640_9606.fasta"; (2) the complete UniProtKB / Swiss-Prot Dataset, "uniprot_sprot.fasta"; (3) the additional sequences representing all annotated splice variants of the UniProtKB / Swiss-Prot Dataset, "uniport_sprot_varsplic.fasta"; and (4) the additional sequences from the netMHCpan-4.0 Immunogenicity Dataset, CD8 ("CD8_epitopes_netMHCpan.fas") and CD4 ("CD4_epitopes_netMHCIIpan.fsa") were all downloaded.
[0213] Each longer sequence within the longer sequence is referred to as the "parental sequence." Each peptide is associated with one or more "flanking regions" of length 10, which are the five amino acids immediately preceding and following the peptide sequence in the parental sequence of each peptide. All unique combinations of flanking regions are saved to a file, and a weight is defined for each unique combination that is inversely proportional to the number of unique sequences for each peptide. These weights are used as sample weights during training to allow the network to learn from all possible variations, but without giving greater weight to peptides with a large number of variations. For peptides without an exact match, we use BLAST10 to find the closest matching peptide and use the corresponding "parental sequence" of that peptide to extract the relevant flanking region.
[0214] Example 2.3: Self-supervised pre-training
[0215] We utilize a large protein database to pre-train our model and learn good initial sequence representations. Inspired by BERT pre-training, using a subset of 25M proteins from the Uniparc database, we train the peptide converter model for the following two tasks:
[0216] 1. Masked Language Modeling. We randomly select 0.15 tokens to be considered "masked" and train a token classification head that attempts to predict the original tokens (based on all other tokens) using cross-entropy loss. For the model input, the "masked" tokens are sometimes randomly replaced (10%), sometimes remain unchanged (10%), and sometimes replaced by the masked tokens (80%).
[0217] 2. Next Peptide Prediction. In the pre-training phase, our input sequence is a concatenation of two peptide sequences (the opposite of the concatenation of peptide and allele sequences in the main training phase). The sequence is predicted via a special separation token (…). <sep>The tokens are separated and have different segment indices and embeddings (the segment sequence is an additional input to the network, simply indicating whether each token belongs to a first sequence, a second sequence, or a special token. The embedding of the segment index is then added to the token and the embedding is located). We... <cls>A classifier is trained on the token output to predict whether the second peptide is the next peptide to appear in the protein (after the first peptide). These peptides are either derived from two consecutive peptides of the same length from human proteins, or randomly sampled from different proteins.
[0218] Example 2.3 Training Objectives and Multi-Task Loss
[0219] Our neural network was jointly trained to predict peptide-MHC binding affinity and cell surface peptide presentation. As O'Donnell et al. [3] did, we utilized a variant of the mean squared error (MSE) loss function represented by LBA-MSE, whereby measurements associated with inequalities (>) or (<) only contribute to the loss when the inequality is violated, for both quantitative and qualitative peptide-MHC binding affinity measurements in the dataset. The exact expressions are outlined in Schemes 1 and 2.
[0220] For the binary classification task of cell surface peptide presentation, we utilize the focus loss denoted by LP-FL
[10] , which is a weighted extension of the standard binary cross-entropy loss that gives more emphasis to poorly classified samples. The exact expression is outlined in Scheme 3 (and reproduced below).
[0221]
[0222] Among them, the correct Ρ ti The predicted probability is:
[0223]
[0224] Option 3
[0225] Here, γ is a real parameter, which we set to 1. In the binary case, Y i {0, 1} are the ground reality labels, and i [0, 1] is the predicted presentation probability of the i-th sample. Lin et al.
[10] showed that focus loss is effective for handling imbalanced data; and recent work by Mukhoti et al.
[11] also showed that it produces a better calibrated network compared to standard cross-entropy loss. The overall objective function is a linear combination of the MSE variant of peptide-MHC binding affinity and the binary focus loss of cell surface peptide presentation, L= αL BA-MSE +(1 -α) L P-FL .
[0226] During training, we also applied the negative sampling strategies mentioned in Example 1.4. These were applied at the beginning of each epoch to both the presentation and affinity tasks. Negative samples for the validation set were generated once at the beginning of the first epoch and fixed throughout the training process. Where possible, the inference target from each task to the other was also utilized. For each loss term, a sample weighting mechanism was also applied to assign different weights to samples based on the amount of change in the flanking region, and samples with missing targets were also masked (we masked them only from the specific task where the target was unknown). As explained in Example 1.5, samples with inferred targets had the same sample weights for both tasks.
[0227] For predicting cell surface peptide presentation, we train a binary classifier using binary cross-entropy loss: L P-BCE The presentation objective is:
[0228]
[0229] Option 9
[0230] The overall objective function is a linear combination of the above training objectives, with a predefined weight loss coefficient:
[0231]
[0232] Option 10
[0233] During training, we also applied a negative sampling strategy to each negative sample we randomly sampled using the four methods described above. We also defined a negative:positive ratio hyperparameter, Nratio, which we used to determine how many negative samples to sample for each positive presentation. For each sampled negative sample, we assigned a sample weight of 1 / Nratio, thereby effectively ensuring that the LP-BCE objective was trained on balanced data (the overall sample weights for positive and negative training samples were equal). Furthermore, during the main training phase, we modeled an auxiliary objective using a masking language, where masking tokens were randomly selected only from the flanking regions of peptides. This objective only applies to native peptide sequences and is therefore ignored for certain types of sampled negative samples where the peptide (e.g., a randomly sampled amino acid sequence) or its background (e.g., an HRV negative sample) is "synthetic." We used L... A-MLM This indicates an auxiliary loss.
[0234] Example 2.4 Calibration
[0235] Recent work by Guo et al.
[12] has shown that modern neural networks are poorly calibrated. The calibration property is very important to us because in our ranking logic pipeline, we use probabilistic computations to make decisions based on predicted presentation probabilities. Specifically, in the vaccine design pipeline, given six HLA alleles of a subject, we apply the following probabilistic computations to estimate the overall presentation probability of a subject's alleles based on the model's predictions for each individual HLA allele:
[0236]
[0237] Option 4
[0238] These types of computations are enhanced if our predictors are properly calibrated. Therefore, we further calibrate the network's presentation predictions on the validation set by fitting low-order polynomials to a calibration curve. All polynomial coefficients are constrained to be positive to obtain a monotonically increasing function. Lasso linear regression was used for this task. This step is crucial for obtaining well-calibrated presentation probabilities, which will later be used in our inference pipeline; however, since the calibration step is monotonic, it does not affect the ranking of peptides in individual alleles. In our vaccine design pipeline, we use these calibrated presentation predictions as the optimal metric for estimating immunogenicity probabilities.
[0239] Example 2.5 Model Architecture
[0240] A. Model Architecture 1
[0241] The architecture of this model consists of three main components: sequence processing, peptide-MHC binding affinity prediction, and cell surface peptide presentation prediction. The input to our model is the primary peptide sequence, including the flanking regions as described above, and a 49-amino acid HLA allele pseudo-sequence. First, the peptide sequence is encoded into a fixed-length vector, and then all amino acid sequences are encoded using a shared d-dimensional amino acid embedding layer. The embedding sequences are then flattened to produce vector representations for each peptide, allele, and flanking region. For the peptide-MHC binding affinity prediction component, we concatenate the peptide and HLA allele representations and apply two dense layers of sizes 512 and 256, each followed by an exponential linear unit (ELU) activation and a dropout layer with a temporary deactivation probability of p=0.5. The output of this component is called the "affinity representation." An additional dense linear layer is then used to output the predicted binding affinity logit. Finally, for the cell surface peptide presentation prediction component, the peptide, flanking region, and HLA allele representations are first concatenated into a single vector. Then, a series of two fully connected layers, configured similarly to the peptide-MHC binding affinity prediction component, are used, but the output of the "affinity representation" above is also concatenated. The output of this component is called the "presentation representation." An additional linear compact layer is then added to predict the cell surface peptide presentation probability logit.
[0242] B. Model Architecture 1
[0243] The model consists of three components: 1. Sequence embedding; 2. Self-attention converter layer; and 3. Prediction head.
[0244] 1. Sequence Embedding—The model receives the following two sequences as input: 1) Amino acid sequences (allele pseudo-sequences and splices of peptides with flanking regions). 2) Segment identifier sequences, which "inform" the model which segment each amino acid belongs to (allele, peptide, background, and special token). Figure 12 The composition of tokens and segmented input sequences is shown. Each token in the sequence is represented by the sum of the aa embedding, the learned positional embedding, and the segmented embedding.
[0245]
[0246] Option 11
[0247] 2. Self-Attention Transformer Layers. The embedding sequence is processed using 12 consecutive transformer layers. Each transformer layer contains a multi-head self-attention module followed by a feedforward module (consisting of two linear layers and GELU activations in between), as shown below. Figure 13 As shown. Layer normalization is applied at the beginning of each module, and residual temporary deceleration with a rate of p=0.1 is applied at the end of each component before residual connection.
[0248] 3. Prediction Heads—The model comprises the following three prediction heads:
[0249] Masked language modeling: The final representation at each position of the sequence is fed into an LM prediction head, which consists of a linear layer (with Gelu activation followed by layer normalization) and an additional linear projection (up to the size of the token vocabulary) with weights and additional learning biases associated with the token embedding matrix.
[0250] Combining affinity: <cls>The final representation at the token position is fed into a prediction header consisting of: linear + GELU + layer normalization + temporary deprecation + linear.
[0251] Presented: will be <cls>The final representation, concatenated with a single binding affinity logit at the token position, is fed into a prediction head consisting of: linear + GELU + layer normalization + deprecation + linear.
[0252] Example 2.6 Model Integration
[0253] The above neural network architecture was used to train the following two types of models: (1) pan-allele, which includes all training data; and (2) allele-specific, where the training data is divided by HLA allele. A separate HLA allele-specific model was trained for each HLA allele using sufficient training data (using the criteria of at least 1K binding affinity and 1K presented samples in the dataset).
[0254] Both types have identical architectures and are trained similarly. Their only difference lies in their training data: the list of alleles supported at inference time and their weight initialization. While the generalized model is randomly initialized, the allele-specific models are fine-tuned based on the optimally trained generalized model. At inference time, an ensemble of trained models is used, averaging the predictions for each sample across all models supporting the alleles of that sample.
[0255] Example 2.7 Model Selection
[0256] Early experiments showed that subsets of models, or even individual models, often outperform ensembles of many models. Therefore, we developed the following model selection procedure: 1. For each given model configuration (a general model and an allele-specific model with sufficient training data for each allele), we train only 10 models with precise training settings using only different folds (training-validation splits). After training, for each model configuration, based on our evaluation scheme, we select the single best-performing model from the 10 trained models using a validation immunogenicity data split. We apply per-allele stratified model selection, where for each allele (with immunogenicity data for performance validation), we select the best possible configuration from the following options: a) using only allele-specific models; b) using only general models; c) using an ensemble of general models + allele-specific models and averaging their predictions.
[0257] Example 2.8 Evaluation
[0258] To evaluate model performance, T-cell immunogenicity data were used to examine how well a given trained model managed to rank positive immunogenic peptide-HLA allele pairs above negative non-immunogenic peptide-HLA allele pairs. Specifically, the top 20 peptides were of interest, as this roughly corresponds to the amount of peptides required to manufacture a given vaccine. In the case of a given immunogenicity validation / test set, the model’s predictions could be extracted across all samples. With this in mind, three common ranking metrics focused on the top K items were utilized: Precision@K, nDCGK, and inverse ranking. Positive predictive value metrics used in previous work (such as O'Donnell et al.
[13] ) were also utilized. For each allele, in the immunogenicity validation / test set, all corresponding peptides were ranked individually based on the predicted cell surface peptide presentation probability or the predicted binding affinity score, and Precision@K, nDCGK, inverse ranking, and positive predictive value metrics were calculated for each allele.
[0259] After calculating these measures for each allele individually, we aggregated them across all HLA alleles by applying a weighted average, where each allele was weighted according to its frequency in the U.S. population. We used allele frequencies from the four largest ethnic groups in the U.S.
[11] . The frequencies derived from these documents were further weighted according to the frequency of each ethnic group in the U.S. population
[12] . Specifically, 0.54 was used for European whites, 0.22 for Hispanics, 0.17 for African Americans, and 0.07 for Asians. The motivation for weighting by allele frequency in the population was to make the measures capture the proportion of subjects (or at least better correlate with the proportion of subjects), which this approach could theoretically help. Population-based frequency weights of HLA alleles in the immunogenicity assessment were plotted on Figure 7 The final metrics, weighted Precision@K (WP@K), weighted nDCGK (WnDCGK), weighted inverse rank (WRR), and weighted positive predictive value (WPPV), are given by the following formula:
[0260]
[0261] in
[0262]
[0263] Option 5
[0264] rel k It is the ground-based relevance of the k-th ranking item, which in our case is a binary immunogenicity label. IDCG K DCG is a reference item based on the ideal ranking of ground conditions. K Score.
[0265]
[0266] Option 6
[0267] in, Ranking i It is the ranking of the first positive sample among the ranked peptides of the i-th allele.
[0268]
[0269] Option 7
[0270] Where Ni is the number of positive samples for the i-th allele. Of all the metrics,
[0271]
[0272] Option 8
[0273] These are HLA allele frequency weights based on US census data, and their values are in the range [0,1] (higher is better). In practice, K=20 is used for W. n DCG K The metric, and for WP@K, uses Ki=min(20,|positive samples|). i (This is because some alleles in our assessment data had fewer than 20 positive samples.)
[0274] While the PPV metric is informative, it doesn't focus on the top-ranked peptides, which is the region of greatest interest to us. The penultimate ranking metric uses the ranking of the first positive peptide among the ranked peptides, with higher rankings yielding higher scores. However, this metric ignores the presence of additional positive peptides among the top-ranked peptides. This is less than ideal for our designed vaccines, as we want to ensure we have as many positive results as possible (since not all presented peptides are immunogenic and generate the desired immune response). The Precision@K metric only captures the positivity rate among the top K peptides. However, it doesn't consider their actual ranking within the top K (meaning a first-ranked positive peptide would receive the exact same score as a K-ranked positive peptide, which is less than ideal). The nDCGK metric considers the positivity rate within the top K and its corresponding ranking. If the number of positives is below 20, the normalization factor for DCGs ranked ideally also reduces the need to limit K and produces values in the [0,1] range. However, these scores are still less interpretable / intuitive compared to the precision@K metric.
[0275] Example 3.1 Results
[0276] Unless otherwise explicitly stated, we use the ADAM optimizer to train all models with a learning rate of 0.001 and a batch size of 256. We set a=0.5 as the loss term coefficient, giving equal weight to both loss terms. Early stopping is applied after 20 epochs, but no improvement is observed in our validation loss evaluated on the retained validation set. A small L1 regularization factor of le-9 and a temporary stop rate of 0.5 are applied. Negative sampling and target inference are applied for all models. For allele-specific models, where we do not want to replace alleles from positive samples with alleles from different supertypes (because we are only interested in samples from a specific allele), we reverse the implementation and retain (before filtering all data from other alleles) a repository of positive samples with alleles from other supertypes (other than the supertype of the specific allele we are training on), as well as randomly sampled positive samples from this set, while replacing "external" alleles with the current allele we are training on. For the final predictor, a set of the following model types is trained: 1. general models; and 2. allele-specific models. These models are trained only on a subset of alleles with sufficient training data. Specifically, they are trained on alleles with at least 1K binding affinity and 1K training sample presentations.
[0277] For each type, 10 similar models were trained on different training-validation splits, and the model that performed best on the immunogenicity validation set was selected. Hierarchical model selection was applied to determine which model to use for each allele during inference.
[0278] To verify whether multi-task training is beneficial in our setup, the effect of the loss weight coefficient α was explored. Several generalized model 3x experiments were performed with the same settings except for this hyperparameter, and the mean WP@K of the immunogenicity test splits was reported. The results obtained by predicting rank by affinity and by presentation were plotted to capture the full effect. Figure 7 It is clearly shown that a=0 and a=1 correspond to presentation-only training and affinity-only training, respectively, resulting in worse results compared to the median value corresponding to joint training using a weighted combination of the two objectives.
[0279] To verify our hypothesis that presentation probability is a good indicator of immunogenicity estimation, a histogram of bin presentation predictions was plotted using the ratio of positive immunogenic samples in each bin. For example... Figure 8A The positive slope shown confirms dP 免疫原性 / dP 呈递 > 0, meaning that, on average, increasing the presentation probability of a peptide increases the chance that the peptide is immunogenic. For comparison, we also explored equivalent behavior regarding binding affinity predictions and... Figure 8B Similar patterns were observed. We also compared the performance of our model with other state-of-the-art predictors.
[0280] Specifically, we compared the performance with that of MHCflurry-2.0 from O'Donnell et al.
[13] and earlier versions. The comparisons were made using our evaluation scheme and metrics, and the reported figures were calculated on immunogenicity test splits. As can be seen from Table 1, our best predictor significantly outperforms the MHCflurry-2.0 predictor and all variants of previous versions. We can also see that our single pan-model outperforms the ensemble of allele-specific models (which makes sense since they do not support predictions for all alleles), but the combination of both with the hierarchical model selection provides additional performance improvements.
[0281] Table 1. The evaluation yielded the proposed immunogenicity benchmarks. Performance comparison between the best predictor and the MHCflurry predictor. The reported metrics are: weighted Precision@K (WP@K), weighted nDCGK (WnDCGK), weighted inverse ranking (WRR), and weighted positive predictive value (WPPV), with higher values generally better.
[0282]
[0283] We further compared the per-allele performance (for all alleles with immunogenicity data to be evaluated) between our best predictor, "pan-+ allele-specific presentation predictor," and the best predictor of MHCflurry-2.0, "pan-(+ms) affinity predictor." The results are presented in Table 2.
[0284] Table 2. Performance per allele of the best predictors described in this paper compared to the MHCflurry-2.0 baseline (the recommended immunogenicity baseline). The reported metrics are: Precision@K (P@K), nDCG20, inverse ranking (RR), and positive predictive value (PPV), all of which are better the higher they are.
[0285]
[0286]
[0287]
[0288] References cited in the embodiments
[0289]
[0290]
[0291] 6. Equivalents
[0292] It will be apparent to those skilled in the art that other suitable modifications and adaptations to the methods of the invention described herein are readily available and can be made using suitable equivalents without departing from the scope of this disclosure or embodiments. Certain compositions and methods have now been described in detail, and the same compositions and methods will be more clearly understood by referring to the following examples, which are introduced for illustrative purposes only and not for limitation.< / cls> < / cls> < / cls> < / sep> < / eos> < / cls>
Claims
1. A method for predicting the immunogenicity of tumor-specific neoantigens MHC class I, the method comprising: a) Obtain the peptide sequence of the tumor-specific neoantigen and the corresponding flanking region of the peptide sequence; The peptide sequence and the flanking region are encoded into numerical vectors, wherein each numerical vector contains amino acid residues of the peptide encoding the tumor-specific neoantigen and amino acid residues of the flanking region, as well as the positions of the amino acid residues. b) Obtain HLA allele pseudo-sequences, wherein the HLA allele pseudo-sequences represent HLA alleles; encode the HLA allele sequences into corresponding numerical vectors; c) Use a neural network model to jointly predict the binding affinity of tumor-specific neoantigens to MHC class I and the numerical probability that the corresponding peptide will be presented on the cell surface by MHC class I proteins for each tumor-specific neoantigen. The neural network model mentioned above includes: (i) The neural network model is trained on a training dataset to optimize the performance of the neural network model; wherein the training dataset includes a peptide-MHC class I affinity measurement dataset and a cell surface peptide presentation dataset; wherein the training dataset includes positive training data and negative training data; wherein the negative training data is generated by replacing the HLA alleles of the positive training data with different HLA alleles of a supertype that do not belong to the HLA alleles of the positive training data. (ii) An input layer comprising a numerical vector containing the peptide sequence of the tumor-specific neoantigen and the flanking region, and a numerical vector containing the HLA allele pseudo-sequence; (iii) Encode the peptide sequence containing the tumor-specific neoantigen and the numerical vector of the flanking region and the numerical vector containing the HLA allele pseudo-sequence into an amino acid embedding layer; (iv) Flatten the amino acid embedding layer to generate a numerical vector representation of each peptide sequence of the tumor-specific neoantigen, the flanking regions of the peptide sequences, and the pseudo-HLA allele sequences; (v) Predicting the MHC class I binding affinity of the tumor-specific neoantigen by splicing the peptide sequence of the tumor-specific neoantigen and the HLA allele pseudosequence, applying one or more layers and / or one or more activation functions, wherein the output is a numerical score representing the MHC class I binding affinity of the tumor-specific neoantigen; and (vi) The probability that the tumor-specific neoantigen will be presented on the cell surface by MHC class I proteins is predicted by splicing the peptide sequence, the flanking region of the peptide sequence and the HLA allele pseudo-sequence into a single numerical vector and applying one or more layers and / or one or more activation functions, wherein the output is the numerical probability that the peptide will be presented on the cell surface by MHC class I proteins. The binding affinity of the tumor-specific neoantigen MHC class I and the numerical probability of the tumor-specific neoantigen being presented on the cell surface by the MHC class I protein are indicators of the immunogenicity of the tumor-specific neoantigen MHC class I.
2. The method of claim 1, further comprising verifying the neural network model by: (i) Apply one or more ranking metrics to the immunogenicity validation dataset; (ii) Rank the peptides of each allele in the immunogenicity validation dataset based on the predicted MHC class I binding affinity of the peptides and the numerical probability that the peptides will be presented on the cell surface by MHC class I proteins; and (iii) Summarize one or more ranking metrics for all alleles.
3. The method of claim 2, wherein the one or more ranking metrics are aggregated by using weighted allele frequencies.
4. The method according to any one of claims 1-3, wherein the neural network model is a pan-allele model, an allele-specific model, a supertype-specific model, or a combination thereof.
5. The method according to any one of claims 1-3, wherein the length of the HLA pseudo-sequence is 30 to 60 amino acids.
6. The method according to any one of claims 1-3, wherein the length of the peptide sequence of the tumor-specific neoantigen is from 8 to 15 amino acids.
7. The method according to any one of claims 1-3, wherein the flanking region is located directly to the left of the tumor-specific neoantigen peptide sequence and / or directly to the right of the tumor-specific neoantigen peptide sequence.
8. The method of any one of claims 1-3, wherein the length of the flanking region is 10 amino acids.
9. The method of claim 8, wherein the length of the flanking region directly to the left of the tumor-specific neoantigen is 5 amino acids.
10. The method of claim 8, wherein the length of the flanking region directly to the right of the tumor-specific neoantigen is 5 amino acids.
11. The method of any one of claims 1-3, wherein the method further comprises calibrating the neural network model.
12. The method of any one of claims 1-3, wherein the negative training data further comprises human rhinovirus (HRV) negative sampling of peptides.
13. The method of any one of claims 1-3, wherein the negative training data comprises peptides that do not have the binding affinity for the tumor-specific neoantigen MHC class I and / or are not presented on the cell surface by the MHC class I protein.
14. The method according to any one of claims 1-3, wherein the HLA allele is HLA type A, B or C.
15. The method according to any one of claims 1-3, wherein the tumor-specific neoantigen MHC class I immunogenicity is CD8+ T cell immunogenicity.
16. The method of any one of claims 1-3, wherein one or more tumor-specific neoantigens predicted to be MHC class I immunogenic are selected for the immunogenic composition.
17. The method of claim 16, wherein at least 20 tumor-specific neoantigens are selected for the immunogenic composition.
18. The method of any one of claims 1-3, wherein the one or more layers are fully connected layers.
19. The method of any one of claims 1-3, wherein the one or more layers are discard layers.
20. The method of any one of claims 1-3, wherein the one or more layers and / or activation functions comprise applying one or more fully connected layers, applying dropout layers, and applying activation functions.
21. The method of claim 11, wherein the neural network model is calibrated using probability calculation.
22. The method of claim 21, wherein the probability calculation estimates the overall presentation probability of the subject's alleles.