Methods, systems, and computer program products for determining the likelihood of presentation of neoantigens
By combining deep learning models with mass spectrometry data, the presentation probability of neoantigens on HLA alleles is predicted, solving the problem that existing technologies cannot accurately predict neoantigen presentation, and enabling efficient identification of neoantigens and the development of personalized cancer vaccines.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MYNEO NV
- Filing Date
- 2021-07-12
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies cannot effectively predict the likelihood of neoantigen presentation on the surface of cancer cells, especially for untrained HLA alleles, resulting in low positive predictive values and an inability to accurately model the entire cell surface presentation process.
Using deep learning models, including deep semantic similarity models, convolutional deep semantic similarity models, and recursive deep semantic similarity models, combined with mass spectrometry data, the model is trained to predict the presentation probability of neoantigens on HLA alleles. By comparing exome and transcriptome data of tumor and normal cells, abnormal genomic events are identified and peptide sequences of neoantigen sets are generated.
It can accurately predict the presentation probability of neoantigens on any HLA allele, improving the accuracy and comprehensiveness of neoantigen identification, and is suitable for the development of personalized cancer vaccines.
Smart Images

Figure CN115836350B_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to a computer-implemented method, computer system, and computer program product for determining the presentation probability of a novel antigen. Background Technology
[0002] In addition to normal epitopes, cancer cells may present neoantigens on their surface derived from aberrant genomic events, which can be identified by T cells.
[0003] Neoantigens are newly formed antigens that have not been previously identified by the immune system. In recent years, targeting these neoantigens has shown to be a very promising approach to personalized medicine.
[0004] New technological advancements have increased the availability of mass spectrometry-derived lists of peptides that actually bind to major histocompatibility complex (MHC) molecules on the cell surface. These lists are known as "ligandomes." Existing neoantigen discovery methods begin by generating a list of all potential neoantigens produced by cancer cells and rely on computer prediction algorithms to extract epitopes most likely to be presented on the cell surface and potentially elicit an immune response.
[0005] WO 2017106638 describes a method for identifying one or more neoantigens that may be presented on the surface of tumor cells in a subject's tumor cells. Furthermore, this document discloses systems and methods for obtaining high-quality sequencing data from tumors and for identifying somatic variations in polymorphic genomic data. Finally, WO '638 describes a unique cancer vaccine.
[0006] US 20190311781 describes methods for identifying peptides that include features associated with successful cellular processing, transport, and MHC presentation using machine learning algorithms or statistical inference models. US 20180085447 describes methods for identifying immunogenic mutant peptides with therapeutic potential as cancer vaccines. More specifically, methods for identifying novel T-cell activation epitopes from all genetically altered proteins. These mutant proteins contribute to the creation of new epitopes after proteolytic degradation within antigen-presenting cells.
[0007] EP 3256853 describes methods for predicting T-cell epitopes for vaccination. In particular, the document relates to methods for predicting whether modifications of peptides or polypeptides (such as tumor-associated neoantigens) are immunogenic (especially for vaccination), or methods for predicting which of these modifications are most immunogenic (especially most suitable for vaccination).
[0008] Several other tools and methods are available for solving the same problem, such as NetMHCpan or MHCflurry. These methods use methods that predict peptide binding affinity for a given HLA allele. Other methods, such as EDGE or MARIA, also output learning-based presentation probabilities, but do not consider the HLA sequence and encode HLA type as a categorical variable.
[0009] Furthermore, the initial prediction methods used the binding affinity of candidate neoantigens to MHC as an indicator of the likelihood of presentation on the cell surface. However, these methods cannot model the entire cell surface presentation process and therefore have low positive predictive values. Additionally, these methods cannot predict the presentation probability of novel epitopes of HLA molecules not included in the model training.
[0010] The purpose of this invention is to provide a solution to at least some of the above-mentioned drawbacks, as well as an improvement on the prior art. Summary of the Invention
[0011] In a first aspect, the present invention relates to a computer-implemented method for determining the presentation probability of tumor cells of a subject's tumor to a set of neoantigens.
[0012] In a second aspect, the present invention relates to a computer system for determining the presentation probability of tumor cells of a subject's tumor to a set of neoantigens.
[0013] In a third aspect, the present invention relates to a computer program product for determining the presentation probability of tumor cells of a subject's tumor to a set of neoantigens.
[0014] In a fourth aspect, the present invention relates to the use of determining treatment for a subject.
[0015] The objective of this invention is to predict the likelihood of a set of HLA alleles expressed by said cells presenting novel epitopes of variable length on the surface of cancer cells. For this purpose, a deep learning model is used.
[0016] The present invention is advantageous because the presentation probability of the new epitope to any HLA allele can be predicted, even if the model is not trained on HLA alleles. Attached Figure Description
[0017] Figure 1 The precision-recall curves are shown as the results of testing the model according to the invention on a test dataset. Figure 1 Figure A shows a performance comparison of the model according to the invention and the existing algorithms EDGE and MHCflurry when tested on the same test dataset. Figure 1 B illustrates the predictive power of the model according to the invention when tested on a new dataset. Detailed Implementation
[0018] In a first aspect, the present invention relates to a computer-implemented method for determining the presentation probability of a neoantigen set. In second and third aspects, the present invention relates to a computer system and computer program product. In a fourth aspect, the present invention relates to the use of any method, system, or product for determining treatment for a subject. Hereinafter, the invention will be described in detail, preferred embodiments will be discussed, and the invention will be illustrated by non-limiting examples.
[0019] Unless otherwise defined, all terms (including technical and scientific terms) used in disclosing this invention have the meanings commonly understood by one of ordinary skill in the art to which this invention pertains. Further guidance, including definitions of terms used in the specification, is provided to better understand the teachings of this invention. The terms or definitions used herein are for illustrative purposes only.
[0020] As used herein, the following terms have the following meanings:
[0021] As used in this article, “a,” “a,” and “the” refer to the singular and plural things, respectively, unless the context clearly indicates otherwise. For example, “a compartment” means one or more compartments.
[0022] As used herein, “comprising” and “consisting of” are synonymous with “including” and “containing”, and are inclusive or open-ended terms that specify the presence of the following (e.g., components) and do not exclude or preclude the presence of other unlisted components, features, elements, members, or steps known in the art or disclosed herein.
[0023] The enumeration of numerical ranges represented by endpoints includes all values and scores included in that range, as well as the enumerated endpoints. All percentages should be understood as weight percentages unless otherwise defined or have a different meaning that would be obvious to those skilled in the art from their use and in the context of their use. Unless otherwise defined, the expressions “weight%”, “weight percentage”, “%wt” or “wt%” throughout this document and the specification mean the relative weight of each component to the total weight of the formulation.
[0024] Although the terms “one or more” or “at least one” (such as one or more or at least one member of a group of members) are clear on their own, by further example, the term in particular covers reference to any one of the members or to any two or more of the members, for example, any ≥3, ≥4, ≥5, ≥6 or ≥7 of the members, and up to all the members.
[0025] Unless otherwise defined, all terms used in this disclosure (including technical and scientific terms) have the meanings commonly understood by one of ordinary skill in the art to which this invention pertains. Further guidance, including definitions of terms used in the specification, is provided to better understand the teachings of this invention. The terms or definitions used herein are for illustrative purposes only.
[0026] Throughout this specification, references to "one embodiment" or "an embodiment" mean that a particular feature, structure, or characteristic described in connection with that embodiment is included in at least one embodiment of the invention. Therefore, the phrases "in one embodiment" or "in an embodiment" appearing throughout this specification do not necessarily refer to the same embodiment, but may refer to the same embodiment. Furthermore, in one or more embodiments, particular features, structures, or characteristics may be combined in any suitable manner, as will be apparent to those skilled in the art from this disclosure. Moreover, while some embodiments described herein include other features not included in other embodiments, combinations of features from different embodiments are intended to be within the scope of the invention and form different embodiments, as will be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments may be used in any combination.
[0027] Furthermore, the terms first, second, third, etc., used in the specification and claims are used to distinguish similar elements and are not necessarily used to describe order or chronological order, unless specifically stated otherwise. It should be understood that such terms are interchangeable where appropriate, and that embodiments of the invention described herein can operate in a different order than those described or shown herein.
[0028] In a first aspect, the present invention relates to a computer-implemented method for determining the presentation probability of a neoantigen set by tumor cells of a subject's tumor. The method preferably includes the step of obtaining at least one of exome or whole-genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells and normal cells associated with the subject's tumor. The method preferably further includes the step of obtaining a set of aberrant genomic events associated with the tumor by comparing exome and / or whole-genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells with exome and / or whole-genome nucleotide sequencing data and transcriptome nucleotide sequencing data from normal cells. The method preferably further includes the step of obtaining data representing peptide sequences of individual neoantigens in a set of neoantigens identified at least partially based on the set of aberrant events, wherein the peptide sequences of each neoantigen include at least one alteration that distinguishes them from corresponding wild-type peptide sequences identified from the subject's normal cells. The method preferably further includes the step of obtaining data representing HLA peptide sequences based on tumor exome and / or whole-genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells. The method preferably further includes the step of training a deep learning model on a training dataset comprising a positive dataset, wherein the positive dataset comprises multiple input-output pairs, wherein each input-output pair includes an entry as an epitope sequence identified or inferred from a surface-binding or secreted HLA / peptide complex encoded by a corresponding HLA allele expressed by the training cells, wherein each input-output pair also includes an entry as an output of a peptide sequence of the α-chain encoded by the corresponding HLA allele. The method preferably further includes the step of determining the presentation probability of each neoantigen in a neoantigen set of HLA peptide sequences using the trained model.
[0029] In a second aspect, the present invention relates to a computer system for determining the presentation probability of tumor cells of a subject's tumor to a set of neoantigens. The computer system is configured to perform a computer-implemented method according to a first aspect of the invention.
[0030] In a third aspect, the present invention relates to a computer program product for determining the presentation probability of tumor cells of a subject's tumor to a set of neoantigens. The computer program product includes instructions that, when executed by a computer, cause the computer to perform the method according to a first aspect of the invention.
[0031] In a fourth aspect, the present invention relates to the use of a method according to a first aspect of the invention and / or a computer system according to a second aspect of the invention and / or a computer program product according to a third aspect of the invention for determining treatment of a subject.
[0032] This invention provides a computer-implemented method, computer system, and computer program product for determining the presentation probability of tumor cells of a subject's tumor to a neoantigen, and the use of any of said methods, systems, or products for determining treatment for said subject. Those skilled in the art will understand that the method is implemented in a computer program product and executed using a computer system. Those skilled in the art will also appreciate that the presentation probability of a neoantigen set can be used to determine treatment for a subject. Hereinafter, all four aspects of the invention are therefore treated together.
[0033] As used herein, “subject” refers to a term known in the prior art, and should preferably be understood as a human or animal body, most preferably as a human body. As used herein, “animal” preferably refers to a vertebrate, more preferably to birds and mammals, and even more preferably to mammals. As used herein, “subject in need” should be understood as a subject who will benefit from the treatment.
[0034] A preferred embodiment of the present invention preferably provides: obtaining at least one of exome or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells and normal cells associated with the tumor of a subject. A preferred embodiment also preferably provides: obtaining a set of aberrant genomic events associated with the tumor by comparing exome and / or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells with exome and / or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data from normal cells. Clearly, the exome, whole genome nucleotide sequencing data, and transcriptome nucleotide sequencing data are compared with their respective nucleotide sequencing data-data types.
[0035] As used herein, “novel epitope” refers to a term known in the art and should preferably be understood as a class of major histocompatibility complex (MHC)-binding peptides generated by tumor-specific mutations. These peptides represent antigenic determinants of neoantigens. Noval epitopes are identified by the immune system as targets of T cells and can trigger an immune response against cancer.
[0036] As used herein, “neoantigen” refers to a term known in the prior art and should preferably be understood as an antigen having at least one alteration that distinguishes it from the most closely related wild-type antigen (i.e., the corresponding wild-type sequence), for example by tumor cell mutation, tumor cell-specific post-translational modification, fusion, transposon insertion, alternative splicing events, or any alteration known to those skilled in the art. Furthermore, a neoantigen may or may not include polypeptide or nucleotide sequences.
[0037] Preferably, the set of anomalous genomic events includes one or more single nucleotide polymorphisms (SNPs), indel (insertion / deletion) mutations, gene fusions, chromosomal rearrangements (such as inversions, translocations, duplications, or chronotropes), transposon insertions, or alternative splicing events. In the context of this specification, the term "indel" should be understood as a molecular biology term used to describe the insertion or deletion of one or more nucleic acids in the genome of an organism. Furthermore, in the context of this specification, the term "SNP" or "single nucleotide polymorphism" refers to a substitution of a single nucleotide occurring at a specific location in the genome of an organism.
[0038] This invention can utilize, or not utilize, input peptides or novel epitope sequences generated by a novel epitope discovery pipeline, starting with raw sequencing data from a subject (preferably a patient). This raw sequencing data includes at least tumor DNA, preferably tumor DNA generated from a biopsy. Preferably, the raw data also includes tumor RNA, more preferably tumor RNA generated from a biopsy. Preferably, the raw data also includes normal DNA generated from a sample (preferably a blood sample) from the subject. Preferably, the raw data also includes normal RNA generated from a sample (preferably a blood sample) from the subject.
[0039] As used herein, “sample” refers to a term known in the prior art and should preferably be understood as a single cell or multiple cells or cell fragments or body fluids taken from a subject by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspiration, irrigation, scraping, surgical incision or intervention or any other means known in the art.
[0040] The novel epitope discovery pipeline outputs a list of all genomic and transcriptomic alteration events occurring within tumors. These “abnormal genomic events” include novel transposon insertion events, novel RNA isoforms, novel gene fusions, novel RNA editing events, and novel nucleotide-based post-translational modifications on the resulting proteins. Furthermore, it detects single nucleotide polymorphisms (SNPs) and indels (localized insertion or deletion mutations) at both RNA and DNA levels and processes the results of both analyses to produce a list of high-confidence SNPs and indels.
[0041] According to a preferred embodiment, a confidence score is associated with each anomalous genomic event in the set of anomalous genomic events, at least in part based on the number of sequencing reads supporting each associated anomalous genomic event. Preferably, the confidence score is also at least in part based on the prevalence of the sequencing data supporting each associated anomalous genomic event in the genome. A preferred embodiment further includes obtaining a subset of anomalous genomic events by comparing the confidence score of each anomalous genomic event in the set of anomalous genomic events to a threshold, wherein if an associated confidence score exceeds the threshold, an event is added to the subset. According to this preferred embodiment, a set of neoantigens identified at least in part based on the subset of anomalous events is used to identify a set of anomalous events. Events with high confidence scores exhibit a high number of sequencing reads and are prevalent in the genome, and are therefore selected for further investigation. As a result, performance is improved.
[0042] It should be noted that this invention will not function if the input sequence includes non-standard amino acids. In the context of this specification, the term "non-standard amino acid" should be understood as a non-standard or non-coding amino acid that is not naturally encoded or found in the genetic code of any organism.
[0043] A preferred embodiment of the present invention provides data representing HLA peptide sequences based on tumor exome and / or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells. Therefore, the same genomic data used to identify this set of neoantigens is used to assess the HLA composition of tumor biopsies. Preferably, the present invention provides data representing peptide sequences of individual HLAs in an HLA set based on tumor exome and / or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data from tumor cells.
[0044] As used herein, "human leukocyte antigen (HLA)" refers to a term known in the art and should preferably be understood as a gene complex encoding proteins of the human major histocompatibility complex (MHC). These cell surface proteins are responsible for regulating the human immune system. HLA genes are highly polymorphic, meaning they may have different alleles, which allows them to fine-tune a subject's adaptive immune system. In the context of this specification, the terms "HLA binding affinity" or "MHC binding affinity" should be understood as the binding affinity between a specific antigen and a specific MHC allele. In the context of this specification, the term "HLA type" should be understood as the complement of an HLA allele.
[0045] A simplified embodiment of the present invention preferably provides training a deep learning model on a training dataset. The training dataset preferably includes a positive dataset. The positive dataset preferably includes multiple input-output pairs. Each input-output pair preferably includes an entry for an epitope sequence as input. The epitope sequence is preferably identified or inferred from a surface-bound or secreted HLA / peptide complex encoded by the corresponding HLA allele expressed by the training cells. Each input-output pair preferably also includes an entry for a peptide sequence of the α-chain encoded by the corresponding HLA allele as output.
[0046] The term "training unit" as used herein should preferably be understood as a unit from which samples are obtained, wherein the samples are used to obtain the input and output of input-output pairs in a positive dataset. Training cells may or may not be obtained from cells of a monoallelic cell line (such as a human cell line) or from cells of a multiallelic tissue (such as human tissue).
[0047] According to the preferred embodiment, each positive input includes a sequence of an epitope consisting of 8-15 amino acids that are shown to be presented on the cell surface. Each associated positive output consists of a tandem amino acid sequence (up to 71 amino acids) of the α chain of the HLA allele expressed by the cell in the same dataset.
[0048] According to a preferred embodiment, the epitope sequence of the input in each input-output pair of the positive dataset is obtained by mass spectrometry. In another or further embodiment, the peptide sequence of the α-chain encoded by the corresponding HLA allele of the output in each input-output pair of the positive dataset is obtained by mass spectrometry.
[0049] In embodiments of the invention, positive input-output pairs can be assigned different weights, preferably based on their frequency of occurrence in the mass spectrometry data used to construct the positive training set. The weights modify the impact of the input-output pairs on the training of the deep learning model. When training the model with the input-output pairs, larger weights will result in larger adjustments to the parameters associated with the deep learning model, as explained further below.
[0050] According to another preferred embodiment, the training dataset for training the deep learning model further includes a negative dataset. The negative dataset preferably includes multiple input-output pairs. Each input-output pair preferably includes an entry for a peptide sequence as input. The peptide sequence is preferably a random sequence from the human proteome. Each input-output pair preferably also includes a peptide sequence encoded by a random HLA allele as output.
[0051] According to the preferred embodiment, each positive input is a random sequence from a human proteome not presented in any ligand set dataset. The input is a random sequence consisting of 8 to 15 amino acids. Each associated output is a linking of the α-strand sequences of a random set of HLA alleles presented in the positive dataset.
[0052] As used herein, "proteome" refers to a term known in the prior art, and should preferably be understood as the entire collection of proteins expressed or capable of being expressed by a genome, cell, tissue, or organism at a specific time. It is the collection of proteins expressed in a given type of cell or organism under given conditions at a given time. "Proteomics" is the study of the proteome.
[0053] Preferably, a portion (preferably the majority) of the input-output pairs of the positive dataset (more preferably both positive and negative datasets) is used to train the deep learning model. Preferably, a portion (preferably a minority) of the input-output pairs of the positive dataset (more preferably both positive and negative datasets) is used to validate the trained deep learning model.
[0054] The ratio between the number of positive and negative input-output pairs used to train a deep learning model can be changed or not. This ratio is an important parameter for model training.
[0055] The ratio between the number of positive and negative input-output pairs used to validate a deep learning model can be changed or not. This ratio is an important parameter for validating the model.
[0056] According to a preferred embodiment, the positive dataset includes a monoallelic dataset and a multiallelic dataset. The monoallelic dataset preferably includes input-output pairs obtained from training cells from monoallelic cell lines. The multiallelic dataset preferably includes input-output pairs obtained from training cells from multiallelic tissues. The training cells obtained from monoallelic cell lines are preferably cells obtained from monoallelic human cell lines. The training cells obtained from multiallelic tissues are preferably cells obtained from human tissues. Multiallelic human tissues may or may not be healthy or cancerous.
[0057] As used herein, “single allele” is a term known in the prior art and should preferably be understood as a situation in which only one allele is present at a locus or gene site in a population.
[0058] As used herein, “multiallelic” refers to a term known in the art and should preferably be understood as the situation where many alleles are present. Polymorphism is “multiallelic”, also known as “polyallelic”.
[0059] According to a preferred embodiment, training the deep learning model includes two or more training cycles. Each training cycle preferably includes multiple training steps. Each training step preferably includes processing one input-output pair from multiple input-output pairs. Preferably, one of the two or more training cycles includes training the deep learning model on a single allele dataset. Preferably, one of the two or more training cycles includes training the deep learning model on both single and multi-allele datasets.
[0060] According to another preferred embodiment, the present invention provides three or more training cycles. One of the three or more training cycles is a supervised learning cycle, wherein the model is trained on monoallelic and multiallelic datasets to predict the complete amino acid sequence presented by a specific set of alleles. One of the three or more training cycles is a burn-in period, during which only samples from the monoallelic dataset are used so that the model learns specific peptide-HLA relationships. One of the three or more cycles is a generalization period, during which the model is generalized using the multiallelic dataset to learn patient data.
[0061] According to a preferred embodiment, the epitope sequence of the input for each input-output pair in the positive dataset is obtained by mass spectrometry. New technological advancements have increased the availability of mass spectrometry-derived lists of peptides that actually bind to MHC molecules on the cell surface. These lists are referred to as “ligandomes.” In the context of this document, the term “ligandome” should be understood as the complete set of molecular ligands for proteins in cells and organisms. Preferably, a positive set of input-output pairs is constructed from ligandome data from training cells.
[0062] Preferably, the deep learning model according to the present invention is at least one of the following: a deep semantic similarity model, a convolutional deep semantic similarity model, a recursive deep semantic similarity model, a deep relevance matching model, a depth and width model, a deep language model, a transformer network, a long short-term memory network, a learned deep learning text embedding, a learned named entity identification, a Siamese neural network, an interactive Siamese network, or a vocabulary and semantic matching network, or any combination thereof.
[0063] Preferably, training the deep learning model includes determining a scoring function. More preferably, the scoring function is one or more of a mean squared error scoring function, an average scoring function, or a maximum scoring function. Preferably, the scoring function is constructed as the sum of squared errors between the probability of the model output and HLA-neoepitope relation information associated with the training dataset. Furthermore, this can be achieved by using scores of 0 and 1. These scores represent the values of "unpresented" (=0) and "presented" (=1) attributed to the ground truth in the training dataset.
[0064] In another embodiment of the invention, the coefficients of the model are adjusted at each training step to minimize the score function. The neural network consists of neurons connected to each other; simultaneously, each connection in our neural network is associated with a weight, which, when multiplied by the input value, indicates the importance of that relationship within the neuron. For the neural network to learn, the weights associated with the neuron connections must be updated after the data has been passed forward through the network. These weights are typically adjusted through a process called backpropagation to help reconcile the discrepancies between the actual and predicted results of the subsequent forward propagation.
[0065] Preferably, the deep learning model according to the invention is a sequence-to-sequence model. As used herein, "sequence-to-sequence model (seq2seq)" refers to a term known in the art, also known as an encoder-decoder model, which should preferably be understood as a model in which the encoder reads the input sequence and outputs a single vector, and the decoder reads the vector to produce an output sequence. Therefore, such a model aims to map inputs of fixed and / or variable lengths to outputs of fixed and / or variable lengths, where the lengths of the input and output can be different. Using a seq2seq method (where HLA alleles are modeled through amino acid sequences of specific functionally related parts of their entire structure) has the advantage of being able to extrapolate and predict the presentation probability of new epitopes to HLA alleles, and this model is not trained specifically on HLA alleles. Most preferably, the seq2seq model is a transformer network.
[0066] According to a preferred embodiment, the present invention provides a method for processing the input of one input-output pair of multiple input-output pairs into an embedded input value vector by transforming the corresponding entries of the epitope sequence using a novel epitope embedder and a position encoder. The embedded input value vector includes information about the plurality of amino acids constituting the epitope sequence of the corresponding entry and the set of positions of the amino acids in the epitope sequence. According to another preferred embodiment, the present invention provides a method for processing the output of the pair into an embedded output value vector by transforming the corresponding peptide sequence entries of the α-chain using an allele embedder and a position encoder. The embedded output value vector includes information about the plurality of amino acids constituting the peptide sequence of the corresponding entry and the set of positions of the amino acids in the peptide sequence. The embedders and encoders discussed above allow the inputs and outputs of deep learning models to be transformed into appropriate formats before and after processing, and during training, validation, or use.
[0067] Most preferably, the deep learning model is a transformer network or transformer. Transformer networks are developed to solve problems of sequence transduction or neural machine translation. This means any task that transforms or matches an input sequence to an output sequence. For a model to perform sequence transduction, it must have some kind of memory. It needs to compute the correlations and connections between the inputs, including long-range connections. These transformer neural networks utilize the concept of self-attention and are able to replace earlier methods that used long short-term memory (LSTM) or convolutional neural networks (CNNs) to use attention between the encoder and decoder of the model. The self-attention mechanism allows the model's inputs to interact with each other and figure out which elements or parts they should pay more attention to. The output is a set of these interactions and attention scores.
[0068] More specifically, the attention function can be described as mapping a query (i.e., a sequence) and a set of key-value pairs to an output, where query(q), keys(k), value(v), and output are all vectors. The keys and values can be viewed as the model's memory, meaning all queries that have been processed previously. Scores are computed to determine the self-attention of tokens (i.e., amino acids) in the sequence. Each token in the sequence needs to be scored relative to the tokens requiring self-attention computation. This score determines how much focus needs to be placed on the rest of the sequence when the token is encoded at a certain position. This score is computed by taking the dot product of the query vector and the key vector of the corresponding scored token. By employing scaled dot product attention, the output is computed as a weighted sum of values, where the weight assigned to each value is determined by the dot product of the query and all keys.
[0069] There are various motivations for using self-attention methods. A major advantage of using transducer-type neural networks is that encoder self-attention can be parallelized, thus reducing the overall model training time. Another is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. A key factor affecting the ability to learn such dependencies is the length of the paths that the forward and backward signals must traverse in the network. The shorter these paths are between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies.
[0070] According to a preferred embodiment, the converter network includes an encoder and a decoder.
[0071] The encoder includes:
[0072] ο New epitope embedder;
[0073] ο Position encoder;
[0074] one or more sequence encoders, each consisting of two sub-layers:
[0075] i. Multi-head self-attention quantum layer;
[0076] ii. Feedforward sublayer;
[0077] The decoder includes:
[0078] one or more sequence decoders, each consisting of three sub-layers:
[0079] i. Multi-head self-attention quantum layer;
[0080] ii. Multi-head encoder-decoder attention sublayer;
[0081] iii. Feedforward sublayer;
[0082] οHLA sequence embedder;
[0083] ο Probability generator, the probability generator includes:
[0084] i. Linear mapper;
[0085] ii. Softmax layer.
[0086] An "embedder" uses an embedding algorithm to transform each input into a vector or tensor. This transformation is necessary because many machine learning algorithms, including deep neural networks, require their inputs to be vectors of continuous values, as they wouldn't work with plain text strings. Using embedders offers the advantages of dimensionality reduction and contextual similarity. By reducing the dimensionality of features or datasets, model accuracy improves, algorithms train faster, require less storage space, and redundant features and noise are removed. The similarity between a pair of inputs can be calculated by applying some similarity or distance measure to the corresponding vector pair, thus giving a more expressive representation of the data.
[0087] In the transform, self-attention ignores the position of tokens in the sequence. However, the position and order of tokens (i.e., amino acids) are essential parts of the sequence. To overcome this limitation, the transformer explicitly adds "positional encoding," which is a piece of information about their position in the sequence added to each token. Both the input and output embedding sequences are positionally encoded to allow the self-attention process to correctly infer position-related interdependencies. These are added to the input or output embeddings before being summed and passed to the first attention layer.
[0088] The "Sequence Encoder" consists of a stack of several identical layers. Each layer has two sub-layers. The first sub-layer is a "multi-head self-attention" mechanism, and the second sub-layer is a simple "feedforward network." Instead of computing attention only once, the multi-head mechanism runs in parallel multiple times through scaled dot-product attention operations. The independent attention outputs are simply concatenated and linearly transformed to the desired dimension. This extends the model's ability to focus on different locations. The output of the self-attention layer is fed into a simple feedforward neural network, where information moves further in only one direction. Residual connections or shortcuts are used around each of the two sub-layers, which allows the model to use fewer layers during the initial training phase, thus simplifying the network. Each layer ends with a normalized sum of its own output and residual connections. The "Sequence Decoder" is very similar to the encoder but has an additional "multi-head encoder-decoder attention sub-layer." The encoder-decoder sub-layer is different from either the encoder or the decoder attention sub-layer. Unlike multi-head self-attention, the encoder-decoder attention sub-layer creates its query matrix from the layer below it, which is the decoder self-attention layer, and obtains the key and value matrix from the output of the encoder layer. This helps the decoder focus on the appropriate position in the input sequence.
[0089] By using a "linear mapping" or transformation and a "softmax function" or "softmax layer," the decoder output is transformed into the predicted probability of the next token. The linear mapping layer reduces the dimensionality of the data and the number of network parameters. The softmax layer is a multi-class operation, meaning it is used to determine the probabilities of multiple classes at once. Because the output of the softmax function can be interpreted as probabilities—that is, they must sum to 1—the softmax layer is often the final layer used in neural network functions.
[0090] According to a preferred embodiment, the training of the deep learning model includes multiple training steps, each training step including processing one of the multiple input-output pairs according to the following steps:
[0091] The input of the input-output pair is processed into an embedded input value vector by using a new epitope embedder and a position encoder to transform the corresponding entries of the epitope sequence. The embedded input value vector includes information about multiple amino acids of the epitope sequence constituting the corresponding entry and the set of positions of the amino acids in the epitope sequence.
[0092] The output of the pair is processed into an embedded output value vector by using an allelic inserter and a position encoder to transform the corresponding peptide sequence entries of the α chain. The embedded output value vector includes information about multiple amino acids of the peptide sequence that constitute the corresponding entry and the set of positions of the amino acids in the peptide sequence.
[0093] The embedded input numeric vector is processed into an encoded input numeric vector using at least one sequence encoder including a multi-head self-attention sublayer and a feedforward sublayer. The encoded input numeric vector includes information about the characteristics of the epitope sequence of the corresponding entries in the epitope sequence.
[0094] The embedded output numerical vector is processed into an output attention numerical vector using a multi-head self-attention sublayer. This output attention numerical vector includes information about the interdependence of multiple amino acids in the peptide sequence of the corresponding peptide sequence entry that constitutes the α chain.
[0095] The encoded input numerical vector and the corresponding output attention vector are processed into a correlated numerical vector using a multi-head encoder-decoder attention sublayer and a feedforward sublayer. This correlated numerical vector includes relevant information between the encoded input numerical vector and the corresponding output attention vector; and
[0096] The probability of processing a relevant numerical vector into a correspondence between an embedded input numerical vector and an embedded output numerical vector using a probability generator.
[0097] In another implementation, the input, epitope sequence and the output, HLA peptide sequence of the pair follow one of different modalities.
[0098] According to the first possible model, each amino acid position is one-hot encoded, meaning it is transformed into a 1×20 vector, since there are 20 standard amino acids. Except for one position where a 1 (one) exists, every position in the vector is 0 (zero). The presence of a 1 indicates the presence of an actual amino acid. In this way, for example, 9mer is transformed into a 9×20 matrix where only 9 positions are 1, and all the other positions are 0.
[0099] According to the second possible model, each amino acid is tokenized individually, which means constructing an amino acid-to-numerical dictionary where each amino acid is represented by a numerical value. For example, proline is represented as 1, while valine is represented as 2, and so on. In this way, 9mer is transformed into a vector of length 9 numbers.
[0100] According to the third possible model, each amino acid is replaced by an embedding vector of n values. These n values relate to specific characteristics of the amino acid, which can be physical, chemical, or otherwise defined. As a preferred example, the amino acid is embedded using values derived from a set of n major components of a physicochemical property / characteristic. Thus, in this example, 9mer is transformed into a 9×n numerical matrix.
[0101] Three possible embedding modes can be performed directly at a single amino acid position, where one amino acid is embedded into one embedding vector. In another or further mode, to embed the epitope sequence (input) and HLA sequence (output), the sequence can be split into strings of length greater than 1. In this way, k-mers are considered instead of individual amino acids.
[0102] According to another preferred embodiment, the processing of one of the plurality of input-output pairs further includes the following steps:
[0103] By comparing the probability of the correspondence between the embedded input numerical vector and the embedded output numerical vector with the correspondence information associated with the training dataset, data points for the score function used for training are obtained.
[0104] ο Adjust the parameters associated with the deep learning model to optimize the scoring function;
[0105] Preferably, the scoring function is one or more of a mean squared error scoring function, an average scoring function, or a maximum scoring function.
[0106] In one implementation, the scoring function can be a binary cross-entropy loss function.
[0107] In embodiments of the invention, as previously described, positive input-output pairs can be assigned different weights, preferably depending on their frequency of occurrence in the mass spectrometry data used to construct the positive training set. The weights modify the impact of the positive input-output pairs on the training of the deep learning model. When training the model with these input-output pairs, larger weights will result in larger adjustments to the parameters associated with the deep learning model.
[0108] According to another preferred embodiment, the converter network includes an encoder but not a decoder. In this network, both the input tabletop sequence and the input HLA sequence embedding vector are processed as single vectors. To indicate whether the values of the input embedding vectors relate to a new tabletop or HLA, a type of masking is performed. This means, for example, changing the sign of the values associated with the tabletop input, while the sign associated with the HLA input remains unchanged. Furthermore, in this network model, custom delimiter values are inserted at various positions in the input embedding vectors, particularly at the beginning and / or end of the vector, and between the tabletop-related values and the HLA-related values. In this way, both input sequences can be processed as single vectors while still being distinguishable between them.
[0109] According to another preferred embodiment, after training the model, one or more of the following are obtained:
[0110] - A set of coefficients that can be used to reproduce the function given the correct structure;
[0111] -A set of parameters that describes all aspects of the trained model;
[0112] - A structural scheme that can be used to regenerate inference / test models;
[0113] - The HLA dictionary observed during model training.
[0114] According to one embodiment, the present invention provides a method in which a structure can be used relative to a central point to train other semi-independent models to account for other relevant biological parameters. These biological parameters include: RNA expression of the gene from which the novel epitope is derived, RNA expression of all other genes in the sample, expression of non-coding RNA, post-translational modification status, RNA editing events, immune scores for each immune cell type, clonality of the sample, confidence scores for all genomic alteration events, peptide-MHC binding affinity predicted by other tools, peptide-MHC complex stability, peptide stability and conversion rate, adjacent amino acids within the original protein of the novel epitope, proteasome activity, and peptide processing activity. The model structure is configured such that any missing data on this list will not prevent the model from outputting probabilities.
[0115] According to a preferred embodiment, the present invention further includes the following steps:
[0116] - Train a semi-independent neural network on a semi-independent training dataset, wherein the semi-independent training dataset includes at least a positive dataset of the deep learning model or a variant thereof and an associated prediction improvement parameter training dataset, wherein the associated prediction improvement parameter training dataset involves one or more of the following biological parameters: RNA expression of the gene from which the new epitope is derived, RNA expression of multiple genes in cancer tissue samples, expression of non-coding RNA sequences, post-translational modification information, RNA editing event information, immune scores of multiple immune cell types, clonality of cancer tissue samples, confidence scores of multiple genomic alteration events, peptide-MHC binding affinity, peptide-MHC complex stability, peptide stability and / or conversion rate, adjacent amino acids within the new epitope sequence, proteasome activity, and peptide processing activity.
[0117] Preferably, the training dataset for the associated prediction improvement parameters involves at least adjacent amino acids in the new epitope sequence;
[0118] - Determine the semi-independent presentation probability of each of the group of neoantigens in the HLA peptide sequence using a trained semi-independent neural network; and
[0119] - For each neoantigen in this group of neoantigens, combine the determined semi-independent presentation probability with the presentation probability obtained through training the model to obtain the overall presentation probability;
[0120] Preferably, the combination is performed via a trained single-layer neural network;
[0121] Preferably, the semi-independent neural network is a single-layer neural network;
[0122] Preferably, at least one of the exome or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data of tumor cells and normal cells associated with the tumor from the subject is obtained from cancer tissue samples and healthy tissue samples from the subject, respectively.
[0123] In one implementation, training of all sublayers is performed using the Adam-type optimization algorithm. An optimizer is an algorithm or method used to modify properties of your neural network, such as weights and learning rate, to reduce loss or error and help obtain results faster. This algorithm leverages the ability of adaptive learning rate methods to find an individual learning rate for each parameter. Adam uses estimates of the first and second moments of the gradient to adapt the learning rate for each weight of the neural network.
[0124] According to one implementation, a deep learning model is trained for five epochs of 5-fold cross-validation, preferably a transformer network. For models on new data, k-fold cross-validation is easy to understand and implement, and leads to skill estimation that typically has lower bias than other methods. In k-fold cross-validation, there is a bias-variance tradeoff associated with the choice of k. Performing k-fold cross-validation with k=5 produces a test error rate estimate that suffers neither excessively high bias nor extremely high variance.
[0125] As used in this article, "epoch" refers to a term known in the prior art, which should preferably be understood as an indication of the number of times a machine learning algorithm completes its work across the entire training dataset. An epoch is a cycle through the entire training dataset.
[0126] As used in this paper, "k-fold cross-validation" refers to a term known in the prior art, which should preferably be understood as a statistical method for estimating the skill of a machine learning model. This method involves repeatedly and randomly dividing a set of observations into k groups, or approximately equal-sized groups (folds). The first group is considered the validation set, and the method is applied to the remaining k-1 groups. The results of a k-fold cross-validation run are typically summarized as the mean of the model's skill scores. It is also good practice to include a measure of the variance of the skill scores, such as standard deviation or standard error.
[0127] The invention is further described by way of the following non-limiting examples, which further illustrate the invention and are not intended to, nor should they be construed as, limiting the invention.
[0128] Example
[0129] Example 1:
[0130] This example relates to the training of a sequence-to-sequence transformation model according to the present invention.
[0131] The sequence-to-sequence transformation model has the following structure:
[0132] - Encoder:
[0133] ο New epitope embedder;
[0134] ο Position encoder;
[0135] one or more sequence encoders, each consisting of two sub-layers:
[0136] i. Multi-head self-attention quantum layer;
[0137] ii. Feedforward sublayer;
[0138] -Decoder:
[0139] one or more sequence decoders, each sequence encoder comprising three sub-layers:
[0140] i. Multi-head self-attention quantum layer;
[0141] ii. Multi-head encoder-decoder attention sublayer;
[0142] iii. Feedforward sublayer;
[0143] οHLA sequence embedder;
[0144] ο Probability generator, the probability generator comprising:
[0145] i. Linear mapper;
[0146] ii. Softmax layer.
[0147] The sequence-to-sequence transformation model described above is trained by processing a set of positive and negative input-output pairs via this model.
[0148] Positive sets of input-output pairs were constructed from ligand set data from monoallelic human cell lines or multiallelic human tissues (healthy or cancerous). Each positive input consisted of the sequence of an epitope (8–15 amino acids) from the given dataset that was shown to be presented on the cell surface. Each associated positive output consisted of the tandem amino acid sequence (71 amino acids) of the α chain of the HLA allele expressed in cells from the same dataset.
[0149] Negative sets of input-output pairs are constructed from the human proteome. Each input is a random 8-mer to 15-mer sequence from any ligand set of unpresented human proteomes. Each associated output is a concatenation of the α-strand sequences of a random set of presented HLA alleles from the positive dataset.
[0150] Each training input-output pair is processed by this model as follows:
[0151] - If needed, the input peptide is padded with "." tokens to a length of 15, and then the resulting sequence is embedded into 21 using a new epitope embedder. 15 Unique Heat Tensor.
[0152] - Sequence-based models, based on the sequences of the two α-helices that interact with the peptide, embed each HLA into 21 gene sequences via allele inserters. 71 Unique Heat Tensor.
[0153] -Then positional encoding is performed on the input and output embedding sequences to allow the self-attention process to correctly infer positional interdependencies.
[0154] Each sequence encoder processes the embedded input sequence sequentially. Self-attention sublayers learn intrapeptide interdependencies, and feedforward sublayers process the input embeddings accordingly.
[0155] - The result of this encoding process is a fixed-size representational encoding of the characteristics of the new peptide being input.
[0156] The embedded HLA sequence inputs are processed sequentially and combined with the encoded new epitope input in each decoder, gradually forming the embedded output sequence. The self-attention sublayer learns intra-allele dependencies; the peptide attention sublayer associates the encoded peptide representation with the embedded output, and the feedforward sublayer modifies the embedded output accordingly. In this step, the correspondence between input and output is established. It should be noted that attention sublayers that allow the detection of intra-sequence dependencies significantly improve the model's overall predictive ability.
[0157] Finally, the generator processes the embedded output to output the probability of the correspondence between the embedded input and the embedded output, representing the probability of presentation (0 to 1, 1 being the highest probability).
[0158] The scoring function is constructed as the sum of squared errors between the model's output probability and the actual HLA-peptide relationship (0: the peptide is not presented on the cell surface expressing the allele, i.e., the peptide is part of the aforementioned negative dataset; -1: the peptide is presented on the cell surface expressing the allele, i.e., it is part of the aforementioned positive dataset). Other ways of aggregating the data are also possible, such as considering an average scoring function or a maximum scoring function.
[0159] - At each training step, that is, by processing each new input-output pair, the coefficients of the model are adjusted to minimize the score function as defined in this way.
[0160] The model was trained as follows:
[0161] - The model was trained for 5 epochs with 5-fold cross-validation.
[0162] The training of this model follows these steps: First, the model is trained on all samples to simply predict the complete amino acid sequence presented by a specific set of alleles, amino acid by amino acid (self-supervised learning). Then, only samples derived from monoallelic HLA datasets (e.g., from monoallelic cell lines) are used for training ("aging") so that the model learns specific peptide-HLA relationships. Finally, HLA multiallelic instances are used for training to generalize the model's learning to real-world patient data.
[0163] - Use the ADAM-type optimizer to train all layers of the model.
[0164] At the end of training, the model output can be used to reproduce the set of coefficients of its function given the correct structure, the set of parameters describing all aspects of the model's training, the structural scheme that can be used to regenerate the model for inference / testing, and the HLA dictionary seen during model training.
[0165] Example 2:
[0166] This example relates to the use of the training model according to Example 1 in the workflow of the invention.
[0167] This implementation provides a workflow for predicting the likelihood of presenting a set of HLA alleles expressed by the cell on the surface of cancer cells of variable length novel epitopes.
[0168] The workflow uses a sequence-to-sequence transformation model. This model allows for extrapolation and prediction of the presentation probability of new epitopes for any HLA allele, even without training on it.
[0169] The workflow is as follows:
[0170] First, new peptides are discovered using next-generation sequencing data from cancer biopsies. Both DNA and RNA sequencing data are used to extract a set of anomalous genomic events that may deliver new epitopes.
[0171] These events are given confidence scores based on the number of sequencing reads that support them and their prevalence in the genome, and epitopes from the highest confidence events are selected for subsequent steps.
[0172] - The same genomic data were also used to assess the HLA composition of the biopsy.
[0173] - Provide the sequence of the selected peptide along with the known HLA sequence to the trained model.
[0174] The model calculates the peptide presentation probability for each HLA in the provided set and outputs the overall peptide probability based on these individual values.
[0175] Additionally, the workflow may or may not include steps to make probabilistic predictions more accurate by providing the model with other biological parameters, such as RNA expression levels, MHC binding probability, or novel epitope protein context.
[0176] Example 3:
[0177] This example relates to an alternative implementation of the transformation model based on Example 1.
[0178] As described in Example 1 above, if necessary, the input new epitope sequence is padded with "." tokens to a length of 15, and the resulting sequence is then embedded by the new epitope embedder into a 21×15 one-hot tensor. The model in Example 1 therefore requires the sequence to be within the correct length range. However, this model can also be implemented to allow epitopes and HLAs of any length. Similarly, the model can be implemented to allow variable-length embeddings. Furthermore, the model can be implemented to allow embeddings onto matrices of different sizes, up to 300×15.
[0179] As described above in Example 1, this model is sequence-based, and each HLA is embedded into 21 gene vectors via an allele inserter according to the sequence of its two α-helices that interact with the peptide. In a one-hot tensor of 71. Alternatively, the model can process associated HLAs into categorical codes. Categorical coding refers to transforming categorical features into one or more numerical features. Thus, each HLA is encoded according to a central repository that reconstructs all HLA sequences known at the time the model was built. Alternatively, the model can also be sequence-free. In this case, HLAs are one-hot encoded based on their previous central repository codes. Associated HLA sequences are processed one after another. Thus, for each HLA sequence found to be associated with a particular sample, a specific new epitope will be processed at once. If the HLA allele amino acid sequence is unknown, the model will not output a prediction. This is the infrequent but true probability of some infrequent HLA alleles.
[0180] Example 4:
[0181] This example involves using the workflow described in Example 2 to determine the treatment for a subject.
[0182] The treatment plan is determined as follows:
[0183] -Based on the determined presentation probability, a subset of the identified neoantigen set is selected to obtain the selected neoantigen subset.
[0184] The subset is obtained by comparing the presentation probability of each neoantigen in the set of neoantigens with a threshold, and wherein a neoantigen is added to the subset if the associated presentation probability exceeds the threshold; and
[0185] - Identify one or more T cells that have antigen specificity for at least one neoantigen in the subset.
[0186] Example 5:
[0187] This example relates to an improved model comprising a sequence-to-sequence transformation model according to Example 1 and one or more semi-independent models of said transformation model. The improved model can be used in a workflow according to Example 2 to determine treatment for a subject.
[0188] According to this example, multiple semi-independent single-layer neural network models are trained relative to a center transformer structure to take into account other relevant biological parameters. Therefore, each of the multiple semi-independent models is trained by training a single-layer neural network on a semi-independent training dataset, which includes a training dataset for the sequence-to-sequence transformation model and an associated prediction improvement parameter training dataset. By incorporating parameters from the prediction improvement parameter training dataset, overall prediction accuracy is improved.
[0189] Each of the multiple semi-independent single-layer neural network models has a parameter training dataset involving one or more biological parameters: RNA expression of the gene from which the new epitope is derived, RNA expression of all genes in cancer tissue samples except the gene from which the new epitope is derived, expression of non-coding RNA sequences, post-translational modification status, RNA editing events, immune scores for each immune cell type, clonality of cancer tissue samples, confidence scores for all genomic alteration events, peptide-MHC binding affinity predicted by other tools, peptide-MHC complex stability, peptide stability and conversion rate, adjacent amino acids within the original protein of the new epitope, proteasome activity, and peptide processing activity.
[0190] After training each semi-independent model, the semi-independent presentation probability of each neoantigen in the set of neoantigens of HLA peptide sequences is determined using a trained semi-independent neural network. The determined semi-independent presentation probabilities are then combined with the presentation probabilities obtained through training the models to obtain the total presentation probability. According to this example, this combination is performed using a trained single-layer neural network.
[0191] Example 6:
[0192] This example involves a comparison between the model according to the invention and existing algorithms, the EDGE algorithm, and the MHCflurry algorithm.
[0193] The sequence-to-sequence transformation model according to the present invention was developed and trained on the following:
[0194] - A positive dataset comprising 326,297 publicly available input-output pairs, wherein each input-output pair includes an entry as an epitope sequence identified or inferred from a surface-bound or secreted HLA / peptide complex encoded by the corresponding HLA allele expressed by training cells, wherein each input-output pair also includes an entry as an output of a peptide sequence of the α chain encoded by the corresponding HLA allele; publicly available from: Abelin et al., 2017; Bulik-Sullivan et al., 2019; di Marco et al., 2017; Sarkizova et al., 2019; and Trolle et al., 2016; and
[0195] - A negative dataset comprising 652,594 input-output pairs, each input-output pair including an entry for a peptide sequence as input, wherein the peptide sequence is a random sequence of the human proteome, and wherein each input-output pair also includes a peptide sequence encoded by a random HLA allele as output.
[0196] Then, the model is tested on a test dataset, which includes:
[0197] -729 positive pairs were selected from published test datasets of the EDGE algorithm (Bulik-Sullivan et al., 2019), and
[0198] -1,822,500 negative pairs, each negative pair including an entry of a peptide sequence as input, wherein the peptide sequence is a random sequence of the human proteome, and wherein each negative pair also includes a peptide sequence encoded by random HLA alleles as output.
[0199] Note that pairs already included in the model's training phase should not be included in the test dataset.
[0200] Generate a precision-recall curve for the test dataset. Precision is measured as the proportion of truly presented epitopes, known as positive epitopes, while recall measures the proportion of truly positive epitopes (accurately referred to as positive). Thus, the precision-recall curve is a good measure of the model's ability to accurately recall the expected positive results without error. The better the model, the more the precision-recall curve is biased towards the upper right corner.
[0201] The result is Figure 1 As shown in Figure A, the results of the transformation model according to the invention are shown in blue (most biased towards the top right), while the results of the EDGE algorithm are shown in black. Furthermore, the (generally flat) green line represents the best accuracy achieved by the affinity-based model MHCflurry.
[0202] These results clearly demonstrate that the model according to the present invention outperforms the existing prior art algorithm EDGE and current state-of-the-art industrial methods, such as MHCflurry, on the same test dataset.
[0203] Example 7:
[0204] This example relates to the ability of a model according to the invention to be used for extrapolation and prediction.
[0205] As a sequence-to-sequence algorithm, the model derives its predictive power not from classification data, but from comparing and mapping the correlations between two sequences. This means that it is possible to predict HLA alleles even when no training data is available, provided their protein sequences are known.
[0206] Given that obtaining new training data is a long and expensive process, this extrapolation / prediction capability is a real advantage.
[0207] To test this capability, the model was trained as in Example 6, and compared with HLA-A A new test dataset was constructed using 2,039 positive pairs and 5,097,500 negative pairs that were uniquely associated with the 74:02 allele. For the positive pairs, there was no data in the training set. Each negative pair included an entry for a peptide sequence as input, wherein the peptide sequence was a random sequence of the human proteome, and wherein each negative pair also included a peptide sequence encoded by a random HLA allele as output.
[0208] The results are as follows Figure 1 As shown in B. The precision-recall curves clearly demonstrate that the model according to the invention has very good predictive power, even on this previously unseen allele.
Claims
1. A computer-implemented method for determining the presentation probability of tumor cells of a subject's tumor to a set of neoantigens, the computer-implemented method comprising the steps of: - Obtain at least one of exome or whole genome nucleotide sequencing data and transcriptome nucleotide sequencing data from the subject's tumor cells and normal cells associated with the tumor; - By comparing the exome and / or whole genome nucleotide sequencing data and the transcriptome nucleotide sequencing data from the tumor cells with the exome and / or whole genome nucleotide sequencing data and the transcriptome nucleotide sequencing data from the normal cells, a set of abnormal genomic events associated with the tumor is obtained; - Obtain data representing the peptide sequence of each neoantigen in a set of neoantigens identified at least in part based on the set of anomalous genomic events, wherein the peptide sequence of each neoantigen includes at least one alteration that makes the peptide sequence different from the corresponding wild-type peptide sequence identified from the normal cells of the subject; - Data representing peptide sequences of human leukocyte antigen HLA were obtained based on the exome and / or whole genome nucleotide sequencing data and the transcriptome nucleotide sequencing data from the tumor cells; - Train a deep learning model on a training dataset including a positive dataset, wherein the positive dataset comprises multiple input-output pairs, wherein each input-output pair includes an entry as an epitope sequence as input, the epitope sequence being identified or inferred from a surface-bound or secreted HLA / peptide complex encoded by a corresponding HLA allele expressed by the training cells, wherein each input-output pair also includes an entry as an output of a peptide sequence of the α chain encoded by the corresponding HLA allele; and - The presentation probability of each neoantigen in the set of neoantigens of the peptide sequence of the HLA is determined by a trained model.
2. The computer-implemented method according to claim 1, further comprising the following steps: - The confidence score is associated with each anomalous genomic event in the set of anomalous genomic events, based at least in part on the number of sequencing reads of the sequencing data supporting each associated anomalous genomic event; - A subset of anomalous genomic events is obtained by comparing the confidence score of each anomalous genomic event in the set of anomalous genomic events with a threshold, wherein if the associated confidence score exceeds the threshold, the event is added to the subset; The neoantigen set is identified at least in part based on a subset of the anomalous genomic events.
3. The computer-implemented method according to claim 1 or 2, wherein, The positive dataset includes a monoallelic dataset and a multiallelic dataset, wherein the monoallelic dataset includes input-output pairs obtained from training cells from monoallelic cell lines, and wherein the multiallelic dataset includes input-output pairs obtained from training cells from multiallelic tissues.
4. The computer-implemented method according to claim 3, wherein, The training of the deep learning model includes two or more training cycles, wherein each training cycle includes multiple training steps, wherein each training step includes processing one of the multiple input-output pairs, wherein one of the two or more training cycles includes training the deep learning model on the single allele dataset, and wherein one of the two or more training cycles includes training the deep learning model on both the single allele dataset and the multi-allele dataset.
5. The computer-implemented method according to claim 1, wherein, The training dataset used to train the deep learning model also includes a negative dataset comprising multiple input-output pairs, each input-output pair including an entry of a peptide sequence as input, wherein the peptide sequence is a random sequence of the human proteome, and wherein each input-output pair also includes a peptide sequence encoded by a random HLA allele as output.
6. The computer-implemented method according to claim 1, wherein, The deep learning model is at least one of the following: deep semantic similarity model, convolutional deep semantic similarity model, recursive deep semantic similarity model, deep relevance matching model, depth and width model, deep language model, transformer network, long short-term memory network, learned deep learning text embedding, learned named entity identification, Siamese neural network, interactive Siamese network, or vocabulary and semantic matching network, or a combination thereof.
7. The computer-implemented method according to claim 1, wherein, The deep learning model is a transformer network.
8. The computer-implemented method according to claim 7, wherein, The training of the deep learning model includes multiple training steps, each training step including processing one of the multiple input-output pairs according to the following steps: The input of the input-output pair is processed into an embedded input value vector by transforming the corresponding entries of the epitope sequence using a new epitope embedder and a position encoder. The embedded input value vector includes information about the multiple amino acids of the epitope sequence that make up the corresponding entry and the set of positions of the amino acids in the epitope sequence. The output of the input-output pair is processed into an embedded output value vector by using an allelic embedder and a position encoder to transform the corresponding entries of the peptide sequence of the α chain. The embedded output value vector includes information about a plurality of amino acids of the peptide sequence that make up the corresponding entry and the set of positions of the amino acids in the peptide sequence. The embedded input value vector is processed into an encoded input value vector using at least one sequence encoder comprising a multi-head self-attention sublayer and a feedforward sublayer, the encoded input value vector including information about the characteristics of the epitope sequence with respect to the corresponding entries of the epitope sequence. The embedded output value vector is processed into an output attention value vector using a multi-head self-attention sublayer, the output attention value vector including information about the interdependence of the plurality of amino acids of the peptide sequence of the corresponding entry of the peptide sequence constituting the α chain; The encoded input numerical vector and the corresponding output attention vector are processed into a related numerical vector using a multi-head encoder-decoder attention sublayer and a feedforward sublayer. The related numerical vector includes relevant information between the encoded input numerical vector and the corresponding output attention vector. as well as The probability of processing the relevant numerical vector into a correspondence between the embedded input numerical vector and the embedded output numerical vector using a probability generator.
9. The computer-implemented method according to claim 8, wherein, Processing one of the plurality of input-output pairs further includes the following steps: By comparing the probability of the correspondence between the embedded input numerical vector and the embedded output numerical vector with the correspondence information associated with the training dataset, data points for the score function used for training are obtained. ο Adjust the parameters associated with the deep learning model to optimize the scoring function.
10. The computer-implemented method according to claim 9, wherein, The scoring function is one or more of the following: the sum of squared errors scoring function, the average scoring function, or the maximum scoring function.
11. The computer-implemented method according to any one of claims 7 to 10, wherein, The converter network includes an encoder and a decoder; The encoder includes: ο New epitope embedder; ο Position encoder; one or more sequence encoders, each consisting of two sub-layers: i. Multi-head self-attention quantum layer; ii. Feedforward sublayer; The decoder includes: one or more sequence decoders, each consisting of three sub-layers: i. Multi-head self-attention quantum layer; ii. Multi-head encoder-decoder attention sublayer; iii. Feedforward sublayer; οHLA sequence embedder; ο Probability generator, the probability generator comprising: i. Linear mapper; ii. Softmax layer.
12. The computer-implemented method according to claim 1, further comprising the following steps: - Train a semi-independent neural network on a semi-independent training dataset, wherein the semi-independent training dataset includes at least a positive dataset of the deep learning model or a variant thereof and an associated predictive improvement parameter training dataset, wherein the associated predictive improvement parameter training dataset involves one or more of the following biological parameters: RNA expression of the gene from which the new epitope is derived, RNA expression of multiple genes in cancer tissue samples, expression of non-coding RNA sequences, post-translational modification information, RNA editing event information, immune scores of multiple immune cell types, clonality of cancer tissue samples, confidence scores of multiple genomic alteration events, peptide-MHC binding affinity, peptide-MHC complex stability, peptide stability and / or conversion rate, adjacent amino acids within the new epitope sequence, proteasome activity, and peptide processing activity; - Determine the semi-independent presentation probability of each neoantigen in the neoantigen set of the HLA peptide sequence using the trained semi-independent neural network; and - For each neoantigen in the neoantigen set, the determined semi-independent presentation probability and the presentation probability obtained by the trained model are combined to obtain the overall presentation probability.
13. The computer-implemented method according to claim 12, wherein, The associated training dataset for the prediction improvement parameters involves at least adjacent amino acids within the new epitope sequence.
14. The computer-implemented method according to claim 12, wherein, Composition is performed using a trained single-layer neural network.
15. The computer-implemented method according to claim 12, wherein, The semi-independent neural network is a single-layer neural network.
16. The computer-implemented method according to claim 2, wherein, The computer-implemented method further includes: identifying one or more T-cells that have antigen specificity to at least one neoantigen in the subset.
17. A computer system for determining the presentation probability of a neoantigen set by means of tumor cells of a subject's tumor, said computer system being configured to perform a computer-implemented method according to any one of claims 1 to 16.
18. A computer program product for determining the presentation probability of a neoantigen set by means of tumor cells of a subject's tumor, the computer program product comprising instructions that, when executed by a computer, cause the computer to perform a computer-implemented method according to any one of claims 1 to 16.