Transposon-derived tumor neoantigen recognition method based on deep learning guided proteomics and verification method

By constructing a genome-wide transposon six-frame translation database and a multi-engine search database through deep learning, and combining it with targeted mass spectrometry validation, the reliability and tumor specificity issues of transposon-derived tumor neoantigen recognition were resolved, achieving efficient tumor neoantigen identification and validation.

CN122201454APending Publication Date: 2026-06-12SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2026-01-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for identifying transposon-derived tumor neoantigens suffer from problems such as low identification sensitivity and low recognition rate due to database expansion, lack of systematic validation, and insufficient tumor-specific assessment.

Method used

A genome-wide transposon six-frame translation protein database was constructed using a deep learning-guided approach. Combined with a deep learning de novo resolution model and multi-engine database search, high-confidence sequence tags were screened, and the database was validated by targeted mass spectrometry to ensure tumor specificity and reliability.

Benefits of technology

It significantly improves the reliability and sensitivity of transposon-derived tumor neoantigen identification, reduces the risk of false positive identification, ensures tumor specificity and reproducibility, and the output results are suitable for tumor immunotherapy design.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201454A_ABST
    Figure CN122201454A_ABST
Patent Text Reader

Abstract

The application relates to a transposon-derived tumor neoantigen recognition method and a verification method based on deep learning guided proteomics, and the recognition method comprises the following steps: acquiring HLA immunopeptidome mass spectrum data, and constructing a whole genome transposon six-frame translation protein database; a deep learning de novo analysis model is used to perform peptide sequence prediction on the HLA immunopeptidome mass spectrum data, high-confidence sequence labels are obtained, and the whole genome transposon six-frame translation protein database is subjected to sample-specific simplification by using the high-confidence sequence labels, so that a simplified transposon-specific protein database is obtained; the simplified transposon-specific protein database is combined with a human standard protein database, the HLA immunopeptidome mass spectrum data is subjected to database searching, and grouping FDR control and affinity screening processing are performed, so that candidate transposon-derived HLA peptides are obtained; and screening is further performed, so that transposon-derived tumor-specific neoantigens are obtained. Compared with the prior art, the application has the advantages of improving recognition sensitivity and the like.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of bioinformatics, and in particular to a method and verification method for identifying transposon-derived tumor neoantigens based on deep learning-guided proteomics. Background Technology

[0002] T cell-mediated tumor immune responses depend on peptides presented to the cell surface by human leukocyte antigen (HLA) molecules. Peptides that are expressed only or significantly enriched in tumor cells, can be presented by HLA, and can be recognized by T cells are called tumor neoantigens and are important targets for precision immunotherapies such as personalized tumor vaccines and adoptive cell therapy.

[0003] Early research primarily focused on classic neoantigens derived from somatic mutations in protein-coding regions. However, with the development of immunopeptidomics technology and multi-omics integrated analysis, increasing evidence suggests that non-classical sources such as non-coding regions, aberrantly spliced ​​transcripts, long non-coding RNAs, circular RNAs, and transposable elements (TEs) can also generate a large number of peptides that can be presented by HLA. These "non-classical neoantigens" are often expressed sparingly or not at all in normal tissues and have higher tumor specificity and potential immunogenicity.

[0004] Transposons comprise nearly half of the human genome and include multiple categories such as LINE, SINE, LTR / endogenous retrovirus ERV, DNA transposons, and SVA, representing an important source of atypical coding potential. Current technologies have revealed that transposons in tumors can be aberrantly activated due to epigenetic inactivation, resulting in proteins or peptides containing TE sequences. These peptides can be presented by HLA and become potential tumor target antigens.

[0005] Existing computational workflows for non-classical neoantigen recognition are mostly based on RNA-seq-driven transcript annotation and open reading frame prediction. These methods translate a large number of potential transcripts into protein sequences, merge them with standard protein databases, and then search the mass spectrometry data using traditional database-searching software. This type of method has the following main shortcomings: 1. A large number of potential transcripts lead to an extreme expansion of the protein database, resulting in a huge search space and significantly reducing the sensitivity of mass spectrometry identification and statistical testing capabilities; 2. Existing processes are mostly designed for general non-coding ORFs or fusion peptides, and have not been optimized for the reproducibility and sequence homology of TE sequences, resulting in a low recognition rate of TE-derived peptides. 3. Most studies only focus on the algorithm level and use FDR control for the "discovery phase" identification, lacking a systematic targeted mass spectrometry parallel reaction monitoring (PRM) verification step, which limits the reliability and convertibility of candidate neoantigens; 4. The tumor-specific assessment of transposon-derived peptides relies heavily on expression thresholds and has not fully incorporated Ribo-seq translational evidence and "negative selection" information such as thymic-presenting peptides, making it difficult to effectively rule out potential issues that may induce autoimmune tolerance.

[0006] Therefore, there is an urgent need for a specialized technical solution for transposon-derived tumor neoantigens to address the aforementioned technical problems. Summary of the Invention

[0007] The purpose of this invention is to provide a method for identifying transposon-derived tumor neoantigens based on deep learning-guided proteomics, and a verification method thereof, to improve the reliability of identification.

[0008] The objective of this invention can be achieved through the following technical solutions: A method for identifying transposon-derived tumor neoantigens based on deep learning-guided proteomics includes the following steps: We acquired HLA immune peptidomimetry mass spectrometry data from tumor tissues or tumor cell lines, and constructed a genome-wide transposon six-frame translational protein database based on the reference genome and transposon annotations. A deep learning de novo analytical model was used to predict peptide sequences in the HLA immune peptidomimetry mass spectrometry data to obtain high-confidence sequence tags. The high-confidence sequence tags were then used to perform sample-specific simplification of the whole-genome transposon six-frame translational protein database to obtain a simplified transposon-specific protein database. The simplified transposon-specific protein database and the human standard protein database are merged, and the HLA immunopeptidome mass spectrometry data are then searched. After grouping FDR control and affinity screening, candidate transposon-derived HLA peptides are obtained. The candidate transposon-derived HLA peptides were screened to obtain transposon-derived tumor-specific neoantigens.

[0009] Furthermore, the steps for constructing the genome-wide transposon six-frame translation protein database include: Based on the preset reference genome and transposon annotations, the genomic sequence of each transposon element is extracted; The genome sequence of each transposon element is translated in six frames to obtain the translation product; The translation products are split according to the stop codon, and sequence fragments shorter than a preset threshold are discarded to obtain a genome-wide transposon six-frame translation protein database, wherein the genome-wide transposon six-frame translation protein database contains a classification index constructed according to transposon categories, families, and subfamilies.

[0010] Furthermore, the step of obtaining the simplified transposon-specific protein database includes: Based on the classification index, the high-confidence sequence tag is compared with the genome-wide transposon six-frame translation protein database using short peptide homology, and at least one transposon protein sequence matching the high-confidence sequence tag is retained to construct a simplified transposon-specific protein database.

[0011] Furthermore, the deep learning de novo parsing model comprises at least two models. When two deep learning de novo parsing models are used, the step of obtaining high-confidence sequence labels includes: Convert each HLA immunopeptidome mass spectrometry data into a standard format; The peptide sequence of each standard format HLA immune peptidomide mass spectrometry data is predicted using a first deep learning de novo parsing model to obtain the first predicted sequence. The first deep learning de novo parsing model adopts a sequence-to-sequence Transformer neural network structure. The second deep learning de novo parsing model is used to predict peptide sequences for each standard format HLA immune peptidomide mass spectrometry data to obtain a second predicted sequence. The second deep learning de novo parsing model adopts a convolutional neural network structure and / or a Transformer neural network structure. The first and second predicted sequences of the same HLA immunopeptide mass spectrometry data are subjected to confidence-weighted fusion or voting selection, and a sliding window is set on the first and second predicted sequences to select sequence fragments with an average residue confidence of not less than a first threshold and a length of not less than a second threshold as high-confidence sequence tags.

[0012] Furthermore, the step of obtaining the candidate transposon-derived HLA peptide includes: The simplified transposon-specific protein database and the human standard protein database are merged to form a joint database. On the joint database, at least two search engines are used to search the HLA immunopeptide mass spectrometry data under non-specific enzyme digestion conditions to obtain the peptide spectrum matching results output by each search engine. The parameters set during the search process include peptide length, fragment quality error tolerance, and variable modification. The peptide matching results output by each search engine are introduced with reversed or scrambled decoy sequences. The peptide matching results output by each search engine are re-scored. The peptide matching results from the simplified transposon-specific protein database and the peptide matching results from the human standard protein database are grouped to obtain the transposon group and the standard group. For the transposon group and the standard group, based on the target-decoy strategy, the discrimination score is extracted from the matching result of each peptide profile of each group, and the score distribution of the two groups is constructed. Based on the score distribution, a sliding score threshold is set, the number of two sets of peptide spectrum matching results above each score threshold is counted, and the ratio of the number of the two sets of peptide spectrum matching results is calculated. The ratio is used as the estimated false detection rate under the score threshold, forming a score threshold-false detection rate estimate list. From the list of score thresholds and error rate estimates, the highest score threshold that satisfies no greater than a preset threshold is obtained and used as the score cutoff value. At both the PSM and peptide levels, all peptide profile matching results that meet the score cutoff values ​​in the two groups are obtained and used as the set of peptides presented as reliable transposon sources and the set of standard peptides, respectively. Based on the set of transposon-derived presented peptides, HLA binding affinity was evaluated, and transposon-derived presented peptides with HLA binding ability were screened as candidate transposon-derived HLA peptides.

[0013] Furthermore, the steps for obtaining the HLA peptide derived from the candidate transposon include: Obtain a list of HLA-I alleles for the tumor tissue or tumor cell line determined using an HLA typing tool based on RNA-seq or external information; For each transposon-derived peptide, the binding affinity with all HLA-I alleles in the HLA-I allele list is predicted, and a binding score expressed as a percentile value is obtained. The smallest percentile ranking value is selected from all percentile ranking values ​​as the optimal allele ranking value. Transposon-derived presenting peptides with optimal allele ranking values ​​less than a preset threshold are identified as HLA-binding peptides, and transposon-derived presenting peptides that meet stricter thresholds are identified as strong-binding peptides. Both HLA-binding peptides and strong-binding peptides are candidate transposon-derived HLA peptides.

[0014] Furthermore, the steps for obtaining the transposon-derived tumor-specific neoantigen include: Transcriptome data from tumor tissue and paired normal tissue were obtained. Transposon or transposon-gene chimeric transcript expression matrix was constructed based on the transcriptome data. Standard differential analysis was used to identify transposon-related transcripts that were significantly upregulated in tumor tissue but expressed below a threshold in paired normal tissue and public health tissue databases. The transposon-associated transcripts or candidate transposon-derived HLA peptides of their respective subfamilies are labeled as candidate transposon-derived tumor-specific neoantigens with expression support; Ribo-seq data from various normal tissues were acquired, and open reading frames with stable translation evidence were identified using ORF prediction and reading frame statistics algorithms. Candidate transposon-derived tumor-specific neoantigens that perfectly matched the open reading frames were initially eliminated. Based on the thymus tissue MHC-I presenting peptide database, the candidate transposon-derived tumor-specific neoantigens were further eliminated to obtain transposon-derived tumor-specific neoantigens that avoid central immune tolerance.

[0015] Furthermore, it also includes using drug-induced differential information to screen for drug-induced transposon-derived tumor-specific neoantigens from the candidate transposon-derived HLA peptides. Specific screening steps include: For tumor cell lines, HLA peptides derived from candidate transposons were identified before and after treatment with tumor drugs. Based on the comparison of the presence and abundance changes of candidate transposon-derived HLA peptides before and after tumor drug treatment, candidate transposon-derived HLA peptides that only appear or have significantly increased abundance after tumor drug treatment, and meet the screening criteria of normal tissue translation evidence and thymus negatively increasing, are obtained as drug-induced transposon-derived tumor-specific neoantigens.

[0016] This invention also provides a targeted mass spectrometry-based experimental verification method, which performs targeted mass spectrometry verification on transposon-derived tumor-specific neoantigens obtained by the aforementioned deep learning-guided proteomics-based transposon-derived tumor neoantigen recognition method, and outputs the verification results.

[0017] Furthermore, the step of performing targeted mass spectrometry verification includes: Candidate transposon-derived HLA peptides of the transposon-derived tumor-specific neoantigen are sorted according to set indicators, and a preset number of candidate transposon-derived HLA peptides are automatically selected as target peptides for priority verification. The indicators include matching score, number of support profiles, HLA binding strength, and sample reproducibility. Based on the amino acid sequence of the target peptide, the theoretical mass-to-charge ratio, charge state, and predicted chromatographic retention time are calculated. A PRM inclusion list is generated based on the theoretical mass-to-charge ratio, charge state, and predicted chromatographic retention time. The PRM inclusion list is sent to the mass spectrometer to instruct the mass spectrometer to preferentially target and fragment the precursor ions of the target peptide that fall within a preset time window, and to collect their secondary mass spectrometry signals for qualitative and quantitative detection of the target peptide in terms of sensitivity and signal-to-noise ratio. Fragment ion peak list extraction and signal intensity normalization were performed on the MS / MS spectrum, PRM spectrum and synthetic peptide MS / MS spectrum of the discovery stage, and similarity index was calculated. The discovery stage includes the prediction stage, library search stage and re-scoring stage of the deep learning de novo analysis model. The MS / MS spectrum is the experimental secondary mass spectrometry spectrum corresponding to the HLA peptide from the candidate transposon. When the similarity index between the discovery phase and the synthetic peptide MS / MS spectrum, as well as the similarity index between the PRM spectrum and the synthetic peptide MS / MS spectrum, are both greater than a preset threshold, the corresponding target peptide is identified as a transposon-derived tumor-specific neoantigen verified by PRM. The transposon-derived tumor-specific neoantigens verified by PRM are statistically summarized according to transposon class, family, and subfamily to obtain the number of transposon-derived tumor-specific neoantigens contributed by different transposon classes in each tumor tissue or tumor cell line, HLA allele restriction, and reproducibility among multiple samples.

[0018] Compared with the prior art, the present invention has the following beneficial effects: (1) The high-confidence sequence tags obtained by the deep learning de novo parsing model of this invention are used for homology screening and simplification of the whole genome transposon six-frame translation protein database, thereby significantly reducing the search space while ensuring coverage, reducing the risk of false positive identification of transposon-derived peptides, and through group FDR control and affinity screening, the statistical bias of the overall FDR strategy on minority classes (TE peptides) can be reduced, improving the reliability and reproducibility of TE peptide identification under the same FDR constraints, and improving the reliability of transposon-derived tumor-specific neoantigen identification.

[0019] (2) The search space is significantly simplified, improving the sensitivity and specificity of TE peptide identification: A complete whole-genome transposon six-frame translation protein database is constructed by performing six-frame translation of whole-genome transposon sequences, and the high-confidence sequence tags obtained by the deep learning de novo analysis model are used to simplify the sample specificity, retaining only candidate sequences with experimental support, which greatly reduces the number of candidate peptides searched, and significantly improves the identification quantity and score quality of transposon-derived HLA peptides while controlling FDR.

[0020] (3) Group FDR control and independent statistical modeling for TE peptides: This invention establishes a target-decoy model by grouping peptides from transposon-specific proteins with peptides from standard proteins. The FDR of the TE group and the standard group are controlled at the peptide profile matching and peptide levels, respectively. This effectively avoids the statistical bias of the traditional overall FDR strategy for TE peptides, which are far fewer in number than standard peptides, thereby ensuring the credibility of the identification results of transposon-derived peptides.

[0021] (4) Combining multi-engine database search with deep learning de novo to balance sensitivity and accuracy: This invention uses multiple database search engines to search in parallel, and at the same time uses a deep learning de novo parsing model to predict high-confidence sequence tags. This can improve the coverage of peptide profile matching, and also improve the identification accuracy by means of statistical integration and rescoring.

[0022] (5) Introduce multi-dimensional biological constraints and strictly screen tumor-specific neoantigens: Candidate transposon-derived HLA peptides not only need to meet the FDR and HLA binding thresholds, but also need to be "negatively screened" by combining tumor and normal tissue transcriptome data, normal tissue Ribo-seq translation evidence and thymus-presenting peptide information, effectively excluding peptides that are widely expressed in normal tissues or have participated in central immune tolerance, so as to preferentially obtain truly tumor-specific neoantigens, namely transposon-derived tumor-specific neoantigens.

[0023] (6) Introducing PRM-targeted mass spectrometry verification to form a closed-loop technical solution of "discovery + verification": Compared with existing methods that rely solely on the results of the library search in the discovery stage, this invention significantly improves the verification level of candidate neoantigens by automatically generating a PRM inclusion list and performing parallel reaction monitoring and spectral similarity evaluation on HLA peptides from candidate transposons, making the output results more suitable for direct use in subsequent T cell function verification and clinical translation applications.

[0024] (7) Outputting TE-derived neoantigen resources that can directly support the design of tumor immunotherapy: This invention not only outputs single transposon-derived tumor-specific neoantigen peptides, but also systematically provides information such as the corresponding transposon class, family, subfamily and HLA restriction, which can be directly used to design "off-the-shelf" shared tumor vaccines or personalized TCR-T treatment regimens. Attached Figure Description

[0025] Figure 1 This is a schematic diagram of the identification and verification methods of the present invention; Figure 2 This is a schematic diagram illustrating the process of constructing a genome-wide transposon six-frame translational protein database and simplifying it based on sample specificity, as described in this invention. Figure 3 This is a schematic diagram of the screening process for transposon-derived tumor-specific neoantigens according to the present invention; Figure 4 This is a schematic diagram of the PRM targeted mass spectrometry verification process and a schematic diagram of the spectrum similarity evaluation of the present invention. Detailed Implementation

[0026] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.

[0027] Terminology Definition ① "Transposon (TE)": refers to mobile genetic elements in the genome, including LINE, SINE, LTR / ERV, DNA transposon, SVA, etc.

[0028] ② "Transposon-derived presented peptides": refers to HLA-presented peptides that are determined to originate from transposon-specific protein databases after being searched in a mass spectrometry database and controlled by a preset error detection rate (FDR).

[0029] ③ "Candidate transposon-derived HLA peptides": refers to the set of peptides obtained by further screening based on HLA binding affinity thresholds on the basis of "transposon-derived presented peptides".

[0030] ④ "Candidate transposon-derived tumor-specific neoantigen peptides": refers to the set of peptides obtained by combining evidence of tumor / normal tissue expression or translation, thymus negative screening, and optional drug-induced differential information based on "candidate transposon-derived HLA peptides".

[0031] ⑤ "Validated transposon-derived tumor-specific neoantigens": refers to the set of peptides confirmed after PRM targeted acquisition of "candidate transposon-derived tumor-specific neoantigen peptides" and determination by spectral similarity threshold.

[0032] ⑥ "BLAST retrieval": refers to using retrieval tools to retrieve the corresponding TE proteins from the transposon database using the prediction results of the deep learning de novo parsing model as the query sequence, and generating a sample-specific simplified transposon-specific protein database from the retrieval results.

[0033] ⑦ “MSFragger”, “MS-GF+” and “Comet”: These are three search engines used to identify input HLA immunopeptide mass spectrometry data.

[0034] ⑧ "iProphet": This is a merging tool used to combine the identification results from multiple search engines to provide more accurate results.

[0035] Example 1 This embodiment provides a method for identifying transposon-derived tumor neoantigens based on deep learning-guided proteomics, such as... Figure 1 As shown, the method includes the following steps: (1) Acquisition of HLA immunopeptidome mass spectrometry data This step primarily involves obtaining HLA immunopeptide mass spectrometry data, specifically including: Representative tumor tissues or cell lines, such as glioma, lung cancer, or melanoma cell lines, were selected. HLA-I (human leukocyte antigen class I) related peptides were prepared using standard immunopeptidomics procedures. High-resolution Orbitrap mass spectrometry was used in DDA (Data-Dependent Acquisition) mode to obtain HLA immunopeptidome mass spectrometry data files of the original antigen-presenting peptides. Optionally, RNA-seq and / or DNA-seq data of the cell line were also acquired simultaneously.

[0036] (2) Construction of a genome-wide transposon six-frame translation protein database This step primarily involves constructing a genome-wide transposon six-frame translational protein database based on the reference genome and transposon annotations, specifically including: Based on a pre-defined human reference genome (e.g., hg38) and transposon annotation resource databases such as Dfam, the genomic sequences of all transposon elements labeled as LINE, SINE, LTR / ERV, DNA, SVA, etc., are extracted. Each sequence is translated in six frames to obtain translation products. The translation products are then split into multiple polypeptide fragments according to the stop codon, and fragments with a length of less than 8 amino acids are discarded to obtain a whole-genome transposon six-frame translation protein database. A classification index is constructed for the transposon protein sequences in the database according to transposon category, family, and subfamily.

[0037] (3) Deep learning de novo sequence label generation and sample specificity simplification This step primarily employs various deep learning de novo analysis models to predict sequences from HLA immunopeptide mass spectrometry data, extracts high-confidence sequence tags, and performs sample-specific simplification of the genome-wide transposon six-frame translation protein database. Specifically, this includes: The raw HLA immunopeptide mass spectrometry data files were converted to a standard format (such as mzML or MGF), and input into at least two deep learning de novo analysis models (e.g., Transformer-based models, convolutional neural network models) to predict peptide sequences for each MS / MS spectrum (MS / MS spectrum in HLA immunopeptide mass spectrometry data). For each prediction result, based on residue-level confidence, a sliding window strategy was used to extract high-confidence sequence tags with a length of not less than 8.

[0038] Based on the constructed classification index, the above high-confidence sequence tags are compared with the genome-wide transposon six-frame translation protein database using short peptide homology. Under the condition of ignoring the distinction between leucine and isoleucine, a complete match is required or only a very small number of mismatches are allowed. Only transposon protein sequences containing at least one matching tag are retained to construct a sample-specific simplified transposon protein database.

[0039] In this embodiment, when two deep learning de novo parsing models are selected, the steps for generating high-confidence sequence labels include: The first deep learning de novo parsing model: Direct peptide sequence prediction of MS / MS spectra based on sequence-to-sequence Transformer neural network structure; The second deep learning de novo parsing model: predicts peptide sequences from the MS / MS spectra based on convolutional neural network and / or Transformer neural network structures; Results integration: The prediction results of different deep learning de novo parsing models were compared. The predicted sequences of the same spectrogram were weighted by confidence or selected by voting. A sliding window was applied to each predicted sequence to select sequence fragments with an average confidence level of residues not lower than the first threshold and a length not less than the second threshold as high-confidence sequence labels.

[0040] The process of steps (2)-(3) above in this embodiment is as follows: Figure 2 As shown.

[0041] (4) Multi-engine database search and grouped FDR control This step primarily involves merging a simplified transposon-specific protein database with a human standard protein database. Multiple database search engines are used to search the databases under enzyme-free conditions. False discovery rates are controlled for transposon-derived peptides and classical protein-derived peptides to obtain a reliable set of transposon-derived presenting peptides. Specifically, this includes: The simplified transposon-specific protein database was merged with a human standard protein database (such as UniProt reference protein) to obtain a joint database. At least two search engines were used to search the HLA immunopeptidome mass spectrometry data under non-specific enzyme digestion conditions. Peptide lengths were set to 8-15 amino acids, fragment quality error tolerance was set to 20 ppm, and variable modifications included methionine oxidation, etc. The peptide matching results output by each search engine were obtained. The above parameters are merely exemplary settings; those skilled in the art can adjust them according to instrument conditions and sample types without affecting the technical effects of the present invention.

[0042] For each search engine's peptide matching results, reversed or scrambled decoy sequences are introduced for target-decoy statistical modeling. Statistical integration tools are used to uniformly re-score the peptide matching results from different engines. All peptide matching results are grouped according to whether their corresponding peptides originate from a simplified transposon-specific protein database or a human standard protein database, resulting in a transposon group and a standard group. Based on the target-decoy strategy, a discrimination score is extracted from each peptide matching result in each group, constructing a score distribution for both groups. A sliding threshold is then set, and the number of peptide matching results in both groups above each threshold is counted. The ratio of these two scores is calculated as an estimate of the false detection rate (FDR) at that threshold, forming a score threshold-false detection rate estimate list. The highest score threshold in this list where the FDR estimate is ≤ a preset condition (e.g., 0.01) is found; this threshold is the score cutoff value. Based on the determined score cutoff value, the FDR is controlled to be no greater than the preset condition at both the PSM (Peptide-Spectrum Match) level and the peptide level. That is, all peptide-spectrum matching results that meet the score cutoff value in the two groups are obtained and presented as the reliable transposon source peptide set and standard peptide set, respectively.

[0043] (5) HLA binding affinity prediction This step primarily assesses the human leukocyte antigen (HLA) binding affinity of transposon-derived presenting peptides, screening for transposon-derived presenting peptides with HLA binding capacity as candidate transposon-derived HLA peptides. Specific steps include: The list of HLA-I alleles in tumor tissues or tumor cell lines is determined using HLA typing tools based on RNA-seq or external information and / or clinical input information. The binding affinity of each transposon-derived presented peptide to all alleles is predicted using HLA ligand binding affinity prediction tools that support multiple alleles (such as one or more of NetMHCpan and MixMHCpred), and a binding score is obtained in percentile order. The lowest percentile order value is selected from all percentile order values ​​as the optimal allele order value, and the optimal allele order value is used as the final HLA binding score of the peptide.

[0044] A threshold is set for HLA-binding peptides, defined as those with optimal allele ranking values ​​less than a preset threshold (e.g., ranking value < 2%). Transposon-derived presented peptides meeting this threshold are considered HLA-binding peptides, while those meeting a more stringent threshold (e.g., ranking value < 0.1%) are considered strongly binding peptides. This screening process yields a set of candidate transposon-derived HLA peptides.

[0045] (6) Screening for tumor-specific and inducible TE neoantigens This step primarily combines HLA binding affinity prediction, transcriptome data from tumor and normal tissues, Ribo-seq translational evidence, thymus-presenting peptide information, and differences in the immune peptidomome before and after drug treatment to perform tumor-specific screening of candidate peptides, resulting in a set of transposon-derived tumor-specific neoantigens. For example... Figure 3 As shown, it specifically includes: 1) Differential expression analysis between tumor and normal tissues For transcriptomic data based on tumor tissue and paired normal tissue samples, transposon or transposon-gene chimeric transcript expression matrices were constructed, and standard differential analysis was used to identify TE-related transcripts that were significantly upregulated in tumor samples but expressed below a threshold in paired normal tissue and public health tissue databases.

[0046] HLA peptides derived from candidate transposons of the aforementioned TE transcripts or their subfamily are labeled as candidate transposon-derived tumor-specific neoantigens with expression support.

[0047] 2) Evidence of normal tissue translation and thymus-related negative screening Based on publicly available or self-built Ribo-seq data of normal tissue ribosome sequencing, we use ORF (open reading frame) prediction and reading frame statistics algorithms to identify open reading frames with stable translation evidence, and remove peptides that perfectly match these ORFs from the candidate set.

[0048] Candidate peptides were retrieved from publicly available or self-built thymic MHC-I presenting peptide databases. Peptides appearing in the thymic presenting library were considered to be potentially involved in central immune tolerance and were removed to obtain transposon-derived tumor-specific neoantigens that avoid central immune tolerance.

[0049] Optionally, this embodiment can combine the information on differences in the immune peptidomome before and after drug treatment to perform inducible screening of HLA peptides derived from candidate transposons, thereby obtaining drug-induced transposon-derived tumor-specific neoantigens. Specifically, the steps include: For tumor cell lines treated with DNA methylation inhibitors or interferon, the identification process in Example 1 was performed before and after drug treatment to identify candidate transposon-derived HLA peptides. The presence and abundance changes of candidate transposon-derived HLA peptides were compared, and candidate transposon-derived HLA peptides that appeared only after treatment or whose abundance increased significantly and met the above negative screening conditions were screened and labeled as drug-induced TE-derived tumor-specific neoantigens.

[0050] Example 2 This embodiment provides a targeted mass spectrometry-based experimental verification method. This method verifies the transposon-derived tumor-specific neoantigen obtained in Example 1 using PRM targeted mass spectrometry. Specifically, as follows... Figure 1As shown, it includes the following steps: (1) Priority ranking of candidate peptides The peptides of transposon-derived tumor-specific neoantigens obtained in Example 1 were ranked according to the following criteria: ① Peptide profile matching score and number of support profiles during the discovery phase; ②HLA binding affinity strength; ③ Reproducibility across multiple samples; ④ The degree of tumor enrichment in the subfamily of transposons to which it belongs.

[0051] Select the highest-scoring peptides (preset quantity) as PRM priority validation targets (priority validation target peptides).

[0052] (2) PRM includes list generation and data collection Based on the amino acid sequence of the target peptide, the theoretical mass-to-charge ratio (m / z), charge state, and predicted chromatographic retention time under possible charge states are calculated. Combined with retention time prediction tools, the chromatographic retention time range is estimated. The m / z-charge state combination and retention time window of the target peptide are written into the PRM containment list and sent to the mass spectrometer control software to instruct the mass spectrometer to preferentially target and fragment the precursor ions of the target peptide that fall within the preset time window, and collect their secondary mass spectrometry signals. This ensures high-sensitivity, high signal-to-noise ratio qualitative and quantitative detection of the target peptide without interference from other high-abundance ions.

[0053] High-resolution mass spectrometry was used to acquire high-quality MS / MS spectra of immune peptide-enriched samples (i.e., tumor samples) in PRM mode under preset gradient and collision energy conditions.

[0054] (3) Evaluation and verification of spectral similarity Fragment ion peak list extraction and signal intensity normalization were performed on MS / MS spectra, PRM spectra (derived from PRM spectra acquired during the target validation phase), and MS / MS spectra of synthetic peptides (derived from reference secondary mass spectra of standard peptides synthesized in vitro) from the discovery phase (including deep learning de novo sequencing screening, database search, and rescoring steps). Similarity indices between spectra, such as spectral angle and Pearson correlation coefficient, were calculated. The specific similarity index calculation results in this embodiment are as follows: Figure 4 As shown.

[0055] When the similarity between the discovery phase and the synthetic peptide spectrum, as well as the similarity between the PRM and the synthetic peptide spectrum, are both greater than preset thresholds (e.g., spectrum angle > 0.7, correlation coefficient > 0.9), the corresponding target peptide is identified as a transposon-derived tumor neoantigen verified by the PRM. Furthermore, this step statistically summarizes the verified transposon-derived tumor neoantigens according to transposon class, family, and subfamily, outputting the number of transposon-derived tumor-specific neoantigens contributed by different transposon classes in each sample, HLA allele restriction, and reproducibility across multiple samples.

[0056] The identification and verification methods of Embodiments 1 and 2 described above can be deployed on servers or workstations equipped with multi-core CPUs and / or GPUs, including memory, processors, and computer programs stored in memory and capable of running on the processor.

[0057] The computer program can encapsulate each module using a workflow framework: the input layer is responsible for receiving raw HLA immune peptidomimetics mass spectrometry data and multi-omics data (including transcriptome and / or genome sequencing data matched with the sample); the intermediate layer includes the construction of a genome-wide transposon six-frame translational protein database and sample-specific simplification, high-confidence sequence tag generation, multi-search engine database searching and grouping FDR control, HLA binding affinity prediction, and transposon-derived tumor-specific neoantigen modules; the output layer is responsible for generating a PRM inclusion list and a transposon-derived tumor-specific neoantigen result report.

[0058] Those skilled in the art can choose containerization (e.g., Docker) or virtual environment management to encapsulate the dependent software according to actual needs, thereby automating and repeating the process.

[0059] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0060] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.

[0061] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0062] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0063] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0064] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.

[0065] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

Claims

1. A method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics, characterized in that, Includes the following steps: We acquired HLA immune peptidomimetry mass spectrometry data from tumor tissues or tumor cell lines, and constructed a genome-wide transposon six-frame translational protein database based on the reference genome and transposon annotations. A deep learning de novo analytical model was used to predict peptide sequences in the HLA immune peptidomimetry mass spectrometry data to obtain high-confidence sequence tags. The high-confidence sequence tags were then used to perform sample-specific simplification of the whole-genome transposon six-frame translational protein database to obtain a simplified transposon-specific protein database. The simplified transposon-specific protein database and the human standard protein database are merged, and the HLA immunopeptidome mass spectrometry data are then searched. After grouping FDR control and affinity screening, candidate transposon-derived HLA peptides are obtained. The candidate transposon-derived HLA peptides were screened to obtain transposon-derived tumor-specific neoantigens.

2. The method for identifying transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 1, characterized in that, The steps for constructing the genome-wide transposon six-frame translation protein database include: Based on the preset reference genome and transposon annotations, the genomic sequence of each transposon element is extracted; The genome sequence of each transposon element is translated in six frames to obtain the translation product; The translation products are split according to the stop codon, and sequence fragments shorter than a preset threshold are discarded to obtain a genome-wide transposon six-frame translation protein database, wherein the genome-wide transposon six-frame translation protein database contains a classification index constructed according to transposon categories, families, and subfamilies.

3. The method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 2, characterized in that, The steps for obtaining the simplified transposon-specific protein database include: Based on the classification index, the high-confidence sequence tag is compared with the genome-wide transposon six-frame translation protein database using short peptide homology, and at least one transposon protein sequence matching the high-confidence sequence tag is retained to construct a simplified transposon-specific protein database.

4. The method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 1, characterized in that, The deep learning de novo parsing model comprises at least two models. When two deep learning de novo parsing models are used, the step of obtaining high-confidence sequence labels includes: Convert each HLA immunopeptidome mass spectrometry data into a standard format; The peptide sequence of each standard format HLA immune peptidomide mass spectrometry data is predicted using a first deep learning de novo parsing model to obtain the first predicted sequence. The first deep learning de novo parsing model adopts a sequence-to-sequence Transformer neural network structure. The second deep learning de novo parsing model is used to predict peptide sequences for each standard format HLA immune peptidomide mass spectrometry data to obtain a second predicted sequence. The second deep learning de novo parsing model adopts a convolutional neural network structure and / or a Transformer neural network structure. The first and second predicted sequences of the same HLA immunopeptide mass spectrometry data are subjected to confidence-weighted fusion or voting selection, and a sliding window is set on the first and second predicted sequences to select sequence fragments with an average residue confidence of not less than a first threshold and a length of not less than a second threshold as high-confidence sequence tags.

5. The method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 1, characterized in that, The steps for obtaining candidate transposon-derived HLA peptides include: The simplified transposon-specific protein database and the human standard protein database are merged to form a joint database. On the joint database, at least two search engines are used to search the HLA immunopeptide mass spectrometry data under non-specific enzyme digestion conditions to obtain the peptide spectrum matching results output by each search engine. The parameters set during the search process include peptide length, fragment quality error tolerance, and variable modification. The peptide matching results output by each search engine are introduced with reversed or scrambled decoy sequences. The peptide matching results output by each search engine are re-scored. The peptide matching results from the simplified transposon-specific protein database and the peptide matching results from the human standard protein database are grouped to obtain the transposon group and the standard group. For the transposon group and the standard group, based on the target-decoy strategy, the discrimination score is extracted from the matching result of each peptide profile of each group, and the score distribution of the two groups is constructed. Based on the score distribution, a sliding score threshold is set, the number of two sets of peptide spectrum matching results above each score threshold is counted, and the ratio of the number of the two sets of peptide spectrum matching results is calculated. The ratio is used as the estimated false detection rate under the score threshold, forming a score threshold-false detection rate estimate list. From the list of score thresholds and error rate estimates, the highest score threshold that satisfies no greater than a preset threshold is obtained and used as the score cutoff value. At both the PSM and peptide levels, all peptide profile matching results that meet the score cutoff values ​​in the two groups are obtained and used as the set of peptides presented as reliable transposon sources and the set of standard peptides, respectively. Based on the set of transposon-derived presented peptides, HLA binding affinity was evaluated, and transposon-derived presented peptides with HLA binding ability were screened as candidate transposon-derived HLA peptides.

6. The method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 5, characterized in that, The steps for obtaining the HLA peptides derived from the candidate transposons include: Obtain a list of HLA-I alleles for the tumor tissue or tumor cell line determined using an HLA typing tool based on RNA-seq or external information; For each transposon-derived peptide, the binding affinity with all HLA-I alleles in the HLA-I allele list is predicted, and a binding score expressed as a percentile value is obtained. The smallest percentile ranking value is selected from all percentile ranking values ​​as the optimal allele ranking value. Transposon-derived presenting peptides with optimal allele ranking values ​​less than a preset threshold are identified as HLA-binding peptides, and transposon-derived presenting peptides that meet stricter thresholds are identified as strong-binding peptides. Both HLA-binding peptides and strong-binding peptides are candidate transposon-derived HLA peptides.

7. The method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 1, characterized in that, The steps for obtaining the transposon-derived tumor-specific neoantigen include: Transcriptome data from tumor tissue and paired normal tissue were obtained. Transposon or transposon-gene chimeric transcript expression matrix was constructed based on the transcriptome data. Standard differential analysis was used to identify transposon-related transcripts that were significantly upregulated in tumor tissue but expressed below a threshold in paired normal tissue and public health tissue databases. The transposon-associated transcripts or candidate transposon-derived HLA peptides of their respective subfamilies are labeled as candidate transposon-derived tumor-specific neoantigens with expression support; Ribo-seq data from various normal tissues were acquired, and open reading frames with stable translation evidence were identified using ORF prediction and reading frame statistics algorithms. Candidate transposon-derived tumor-specific neoantigens that perfectly matched the open reading frames were initially eliminated. Based on the thymus tissue MHC-I presenting peptide database, the candidate transposon-derived tumor-specific neoantigens were further eliminated to obtain transposon-derived tumor-specific neoantigens that avoid central immune tolerance.

8. The method for recognizing transposon-derived tumor neoantigens based on deep learning-guided proteomics according to claim 1, characterized in that, It also includes using drug-induced differential information to screen for drug-induced transposon-derived tumor-specific neoantigens from the candidate transposon-derived HLA peptides. Specific screening steps include: For tumor cell lines, HLA peptides derived from candidate transposons were identified before and after treatment with tumor drugs. Based on the comparison of the presence and abundance changes of candidate transposon-derived HLA peptides before and after tumor drug treatment, candidate transposon-derived HLA peptides that only appear or have significantly increased abundance after tumor drug treatment, and meet the screening criteria of normal tissue translation evidence and thymus negatively increasing, are obtained as drug-induced transposon-derived tumor-specific neoantigens.

9. A method for experimental verification based on targeted mass spectrometry, characterized in that, The transposon-derived tumor-specific neoantigens obtained by the deep learning-guided proteomics-based transposon-derived tumor neoantigen recognition method according to any one of claims 1-8 are subjected to targeted mass spectrometry verification, and the verification results are output.

10. The method for experimental verification based on targeted mass spectrometry according to claim 9, characterized in that, The steps for targeted mass spectrometry validation include: Candidate transposon-derived HLA peptides of the transposon-derived tumor-specific neoantigen are sorted according to set indicators, and a preset number of candidate transposon-derived HLA peptides are automatically selected as target peptides for priority verification. The indicators include matching score, number of support profiles, HLA binding strength, and sample reproducibility. Based on the amino acid sequence of the target peptide, the theoretical mass-to-charge ratio, charge state, and predicted chromatographic retention time are calculated. A PRM inclusion list is generated based on the theoretical mass-to-charge ratio, charge state, and predicted chromatographic retention time. The PRM inclusion list is sent to the mass spectrometer to instruct the mass spectrometer to preferentially target and fragment the precursor ions of the target peptide that fall within a preset time window, and to collect their secondary mass spectrometry signals for qualitative and quantitative detection of the target peptide in terms of sensitivity and signal-to-noise ratio. Fragment ion peak list extraction and signal intensity normalization were performed on the MS / MS spectrum, PRM spectrum and synthetic peptide MS / MS spectrum of the discovery stage, and similarity index was calculated. The discovery stage includes the prediction stage, library search stage and re-scoring stage of the deep learning de novo analysis model. The MS / MS spectrum is the experimental secondary mass spectrometry spectrum corresponding to the HLA peptide from the candidate transposon. When the similarity index between the discovery phase and the synthetic peptide MS / MS spectrum, as well as the similarity index between the PRM spectrum and the synthetic peptide MS / MS spectrum, are both greater than a preset threshold, the corresponding target peptide is identified as a transposon-derived tumor-specific neoantigen verified by PRM. The transposon-derived tumor-specific neoantigens verified by PRM are statistically summarized according to transposon class, family, and subfamily to obtain the number of transposon-derived tumor-specific neoantigens contributed by different transposon classes in each tumor tissue or tumor cell line, HLA allele restriction, and reproducibility among multiple samples.