Method for predicting pathogenicity of gene variations based on perturbed protein language models
By combining high-throughput experimental data with protein language models, a gene-specific variation effect prediction method was constructed, which solved the problems of insufficient accuracy and interpretability of existing models in predicting the pathogenicity of gene variations, and realized efficient and reliable prediction and clinical application of gene variation pathogenicity.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-12
AI Technical Summary
Existing protein language models lack gene specificity when predicting the pathogenicity of gene variations, resulting in insufficient accuracy and a lack of interpretability, making it difficult to scale up across the genome.
By combining high-throughput experimental data with protein language models, and by fine-tuning model parameters and introducing experimental perturbation data and clinical annotation databases, a gene-specific variation effect prediction method is constructed, which outputs pathogenicity classification results and confidence scores, and candidate mutations are verified through cell experiments.
It significantly improves the accuracy and interpretability of predicting the pathogenicity of gene variants, can identify unknown variants and provide reliable clinical decision support, and has good generalization and application value.
Smart Images

Figure CN122201432A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of bioinformatics, and in particular to a method for predicting the pathogenicity of gene variations based on perturbation protein language models. Background Technology
[0002] Accurately predicting the functional impact of gene variants is crucial for understanding human diseases and advancing precision medicine. This is especially true for cancer-related genes, such as the key tumor suppressor gene TP53, where accurate classification of the pathogenicity of missense variants is vital for disease diagnosis, treatment decisions, and risk assessment. However, many missense mutations are still classified as "variables of unclear significance," and their functional impact is difficult to determine quickly and accurately using traditional methods.
[0003] Existing methods for predicting variation effects mainly fall into two categories. The first category is based on high-throughput experimental assays, such as deep mutation scanning and perturb-seq, which can directly measure the functional impact of thousands of mutations, providing causal evidence. However, these methods are costly and time-consuming, making them difficult to scale up across the genome. The second category is based on computational prediction, where protein language models, pre-trained on large-scale evolutionary sequence data, can capture the structural and functional constraints of proteins, enabling unsupervised evaluation of variation effects. Compared to traditional supervised learning methods that rely on clinical annotation databases, protein language models exhibit better generalization ability.
[0004] However, existing protein language models still have significant limitations in predicting mutation effects. First, general models lack gene-level specificity, making it difficult to capture the unique functional constraints and mutation patterns of specific genes. Second, most models rely on evolutionary conservation or clinical associations as training signals rather than direct experimental functional evidence, leading to discrepancies between predicted results and actual biological functions. For example, while existing models such as AlphaMissense and PrimateAI-3D have made progress in overall performance, their training data mainly comes from clinical databases and evolutionary analyses, lacking direct measurement of the functional impact of mutations. Furthermore, existing models typically output classification results in a "black box" manner, lacking interpretability of the predictive basis, which limits their credibility and application depth in clinical practice and research.
[0005] Therefore, how to combine high-throughput experimental data with protein language models to construct a variation effect prediction method with gene specificity, high accuracy and interpretability remains a technical problem that urgently needs to be solved in this field.
[0006] Therefore, this invention is proposed. Summary of the Invention
[0007] To address the aforementioned technical problems, this invention provides a method for predicting the pathogenicity of gene variations based on a perturbation protein language model. By combining high-throughput experimental data with a protein language model, a variation effect prediction method with gene specificity, high accuracy, and interpretability is constructed.
[0008] In order to achieve the objective of this invention, the following technical solution is adopted: This invention provides a method for predicting the pathogenicity of gene variants based on a perturbation protein language model, comprising the following steps: S1. Obtain the mutation dataset of the target gene, wherein the mutation dataset includes multiple mutation samples and their corresponding pathogenicity labels, and the pathogenicity labels are derived from experimental perturbation data and clinical annotation databases; S2. Fine-tune the protein language model using a mutation dataset, wherein the fine-tuning includes: fine-tuning and updating the parameters of the last layer of the protein language model and the parameters of the added classification head; S3. Input the sequence of the mutation to be tested in the target gene into the fine-tuned model, and output the pathogenicity classification result and confidence score of the mutation sequence to be tested.
[0009] Furthermore, the experimental perturbation data includes deep mutation scan data, and the clinical annotation data includes the ClinVar database; The deep mutation scanning data refers to data obtained by CRISPR or base editing technology that reflects the impact of mutations on protein binding affinity or functional activity.
[0010] Furthermore, in S2, the update specifically includes: using a binary cross-entropy loss function for parameter optimization, and using weighted cross-entropy for the categories and their imbalanced datasets. In the binary cross-entropy loss function, the weight of each category is inversely proportional to the number of samples of the corresponding category in the mutation dataset.
[0011] Furthermore, in step S3, while outputting the pathogenicity classification result of the mutation sequence to be tested, the mutant embedding vector and the wild-type embedding vector of the target gene after fine-tuning are extracted and calculated as a measure of the structural perturbation of the mutation to be tested in the target gene.
[0012] Furthermore, the method also includes screening candidate mutations from the mutations to be tested in the target gene based on pathogenicity classification results and confidence scores, and performing functional verification of the candidate mutations through cell experiments; Candidate mutations must meet any of the following conditions: they are classified as pathogenic mutations but are labeled as variants of indeterminate significance in known databases, they are not covered by experimental perturbation data, or they are inconsistent with the prediction results of the reference model.
[0013] Furthermore, the target gene is any one of P53, VHL, ATM, BRCA1, RAD51C, or BAP1.
[0014] Furthermore, the target gene is P53.
[0015] Furthermore, the confidence score is obtained by applying the sigmoid function to the output logit value of the classification head.
[0016] Furthermore, the protein language model is an ESMC model.
[0017] Furthermore, the input format for the mutation to be tested in the target gene is a sequence window containing the mutation site and 25 amino acid residues upstream and downstream of it.
[0018] Furthermore, the cell experiments include: detecting the growth advantage of cells carrying candidate mutations relative to wild-type cells under drug stress conditions.
[0019] Furthermore, the growth advantage was assessed by detecting changes in the abundance of mutant cells before and after drug treatment.
[0020] This invention also provides a prediction system for a method of predicting the pathogenicity of gene variants based on a perturbation protein language model, comprising: The data acquisition module is used to acquire the mutation dataset of the target gene, which includes multiple mutation samples and their corresponding pathogenicity labels; the model fine-tuning module is used to fine-tune the basic model based on the protein language model on the mutation dataset; the prediction module is used to input the test mutation sequence of the target gene into the fine-tuned model and output the pathogenicity classification result and confidence score of the test mutation sequence.
[0021] This invention also provides the application of the above-mentioned method for predicting the pathogenicity of gene variants based on perturbation protein language models in constructing treatment decision-making or prognostic assessment models for gene loss-of-function cancers.
[0022] The present invention has the following technical effects: (1) Significantly improved prediction accuracy: By integrating deep mutation scanning experimental data and clinical annotation data, and by fine-tuning the protein language model specifically, it consistently outperforms the most advanced general prediction model in key indicators such as accuracy, precision, F1 score and Matthews correlation coefficient.
[0023] (2) Achieving efficient identification of unknown variants: This protocol successfully identified 7 novel functional mutations and 5 novel non-functional mutations, including 3 variants marked as "of uncertain significance" in the ClinVar database, providing direct experimental evidence for the clinical reclassification of these variants. Among the 9 controversial mutations that were inconsistent with existing model predictions, this protocol correctly predicted 6, with an accuracy rate of 66.7%, fully demonstrating its advantages in predicting challenging variants.
[0024] (3) Enhanced interpretability and reliability of predictions: The method not only outputs binary classifications but also provides confidence scores based on logistic transformations and introduces the Euclidean distance (a measure of structural perturbation) between wild-type and mutant sequence embedding vectors as a quantitative indicator. These outputs make the prediction results more interpretable, and the model gives higher confidence for correct predictions.
[0025] (4) It has good generalization and application value: The framework has been successfully extended to five other cancer-related genes, including VHL and ATM, and still maintains its performance advantage on most genes, demonstrating its potential as a general gene-specific prediction framework. In addition, this method can reclassify variants with unclear clinical significance, providing a direct decision support tool for precision medicine.
[0026] This invention integrates causal experimental data with protein language models, achieving a gene-specific variation effect prediction that is significantly superior to existing methods in terms of accuracy, interpretability, and practicality. Attached Figure Description
[0027] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0028] Figure 1 : A method for predicting the pathogenicity of gene variants based on perturbation protein language models. a) is the development and application process of the model; b) is the TP53-specific VEP model that enhances the existing PLM by adding a classification head to predict the pathogenicity of variant sequences. Figure 2The TP53-specific VEP model improves the accuracy of pathogenicity prediction. (a) Benchmark performance of eight protein language models using the training dataset, based on the area under the precision-recall curve (AUPR). (b) TP53-specific fine-tuning significantly improves the performance of all ESM variants, especially in smaller models such as ESM2_35M and ESM2_8M, where the performance improvement is particularly rapid. (c) Comparison of performance metrics (accuracy, precision, F1 score, MCC) between the AM model and the TP53-specific model; the TP53-specific model generally outperforms the AM model. The values above the CaVepP53 bar chart represent performance relative to AM. The percentage improvement is shown in Figure d; d represents the model's higher confidence score for correctly predicted benign and pathogenic mutations, highlighting its strong predictive accuracy relative to inaccurate predictions, with correctly predicted benign mutations (CPBM), incorrectly predicted benign mutations (IPBM), correctly predicted pathogenic mutations (CPPM), and incorrectly predicted pathogenic mutations (IPPM); e represents the distribution of prediction scores for cancer driver and non-driver mutations in the three models (total sample size n=503), with box plots showing the median, interquartile range (IQR), and whisker (1.5×IQR); a two-sided Mann-Whitney U test was used to compare the driver and non-driver mutation groups, with sample sizes of n=253 for the non-driver group and n=250 for the driver group. ***p<0.001; f represents the experimentally measured functional score distribution of the model's prediction of pathogenicity class based on the William independent missense mutation set (total sample size n = 3,945), with a sample size of n = 3,945 for the CaVepP53 binary classification. 3,362 (benign) and n=687 (pathogenic); AlphaMissense sample size: n=475 (fuzzy), n=2,531 (benign) and n=1,024 (pathogenic); PrimateAI-3D sample size: n=943 (pathogenic) and n=317 (benign); significance was assessed using a two-sided Mann-Whitney U test (*p<0.05, **p<0.01, ***p<0.001); g is a stacked bar chart showing the proportion of driver gene (top) and non-driver gene (bottom) variants in each model prediction category; the bars show absolute counts (total n=503); CaVepP53 and PrimateAI-3D used binary classification (benign / pathogenic); AlphaMissense used ternary classification (benign / pathogenic / uncertain); Figure 3: Saturation mutation prediction of p53. a) CaVepP53 pathogenicity score heatmap of all possible amino acid substitution sites in the TP53 protein sequence; b) Comparative analysis of pathogenicity variants predicted by DBD and non-DBD regions; c) Distribution of TP53-specific VEP model scores across all ClinVar tags (including VUS classification); d) Prediction of experimentally validated single amino acid deletion variants, with each point representing a variant. Benign variants (n=32) are shown in blue, and pathogenic variants (n=125) are shown in red. Statistical significance was assessed using a two-sided Mann-Whitney U test. ***p<0.001; Figure 4 Functional validation of the predicted TP53 variant. Figure a shows the flowchart, where a pool of mixed cells containing TP53 primer-edited mutants (red) and wild-type cells (gray) is treated with DMSO or 10 μM... Nutlin-3a treatment for 5 days showed that functional mutants exhibited growth advantage under Nutlin-3a screening, while non-functional mutants did not. b represents the editing efficiency mutants of TP53: 3 functional mutants predicted by AM and screened by high-throughput; solid bars: DMSO control; striped bars: Nutlin-3a treatment. c represents the editing efficiency of newly identified functional / non-functional mutants; solid bars: DMSO control; striped bars: Nutlin-3a treatment. The figures show mean ± standard deviation (n=2 biological replicates). The standard two-tailed unpaired Student's test was used to statistically analyze the two groups of data. ns, no statistical significance; *p<0.05; **p<0.01; ***p<0.001; d represents the model prediction and experimental validation of pathogenic variants: solid circles represent variants that have been experimentally confirmed as pathogenic. Experimental results are shown in gray, AM model predictions in blue, and CaVepP53 predictions in variant-specific colors consistent with Figures b and c. Hollow circles indicate variants where CaVepP53 predictions do not match experimental results. Of the eight pathogenic variants predicted by CaVepP53, seven were experimentally validated, while the AM model only predicted three. Figure e shows the morphology and H&E staining of liver tumors induced by MYC and Trp53 mutations, with tumor boundaries marked by dashed lines and tumor regions marked by asterisks. Figure f shows the reclassification of ClinVarVUS based on model predictions and experimental validation. The figure shows five variants initially labeled as "variables of indeterminate significance" (VUS) in ClinVar and Ensembl. All five variants were experimentally validated as functionally pathogenic, with CaVepP53 correctly predicting four and the AM model correctly predicting three. Detailed Implementation
[0029] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0030] The following is a detailed explanation using specific embodiments: Example The model development process is as follows Figure 1 As shown, the model is named CaVepP53.
[0031] 1. Dataset Preparation This embodiment first obtains the mutation dataset of the target gene for subsequent model fine-tuning. The mutation dataset includes multiple mutation samples and their corresponding pathogenicity labels, which are derived from experimental perturbation data and clinical annotation databases.
[0032] Most existing VEP models heavily rely on ClinVar as a source of labeled data. However, the number of mutations in ClinVar is limited and primarily derived from clinical associations based on correlation, lacking direct evidence of functional impact. Previous studies have recognized this limitation and attempted to correct for data leakage by using evolution-derived datasets during pre-training. However, these datasets fall far short of experimentally validated functional relevance. To improve model performance and ensure causal validity, the training dataset was expanded by integrating ClinVar with the Deep Mutation Scan dataset (hereinafter referred to as the Experimental Perturbation Data) published by Stiewe et al. In the Experimental Perturbation Data, the method assesses mutations through base editing techniques, quantifying changes in binding affinity to provide experimental evidence of functional relevance. This integration shifts variant classification from correlation to causality, expanding the gold standard dataset to include 3,522 single amino acid polymorphisms (SAPs), of which 1,976 were labeled as pathogenic and 1,546 as benign. Each residue can have up to 19 amino acid substitutions. Including all possible mutations in the model is not conducive to the model's performance. In fact, in some cases, this approach can introduce noise and reduce prediction accuracy.
[0033] To determine the most suitable baseline model for the TP53 VEP, several state-of-the-art models were first benchmarked on experimental perturbation data. The AM model performed best. Figure 2 a). However, since the source code for the AM model was not available and full model pre-training required a large amount of computational resources, the PLM model from the ESM series was used instead for further development.
[0034] Furthermore, the ESM model possesses efficient transfer learning capabilities and allows for partial fine-tuning without updating all model parameters, thus significantly reducing computational requirements. Comprehensive testing revealed that fine-tuning only the last transformation module of the ESMC is sufficient to achieve good predictive performance. Figure 2 b). Furthermore, model performance improves with increasing parameter size, with ESMC outperforming ESM-2 and ESM-1b (b). Figure 2 b). These results are consistent with previously reported benchmark results for 3D structure prediction. Furthermore, even models with relatively few parameters showed rapid performance improvement after fine-tuning on TP53-specific data. Figure 2 c). Among them, ESMC_600M showed the highest correlation with experimental measurement results and its performance was comparable to AM.
[0035] Considering the importance of local sequence context in modeling, the input window size was further optimized. Five-step cross-validation showed that including 25 amino acids upstream and downstream of the mutation site yielded optimal performance. Furthermore, fine-tuning only the last transformation module also achieved better results compared to other parameter freezing strategies. Based on these findings, the final CaVepP53 model was implemented using ESMC_600M, fine-tuned on TP53-specific data, with an input window size of 51 amino acid residues, and only the parameters of the last transformation module were updated.
[0036] Specifically, taking the TP53 gene as an example, the protein language model was fine-tuned using variant annotation datasets from publicly available experimental data and the ClinVar database (a clinical annotation database; the ClinVar database was used in this embodiment). The experimental perturbation data originated from a study on saturation mutations in the primary DNA binding domain of TP53 (JS Funk, M. Klimovich, D. Drangenstein, et al. "Deep CRISPRmutagenesis characterizes the functional diversity of TP53 mutations," NatGenet 57, no.1 (2025): 140-153). This data quantified the impact of mutations on protein binding affinity using CRISPR base editing technology. The processed dataset was designated as the experimental perturbation data.
[0037] Experimental data were labeled for pathogenicity based on the relative functional score (RFS) provided in the study, with variants RFS > 0 classified as pathogenic and variants RFS ≤ 0 classified as benign. For the clinical annotation database ClinVar, variants annotated as "benign," "likely benign," or "benign / likely benign" were classified as benign, while variants labeled as "pathogenic," "likely pathogenic," or "pathogenic / likely pathogenic" were classified as pathogenic. The experimental perturbation data contributed 3,419 single amino acid polymorphisms (SAPs), of which 1,955 were classified as pathogenic and 1,464 as benign; the ClinVar dataset provided 327 SAPs, of which 203 were identified as pathogenic and 124 as benign.
[0038] The variant data from these datasets were merged, and duplicates and conflicting data were removed to ensure data integrity and consistency. These datasets contain a total of 3,522 SAPs, of which 1,976 were labeled as pathogenic and 1,546 as benign.
[0039] To enhance the model's generality, this embodiment extends the framework to five other cancer-related genes: VHL, ATM, BRCA1, RAD51C, and BAP1. For this purpose, experimental deep mutation scan data for these genes were acquired, with datasets derived from previously published saturation mutagenesis studies. For each gene, a pathogenicity label was assigned based on a threshold defined in the original study to ensure consistency with the biological context of each experiment. Figure 1 b).
[0040] 2. Model fine-tuning and optimization The protein language model is fine-tuned using a mutation dataset, wherein the fine-tuning includes updating the parameters of the last layer transformation module and the added classification head of the protein language model.
[0041] Specifically, ESMC_600M was chosen as the base protein language model (PLM). During model training, pathogenic variant sequences were labeled as 1, and benign variant sequences were labeled as 0. To fine-tune the pre-trained ESMC model, a fully connected layer was added to its architecture as a classification head, responsible for outputting the pathogenicity prediction for each variant.
[0042] The fine-tuning process optimizes model parameters by minimizing the cross-entropy loss function. To address the imbalance between the number of pathogenic and benign variant samples in the dataset, this embodiment employs a weighted binary cross-entropy loss function for parameter optimization. The binary cross-entropy loss function is denoted as... Defined as: ; in Represents the total number of samples. These are real labels (1 indicates pathogenicity, 0 indicates benignity). It is the predicted probability of the corresponding sample.
[0043] To address the inherent class imbalance between pathogenic and benign variants, a weighted binary cross-entropy loss function was employed during training. (Weighted loss function) The definition is as follows: in, and The weighting coefficients represent the negative (benign) and positive (pathogenic) categories, respectively, and their values are inversely proportional to the number of samples of the corresponding category in the mutation dataset. = N / (2· ), where N represents the total number of training samples, This represents the number of samples in category c. This weighting scheme reduces the model's bias towards the majority class, thereby enhancing the sensitivity to detect pathogenic variants.
[0044] To optimize model performance and avoid overfitting, k-fold cross-validation (k=5) was employed during training. Specifically, the dataset was randomly divided into 5 subsets. In each iteration, one subset was used as the validation set, and the remaining 4 subsets were used for training. This cross-validation method was repeated across multiple iterations to evaluate the model's generalization ability and ensure consistent performance across different data partitions.
[0045] Model tuning was performed based on performance metrics such as F1 score, area under the receiver operating characteristic curve (AUC-ROC), and precision. The F1 score is a balanced metric that combines precision and recall, and is particularly useful when dealing with imbalanced datasets. The F1 score is calculated as the harmonic mean of precision and recall. ; Precision (positive predictive value) and recall (sensitivity) are defined as follows: ; TP, FP, and FN represent true positive, false positive, and false negative, respectively.
[0046] The overall discriminative ability of the model is summarized by the AUC-ROC metric. The ROC curve plots the relationship between the true positive rate (recall) and the false positive rate (1-specificity) at all decision thresholds. The AUC-ROC value represents the probability that the model will rank a randomly selected positive example higher than a randomly selected negative example; a higher value indicates better classification performance.
[0047] Cross-validation is used to select hyperparameters to optimize the balance between these metrics (precision, recall, F1 score, and AUC-ROC), thereby ensuring the robustness and accuracy of the model.
[0048] 3. Model Training The model was fine-tuned using the AdamW optimizer with an initial learning rate of 4e-5 and a weight decay of 1e-5 to prevent overfitting. A linear decaying learning rate scheduler was employed, and a warm-up step of 10% of the total training steps was included to allow the model to gradually adapt to the target task.
[0049] The model is trained for a maximum of 3 epochs. If the validation loss does not improve by at least 0.0001 over 5 consecutive evaluation steps, the model is trained for a maximum of 3 epochs and early termination is triggered. Training uses automatic mixed precision (FP16) to accelerate computation while maintaining numerical stability.
[0050] To ensure the reproducibility of the results, 5x cross-validation with a fixed random seed was used to rigorously evaluate model performance. At each step, the base ESMC model was first fine-tuned on the training set, followed by hyperparameter tuning and model selection on the validation set. After cross-validation, the optimal number of training steps was determined by averaging the best-performing training steps across all folds. This optimal number of training steps was then used to train the final model on the entire training dataset.
[0051] Table 1: Hyperparameter settings for the ESMC-based mutation classification model 4. Model Output and Pathogenicity Prediction After fine-tuning, the ESMC model outputs a confidence score representing the predicted pathogenicity of each input mutation sequence. This confidence score is derived from the model's logit value, which corresponds to the logit value for a mutation to be classified as either pathogenic (1) or benign (0). The logit score is the raw output of the model's classification layer, representing the unscaled log odds that the mutation is pathogenic.
[0052] The model outputs a logit value for each input sequence, which is then transformed using the sigmoid function to obtain the pathogenicity probability P∈(0,1). Values close to 1 indicate a high confidence level in pathogenicity, while values close to 0 indicate benignity. In this embodiment, 0.5 is used as the classification threshold.
[0053] This confidence score provides an interpretable metric reflecting the model's certainty regarding the functional impact of the mutation. A higher logit value (leading to a higher confidence score) indicates stronger evidence supporting the pathogenicity of the mutation, while a lower logit value indicates weaker evidence. Therefore, this probabilistic output enables a detailed and quantifiable assessment of the pathogenic potential of a mutation, thus supporting more informed decision-making in interpreting gene variations.
[0054] To enhance the interpretability of the model's predictions, Euclidean distance is further introduced as a metric to measure the embedding similarity between wild-type and mutant sequences. Specifically, for each mutation, the model uses a sliding window around the 51st amino acid surrounding the mutation site to calculate the average embedding of the wild-type and mutant sequences. This ensures that the mutation site and its local sequence environment are taken into account when evaluating the mutation effect.
[0055] The average embedding vector of a sequence is calculated as the mean vector of all word embedding vectors within the window, and its mathematical expression is: ; Wherein represents The embedding of the i-th amino acid within a window of 51 amino acids. This represents the total number of amino acids within the window. It is the total number of amino acids in the window (in this embodiment, =51).
[0056] After obtaining the average embedding vectors of the wild-type and variant sequences, the Euclidean distance between them is calculated to quantify the difference between the two sequences. Euclidean distance The calculation formula is as follows: ; in , These represent the embedding values of the wild-type and mutant sequences in each dimension of the embedding space. A larger Euclidean distance indicates a greater difference between the wild-type and mutant sequences, suggesting a higher likelihood that the mutation disrupts protein function. In this case, a larger Euclidean distance is often associated with pathogenic mutations, as it indicates that the mutation causes a significant change in protein structure or function.
[0057] To explore the basis of the model's predictions, the Euclidean distance between wild-type and mutant embeddings was examined. While this Euclidean distance continuously measures changes in characterization, it was observed to have a low correlation with experimental functional scores in the validation set. This suggests that although this Euclidean distance metric provides interpretability by quantifying perturbations at the sequence level, other factors, such as local structural context, may be needed to fully capture functional effects.
[0058] 5. Can learn embedded CNN baseline To establish a non-pretrained baseline model, this embodiment implements a lightweight convolutional neural network (CNN) with a learnable embedding layer (LearnableEmb-CNN). Each protein sequence is tokenized into integers (1-20, with 0s used for padding) and fed into a randomly initialized embedding layer (vocabulary size 21, dimension 128). This embedding layer is jointly optimized during training so that task-specific amino acid representations are generated without relying on pre-trained models or biophysical features. Learnable positional encodings are added before three one-dimensional convolutional blocks (kernel sizes 3, 5, and 7; channel numbers 128→64→128→256). After each convolutional block, batch normalization, ReLU activation, and dropout (p=0.3) are performed sequentially. Finally, global max pooling is applied, and two fully connected layers (256→128→2) are used to generate binary classification logits. The model employs a cross-entropy loss function, balanced class weights, and an Adam optimizer (learning rate = 1×10⁻⁶). -4 The baseline model (approximately 1 million parameters) was trained with a batch size of 32 and an early stopping strategy based on the validation set F1 score (patience value = 5), and underwent 5x cross-validation training. This baseline model quantifies task-specific learning without the need for external knowledge, thus enabling ablational comparisons with pLM-based models in the study.
[0059] 6. Candidate Mutation Screening and Experimental Validation Based on the model's prediction results, pathogenicity classification results, and confidence scores, candidate mutations are screened from the target gene's mutations, and their function is validated through cell experiments. Candidate mutations meet any of the following criteria: classified as pathogenic but labeled as variants of indeterminate significance in a known database (such as ClinVar); not covered by experimental perturbation data; or inconsistent with the prediction results of the reference model. Higher confidence scores are awarded for correct predictions, and lower confidence scores are awarded for incorrect predictions. Figure 2 (d) This highlights its internal reliability. This strategy ensures a comprehensive assessment of the model's predictive accuracy and lays the foundation for discovering new, potentially pathogenic mutations.
[0060] To further evaluate the reliability of CaVepP53, a completely independent external test dataset containing 3,945 missense mutations was constructed. This dataset was built upon a p53 functional dataset containing 7,467 TP53 mutations, and any variants overlapping with the internal training and test sets were strictly excluded. Using the experimentally determined functional activity score as the gold standard, the functional scores of variants classified as pathogenic by each model were evaluated to determine whether they were significantly lower than those classified as benign. Among the three models, CaVepP53 showed the most significant separation in the distribution of functional scores between the predicted pathogenic and benign groups, while all models showed statistically significant differences. Figure 2 f). Furthermore, analysis of TP53 variants based on 503 clinical annotations obtained from the cBioPortal database indicates that ( Figure 2 (e, 2g), CaVepP53 showed particularly strong consistency in identifying driver variants, achieving an ROC-AUC of 0.918 and a Cohen's d of 2.03, while AM and PrimateAI-3D achieved ROC-AUCs of 0.893 and 0.865, respectively, and Cohen's d values of 1.94 and 1.64.
[0061] The predicted results were verified experimentally.
[0062] Plasmid construction: PegRNA and nick-sgRNA plasmids were constructed. The pegRNA plasmid backbone was amplified from pGL3-U6-sgRNA-EGFP (Addgene, #107721) and then digested with BsaI-HFv2 (NEB) to generate sticky ends. Spacer oligonucleotides, pegRNA 3' extension oligonucleotides, and sgRNA scaffold oligonucleotides were all synthesized into fragments with compatible sticky ends. These four fragments were then ligated together using T4 DNA ligase (NEB). The universal primers and pegRNA sequences used in the construction are listed in Tables 2-3. The PE7 plasmid was purchased from Addgene (Addgene, #214812).
[0063] Table 2: Primers for constructing pegRNA Table 3: pegRNA sequences Construction of TP53 mutant cells HCT-116 cells were seeded at 50% confluence in 24-well plates. After 12 hours of culture, a transfection mixture containing 0.9 μg PE7 plasmid, 0.3 μg pegRNA-EGFP plasmid, 0.1 μg nick-sgRNA-mCherry plasmid, and 3 μL transfection reagent was prepared and added to the wells. Cells were incubated with the mixture for 24 hours. Two days after transfection, EGFP / mCherry double-positive cells were isolated by flow cytometry for subsequent experiments.
[0064] Competitive test Nutlin-3a was dissolved in dimethyl sulfoxide (DMSO) to prepare a 40 mM stock solution. TP53 mutant HCT-116 cells (24-well plates, 4,000 cells per well) were treated for 5 days with either 10 μM Nutlin-3a or the solvent control (DMSO). This experiment utilized a known mechanism: functional TP53 mutants evade Nutlin-3a-induced cell cycle arrest and apoptosis due to impaired transcriptional activity, while non-functional mutants are affected by these effects. Functional TP53 mutants exhibited growth advantage and showed higher editing efficiency after Nutlin-3a treatment. Cells were subsequently collected, and TP53 alleles were amplified by PCR and sequenced using Sanger sequencing. Primers used are listed in Table 4. Mutant allele frequencies were calculated using EditR, and the frequencies were compared between the Nutlin-3a-treated group and the DMSO control group.
[0065] Table 4: Sequences of target primers Establishment of liver tumors in mice A hydrodynamic injection method was used to inject mice with a solution containing 5 μg MYC-luciferase plasmid, 1 μg SB100X plasmid, 30 μg PE7 plasmid, 15 μg pegRNA-EGFP plasmid, and 15 μg nick-sgRNA-mCherry plasmid (dissolved in 10% (w / v) physiological saline). Once tumors formed (confirmed by luciferase bioluminescence imaging), the mice were euthanized.
[0066] HE staining Paraffin-embedded sections were dewaxed with xylene, then rehydrated with a series of ethanol gradients, and finally soaked in distilled water for 10 minutes. After dewaxing, the sections were stained with hematoxylin until the cell nuclei were clearly visible, rinsed twice with distilled water, and then bluing with ammonia. Subsequently, the sections were counterstained with eosin, dehydrated with a series of ethanol gradients, cleared with xylene, and finally mounted. The materials used are shown in Table 5.
[0067] Table 5: Bill of Materials Python (v3.13) and GraphPad Prism were used for data visualization and statistical analysis. Data, presented as mean ± standard deviation (SD), were from two independent biological replicates (n = 2 per group). For comparisons between the two groups, a standard two-tailed unpaired Student's t-test was used, assuming the data were normally distributed and had equal variances. For data that were not normally distributed, the Mann-Whitney U test was used. Significance levels were expressed as ns (not significant, p ≥ 0.05), *p < 0.05, **p < 0.01, and ***p < 0.001.
[0068] After confirming the high performance of the fine-tuned TP53-specific VEP model, saturation mutagenesis prediction analysis was performed on the entire protein sequence. Figure 3 a). Of the 7467 variant predictions generated, 2737 were classified as pathogenic variants, primarily located within the DNA-binding domain (DBD); while 4694 were benign variants. This distribution suggests that the model tends to identify functionally sensitive regions, such as the DBD (…). Figure 3 a, Figure 3 (b) To assess whether location within the DBD region alone is sufficient for accurate prediction, a biologically meaningful baseline model was established: a rule-based classifier that predicts pathogenicity solely based on whether the variant is located within the DBD region (amino acid residues 102-292). This classifier achieved an MCC value of 0.3663 and an F1 score of 0.7623 on the same test set, confirming that "DBD mutation" is indeed a strong predictive feature.
[0069] However, it has its limitations: it cannot distinguish between pathogenic and benign variants within the DBD region, nor can it identify pathogenic mutations outside this region. In contrast, fine-tuned models can capture these nuances, enabling more detailed genome-wide analysis. Given the large number of unsigned variants (VUS) in the ClinVar database and the continued reporting of new TP53 mutations, these mutations were further classified using the model. The goal was to determine whether they exhibited intermediate effects or resembled known categories but remained unclassified. The analysis showed that approximately 53% of ClinVar missense VUS were predicted as benign, while 47% were predicted as pathogenic (…). Figure 3c). Based on the model's excellent performance in missense mutations, its application was extended to single amino acid deletions. Using 157 experimentally validated single-residue deletion variants from experimental perturbation data, the model was able to accurately distinguish between benign and pathogenic deletions. Figure 3 d).
[0070] Nutlin-3a, as an MDM2 inhibitor, stabilizes the p53 protein, thereby promoting cell cycle arrest or apoptosis. However, due to impaired transcriptional activity, TP53 loss-of-function mutants can escape Nutlin-3a-induced growth inhibition and continue to proliferate under drug-induced effects. Based on this biological mechanism, a competitive growth assay was established to validate the predictions of a finely tuned TP53-specific variant effect predictor in primer-edited HCT-116 cell lines.
[0071] Functional mutants exhibited significant growth advantages under drug screening, demonstrating higher editing efficiency when co-cultured with wild-type cells. Figure 4 a). To verify the reliability of this experiment, three functional mutants previously identified by AM, high-throughput screening, and CaVepP53 were randomly selected. The results showed that, compared with the control group, Nutlin-3a treatment significantly improved editing efficiency ( Figure 4 (b) The results are completely consistent with the predictions of CaVepP53. These results validate the robustness of the detection method in identifying functional variants and confirm the effectiveness of the competitive detection method in distinguishing functional mutants. After the detection method was established, it was experimentally tested on 19 TP53 mutants: 10 of them were consistent with the predictions of AM / screening, and 9 were misclassified by AM. The accuracy of CaVepP53 reached 68.2%, which is 18.2% higher than the pre-trained model (ESMC_600M), demonstrating the effectiveness of fine-tuning.
[0072] CaVepP53 achieved a 9.1% improvement in accuracy over the AM model on this experimental dataset. Specifically, CaVepP53 correctly predicted 6 of the 9 mutation subsets misclassified by AM, achieving an accuracy of 66.7% (6 / 9) in these challenging cases. Crucially, CaVepP53 identified 7 new functional variants (S99Y, R110A, S116P, V218I, P250A, L265I, G334L) and 5 new non-functional mutations (Q52F, D61P, T329H, E343L, S367P). Figure 4c), including three unsigned variants (VUS) from ClinVar and Ensembl (S99Y, S116P, P250A). Notably, the CaVepP53 model has demonstrated remarkable ability to identify pathogenic mutations. Figure 4 d). Furthermore, three mutants (S116P, P250A, and L265I) were randomly selected to assess their hepatic tumorigenicity. Further, the hepatic tumorigenic potential of three randomly selected mutants (S116P, P250A, and L265I) in mice was evaluated. Both the S116P (corresponding to mouse S113P) and L265I (mouse L262I) mutants induced malignant transformation in MYC-expressing hepatocytes, consistent with the known oncogenic mutant R248W (mouse R245W), thus confirming their in vivo tumorigenic potential. Figure 4 e). These results provide causal evidence for the reclassification of VUS and the advancement of TP53 functional annotation. Figure 4 f). In summary, CaVepP53 demonstrated enhanced predictive performance, accurately classifying 15 of the 22 validated variants, including resolving VUS cases such as S116P and P250A, thus supporting evidence-driven re-annotation.
[0073] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the technical solutions of the embodiments of the present invention.
Claims
1. A method for predicting the pathogenicity of gene variants based on a perturbation protein language model, characterized in that, Includes the following steps: S1. Obtain the mutation dataset of the target gene, wherein the mutation dataset includes multiple mutation samples and their corresponding pathogenicity labels, and the pathogenicity labels are derived from experimental perturbation data and clinical annotation databases; S2. Fine-tune the protein language model using a mutation dataset, wherein the fine-tuning includes: fine-tuning and updating the parameters of the last layer of the protein language model and the parameters of the added classification head; S3. Input the sequence of the mutation to be tested in the target gene into the fine-tuned model, and output the pathogenicity classification result and confidence score of the mutation sequence to be tested.
2. The method for predicting the pathogenicity of gene variants based on a perturbation protein language model according to claim 1, characterized in that, The experimental perturbation data includes deep mutation scan data, and the clinical annotation data includes the ClinVar database; The deep mutation scanning data refers to data obtained by CRISPR or base editing technology that reflects the impact of mutations on protein binding affinity or functional activity.
3. The method for predicting the pathogenicity of gene variations based on a perturbation protein language model according to claim 1, characterized in that, In S2, the update specifically includes: using a binary cross-entropy loss function for parameter optimization, and using weighted cross-entropy for the categories and imbalanced datasets. The weight of each category in the binary cross-entropy loss function is inversely proportional to the number of samples of the corresponding category in the mutation dataset.
4. The method for predicting the pathogenicity of gene variants based on a perturbation protein language model according to claim 1, characterized in that, In step S3, while outputting the pathogenicity classification result of the mutation sequence to be tested, the mutant embedding vector and the wild-type embedding vector of the target gene after fine-tuning are extracted and calculated as a measure of structural perturbation of the mutation to be tested in the target gene.
5. The method for predicting the pathogenicity of gene variations based on a perturbation protein language model according to claim 1, characterized in that, The method further includes screening candidate mutations from the mutations to be tested in the target gene based on pathogenicity classification results and confidence scores, and verifying the function of the candidate mutations through cell experiments; Candidate mutations must meet any of the following conditions: they are classified as pathogenic mutations but are labeled as variants of indeterminate significance in known databases, they are not covered by experimental perturbation data, or they are inconsistent with the prediction results of the reference model.
6. The method for predicting the pathogenicity of gene variants based on a perturbation protein language model according to claim 1, characterized in that, The target gene is any one of P53, VHL, ATM, BRCA1, RAD51C, or BAP1.
7. The method for predicting the pathogenicity of gene variants based on a perturbation protein language model according to claim 1, characterized in that, The confidence score is obtained by applying the sigmoid function to the output logit value of the classification head.
8. The method for predicting the pathogenicity of gene variants based on a perturbation protein language model according to claim 1, characterized in that, The protein language model is the ESMC model.
9. A prediction system applied to the method for predicting the pathogenicity of gene variants based on a perturbation protein language model as described in any one of claims 1-8, characterized in that, include: The data acquisition module is used to acquire the mutation dataset of the target gene, which includes multiple mutation samples and their corresponding pathogenicity tags; The model fine-tuning module is used to fine-tune the base model on the mutation dataset, based on the protein language model. The prediction module is used to input the target gene mutation sequence to be tested into the fine-tuned model and output the pathogenicity classification result and confidence score of the mutation sequence to be tested.
10. The application of the method for predicting the pathogenicity of gene variants based on a perturbation protein language model as described in any one of claims 1-8 in constructing a treatment decision or prognostic assessment model for gene loss-of-function cancers.