Model for predicting high-affinity mutation site of antibody, and construction method therefor and use thereof
By training and optimizing the ESM model to predict antibody high-affinity mutation sites, the problem of predicting antibody multi-point mutation combinations was solved, achieving rapid and accurate antibody affinity enhancement and guiding the construction of high-affinity antibodies.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- BIOINTRON BIOLOGICAL INC
- Filing Date
- 2025-12-17
- Publication Date
- 2026-07-02
AI Technical Summary
How to effectively predict the impact of antibody multi-point mutation combinations on affinity remains a challenge in existing technologies. Blindly combining them may lead to a decrease in affinity, and there is a lack of rapid and accurate prediction methods.
An ESM model was used to train and optimize a model for predicting high-affinity mutation sites on antibodies. The amino acid sequence of the mutated antibody was masked using a masking language model to predict the probability of the correct target. The model was then adjusted to improve prediction accuracy. Combined with single-point mutation testing and combined scoring, this method guides the construction of high-affinity antibodies.
It enables rapid and accurate prediction of combinations of high-affinity mutation sites for antibodies, guides antibody affinity maturation technology, and significantly improves antibody affinity.
Smart Images

Figure PCTCN2025143190-APPB-I100001 
Figure PCTCN2025143190-APPB-I100002 
Figure PCTCN2025143190-APPB-I100003
Abstract
Description
A model for predicting antibody high-affinity mutation sites, its construction method, and its application. Technical Field
[0001] This application belongs to the field of biotechnology and relates to a model for predicting antibody high-affinity mutation sites, its construction method, and its application. Background Technology
[0002] Antibodies are important proteins in the immune system, used to recognize and neutralize foreign substances such as pathogens, viruses, and bacteria. Antibody molecules typically consist of four polypeptide chains: two heavy chains and two light chains. These chains form a Y-shaped structure in the molecular structure, enabling the antibody to bind to specific antigens. The role of antibodies in the immune response is to neutralize or label specific antigens by binding to them through their variable regions, allowing other immune cells to recognize and eliminate these antigens. Antibody diversity arises from genetic recombination and mutation, enabling the immune system to recognize virtually an unlimited number of antigens. Antibodies have wide applications in medicine and research, including disease diagnosis, vaccine development, and treatment. For example, monoclonal antibodies are widely used in cancer treatment; they can specifically target antigens on the surface of cancer cells, thereby inhibiting the growth and spread of cancer cells.
[0003] Antibody affinity refers to the binding force between an antibody and an antigenic epitope or antigenic determinant. In the development and application of antibodies, affinity is an important indicator for measuring their biological activity and clinical value. Antibodies with high affinity can significantly reduce drug dosage, reduce side effects, and save costs. Affinity maturation refers to the gradual enhancement of antibody binding strength to antigen during the immune response. The antibody affinity maturation process is achieved through high-frequency gene mutations in somatic cells, mainly concentrated in the antibody CDR region. The core idea is to screen for antibody variants with higher affinity through random or targeted mutations. This mainly involves performing single-point saturation mutations at each amino acid site in the CDR region to construct a single-point saturation mutant plasmid library of the maternal antibody, obtaining single mutation points that can improve affinity. These single mutation points can be further combined to obtain a series of high-affinity mutants. For example, CN115925947A discloses an affinity maturation method and affinity maturation of anti-human PD-L1 single-domain antibodies. By combining single-point saturation mutations covering the entire CDR region with a high-throughput expression screening system of the mammalian system, high-affinity anti-human PD-L1 single-domain antibodies are obtained.
[0004] However, how to combine single-point mutations to obtain effective multi-point mutations remains one of the challenges in this field. Blindly combining them can lead to a decrease in affinity. Therefore, developing an effective method to predict the impact of combined mutations on antibody affinity is of great significance to the field of antibody development. Summary of the Invention
[0005] This application provides a model for predicting the affinity of mutant antibodies, a method for constructing the model, and its application. It develops a model for predicting high-affinity mutation sites of antibodies, so as to achieve rapid and accurate prediction of combinations of high-affinity mutation sites of antibodies and promote the development of mature affinity technology.
[0006] In a first aspect, this application provides a method for constructing a model to predict antibody high-affinity mutation sites, the method comprising:
[0007] The amino acid sequence of the mutant antibody with enhanced affinity is used as the training set. The training set includes the amino acid sequence of the mutant antibody, the mutation site, and the information of the amino acid after mutation. The amino acid after mutation is used as the target for prediction. The ESM model is trained using the training set. The training includes masking the amino acid sequence of the mutation site of the mutant antibody using a masking language model, then predicting the amino acid sequence of the masked site, outputting the probability of predicting the target correctly, and adjusting the model based on the output to obtain a model for predicting the high affinity mutation site of the antibody.
[0008] In this application, a model for predicting high-affinity mutation sites of antibodies is constructed based on ESM model training and optimization. The resulting model can effectively predict and evaluate the naturalness of different mutation sequences, and thus infer the impact of mutations on antibody affinity. It can be effectively applied to antibody affinity maturation technology to predict effective combinations of mutation sites.
[0009] Preferably, when there is more than one masked mutation site, the average of the sum of the probabilities of multiple correctly predicted targets is used as the output result.
[0010] Preferably, the adjustment continues until the output result is not lower than a preset threshold, where the threshold is 0.5.
[0011] Secondly, this application provides a model for predicting antibody high-affinity mutation sites, which is constructed by the method for constructing the model for predicting antibody high-affinity mutation sites described in the first aspect.
[0012] Thirdly, this application provides a method for predicting combinations of high-affinity mutation sites on antibodies, the method comprising:
[0013] The amino acid sequence of the mutant antibody is input into the model for predicting high-affinity mutation sites of the antibody as described in the second aspect. All single mutation sites are masked using a masking language model. Then, the amino acid sequence of the masked site is predicted, and the probability of correctly predicting the target is output. The prediction results of multiple mutation sites are summed and averaged to obtain the score of the mutation site combination. The mutation site combination with the highest score is selected as the candidate antibody high-affinity mutation site combination.
[0014] In this application, a method for predicting combinations of high-affinity mutation sites for antibodies is further developed. Based on the constructed prediction model, the method can quickly and accurately predict the impact of different mutation combinations on antibody affinity, and guide the construction of high-affinity antibodies.
[0015] In this application, the combinations of mutation sites can be sorted from high to low based on their scores, and then the combinations of mutation sites with high scores can be selected for affinity testing to verify the results and guide the construction of high-affinity antibodies.
[0016] Preferably, the method further includes performing a single-point saturation mutation on the antibody to be predicted and testing its affinity to obtain the mutated antibody and single-point mutation information.
[0017] Preferably, the method further includes a step of performing an affinity test on the antibody containing the combination of high-affinity mutation sites of the candidate antibody.
[0018] It is understood that this application can make predictions for any antibody.
[0019] Preferably, the antibody to be predicted includes a CD138 antibody.
[0020] Preferably, the amino acid sequence of the heavy chain variable region of the antibody to be predicted includes the sequence shown in SEQ ID NO.1.
[0021] Preferably, the amino acid sequence of the light chain variable region of the antibody to be predicted includes the sequence shown in SEQ ID NO.2.
[0022] Preferably, the combination of mutation sites includes a combination of heavy chain mutation sites and a combination of light chain mutation sites. The combination of heavy chain mutation sites includes D54F and S57F, and the combination of light chain mutation sites includes a combination of L46G, Y50G, I53F and N92Y, or a combination of L46G, Y50G, I53W and N92Y.
[0023] Fourthly, this application provides a high-affinity antibody containing the combination of mutation sites predicted by the method for predicting combinations of high-affinity mutation sites for antibodies as described in the third aspect.
[0024] Specifically, this application provides a high-affinity CD138 antibody, wherein the high-affinity CD138 antibody is mutated based on the combination of mutation sites predicted by the method for predicting combinations of high-affinity mutation sites for antibodies as described in the third aspect, on the basis of the heavy chain variable region of the amino acid sequence as shown in SEQ ID NO.1 and the light chain variable region of the amino acid sequence as shown in SEQ ID NO.2.
[0025] Preferably, the mutations in the heavy chain variable region include D54F and S57F.
[0026] Preferably, the mutations in the variable region of the light chain include combinations of L46G, Y50G, I53F and N92Y, or combinations of L46G, Y50G, I53W and N92Y.
[0027] Fifthly, this application provides an apparatus for predicting combinations of high-affinity mutation sites of antibodies, the apparatus comprising a single-point mutation testing unit, a mutation combination prediction unit, and a verification unit;
[0028] The single-point mutation test unit is used to perform the following:
[0029] Single-point saturation mutations were performed on the antibody to be predicted, and affinity was tested to obtain the mutant antibody and single-point mutation information.
[0030] The mutation combination prediction unit is used to perform the following:
[0031] The amino acid sequence of the mutant antibody is input into the model for predicting high-affinity mutation sites of the antibody as described in the second aspect, and the probability of correctly predicting the target is output. The prediction results of multiple mutation sites are summed and averaged to obtain the score of the mutation site combination. The mutation site combination with the highest score is selected as the candidate antibody high-affinity mutation site combination.
[0032] The verification unit is used to perform the following:
[0033] Affinity testing was performed on antibodies containing the combination of high-affinity mutation sites of the candidate antibody.
[0034] In a sixth aspect, this application provides an electronic device comprising: at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method for predicting combinations of high-affinity mutation sites of antibodies as described in the third aspect.
[0035] In a seventh aspect, this application provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method for predicting combinations of high-affinity mutation sites for antibodies as described in the third aspect.
[0036] Compared with the prior art, this application has at least the following beneficial effects:
[0037] In this application, a model for predicting high-affinity mutation sites of antibodies is constructed based on ESM model training and optimization. The resulting model can effectively predict and evaluate the naturalness of different mutation sequences, and thus infer the impact of mutations on antibody affinity. It can be effectively applied to antibody affinity maturation technology to predict effective combinations of mutation sites and guide the construction of high-affinity antibodies. Detailed Implementation
[0038] The technical solution of this application will be further described below through specific embodiments. However, the examples below are merely simplified examples of this application and do not represent or limit the scope of protection of this application. The scope of protection of this application shall be determined by the claims.
[0039] Where specific techniques or conditions are not specified in the examples, they shall be performed in accordance with the techniques or conditions described in the literature in this field, or in accordance with the product instructions. Reagents or instruments whose manufacturers are not specified are all conventional products that can be purchased from legitimate channels.
[0040] The Transformer is a deep learning model based on a self-attention mechanism. It primarily consists of an encoder and a decoder. The encoder encodes the input sequence into a fixed-length vector, while the decoder predicts the next output based on the encoder's output and the generated partial output sequence. In the encoder, the input sequence is first converted into a vector representation through a word embedding layer. These vectors then pass through multiple self-attention layers and feedforward neural network layers. The self-attention layers calculate the correlation between different positions in the sequence, while the feedforward neural network layers perform further non-linear transformations on the representation of each position. The decoder also contains multiple self-attention layers and feedforward neural network layers, but unlike the encoder, the decoder's self-attention layers employ a mask mechanism to prevent the encoder from seeing future information when predicting the current position. Furthermore, the decoder's self-attention layers also consider the encoder's output, enabling information interaction between the encoder and decoder. During training, the Transformer model uses a multi-head attention mechanism, dividing the input sequence into multiple heads, each of which performs self-attention calculations.
[0041] In the field of amino acid sequence analysis, Masked Language Models (MLMs), as a pre-training strategy, aim to enhance a model's understanding of protein sequences through masking and prediction techniques. Specifically, this model masks certain amino acids in the original amino acid sequence and then uses contextual information to recover the masked amino acids, thereby learning the sequence's intrinsic structure and biological semantics. In practice, MLM first randomly selects several positions in the amino acid sequence and replaces these positions with specific masking markers. The model is then trained to identify and predict the original amino acids at these masked positions. This process not only enables the model to capture sequence dependencies between amino acids but also allows it to understand the roles of amino acids in protein function and structure formation. In this way, MLM-pretrained models gain a deep understanding of the features of amino acid sequences, which is crucial for various bioinformatics tasks. For example, in areas such as protein function prediction, structure prediction, sequence variation analysis, and protein-protein interaction prediction, models pre-trained with MLM have demonstrated superior performance.
[0042] ESM (Evolutionary Scale Modeling) is a Transformer-based protein language model primarily used to predict and generate protein structure and function. ESM models utilize deep learning to treat protein sequences as a language, with each amino acid representing a character, and employ autoregressive neural networks to learn the statistical patterns of this language. ESM-2 (Evolutionary Scale Modeling 2), developed by Facebook AI Research, is a novel MLM language model designed to understand and predict protein sequence structure and function through pre-training and fine-tuning. Based on the Transformer architecture, it employs an 8-layer Transformer encoder and boasts a model size of 15 billion parameters. ESM-2 excels in predicting protein 3D structures, achieving atomic resolution and providing a powerful tool for bioinformatics and protein engineering.
[0043] In this application, a model for predicting high-affinity mutation sites of antibodies is constructed based on ESM model training and optimization. The resulting model can effectively predict and evaluate the naturalness of different mutation sequences, and thus infer the impact of mutations on antibody affinity. It can be effectively applied to antibody affinity maturation technology to predict effective combinations of mutation sites and guide the construction of high-affinity antibodies.
[0044] In one embodiment of this application, an apparatus for predicting combinations of high-affinity mutation sites for antibodies is provided. The apparatus includes a single-point mutation testing unit, a mutation combination prediction unit, and a verification unit. The single-point mutation testing unit performs the following actions: performing a single-point saturation mutation on the antibody to be predicted and testing its affinity to obtain the mutated antibody and single-point mutation information. The mutation combination prediction unit performs the following actions: inputting the amino acid sequence of the mutated antibody into a model predicting high-affinity mutation sites for antibodies, outputting the probability of correctly predicting the target, summing and averaging the prediction results of multiple mutation sites as a score for the mutation site combination, and selecting the mutation site combination with the highest score as a candidate high-affinity mutation site combination for antibodies. The verification unit performs the following actions: performing an affinity test on antibodies containing the candidate high-affinity mutation site combinations for antibodies.
[0045] In another specific embodiment of this application, an electronic device may be provided, the electronic device comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method for predicting combinations of antibody high affinity mutation sites.
[0046] In another specific embodiment of this application, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method for predicting the combination of high-affinity mutation sites of antibodies.
[0047] Example 1
[0048] This embodiment uses the CD138 antibody as an example to predict combinations of mutation sites.
[0049] The heavy chain amino acid sequence of the CD138 antibody is shown in SEQ ID NO.1, and the light chain amino acid sequence is shown in SEQ ID NO.2. The single saturation mutation sites of the heavy chain include... (This represents a G mutation to W at position 50 of the heavy chain; other mutations can be deduced similarly.) , , , and Light chain mutation sites include , , , , , , , , , , , , , and .
[0050] Model training and optimization based on the ESM model, and prediction of mutation site combinations, using G50W as an example for model training, involves replacing site 50 in the heavy chain (H) of the original sequence with a mask. After prediction by the ESM model, the target of this site is... .
[0051] Specifically, the mutant antibody sequence will be input... .
[0052] (1) Embedding layer:
[0053] , obtain emb.
[0054] (2) Encoder:
[0055] Following the self-attention layer and the feedforward network layer, the composite state tensor EmbeddingVec1 is multiplied by three weight matrices to obtain three matrices Q, K, and V, which serve as the input to the multi-head attention layer. Since the multi-head design allows each state to establish correlations with states at other times, there are multiple such Q, K, and V matrices. The calculation formula for the i-th matrix is as follows:
[0056] ;
[0057] ;
[0058] .
[0059] The attention tensor of the i-th output of the multi-head attention layer is calculated as follows:
[0060]
[0061] Where dK is the dimension of K.
[0062] Then, the attention tensors are concatenated and mapped to the original input space as the output MultiHead of the multi-head attention layer.
[0063] ;
[0064] .
[0065] The resulting multi-head attention layer output, AttentionOut1, is then input into the feedforward network layer to obtain FNNOut1. The feedforward network layer is composed of neural networks nested with ReLU functions, and then nested with further neural networks. The calculation formula is as follows:
[0066] .
[0067] The feedforward network layer also includes a residual connection structure. After the residual connection, the final output of the encoder, EncoderVec, is obtained. The formula for calculating the encoding tensor EncoderVec is as follows:
[0068] .
[0069] (3) Transition layer:
[0070] The output is mapped to character probabilities. During training, the model learns the probabilities of high scores, such as H_G50W, so that the model output is as close to W as possible. During prediction, if H_G50W and H_I52W are combined for scoring, the output is the average of the probabilities of W at position 50 of H chain and the probabilities of W at position 52 of H chain.
[0071] The combinations with higher scores based on light chains are: L46G, Y50G, I53F, and N92Y, with a score of 0.5152; or L46G, Y50G, I53W, and N92Y, with a score of 0.5443. The combinations with higher scores based on heavy chains are: D54F and S57F, with a score of 0.7.
[0072] Example 2
[0073] This embodiment tests the effect of the mutant combination obtained in Example 1 on affinity.
[0074] (1) Preparation of antibodies with specific mutant combinations
[0075] Two CD138 antibodies can be designed based on the mutation combinations. Based on the heavy and light chains of SEQ ID NO.1 and SEQ ID NO.2, mutant antibody 1 is a combination mutation of heavy chain D54F and S57F, and a combination mutation of light chain L46G, Y50G, I53F and N92Y; mutant antibody 2 is a combination mutation of heavy chain D54F and S57F, and a combination mutation of light chain L46G, Y50G, I53W and N92Y.
[0076] 1) Primer design: Design and synthesize primers for antibody light and heavy chain genes, and add appropriate restriction endonuclease recognition sequences to the 5' end of the primers.
[0077] 2) Gene synthesis: Genes are spliced using PCR reaction with designed primers to obtain the light and heavy chain DNA sequences of the antibody.
[0078] 3) Enzyme digestion reaction: The PCR product and the cloning vector are digested with the same restriction endonuclease to produce sticky ends.
[0079] 4) Ligation reaction: The enzyme-digested antibody gene fragment is ligated to the vector using DNA ligase.
[0080] 5) Transformation: The ligation product is transformed into competent E. coli.
[0081] 6) Screening and identification: Positive clones are screened with antibiotics, and colony PCR and sequence analysis are performed to confirm the correctness of the inserted fragment.
[0082] 7) Expression and purification: Transform the validated plasmid into a eukaryotic or prokaryotic expression system to induce the expression of antibody protein, and purify the protein using methods such as affinity chromatography.
[0083] (2) Affinity test
[0084] 1) Ligand capture: The purified antibody is diluted to an appropriate concentration with the running reagent and immobilized on the surface of the PAHC200M chip.
[0085] 2) Analyte dilution: The analyte is serially diluted 3-fold using the running reagent. The diluted analytes are then injected sequentially onto the chip surface from the lowest to the highest concentration, with binding and dissociation occurring at the corresponding times.
[0086] 3) Baseline equilibration: Before running the analyte, perform multiple solution circulations to equilibrate the baseline. All binding and dissociation steps are performed during reagent running.
[0087] 4) Chip regeneration: After all analyte concentrations have been cycled, regenerate three times with 10 mM Glycine at pH 2.0 for 20 seconds each time to wash away ligands and undissociated analytes.
[0088] 5) Data Acquisition and Analysis: After collecting the data, perform dynamic analysis using Carterra software. Fit the data using a 1:1 Langmuir combined model or other suitable model. Calculate the KD value.
[0089] The affinity results are shown in Table 1, indicating that the combination of mutation sites predicted in this application can significantly improve the affinity of the antibody.
[0090]
[0091] In summary, this application constructs a model for predicting high-affinity mutation sites of antibodies based on ESM model training and optimization. The resulting model can effectively predict and evaluate the naturalness of different mutation sequences, and thus infer the impact of mutations on antibody affinity. It can be effectively applied to antibody affinity maturation technology to predict effective combinations of mutation sites and guide the construction of high-affinity antibodies.
[0092] The applicant declares that the above description is only a specific implementation of this application, but the protection scope of this application is not limited thereto. Those skilled in the art should understand that any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application fall within the protection and disclosure scope of this application.
Claims
1. A method for constructing a model to predict antibody high-affinity mutation sites, comprising: The amino acid sequence of the mutant antibody with enhanced affinity is used as the training set. The training set includes the amino acid sequence of the mutant antibody, the mutation site, and the information of the amino acid after mutation. The amino acid after mutation is used as the target for prediction. The ESM model is trained using a training set. The training includes masking the amino acid sequence of the mutation site of the mutant antibody using a masking language model, then predicting the amino acid sequence of the masked site, outputting the probability of correctly predicting the target, and adjusting the model based on the output to obtain a model that predicts the high affinity mutation site of the antibody.
2. The method for constructing a model for predicting antibody high-affinity mutation sites according to claim 1, wherein, When there is more than one masked mutation site, the average of the sum of the probabilities of multiple correctly predicted targets is used as the output result.
3. The method for constructing a model for predicting antibody high-affinity mutation sites according to claim 1, wherein, The adjustment continues until the output result is not lower than a preset threshold, which is 0.
5.
4. A model for predicting antibody high-affinity mutation sites, which is constructed by the method for constructing a model for predicting antibody high-affinity mutation sites as described in any one of claims 1-3.
5. A method for predicting combinations of high-affinity mutation sites on antibodies, comprising: The amino acid sequence of the mutant antibody is input into the model for predicting high-affinity mutation sites of antibodies as described in claim 4. All single mutation sites are masked using a masking language model. Then, the amino acid sequence of the masked site is predicted, and the probability of correctly predicting the target is output. The prediction results of multiple mutation sites are summed and averaged to obtain the score of the mutation site combination. The mutation site combination with the highest score is selected as the candidate antibody high-affinity mutation site combination.
6. The method for predicting combinations of antibody high-affinity mutation sites according to claim 5, wherein, The method further includes performing a single-point saturation mutation on the antibody to be predicted and testing affinity to obtain the mutated antibody and single-point mutation information.
7. An apparatus for predicting combinations of high-affinity mutation sites of antibodies, comprising a single-point mutation testing unit, a mutation combination prediction unit, and a verification unit; The single-point mutation test unit is used to perform the following: Single-point saturation mutations were performed on the antibody to be predicted, and affinity was tested to obtain the mutant antibody and single-point mutation information. The mutation combination prediction unit is used to perform the following: The amino acid sequence of the mutant antibody is input into the model for predicting high-affinity mutation sites of antibodies as described in claim 4, and the probability of correctly predicting the target is output. The prediction results of multiple mutation sites are summed and averaged to obtain the score of the mutation site combination. The mutation site combination with the highest score is selected as the candidate antibody high-affinity mutation site combination. The verification unit is used to perform the following: Affinity testing was performed on antibodies containing the combination of high-affinity mutation sites of the candidate antibody.
8. A high-affinity antibody comprising the combination of mutation sites predicted by the method for predicting combinations of high-affinity mutation sites of an antibody as described in claim 5 or 6.
9. A high-affinity CD138 antibody, wherein, The high-affinity CD138 antibody is mutated based on the combination of mutation sites predicted by the method for predicting combinations of antibody high-affinity mutation sites as described in claim 5 or 6, on the basis of the heavy chain variable region of the amino acid sequence as shown in SEQ ID NO.1 and the light chain variable region of the amino acid sequence as shown in SEQ ID NO.
2. Preferably, the mutations in the heavy chain variable region include D54F and S57F; Preferably, the mutations in the variable region of the light chain include combinations of L46G, Y50G, I53F and N92Y, or combinations of L46G, Y50G, I53W and N92Y.
10. An electronic device comprising: At least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for predicting combinations of high-affinity antibody mutation sites as described in claim 5 or 6.
11. A computer-readable storage medium having a computer program stored thereon, wherein, When the program is executed by a processor, it implements the method for predicting combinations of high-affinity mutation sites for antibodies as described in claim 5 or 6.