A method for predicting the functional fitness of short peptide sequences inserted at position 7 of AAV capsid proteins
By constructing a chain-star hybrid graph structure combining GNN and BiLSTM models, the nonlocal epistasis effect of the 7-position short peptide sequence in AAV capsid protein is explicitly modeled, solving the problems of insufficient prediction accuracy and generalization ability in existing technologies, and realizing high-precision functional prediction of the insertion of the 7-position short peptide sequence in AAV capsid protein.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DALIAN UNIV OF TECH
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for predicting the function of inserting a 7-position short peptide sequence into the AAV capsid protein suffer from problems such as strong bias in single-architecture generalization, difficulty in taking into account both structural association and sequence characteristics, and weak generalization ability in small samples. These issues prevent them from meeting the clinical translation needs of AAV vectors for cross-species and multi-tissue targeting.
We construct a graph neural network model GNN and a bidirectional long short-term memory network model BiLSTM, and combine them with a chain-star hybrid graph structure to explicitly model the nonlocal epistatic effect between residues in the 7-position short peptide sequence. We then fuse the prediction results of the two models through a linear regression model to extract sequence context features and achieve high-precision prediction.
It significantly improved the prediction accuracy and generalization ability of AAV capsid protein insertion at the 7th position of the short peptide sequence, with the coefficient of determination and Pearson correlation coefficient increasing from 0.845 to 0.884 and 0.940, respectively. It also reduced the cost of wet experiments and improved the robustness and prediction accuracy of the model.
Smart Images

Figure CN122245423A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the interdisciplinary fields of bioinformatics, computational biology and artificial intelligence, and specifically relates to a method for predicting the functional fitness of AAV capsid protein with 7-position short peptide sequences inserted. Background Technology
[0002] Adeno-Associated Virus (AAV), with its advantages of low immunogenicity, long-lasting gene expression, and broad-spectrum cell tropism, has become one of the most promising vectors for clinical translation in in vivo gene therapy, bringing revolutionary hope for the treatment of intractable diseases such as hereditary and neurodegenerative diseases. However, wild-type AAV vectors still face three major bottlenecks: insufficient tissue targeting precision, safety risks caused by off-target effects, and low transduction efficiency in special tissues such as the central nervous system and retina. These problems severely restrict the scope of their clinical application and the improvement of therapeutic effects. To overcome these difficulties, capsid protein engineering has become a key strategy. Among them, inserting exogenous short peptides into the hypervariable regions (HVRs) on the capsid surface has been proven to be an efficient means of achieving precise optimization of targeting.
[0003] Compared to the problems that may arise from the insertion of long peptides, such as capsid structure disorder and a sharp decrease in packaging efficiency, the insertion of short peptides with a length of 7 amino acids (i.e., 7-position short peptide sequences) exhibits unique technical advantages and practical value: on the one hand, the molecular scale of the 7-position short peptide sequence is sufficient to conformate and form a functional receptor binding site, enabling targeted enrichment and efficient internalization of the vector through precise interaction with specific proteins on the surface of target tissue cells; on the other hand, it has minimal interference with the natural folding and assembly process of the capsid, thus maximizing the preservation of rAAV's in vivo stability and potential for large-scale production. Numerous studies have confirmed the effectiveness of inserting the 7-position short peptide sequence: the variant AAV-PHP.B, obtained by inserting the 7-position short peptide sequence TLAVPFK into the VR-VIII ring of the AAV9 capsid, can cross the blood-brain barrier by specifically binding to the LY6A protein on the surface of mouse cerebral vascular endothelial cells, resulting in a nearly 100-fold increase in the transduction efficiency of the central nervous system after systemic administration compared to wild-type AAV9; the retinal-targeting variant AAV2-7m8, through 7-peptide insertion modification, significantly enhances the transduction specificity of retinal photoreceptor cells after intravitreal injection, and greatly reduces the dosage and off-target toxicity of ocular gene therapy. A more representative study is the "Systematic multi-trait AAV capsid engineering for efficient gene delivery" (Fit4function strategy) published in Nature Communications in 2024. By constructing a sequence-function mapping of 7 short peptide sequences using a long short-term memory network (LSTM) model, a successful AAV variant with both high production efficiency and cross-species liver targeting was developed. Its liver transduction efficiency in cynomolgus monkeys and rhesus monkeys was significantly better than that of AAV9, and some variants showed up to 1000 times higher transduction efficiency in human hepatocytes than AAV9. This study achieved accurate prediction and engineering design of multiple traits of 7-peptide insertion vectors.
[0004] Despite the significant advantages of the 7-position short peptide sequence insertion strategy and the preliminary exploration of related prediction models, practical applications still face severe challenges: the theoretical combinations of 7-position short peptide sequences alone amount to as many as 20. 7 There are approximately 1.28 billion sequences, and the influence of inserted sequences on AAV capsid function exhibits extremely complex nonlinear epistatic effects—including residue charge interactions, hydrogen bond formation, steric hindrance, and the global constraints of macroscopic physicochemical properties such as the overall hydrophobicity and net charge of the inserted sequence on capsid assembly and receptor recognition. Existing prediction models related to 7 short peptide sequences are insufficient to meet the precise requirements of AAV vector modification. Furthermore, in the field of AAV targeting and production efficiency prediction, traditional directed evolutionary screening is essentially experimental screening rather than model prediction, which cannot cover the massive number of sequence combinations and is inefficient.
[0005] The Fit4Function strategy employs sequence modeling, but because it only propagates features within a one-dimensional sequence space and lacks an explicit modeling mechanism for non-adjacent topological relationships between residues, it cannot fully characterize the global epistatic effect generated by the synergistic interaction of multiple residues in a short peptide sequence at position 7. In the field of general bioactivity prediction, Transformer models such as PeptideBERT rely on massive pre-training data and are prone to overfitting in small AAV sample scenarios; XGBoost models such as XG-M7G, while quantifying the contribution of single residues, cannot explain the epistatic effect between residues; and sequence alignment models such as TargetScan7.2 rely solely on conservation analysis, completely ignoring site-specific interactions. Furthermore, all existing models suffer from inductive bias limitations: single architectures (such as RNNs biased towards sequence order) exhibit large prediction variance and weak generalization ability in short sequence + complex interaction scenarios, making it difficult to adapt to the clinical translation needs of AAV vectors targeting multiple species and tissues.
[0006] Based on the aforementioned research background and technological gaps, this study optimizes the model on the Fit4Function strategy, aiming to develop a prediction method that can integrate sequence context and graph structure topology features, and a method for predicting the function of inserting a 7-position short peptide sequence into the AAV capsid protein sequence with high robustness under small sample conditions. This will provide technical support for the rational design of next-generation multi-trait optimized rAAV vectors and promote the efficient upgrade of gene therapy from broad-spectrum drug delivery to precise targeting. Summary of the Invention
[0007] The purpose of this invention is to overcome the shortcomings of existing functional prediction techniques for the insertion of a 7-position short peptide sequence in AAV capsid proteins, which suffer from strong inductive bias in a single architecture, difficulty in simultaneously considering structural associations and sequence features, and weak generalization ability under small sample conditions. This invention provides a structure-sequence hybrid prediction method. This method constructs a graph neural network model (GNN) with complementary inductive biases and a bidirectional long short-term memory network model (BiLSTM). It then uses a linear regression model to fuse the prediction results of the two models, achieving high-precision prediction of key functions such as targeting, transduction efficiency, and production adaptability of the inserted 7-position short peptide sequence in AAV capsid proteins. This method introduces a chain-star hybrid graph structure containing a global master node to explicitly model the nonlocal epistatic effect between residues in the 7-position short peptide sequence. Simultaneously, it uses the bidirectional long short-term memory network model to extract sequence context features and fuses the two features through a linear regression model. This overcomes the shortcomings of a single architecture with excessive inductive bias, difficulty in simultaneously considering structural associations and sequence features, and weak generalization ability under small sample conditions.
[0008] A method for predicting the functional fitness of AAV capsid protein by inserting a short peptide sequence at position 7, comprising the following steps: Step A: Data preprocessing; Acquire functional sequencing data containing different 7-position short peptide sequences and their corresponding next-generation sequencing (NGS) data, and analyze the sequencing frequency values of the functional sequencing data. Perform a logarithmic transformation to obtain the functional fitness score for all model training and prediction. Then, for each 7-position short peptide sequence, sequence-level protein parameter features characterizing the overall physicochemical properties of the 7-position short peptide sequence were extracted, and a basic feature vector for each amino acid residue was constructed by fusing the amino acid one-heat coding vector and the physicochemical property feature vector; All 7 short peptide sequences and their corresponding functional fitness scores The training set, validation set, and test set are divided according to a certain ratio.
[0009] Step B: Feature extraction based on chain-star hybrid graph; A chain-star hybrid graph structure is constructed based on the 7-position short peptide sequence. The chain-star hybrid graph structure includes 7 amino acid residue nodes and 1 global master node that aggregates global features. The 7 amino acid residue nodes are position-encoded to obtain position encoding vectors. For each amino acid residue node, the basic feature vector described in step A is concatenated with the position encoding vector corresponding to that amino acid residue node to serve as the initial feature vector for a single amino acid residue node. The basic feature vector of the global master node is initialized to a zero vector. The graph neural network model GNN is used to learn features of the chain-star hybrid graph structure. Neighborhood information is aggregated by calculating the attention weights between nodes to update the features of the global master node and the amino acid residue nodes. A hybrid readout strategy is adopted. First, the one-hot encoded feature matrix and the physicochemical property feature matrix of the 7 amino acid residues in the 7-position short peptide sequence are averaged and pooled respectively. Then, the results of the two average pooling are concatenated with the updated global master node features and the sequence-level protein parameter feature vector obtained in step A. After processing by a fully connected regression layer, the first prediction result is output through supervised training. Step C: Feature extraction based on sequence context; According to the amino acid sequence of the 7-position short peptide, the basic feature vector obtained in step A is input into the bidirectional long short-term memory network BiLSTM to obtain the bidirectional contextual hidden state representation of each amino acid residue position; and an attention mechanism is introduced to perform weighted aggregation of the hidden state representation of each amino acid residue position, extract the sequence context features, process them through a fully connected regression layer, and output the second prediction result through supervised training. Step D: Model ensemble based on linear regression; Using the first prediction result and the second prediction result as input features, and the functional fitness score obtained in step A... As the target value, it is input into the linear regression model for weight learning and parameter fitting to obtain the optimal weight coefficients and bias terms of the linear regression model.
[0010] Step E: Predict the output; The 7 short peptide sequences to be predicted are preprocessed and mapped, then input into the trained GNN and BiLSTM models to obtain the corresponding first and second prediction results. These results are then substituted into the linear regression model in step D to calculate the final functional fitness score.
[0011] Furthermore, the formula for calculating the functional fitness score in step A is as follows: in The target value for all model training and prediction is used to score the functional fitness of all models during training and prediction. This represents the sequencing frequency value of the 7-position short peptide sequence. It is a non-zero smoothing factor used to avoid the problem of undefined logarithmic operations when the sequencing frequency value is zero.
[0012] Further, in step A, the extraction process of the basic feature vector includes: mapping each amino acid residue in the 7-position short peptide sequence to a 20-dimensional amino acid one-heat encoding vector; extracting 9-dimensional physicochemical property features for each amino acid residue, specifically including: hydrophobicity, net charge, polarity, relative accessible surface area, van der Waals volume, hydrogen bond donating ability, hydrogen bond accepting ability, etc. - Spiral tendency and - Folding tendency; The unique thermal encoding vector of the amino acid is fused with the physicochemical property feature vector to obtain a 29-dimensional basic feature vector for each amino acid residue. : in, This represents the position index of a single amino acid residue within a 7-position short peptide sequence. , This represents the one-hot encoded vector of the amino acid residue. This represents the physicochemical characteristic vector of the amino acid residue. This indicates a vector concatenation operation.
[0013] Further, the graph structure data mentioned in step B adopts a chain-star hybrid graph structure, specifically constructed as follows: the node set of the chain-star hybrid graph structure is defined to contain 7 amino acid residue nodes and 1 global master node; the edge set of the chain-star hybrid graph structure is defined as the union of chain edges and star edges, wherein chain edges connect adjacent amino acid residue nodes, and star edges connect the global master node and all amino acid residue nodes; for the first... Each amino acid residue node has an initial feature vector. Calculated using the following formula: in, Representing the The initial input feature vector of each amino acid residue node; This represents the index of an amino acid in the sequence. ; A 29-dimensional basic feature vector representing each amino acid residue; This represents a vector concatenation operation; The position encoding vector (7-dimensional) represents the relative position of an amino acid in a 7-position short peptide sequence. The dimension corresponding to the position of the amino acid residue in the sequence is 1, and the other dimensions are 0.
[0014] Further, in step B, the graph neural network model includes a multi-layer dynamic graph attention convolutional network (GATv2). In each layer of the GATv2 network, an attention score is constructed between node pairs by linearly transforming and combining the feature vectors of the node itself and its neighboring nodes, thus representing the structural correlation between nodes. Based on the attention score, a normalization operation is performed to obtain the attention coefficients between the node and its neighboring nodes. The features of the neighboring nodes are then weighted and aggregated according to these attention coefficients to update the structural representation of the current node. The dynamic graph attention mechanism allows the attention weights to adaptively adjust with changes in node features, thereby enhancing the modeling ability for local structural differences in the AAV capsid protein and the perturbation effect of inserting a short peptide sequence at position 7. The feature update process specifically includes: first calculating the node... With nodes Attention score between Then calculate the first Nodes in a layered dynamic graph attention convolutional network Its neighboring nodes Attention coefficient : Based on the attention coefficient, neighborhood information is aggregated to update node features: in, This represents the transpose of the attention weight vector. This represents the activation function of the rectified linear unit with leakage. The weight matrix is a linear transformation matrix. and Representing nodes respectively and neighboring nodes In the The feature vector of the layer, Exponentiation, for Nodes after GATv2 update eigenvectors, For nodes The neighborhood group, This is the ELU activation function.
[0015] Furthermore, in step B, the output of the first prediction result adopts a hybrid readout strategy, and the specific calculation formula is as follows: in, This is the final feature vector of the global master node after multiple GATv2 feature updates. This indicates that an average pooling operation is being performed. It is a one-hot encoded feature matrix composed of the one-hot encoded vectors of the seven amino acid residues in the seven-position short peptide sequence. This represents the physicochemical property feature matrix, composed of the physicochemical property feature vectors of the seven amino acid residues in the seven-position short peptide sequence. This represents a sequence-level protein parameter feature vector representing the 7-position short peptide sequence. It is a fully connected regression layer.
[0016] Furthermore, in step C, the specific technical details of the BiLSTM feature extraction include: Input layer: Constructs a shape of The sequence of input tensors; Bidirectional Long Short-Term Memory (BiLSTM) layer: BiLSTM is used to perform forward and backward scans on each 7-peptide sequence, sequentially processing the feature vector of each amino acid residue in the 7-peptide sequence and applying it to the corresponding position. Output its corresponding hidden state : in, Represents the 7th position of the short peptide sequence. The basic feature vector of amino acid residues at each position.
[0017] Attention layer: Calculate normalized attention weights And generate a weighted context vector. : in, The 7th position of the short peptide sequence The feature representation of the hidden states of amino acid residues at each position after linear transformation. Let be the hyperbolic tangent function, and map the linearly transformed value to [-1, 1]. The weight matrix is a learnable matrix. For bias terms; For learnable context query vectors, For position Normalized attention weights This is the final weighted context vector.
[0018] Output layer: converts vectors Inputting the fully connected layer yields the second prediction result. ; in, This is the final weighted context vector; It is a fully connected regression layer.
[0019] Furthermore, the training process of the GNN and BiLSTM specifically includes: using the training set, validation set, and test set divided in step A, wherein the validation set does not participate in the gradient update of the model; iteratively training the GNN and BiLSTM using the training set respectively, and using the mean squared error as the loss function for parameter optimization; introducing an early stopping mechanism to monitor model performance: after each training epoch, calculating the loss value of the model on the validation set; setting a patience value of N training epochs, and terminating training if the validation set loss value does not decrease within N consecutive training epochs; saving the model parameters corresponding to the minimum validation set loss value during training as the optimal weights.
[0020] Further, in step D, the weight learning and parameter fitting of the linear regression model specifically includes: using the validation set obtained in step A, preprocessing and mapping the 7-position short peptide sequences in the validation set; calling the GNN and BiLSTM models trained in steps B and C to obtain the optimal weights, and inputting the feature representations obtained after preprocessing and mapping of each 7-position short peptide sequence in the validation set into the GNN and BiLSTM models with optimal weights for inference, respectively, to obtain the first prediction result of the validation set sequence. Second prediction results ;Will and Perform linear combinations and utilize the functional fitness scores corresponding to the validation set sequences. The weight coefficients of the fitted linear regression model are minimized by the mean squared error. , and bias terms The calculation formula is as follows: After fitting, the weighting coefficients and bias terms The parameters of the trained linear regression model are used for the prediction output in step E.
[0021] Further, the feature is that, in step E, the functional fitness score of the 7-position short peptide sequence to be predicted is predicted using the linear regression model fitted in step D. The operation is as follows: the 7-position short peptide sequence to be predicted undergoes the same preprocessing as in step A, including generating amino acid one-heat encoding vectors, physicochemical property feature vectors, position encoding vectors, and sequence-level protein parameter features. The preprocessed features are then input into the trained GNN model and BiLSTM model according to the model input requirements to obtain the first prediction result of the sequence. Second prediction results .Will and Input the fitted linear regression model from step D. The linear regression model then calculates and outputs the final predicted value. The calculation formula is as follows: in, and These are the parameters of the linear regression model fitted in step D.
[0022] Compared with the prior art, the present invention has the following beneficial effects: Improving Prediction Accuracy and Generalization Ability: For the task of predicting the insertion of a short peptide sequence at position 7 of the AAV capsid protein, existing techniques (such as the Fit4Function strategy), while establishing a certain sequence-function mapping, still have bottlenecks in goodness of fit. This invention, through the integrated prediction of GNN and BiLSTM, under exactly the same test conditions, achieves a significantly higher coefficient of determination (COP) compared to existing techniques. The variance score increased from 0.845 to 0.884, and the Pearson correlation coefficient (Pearson r) increased from 0.919 to 0.940. This indicates that the present invention can more accurately interpret the variance in the data and effectively reduce the cost of wet experiments. Attached Figure Description
[0023] Figure 1 This is a schematic diagram of the overall process of the method of the present invention.
[0024] Figure 2This is a schematic diagram of the chain-star hybrid graph topology of the 7-mer insertion sequence in this invention.
[0025] Figure 3 This is a model architecture diagram of GNN in this invention.
[0026] Figure 4 This is a model architecture diagram of BiLSTM in this invention.
[0027] Figure 5 This is a scatter plot of the predictions of GNN, BiLSTM and linear regression model on the test set in this invention, where (a) is the prediction result of GNN on the test set, (b) is the prediction result of BiLSTM on the test set, and (c) is the prediction result of linear regression model on the test set.
[0028] Figure 6 This is a performance comparison chart between the model of this invention and the benchmark model under different training set sizes. The blue curve represents the model of this invention, and the orange curve represents the benchmark model; the shaded area around the curves represents the standard deviation of multiple independent experiments, reflecting the stability of the model. Detailed Implementation
[0029] To describe the present invention in more detail, the technical solution of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
[0030] This experiment takes the prediction of viral production fitness by inserting a 7-position short peptide sequence between positions 588 and 589 of the AAV capsid VP1 protein as an example, and constructs a linear regression prediction method based on GNN and BiLSTM.
[0031] Step A: Data Preprocessing 1. Data Source: The modeling_library_production_fitness.csv dataset from Fit4function was selected. It contains approximately 75,000 AAV variants with 7-position short peptide sequences and their corresponding raw production data from next-generation sequencing (NGS).
[0032] 2. Construction of Functional Fitness Score: A non-zero smoothing factor was introduced to perform a logarithmic transformation on the sequencing frequency to construct a functional fitness score label. The calculation formula is as follows: in The target value for all model training and prediction is used to score the functional fitness of all models during training and prediction. This represents the Production value for the 7-position short peptide sequence. This is a non-zero smoothing factor used to avoid the undefined logarithmic operation when the sequencing frequency is zero. In this experiment, the value of this factor was determined by performing a hyperparameter grid search on the validation set, and the value was 0.005. Experiments show that this value can best balance the skewness of the data distribution, making the model training more stable.
[0033] 3. Basic Feature Extraction: For each 7-position short peptide sequence, the following features are extracted for subsequent processes: One-hot encoding vector: mapping 20 natural amino acid residues into a 20-dimensional sparse vector; Physicochemical property vector: Extracting 9-dimensional physicochemical characteristics for each amino acid residue, including hydrophobicity, net charge, polarity, relative accessible surface area, van der Waals volume, hydrogen bond donating ability, and hydrogen bond accepting ability. - Spiral tendency and - Folding tendency, by splicing the above two types of features, we obtain a 29-dimensional basic feature vector for each amino acid residue. : in, This represents the index of an amino acid in the sequence. , This represents the one-heat encoding of the amino acid residue. The physicochemical properties of this amino acid residue are encoded. This indicates a vector concatenation operation.
[0034] 4. Sequence-level protein parameter feature extraction: For each 7-position short peptide sequence, extract sequence-level protein parameter feature vectors to characterize the overall physicochemical properties of the short peptide. .
[0035] The sequence-level protein parameter characteristics include: short peptide sequences, molecular weight, theoretical isoelectric point, amino acid composition ratio, aromatic amino acid ratio, aliphatic index, instability index, and average hydrophobicity, etc. Sequence-level protein parameter features are subsequently used as global sequence features in the GNN's hybrid readout strategy to generate the first prediction result.
[0036] 5. Dataset Partitioning: To ensure the rigor of model evaluation and a fair comparison with existing technologies, this experiment adopts the same dataset partitioning strategy as Fit4function: Training set: 24,000 sample sequences and their corresponding functional fitness scores were randomly selected from 74,500 sequences in the dataset and used as labels for parameter training of the graph neural network model GNN and the bidirectional long short-term memory network model BiLSTM.
[0037] Validation set: 5000 sample sequences and their labels are selected from the remaining data and are strictly excluded from the parameter updates of GNN and BiLSTM.
[0038] Test set: All other samples in the dataset constitute an independent test set, which is used to evaluate the performance of all models.
[0039] Step B: Feature extraction based on chain-star hybrid graph 1. Chain-star hybrid graph structure construction: For each 7-position short peptide sequence, a chain-star hybrid graph structure is constructed, with the nodes and edges defined as follows: Nodes: Contain 7 amino acid residue nodes and 1 global master node; Edges: include chain edges connecting adjacent amino acid residue nodes and star edges connecting the global master node and all amino acid residue nodes.
[0040] 2. Node Initialization: For each 7-position short peptide sequence, construct a graph consisting of 7 amino acid nodes plus 1 global master node. For the _____, ... Each amino acid residue node has an initial feature vector. The calculation is as follows: in, Representing the The initial input feature vector of each amino acid residue node; This represents the index of an amino acid in the sequence. ; A 29-dimensional basic feature vector representing each amino acid residue; This represents a vector concatenation operation; The position encoding vector (7-dimensional) represents the relative position of an amino acid in a 7-position short peptide sequence. The dimension corresponding to the position of the amino acid residue in the sequence is 1, and the other dimensions are 0.
[0041] The initial feature vector of the global master node contained in the graph. It is initialized as a zero vector with the same dimension as the feature of a single amino acid residue node, and in subsequent processes, it serves as the convergence center for feature updates to integrate the topological information of the entire graph.
[0042] 3. Feature Update Based on Graph Attention Mechanism (GATv2): A multi-layer dynamic graph attention convolutional network (GATv2) is used to learn features on the chain-star hybrid graph. The feature update process specifically includes: first calculating the nodes... With nodes Attention score between Then calculate the first Nodes in a layered dynamic graph attention convolutional network Its neighboring nodes Attention coefficient : Based on the attention coefficient, neighborhood information is aggregated to update node features: in, This represents the transpose of the attention weight vector. This represents the activation function of the rectified linear unit with leakage. The weight matrix is a linear transformation matrix. and Representing nodes respectively and neighboring nodes In the The feature vector of the layer, Exponentiation, for Nodes after GATv2 update eigenvectors, For nodes The neighborhood group, This is the ELU activation function.
[0043] 4. Hybrid readout strategy for outputting prediction results: After multi-layer graph convolution, a hybrid readout strategy is used to output the first prediction result. The specific calculation formula is as follows: in, This is the final feature vector of the global master node after multiple GATv2 feature updates. This indicates that an average pooling operation is being performed. It is a one-hot encoded feature matrix composed of the one-hot encoded vectors of the seven amino acid residues in the seven-position short peptide sequence. This represents the physicochemical property feature matrix, composed of the physicochemical property feature vectors of the seven amino acid residues in the seven-position short peptide sequence. This represents a sequence-level protein parameter feature vector representing the 7-position short peptide sequence. It is a fully connected regression layer.
[0044] 5. GNN Model Training and Parameter Optimization: Using the training set obtained in step A, supervised training of the GNN is performed on the constructed chain-star mixture graph and its corresponding true functional fitness labels. An early stopping mechanism is introduced during training to prevent overfitting. After each training epoch, the loss value of the model on the validation set is calculated, and the patience value is set to 20 training epochs. When the validation set loss value does not decrease within 20 consecutive training epochs, training is stopped and the model parameters corresponding to the minimum validation set loss value during training are saved as the optimal weight parameters of the GNN for subsequent prediction and model integration steps.
[0045] Step C: Feature extraction based on sequence context This step utilizes the Bidirectional Long Short-Term Memory (BiLSTM) network model to model the sequence context features of amino acids in the inserted 7-position short peptide sequence, and outputs the second prediction result through a fully connected regression layer.
[0046] 1. Sequence Input Construction: Based on the amino acid sequence of the 7-position short peptide sequence, construct the 29-dimensional basic feature vector built in step A into a shape as follows: Input tensor .
[0047] in, Representing the The basic feature vector of an amino acid residue is composed of the unique heat encoding and physicochemical properties of that amino acid residue. This represents the index of an amino acid in the sequence. .
[0048] 2. Bidirectional Long Short-Term Memory (BiLSTM) Layer: BiLSTM is used to perform forward and reverse scans on each of the 7-peptide sequences, sequentially processing the feature vector of each amino acid residue in the 7-peptide sequence and applying it to the corresponding position. Output its corresponding hidden state : in, Represents the 7th position of the short peptide sequence. The basic feature vector of amino acid residues at each position.
[0049] 3. Attention Layer: Calculate normalized attention weights And generate a weighted context vector. : in, The 7th position of the short peptide sequence The feature representation of the hidden states of amino acid residues at each position after linear transformation. Let be the hyperbolic tangent function, and map the linearly transformed value to [-1, 1]. The weight matrix is a learnable matrix. For bias terms; For learnable context query vectors, For position Normalized attention weights This is the final weighted context vector.
[0050] 4. Output the second prediction result: (This refers to the vector...) Inputting the fully connected layer yields the second prediction result. .
[0051] in, represents the fully connected regression layer, and c represents the calculated weighted context vector.
[0052] 5. BiLSTM Model Training and Parameter Optimization: Using the training set obtained in step A, the constructed sequence input tensor and its corresponding functional fitness score are input into the BiLSTM for supervised training. The early stopping mechanism is also adopted during the training process. After each training epoch, the loss value of the model on the validation set is calculated. When the validation set loss value does not decrease within 20 consecutive training epochs, the training is stopped and the model parameters corresponding to the minimum validation set loss value are saved as the optimal weight parameters of the BiLSTM model.
[0053] Step D: Model Integration Based on Linear Regression 1. Obtaining Validation Set Prediction Results: Using the validation set obtained in step A, perform the same encoding and mapping process as in steps B and C on each 7-position short peptide sequence in the validation set. Then, call the GNN and BiLSTM models trained in steps B and C with optimal weight parameters respectively to infer the validation set samples and obtain the corresponding first prediction results. Second prediction results .
[0054] 2. Linear regression model parameter fitting: The first prediction result... Compared with the second prediction result As input variables, use the functional fitness score corresponding to the validation set. The weight coefficients w1, w2, and bias term b of the linear regression model are fitted by minimizing the mean squared error. The calculation formula is as follows: After fitting, the weighting coefficients and bias terms These parameters serve as the parameters for the trained linear regression model.
[0055] Step E: Predict the output After the model completes the above steps, the following prediction process is performed for the AAV capsid protein 7-mer insertion sequence to be predicted: 1. Data preprocessing: Perform the same preprocessing as in step A on the target sequence to generate amino acid one-heat coding vectors, physicochemical property feature vectors, position coding vectors, and sequence-level protein parameter features.
[0056] 2. GNN and BiLSTM Inference: The encoded features are input into the trained GNN and BiLSTM models respectively according to the model input requirements, and the first prediction results are obtained respectively. Second prediction results ; 3. Linear Regression Model Output: The first prediction result Second prediction results The input is fed into the linear regression model trained in step D, and the final predicted value is obtained through weighted calculation. : in, , and bias terms These are the parameters that have been fitted in step D. The higher the score, the higher the production fitness of the AAV capsid variant during viral packaging, i.e., the higher the expected viral yield or enrichment.
[0057] The baseline model set up for the experiment was the Production model in the Fit4Function strategy, and the evaluation metrics included the coefficient of determination (COP). The model's fit is evaluated using a score, and the predictive relevance is assessed using the Pearson correlation coefficient (Pearson r). The experimental results are shown in the table below. The method of this invention outperforms the benchmark model on both evaluation metrics: Table 1. Performance comparison results of the proposed method and the benchmark model in the Production prediction task.
[0058] 1. Experimental data show that although the baseline model establishes a basic sequence-function mapping, it suffers from a fitting bottleneck when dealing with the complex epistatic effects of AAV capsid proteins. This invention introduces a chain-star hybrid diagram structure and explicitly models nonlocal interactions between residues using GNN. BiLSTM captures the linear dependencies and contextual information of sequences. By weightedly fusing the prediction results of GNN and BiLSTM using a linear regression model, the prediction accuracy is significantly improved. The score improved by approximately 4%, and the Pearson R score improved by approximately 2.1%. Figure 5 This diagram illustrates the correlation distribution between predicted and true values for different models on the test set. The horizontal axis represents the true value, and the vertical axis represents the predicted value. The dashed line represents... Ideal reference line.
[0059] 2. To evaluate the model's generalization ability and robustness in small sample sizes, the learning curves of the proposed model and the benchmark model were compared within a training set size ranging from 200 to 5000 sequences (e.g., ...). Figure 6 (As shown). The solid line in the figure represents the average Pearson r of 5 independent repeated experiments, and the semi-transparent shaded area around the solid line represents the standard deviation, which reflects the range of fluctuation in model performance. Detailed analysis follows: Rapid generalization ability under small sample conditions: With a training set sequence size of 200, the baseline model's average Pearson r is only 0.40, and its prediction performance is poor; while the model of this invention achieves a high correlation of approximately 0.78 under the same conditions. This fully verifies that by integrating graph structure topological features and sequence context features using a linear regression model, this invention can more efficiently mine the functional mapping rules of short peptide sequences than a single sequence architecture model, thus outputting stable and high-precision predictions even with a small sample size.
[0060] Model robustness and stability: such as Figure 6 As shown, in the range of relatively small data sets (200-1000 data points), the baseline model exhibits a wide shaded area around its curve. This significantly reflects the extremely high prediction variance of the baseline model under small sample conditions, indicating highly unstable model performance and difficulty in guaranteeing reproducibility in practical applications. In contrast, the shaded area of this invention remains small throughout the entire test range, indicating lower variance in the output prediction results during repeated experiments, demonstrating higher robustness of the model.
[0061] Convergence efficiency comparison: As the number of training sequence sequences increases, the performance improvement rate of the model in this invention is significantly faster than that of the baseline model. With 2000 training sequence sequences, the Pearson r of this invention has exceeded 0.90, while the baseline model requires more data accumulation to reach this level.
[0062] Experimental results show that this invention not only excels in final prediction accuracy but also possesses superior data efficiency. In practical AAV engineering modifications, using the method of this invention, only a very small amount of wet experimental data (screening of a few hundred sequences) is required to establish a highly reliable prediction model, thereby significantly reducing the time cycle and sequencing cost of AAV capsid development.
Claims
1. A method for predicting the functional fitness of AAV capsid protein by inserting a short peptide sequence at position 7, characterized in that, Includes the following steps: Step A: Data preprocessing; Acquire functional sequencing data containing different 7-position short peptide sequences and their corresponding next-generation sequencing (NGS) data, and analyze the sequencing frequency values of the functional sequencing data. Perform a logarithmic transformation to obtain the functional fitness score for all model training and prediction. Then, for each 7-position short peptide sequence, sequence-level protein parameter features characterizing the overall physicochemical properties of the 7-position short peptide sequence were extracted, and a basic feature vector for each amino acid residue was constructed that integrates the amino acid unique heat encoding vector and the physicochemical property feature vector. All 7 short peptide sequences and their corresponding functional fitness scores Divide the training set, validation set, and test set according to a certain ratio; Step B: Feature extraction based on chain-star hybrid graph; A chain-star hybrid graph structure is constructed based on the 7-position short peptide sequence. The chain-star hybrid graph structure includes 7 amino acid residue nodes and 1 global master node that aggregates global features. The 7 amino acid residue nodes are position-encoded to obtain position encoding vectors. For each amino acid residue node, the basic feature vector described in step A is concatenated with the position encoding vector corresponding to that amino acid residue node to serve as the initial feature vector for a single amino acid residue node. The basic feature vector of the global master node is initialized to a zero vector. The graph neural network model GNN is used to learn features of the chain-star hybrid graph structure. Neighborhood information is aggregated by calculating the attention weights between nodes to update the features of the global master node and the amino acid residue nodes. A hybrid readout strategy is adopted. First, the one-hot encoded feature matrix and the physicochemical property feature matrix of the 7 amino acid residues in the 7-position short peptide sequence are averaged and pooled respectively. Then, the results of the two average pooling are concatenated with the updated global master node features and the sequence-level protein parameter feature vector obtained in step A. After processing by a fully connected regression layer, the first prediction result is output through supervised training. Step C: Feature extraction based on sequence context; According to the amino acid sequence of the 7-position short peptide sequence, the basic feature vector obtained in step A is input into the bidirectional long short-term memory network BiLSTM to obtain the bidirectional contextual hidden state representation of each amino acid residue position. An attention mechanism is introduced to weight and aggregate the hidden state representations of each amino acid residue position, extract sequence context features, process them through a fully connected regression layer, and output a second prediction result through supervised training. Step D: Model ensemble based on linear regression; Using the first prediction result and the second prediction result as input features, and the functional fitness score obtained in step A... As the target value, it is input into the linear regression model to perform weight learning and parameter fitting, thereby obtaining the optimal weight coefficients and bias terms of the linear regression model; Step E: Predict the output; The 7 short peptide sequences to be predicted are preprocessed and mapped, then input into the trained GNN and BiLSTM models to obtain the corresponding first and second prediction results. These results are then substituted into the linear regression model in step D to calculate the final functional fitness score.
2. The method according to claim 1, characterized in that, The formula for calculating the functional fitness score in step A is: in The target value for all model training and prediction is used to score the functional fitness of all models during training and prediction. This represents the sequencing frequency value of the 7-position short peptide sequence. It is a non-zero smoothing factor used to avoid the problem of undefined logarithmic operations when the sequencing frequency value is zero.
3. The method according to claim 1, characterized in that, In step A, the extraction process of the basic feature vector includes: mapping each amino acid residue in the 7-position short peptide sequence to a 20-dimensional amino acid one-heat encoding vector; extracting 9-dimensional physicochemical property features for each amino acid residue, specifically including: hydrophobicity, net charge, polarity, relative accessible surface area, van der Waals volume, hydrogen bond donating ability, hydrogen bond accepting ability, etc. - Spiral tendency and - Folding tendency; The unique thermal encoding vector of the amino acid is fused with the physicochemical property feature vector to obtain a 29-dimensional basic feature vector for each amino acid residue. : in, This represents the position index of a single amino acid residue within a 7-position short peptide sequence. , This represents the one-hot encoded vector of the amino acid residue. This represents the physicochemical characteristic vector of the amino acid residue. This indicates a vector concatenation operation.
4. The method according to claim 1, characterized in that, The graph structure data mentioned in step B adopts a chain-star hybrid graph structure. The specific construction method is as follows: the node set of the chain-star hybrid graph structure is defined to include 7 amino acid residue nodes and 1 global master node; the edge set of the chain-star hybrid graph structure is defined as the union of chain edges and star edges, where chain edges connect adjacent amino acid residue nodes, and star edges connect the global master node to all amino acid residue nodes; for the first... Each amino acid residue node has an initial feature vector. Calculated using the following formula: in, Representing the The initial input feature vector of each amino acid residue node; This represents the index of an amino acid in the sequence. ; A 29-dimensional basic feature vector representing each amino acid residue; This represents a vector concatenation operation; The position encoding vector (7-dimensional) represents the relative position of an amino acid in a 7-position short peptide sequence. The dimension corresponding to the position of the amino acid residue in the sequence is 1, and the other dimensions are 0.
5. The method according to claim 4, characterized in that, In step B, the graph neural network model includes a multi-layer dynamic graph attention convolutional network (GATv2). In each layer of the GATv2 network, an attention score is constructed between node pairs by linearly transforming and combining the feature vectors of the node itself and its neighboring nodes, thus representing the structural correlation between nodes. Based on the attention score, a normalization operation is performed to obtain the attention coefficients between the node and its neighboring nodes. The features of the neighboring nodes are then weighted and aggregated according to these attention coefficients to update the structural representation of the current node. The dynamic graph attention mechanism allows the attention weights to adaptively adjust with changes in node features, thereby enhancing the modeling ability for local structural differences in the AAV capsid protein and the perturbation effect of inserting a short peptide sequence at position 7. The feature update process specifically includes: first calculating the node... With nodes Attention score between Then calculate the first Nodes in a layered dynamic graph attention convolutional network Its neighboring nodes Attention coefficient : Based on the attention coefficient, neighborhood information is aggregated to update node features: in, This represents the transpose of the attention weight vector. This represents the activation function of the rectified linear unit with leakage. The weight matrix is a linear transformation matrix. and Representing nodes respectively and neighboring nodes In the The feature vector of the layer, Exponentiation, for Nodes after GATv2 update eigenvectors, For nodes The neighborhood group, This is the ELU activation function.
6. The method according to claim 5, characterized in that, In step B, the first prediction result is output using a hybrid readout strategy, and the specific calculation formula is as follows: in, This is the final feature vector of the global master node after multiple GATv2 feature updates. This indicates that an average pooling operation is being performed. It is a one-hot encoded feature matrix composed of the one-hot encoded vectors of the seven amino acid residues in the seven-position short peptide sequence. This represents the physicochemical property feature matrix, composed of the physicochemical property feature vectors of the seven amino acid residues in the seven-position short peptide sequence. This represents a sequence-level protein parameter feature vector representing the 7-position short peptide sequence. It is a fully connected regression layer.
7. The method according to claim 1, characterized in that, In step C, the specific technical details of the BiLSTM feature extraction include: Input layer: Constructs a shape of The sequence of input tensors; Bidirectional Long Short-Term Memory (BiLSTM) layer: BiLSTM is used to perform forward and backward scans on each 7-peptide sequence, sequentially processing the feature vector of each amino acid residue in the 7-peptide sequence and applying it to the corresponding position. Output its corresponding hidden state : in, Represents the 7th position of the short peptide sequence. The basic feature vector of amino acid residues at each position; Attention layer: Calculate normalized attention weights And generate a weighted context vector. : in, The 7th position of the short peptide sequence The feature representation of the hidden states of amino acid residues at each position after linear transformation. Let be the hyperbolic tangent function, and map the linearly transformed value to [-1, 1]. The weight matrix is a learnable matrix. For bias terms; For learnable context query vectors, For position Normalized attention weights This is the final weighted context vector; Output layer: converts vectors Inputting the fully connected layer yields the second prediction result. ; in, This is the final weighted context vector; It is a fully connected regression layer.
8. The method according to claim 1, characterized in that, The training process of GNN and BiLSTM specifically includes: using the training set, validation set, and test set divided in step A, wherein the validation set does not participate in the model's gradient update; iteratively training GNN and BiLSTM using the training set respectively, and using mean squared error as the loss function for parameter optimization; introducing an early stopping mechanism to monitor model performance: after each training epoch, calculating the model's loss value on the validation set; setting a patience value of N training epochs, and terminating training if the validation set loss value does not decrease within N consecutive training epochs; saving the model parameters corresponding to the minimum validation set loss value during training as the optimal weights.
9. The method according to claim 1, characterized in that, In step D, the weight learning and parameter fitting of the linear regression model specifically includes: using the validation set obtained in step A, preprocessing and mapping the 7-position short peptide sequences in the validation set; calling the GNN and BiLSTM models trained in steps B and C to obtain the optimal weights, and inputting the feature representations obtained after preprocessing and mapping of each 7-position short peptide sequence in the validation set into the GNN and BiLSTM models with optimal weights for inference, respectively, to obtain the first prediction result of the validation set sequence. Second prediction results ;Will and Perform linear combinations and utilize the functional fitness scores corresponding to the validation set sequences. The weight coefficients of the fitted linear regression model are minimized by the mean squared error. , and bias terms The calculation formula is as follows: After fitting, the weighting coefficients and bias terms The parameters of the trained linear regression model are used for the prediction output in step E.
10. The method according to claim 1, characterized in that, In step E, the functional fitness score of the 7-position short peptide sequence to be predicted is predicted using the linear regression model fitted in step D. The operation is as follows: the 7-position short peptide sequence to be predicted is preprocessed in the same way as in step A, including generating amino acid one-heat encoding vectors, physicochemical property feature vectors, position encoding vectors, and sequence-level protein parameter features. The preprocessed features are then input into the trained GNN and BiLSTM models according to the model input requirements to obtain the first prediction result for the sequence. Second prediction results ;Will and Input the fitted linear regression model from step D; the linear regression model then calculates and outputs the final predicted value. The calculation formula is as follows: in, and These are the parameters of the linear regression model fitted in step D.