A sequence feature-based pig brain neurotrophic peptide structure-activity relationship mining method and system

CN122245429APending Publication Date: 2026-06-19PINGDINGSHAN HUIXINYUAN BIOTECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
PINGDINGSHAN HUIXINYUAN BIOTECHNOLOGY CO LTD
Filing Date
2026-03-24
Publication Date
2026-06-19

Smart Images

  • Figure CN122245429A_ABST
    Figure CN122245429A_ABST
Patent Text Reader

Abstract

This invention relates to the field of bioinformatics processing and discloses a method and system for mining the structure-activity relationship (SMR) of porcine neurotrophic peptides based on sequence features. The method includes constructing an original sample index table and fusing multi-source production data, extracting peptide sequence features to generate a sequence feature matrix, constructing a sequence-process joint graph containing peptide nodes and process state nodes, training a structure-activity relationship graph neural network to mine SMR relationships, and deriving a process control decision table based on a process response sample set generated by the network, thereby achieving online optimization of the porcine neurotrophic peptide preparation process. This invention solves the problem of SMR mining caused by the separation of process parameters, sequence information, and activity data, achieving accurate characterization of the synergistic effect of sequence and process and reverse optimization of process parameters, thus improving the targeted enrichment efficiency and bioactivity retention level of target neurotrophic peptides.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of bioinformatics processing, and in particular to a method and system for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features. Background Technology

[0002] Neurotrophic peptides possess significant biological activity and show promise for applications in maintaining neurological health and assisting in the treatment of related diseases. Currently, the preparation of neurotrophic peptides from pig brains primarily employs a bio-extraction process. This process involves continuous steps such as raw material pretreatment, enzymatic extraction, multi-stage purification, and activity protection. Each step requires the empirical setting of process parameters such as temperature, pH, time, and purification conditions.

[0003] In actual production, when workers operate according to fixed process parameters, insufficient separation of target neurotrophic peptides from impurities often occurs. This is because there is a complex nonlinear relationship between the process parameters of each production stage and the sequence characteristics and bioactivity requirements of the target neurotrophic peptide. This relationship is difficult to accurately identify and quantify through human experience. When the characteristics of raw materials or the state of intermediate products change, the fixed parameters cannot dynamically adapt to changes in this implicit relationship, resulting in poor targeted enrichment of the target peptide. This leads to a low proportion of the target neurotrophic peptide in the final product, requiring additional purification steps to meet the purity requirements of subsequent applications. Furthermore, over-purification may cause loss of the target peptide, affecting the actual application effect of the product. Summary of the Invention

[0004] To overcome the aforementioned deficiencies of the prior art, the present invention provides the following technical solution:

[0005] A method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features, comprising:

[0006] S1. Obtain original process and sequencing activity data of multiple batches of pig brains to construct an original sample index table. Based on the original sample index table, extract sequence features to generate a neurotrophic peptide sequence feature matrix.

[0007] S2, construct a sequence-process joint graph using the neurotrophic peptide sequence feature matrix, and train a structure-activity relationship graph neural network based on the sequence-process joint graph to obtain the trained structure-activity relationship graph neural network;

[0008] S3. Use the trained structure-effect relationship graph neural network to generate a process response sample set from historical process records, and derive the process control candidate rule set based on the process response sample set to obtain the process control decision table.

[0009] S4 generates a process adjustment instruction sequence based on real-time production data using the process control decision table, and uses the process adjustment instruction sequence to perform online optimization of the preparation process of porcine brain neurotrophic peptides.

[0010] Furthermore, the steps for constructing the original sample index table and generating the neurotrophic peptide sequence feature matrix include:

[0011] S11: Collect raw data from multiple batches of pig brain production lines and construct a multi-source production data set;

[0012] S12: Based on a multi-source production data set, construct an original sample index table using a batch process stage binding method;

[0013] S13: Based on the original sample index table, perform noise reduction and redundancy elimination on the mass spectrometry peptide sequences to obtain a set of non-redundant peptide sequences.

[0014] S14: Based on a set of non-redundant peptide sequences, extract multi-dimensional sequence features to generate a set of sequence feature vectors;

[0015] S15: Construct a neurotrophic peptide sequence feature matrix based on the sequence feature vector set and the original sample index table.

[0016] Furthermore, the process flow of the batch process stage binding method is as follows:

[0017] S121: Determine the batch time axis based on the batch timestamp acquisition sequence. The batch time axis is a set of time intervals divided into process stages within a single production batch, including the raw material pretreatment stage, enzymatic extraction stage, multi-stage purification stage, and activity protection stage.

[0018] S122: The enzymatic hydrolysis parameter acquisition sequence, temperature monitoring acquisition sequence, pH monitoring acquisition sequence, and purification condition acquisition sequence in the multi-source production data set are matched with time intervals according to the batch time axis. Each process parameter record is mapped to its batch number and process stage to obtain a process parameter binding record set.

[0019] S123: Match each peptide sequence and its abundance information in the mass spectrometry sequencing acquisition sequence according to the sampling timestamp and batch time axis, and associate the peptide sequence with the corresponding batch number and process stage to obtain a peptide batch association record set;

[0020] S124: For each activity score data in the activity evaluation collection sequence, associate it with the peptide batch association record set one by one according to the batch source identifier of the test sample, and attach a corresponding activity tag to each peptide record to obtain a peptide record set with activity tags;

[0021] S125: Connect the process parameter binding record set with the peptide record set with active tags according to the batch number and process stage to generate the original sample index table.

[0022] Furthermore, the process for extracting the multidimensional sequence features is as follows:

[0023] S141: For each peptide sequence in the non-redundant peptide sequence set, its amino acid composition characteristics are statistically analyzed. The amino acid composition characteristics are vectors of length 20, and each component of the vector corresponds to the frequency of a standard amino acid in the peptide sequence.

[0024] S142: For each peptide sequence in the non-redundant peptide sequence set, calculate its physicochemical properties, including molecular weight, isoelectric point, hydrophobicity index, net charge number, and instability index.

[0025] S143: For each peptide sequence in the non-redundant peptide sequence set, calculate its dipeptide frequency features and tripeptide frequency features. The dipeptide frequency features are vectors of length 400, and the tripeptide frequency features are vectors of length 8000.

[0026] S144: For each peptide sequence in the non-redundant peptide sequence set, call the secondary structure prediction model to generate secondary structure features, and obtain a secondary structure proportion vector of length 3.

[0027] S145: For each peptide sequence in the non-redundant peptide sequence set, call the disorder region prediction model to generate disorder region features, the disorder region features include the proportion of amino acids in the disorder region and the length of the longest continuous disorder region.

[0028] S146: For each peptide sequence in the non-redundant peptide sequence set, call the pre-trained protein language model to generate a sequence embedding vector;

[0029] S147: The amino acid composition features, physicochemical properties features, dipeptide frequency features, tripeptide frequency features, secondary structure ratio vector, disorder region features, and sequence embedding vector are sequentially concatenated to obtain the complete sequence feature vector of each peptide segment, and the sequence feature vector set is obtained by collecting them.

[0030] Furthermore, the steps of constructing the sequence process joint graph and training the structure-property relationship graph neural network include:

[0031] S21: Construct a peptide sequence similarity subgraph based on the neurotrophic peptide sequence feature matrix;

[0032] S22: Construct a set of process state nodes based on the process parameter encoding vectors in the neurotrophic peptide sequence feature matrix;

[0033] S23: Construct a sequence-process joint graph based on peptide sequence similarity subgraphs and process state node sets;

[0034] S24: Construct and train a structure-property relationship graph neural network based on the sequence process joint graph.

[0035] Furthermore, the process of constructing and training a structure-behavior graph neural network includes:

[0036] S241: Define the network architecture of the structure-function graph neural network, which includes three modules: a node feature embedding layer, a heterogeneous graph attention layer, and a multi-task prediction layer.

[0037] S242: Divide the peptide nodes in the sequence process combination diagram according to the preset training, verification and testing ratio to obtain the training node set, the verification node set and the test node set;

[0038] S243: Define a multi-task loss function, which is a weighted sum of the prediction loss of nerve cell survival rate and the prediction loss of neurite growth promotion rate, and each prediction loss is calculated using mean square error;

[0039] S244: The structure-activity relationship graph neural network is iteratively trained using a mini-batch gradient descent strategy. In each training iteration, a batch of peptide nodes and their local subgraphs are sampled and input into the network to calculate the predicted activity value. The multi-task loss function value is calculated with the activity label vector and backpropagated to update the parameters.

[0040] S245: After each training cycle, input the local subgraph corresponding to the set of validation nodes into the network to calculate the validation loss value. When the validation loss value no longer decreases for several consecutive training cycles, the training is considered to have converged, and the current network parameters are saved.

[0041] S246: The structure-function graph neural network after training convergence is denoted as the trained structure-function graph neural network, while the attention weight matrix learned in the heterogeneous graph attention layer is saved.

[0042] Furthermore, the steps of generating a process response sample set and deriving the process control decision table include:

[0043] S31: Based on the trained structure-property relationship graph neural network, a virtual evaluation of historical process parameter combinations is performed to generate a process response sample set;

[0044] S32: Based on the process response sample set, a joint strategy of reinforcement learning and evolutionary search is used to derive the candidate rule set for process control;

[0045] S33: Based on the candidate rule set for process control, perform conflict resolution and coverage optimization to generate a process control decision table.

[0046] Furthermore, the process of deriving the candidate rule set for process control using a joint strategy of reinforcement learning and evolutionary search is as follows:

[0047] S321: Define the process trajectory optimization objective function, which is a multi-objective evaluation function, including the objective of maximizing the comprehensive neurotrophic index, the objective of minimizing production energy consumption, the objective of minimizing inter-batch variability, and the objective of maximizing the abundance of target high-activity peptides;

[0048] S322: Construct a process trajectory state space and an action space. The process trajectory state space is defined as the joint space of the current process stage, the current process parameter encoding vector, and the current intermediate peptide profile features. The action space is defined as the set of adjustment ranges for each process parameter in the next process stage.

[0049] S323: A process strategy network is constructed using a near-end strategy optimization algorithm. The input of the process strategy network is the current state vector, and the output is the probability distribution of each adjustment action in the action space.

[0050] S324: The process response sample set is used as the response model of the virtual environment, and candidate process trajectories are generated by combining Monte Carlo tree search with process strategy network.

[0051] S325: Sort the generated candidate process trajectories according to their trajectory return values, and select the top 10% of process trajectories by trajectory return value as the set of high-quality trajectories.

[0052] S326: Employ an evolutionary search strategy to perform crossover and mutation operations on the high-quality trajectory set to generate new candidate process trajectories, and iteratively update the high-quality trajectory set until the average trajectory return value no longer significantly improves for multiple consecutive rounds.

[0053] S327: Perform state discretization processing on each process trajectory in the high-quality trajectory set, map continuous process states to discrete process state node numbers, and map continuous adjustment actions to preset discrete action categories.

[0054] S328: Extract state-action pairs from the discretized process trajectory, perform voting statistics on actions with the same state, select the action with the highest frequency as the recommended action for that state, and collect all state-recommended action pairs to obtain the process control candidate rule set.

[0055] Furthermore, the steps of generating a process adjustment command sequence and performing online optimization include:

[0056] S41: Collect real-time production data and encode it as a description of the current process status;

[0057] S42: Based on the current process status description and the process control decision table, generate a process adjustment instruction sequence;

[0058] S43: Online optimization of the preparation process of porcine brain neurotrophic peptides using process control command sequences;

[0059] S44: Establish a production feedback mechanism to achieve adaptive updates of the process control decision table.

[0060] A sequence feature-based structure-activity relationship (SPR) mining system for porcine brain neurotrophic peptides is provided to implement the aforementioned sequence feature-based SPR mining method for porcine brain neurotrophic peptides. The system comprises:

[0061] Feature matrix construction module: used to obtain original process and sequencing activity data of multiple batches of pig brain to construct an original sample index table, and extract sequence features based on the original sample index table to generate a neurotrophic peptide sequence feature matrix;

[0062] Graph Neural Network Training Module: Used to construct a sequence-process joint graph using the neurotrophic peptide sequence feature matrix, and train a structure-activity relationship graph neural network based on the sequence-process joint graph to obtain the trained structure-activity relationship graph neural network;

[0063] Process decision derivation module: Used to generate a process response sample set from historical process records using a trained structure-property relationship graph neural network, and derive a process control candidate rule set based on the process response sample set to obtain a process control decision table;

[0064] Online optimization execution module: This module generates a sequence of process adjustment instructions based on real-time production data from the process control decision table, and then uses this sequence to perform online optimization of the preparation process of porcine brain neurotrophic peptides.

[0065] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0066] This invention addresses the problem of difficulty in accurately mining structure-activity relationships in the preparation of porcine brain neurotrophic peptides due to the separation of process parameters, sequence information, and activity data. By constructing an original sample index table, it unifies and binds scattered process parameters, peptide sequences, and activity tags by batch and process stage, eliminating the barrier to cross-reference of heterogeneous data. Furthermore, it integrates multi-dimensional features such as amino acid composition, physicochemical properties, and sequence embedding vectors generated by pre-trained protein language models to generate a high-information-density sequence feature matrix. This overcomes the limitations of traditional sequence features that rely solely on basic physicochemical properties, enabling the encoding of evolutionary conservation and functional semantic information of peptide sequences.

[0067] A sequence-process joint graph constructed based on sequence feature matrices connects peptide nodes and process state nodes through heterogeneous edges. Combined with a structure-activity relationship graph neural network with an attention mechanism, it achieves collaborative modeling of peptide sequence similarity and process state correlation. The heterogeneous graph attention layer learns the attention weight matrix to automatically identify the activation effect of process state on specific amino acid site combinations, accurately characterizing the nonlinear interaction effect between sequence and process, thus overcoming the bottleneck of traditional models' inability to capture complex structure-activity relationships.

[0068] By utilizing a trained neural network to evaluate candidate parameters in batches within a virtual process space, a process response sample set covering a broad parameter space is generated. Combined with reinforcement learning and evolutionary search, a process control decision table is derived, overcoming the traditional limitation of relying on numerous real experiments to explore high-dimensional parameter spaces. The decision table solidifies continuous optimization strategies into a set of rules that can be quickly queried, reducing the computational complexity of online decision-making. This enables real-time process regulation based on sequence structure-activity relationships, significantly improving the targeted enrichment efficiency and activity retention stability of target neurotrophic peptides, providing technical support for the intelligent preparation of bioactive peptides based on specific computational models. Attached Figure Description

[0069] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0070] Figure 1 This is a flowchart of a method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features, as described in this invention.

[0071] Figure 2 This is a schematic diagram illustrating the construction of the original sample index table in an embodiment of the present invention;

[0072] Figure 3 This is a schematic diagram of multidimensional sequence feature splicing in an embodiment of the present invention;

[0073] Figure 4 This is a schematic diagram of process state node clustering in an embodiment of the present invention;

[0074] Figure 5 This is a schematic diagram of process control decision table lookup in an embodiment of the present invention;

[0075] Figure 6 This is a functional block diagram of a structure-activity relationship mining system for porcine brain neurotrophic peptides based on sequence features, as described in this invention. Detailed Implementation

[0076] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0077] Example 1:

[0078] Please see Figure 1 As shown, this embodiment provides a method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features, including:

[0079] S1: Obtain original process and sequencing activity data of multiple batches of pig brains to construct an original sample index table, and extract sequence features based on the original sample index table to generate a neurotrophic peptide sequence feature matrix.

[0080] This step addresses the problem of fragmented process parameters, sequence information, and activity data in the preparation of porcine brain neurotrophic peptides, which makes it difficult to accurately mine structure-activity relationships. It constructs a unified index system for the original sample index table and then integrates multi-source sequence features to generate a high-information-density neurotrophic peptide sequence feature matrix. This neurotrophic peptide sequence feature matrix, as the final output of S1, will be used in S2 to construct a sequence-process joint graph and train a structure-activity relationship graph neural network.

[0081] Specifically, the process of constructing the original sample index table and generating the neurotrophic peptide sequence feature matrix includes:

[0082] S11: Collect raw data from multiple batches of pig brain production lines and construct a multi-source production data set.

[0083] In this step, the sensors, detection equipment and analytical instruments on the porcine brain neurotrophic peptide preparation production line are used as data sources. Enzymatic hydrolysis parameter acquisition sequences, temperature monitoring acquisition sequences, pH monitoring acquisition sequences, purification condition acquisition sequences, mass spectrometry sequencing acquisition sequences, activity evaluation acquisition sequences and batch timestamp acquisition sequences are obtained through synchronous acquisition, and a multi-source production data set is constructed based on these acquisition sequences. The enzymatic hydrolysis parameter acquisition sequence refers to the discrete record sequence of process parameters such as enzyme type, enzyme-to-liquid ratio, and solid-liquid ratio in the enzymatic hydrolysis reactor throughout the entire enzymatic hydrolysis stage, obtained by reading the parameter log of the process control system; the temperature monitoring acquisition sequence refers to the temperature reading sequence fed back by temperature sensors arranged in key equipment such as enzymatic hydrolysis tanks, purification columns, and freeze-drying chambers, obtained by periodically reading through a multi-channel temperature acquisition module; the pH monitoring acquisition sequence refers to the pH reading sequence output by online pH electrodes, reflecting the changing trend of the reaction environment during enzymatic hydrolysis and purification; the purification condition acquisition sequence refers to the time-series record of purification process parameters such as ultrafiltration membrane molecular weight cutoff, chromatographic column elution gradient program, and flow rate; the mass spectrometry sequencing acquisition sequence refers to the peptide sequences and their abundance information output after sequencing each batch of samples by liquid chromatography-tandem mass spectrometry; and the activity evaluation acquisition sequence refers to the activity score data obtained by nerve cell viability detection or in vitro neurite growth promotion experiment. The above-mentioned collection sequences are combined according to production line number, batch number, process stage and sampling time to obtain a multi-source production data set with batch and time as joint primary keys, which provides a basic data source for the subsequent construction of the original sample index table.

[0084] S12: Based on a multi-source production data set, construct an original sample index table using a batch process stage binding method.

[0085] Specifically, the process flow of the batch process stage binding method is as follows:

[0086] S121, Determine the batch timeline based on the batch timestamp acquisition sequence. The batch timeline is a set of time intervals divided into process stages within a single production batch, including the raw material pretreatment stage, enzymatic extraction stage, multi-stage purification stage, and activity protection stage, which are used as a unified time reference for aligning all process parameters and detection data.

[0087] S122, the enzymatic hydrolysis parameter acquisition sequences, temperature monitoring acquisition sequences, pH monitoring acquisition sequences, and purification condition acquisition sequences from the multi-source production data set are subjected to a time interval matching operation according to the batch timeline. Each process parameter record is mapped to its corresponding batch number and process stage, resulting in a process parameter binding record set. The time interval matching operation marks the process parameter record as belonging to the corresponding process stage based on which time interval of the batch timeline its timestamp falls into.

[0088] S123. Each peptide sequence and its abundance information in the mass spectrometry sequencing acquisition sequence is matched with the sampling timestamp and batch time axis, and the peptide sequence is associated with the corresponding batch number and process stage to obtain a peptide batch association record set.

[0089] S124. Each activity score data in the activity evaluation collection sequence is associated with the peptide batch-associated record set according to the batch source identifier of the test sample, and a corresponding activity tag is attached to each peptide record to obtain a peptide record set with activity tags.

[0090] S125, the process parameter binding record set and the peptide record set with activity tags are joined using batch number and process stage as keys to generate an original sample index table. The original sample index table has a two-dimensional relational table structure, where each row represents a unique peptide sample record. Column fields include batch number, process stage, sampling time, peptide sequence, peptide abundance, enzymatic hydrolysis temperature, enzymatic hydrolysis pH, enzymatic hydrolysis time, solid-liquid ratio, enzyme-to-base ratio, purification molecular weight cutoff, elution gradient parameters, and activity score. The number of rows in the original sample index table is determined by the total number of non-redundant peptides detected in all batches, and the number of columns is determined by the sum of the process parameter dimension and the detection index dimension. See also... Figure 2 This is a schematic diagram illustrating the construction of the original sample index table provided in an embodiment of this application. For example... Figure 2 As shown in the figure, this diagram visually illustrates the core data processing logic for solving the problem of data silos caused by multiple heterogeneous sources in the preparation of porcine brain neurotrophic peptides. The left side shows three independent data sources: process parameter logs recording the physical environment (such as temperature and time), mass spectrometry sequencing libraries characterizing molecular structures (peptide sequences), and activity detection reports reflecting the final efficacy. In actual production, these three types of data are often stored separately, making it difficult to analyze the impact of specific processes on the activity of specific sequences. The converging arrows in the middle of the diagram symbolize the batch-process stage binding method, which uses batch number and process stage as a unified primary key to align data rows from different sources along the time dimension. The generated original sample index table on the right is a wide table structure. Each row not only contains the biological sequence information of the peptide but also binds it to the specific process environment during its production (such as the enzymatic hydrolysis temperature of 42.0℃) and the final activity score. This structured data fusion eliminates the obstacle of data fragmentation for the subsequent step S15 of constructing the feature matrix, ensuring that the subsequently trained model can accurately learn the intrinsic relationship between process conditions, sequence structure, and biological activity.

[0091] Specifically, in the preparation of porcine brain neurotrophic peptides, process parameters are recorded separately in the control system log, sequence information is stored in the mass spectrometry database, and activity data is saved in the test report. These three types of data are independent and lack a unified index. When it is necessary to analyze under what process conditions a highly active peptide was produced, it is necessary to manually compare timestamps and batch numbers one by one, which is inefficient and prone to errors. By constructing an original sample index table, using batch number, process stage, and sampling time as unified index keys, the three types of heterogeneous data—process parameters, peptide sequences, and activity tags—are bound together in the same row. This eliminates the obstacles to correlation analysis caused by data fragmentation and provides a structured data foundation for subsequent structure-activity relationship mining.

[0092] S13: Based on the original sample index table, perform noise reduction and redundancy elimination processing on the mass spectrometry peptide sequences to obtain a set of non-redundant peptide sequences.

[0093] Specifically, the noise reduction and redundancy elimination process is as follows:

[0094] S131, Extract all peptide sequence fields from the original sample index table to construct the original peptide sequence list.

[0095] S132, For each peptide sequence in the original peptide sequence list, calculate its mass spectrometry matching confidence score. Specifically, the mass spectrometry matching confidence score is calculated using MaxQuant mass spectrometry interpretation software. The input data for MaxQuant mass spectrometry interpretation software is the original mass spectrometry file output by the liquid chromatography-tandem mass spectrometer. The input data structure is a binary spectrum file containing the parent ion mass-to-charge ratio, fragment ion mass-to-charge ratio, and their intensities. The output data for MaxQuant mass spectrometry interpretation software is a peptide identification result table. The output data structure is a text table, with each row containing fields such as peptide sequence, posterior error probability score, Andromeda score, and number of fragment ion matches. The mass spectrometry matching confidence score is the negative logarithm of the posterior error probability score; the lower the posterior error probability score, the higher the mass spectrometry matching confidence score. A mass spectrometry matching confidence threshold is set. This threshold is determined through a target decoy library retrieval strategy based on false positive rate control requirements. Specifically, it is determined as follows: a decoy sequence database of similar size to the target protein database is constructed. Mass spectrometry data is simultaneously searched in both the target and decoy libraries. The ratio of decoy library hits to target library hits at different confidence thresholds is used as an estimate of the false positive rate. The confidence score that makes the estimated false positive rate equal to a preset false positive rate tolerance is selected as the mass spectrometry matching confidence threshold. This preset false positive rate tolerance is determined according to commonly accepted standards in the field of mass spectrometry identification. Peptide sequences with mass spectrometry matching confidence scores lower than the mass spectrometry matching confidence threshold are marked as low-confidence sequences and removed, resulting in a list of high-confidence peptide sequences.

[0096] S133, perform sequence clustering on the list of high-confidence peptide sequences, grouping peptides with sequence similarity higher than a preset sequence similarity threshold into the same cluster, and taking the peptide with the highest abundance in each cluster as the representative sequence. The sequence similarity is calculated using the edit distance normalization method.

[0097] In a preferred embodiment, the formula for calculating the sequence similarity is as follows:

[0098] ;

[0099] in, This represents the sequence similarity between peptide sequence A and peptide sequence B, with values ​​ranging from [value range missing]. ; This represents the edit distance between peptide sequence A and peptide sequence B, which is the minimum number of single-character edit operations required to convert sequence A into sequence B. Single-character edit operations include insertion, deletion, and replacement. This indicates the number of amino acid residues in peptide sequence A; This indicates the number of amino acid residues in peptide sequence B. This indicates taking the larger of the two sequence lengths.

[0100] The preset sequence similarity threshold is determined as follows: based on the typical occurrence frequency of equivalent amino acid substitutions in the statistical function of the amino acid conservative substitution matrix, and combined with the peptide length distribution, the maximum allowable edit distance ratio under the premise of maintaining biological functional equivalence is calculated, and the complement of this ratio is used as the preset sequence similarity threshold.

[0101] S134: Collect the representative sequences of all clusters to obtain a non-redundant peptide sequence set, and reverse-label the corresponding rows of the non-redundant peptide sequence set with the original sample index table, and update the peptide sequence field of the original sample index table to the representative sequence identifier.

[0102] S14: Based on a set of non-redundant peptide sequences, extract multi-dimensional sequence features to generate a set of sequence feature vectors.

[0103] Specifically, the process for extracting multidimensional sequence features is as follows:

[0104] S141, For each peptide sequence in the non-redundant peptide sequence set, its amino acid composition characteristics are statistically analyzed. The amino acid composition characteristics are vectors of length 20, and each component of the vector corresponds to the frequency of a standard amino acid appearing in the peptide sequence, which is obtained by calculating the frequency of each amino acid by dividing by the total length of the peptide.

[0105] S142, For each peptide sequence in the non-redundant peptide sequence set, calculate its physicochemical properties. These properties include molecular weight, isoelectric point, hydrophobicity index, net charge number, and instability index. The molecular weight is obtained by summing the molecular weights of each amino acid in the peptide and subtracting hydrolysis losses; the isoelectric point is obtained by iteratively solving the pH value at which the net charge of the peptide is zero using a bisection method; the hydrophobicity index is calculated using the weighted average of the amino acid hydrophobicity scaling table; the net charge number is obtained by statistically analyzing the difference in the number of basic and acidic amino acids; and the instability index is obtained by statistically summing the instability weights of adjacent amino acid pairs in the peptide.

[0106] S143. For each peptide sequence in the non-redundant peptide sequence set, calculate its dipeptide frequency characteristics and tripeptide frequency characteristics. The dipeptide frequency characteristics are vectors of length 400, where each component of the vector corresponds to the frequency of a dipeptide combination appearing in the peptide segment; the tripeptide frequency characteristics are vectors of length 8000, where each component of the vector corresponds to the frequency of a tripeptide combination appearing in the peptide segment, and the frequency of tripeptide combinations that cannot be counted due to exceeding the peptide length limit is recorded as zero.

[0107] S144, for each peptide sequence in the non-redundant peptide sequence set, a secondary structure prediction model is invoked to generate secondary structure features. Specifically, the secondary structure prediction model uses the PSIPRED secondary structure predictor. The input data of the PSIPRED secondary structure predictor is the single-letter symbol sequence of the peptide's amino acids, and the input data structure is a string, the length of which equals the number of amino acid residues in the peptide. The output data of the PSIPRED secondary structure predictor is the secondary structure probability distribution for each amino acid site, and the output data structure is a two-dimensional floating-point matrix, with the number of rows equal to the peptide length and the number of columns being 3. The three columns correspond to the probabilities of helical structures, folded structures, and random coil structures, respectively. The proportions of sites of each structure type are summarized to obtain a secondary structure proportion vector of length 3. The three components of the secondary structure proportion vector are the proportions of sites predicted as helical structures, folded structures, and random coil structures, respectively.

[0108] S145, for each peptide sequence in the non-redundant peptide sequence set, a disorder region prediction model is invoked to generate disorder region features. Specifically, the disorder region prediction model uses the IUPred2A disorder region predictor. The input data of the IUPred2A disorder region predictor is the single-letter symbol sequence of the peptide's amino acids, and the input data structure is a string. The output data of the IUPred2A disorder region predictor is the disorder tendency score for each amino acid site, and the output data structure is a one-dimensional floating-point vector with a vector length equal to the peptide length. Each component has a value ranging from 0 to 1, and sites with a score greater than 0.5 are determined to be disordered residues. The disorder region features include two components: the proportion of disordered amino acids and the length of the longest continuous disordered region. The proportion of disordered amino acids is the number of disordered residues divided by the total peptide length, and the length of the longest continuous disordered region is the maximum number of consecutive disordered residues in the peptide.

[0109] S146, for each peptide sequence in the non-redundant peptide sequence set, a pre-trained protein language model is invoked to generate a sequence embedding vector. Specifically, the pre-trained protein language model is the ESM-2 protein language model. The input data of the ESM-2 protein language model is the single-letter symbol sequence of the peptide's amino acids, with a string structure and a maximum sequence length of 1024 amino acid residues. Sequences exceeding the maximum length are truncated. The output data of the ESM-2 protein language model is a fixed-dimensional sequence embedding vector, with a one-dimensional floating-point vector structure and a vector dimension of 1280. This vector is obtained by averaging the hidden states of the last layer of the model along the sequence length dimension. The sequence embedding vector can encode the global semantic information and evolutionary conservation patterns of the peptide sequence.

[0110] S147 involves sequentially concatenating amino acid composition features, physicochemical property features, dipeptide frequency features, tripeptide frequency features, secondary structure ratio vectors, disordered region features, and sequence embedding vectors to obtain a complete sequence feature vector for each peptide segment. These complete sequence feature vectors from all peptide segments are then aggregated to form a sequence feature vector set. (See also...) Figure 3 This is a schematic diagram of multidimensional sequence feature splicing provided in the embodiments of this application.

[0111] like Figure 3As shown, this diagram details the generation process of neurotrophic peptide sequence feature vectors. Starting with a specific peptide sequence (e.g., lis-pro-va-propyl-propyl-glycolic acid), four main feature extraction paths branch downwards. This includes not only traditional statistical features (such as 20-dimensional amino acid composition and physicochemical properties), but more importantly, it introduces a sequence embedding vector (1280 dimensions) as shown in the right-hand branch. This feature comes from the ESM-2 protein language model, which can capture the evolutionary information and contextual semantics implicit in peptide sequences, overcoming the insensitivity of traditional physicochemical features to sequence order. The bottom shows the convergence and splicing operations of all feature branches, ultimately forming a high-dimensional complete sequence feature vector, which is further combined with process parameters and activity tags. In step S15, this high-density feature representation allows the model to not only "understand" the chemical composition of peptides but also comprehend their deep biological semantics, thereby more accurately identifying key sequence fragments with neuroprotective potential in structure-activity relationship mining.

[0112] S15: Construct a neurotrophic peptide sequence feature matrix based on the sequence feature vector set and the original sample index table.

[0113] Specifically, the process for constructing the neurotrophic peptide sequence feature matrix is ​​as follows:

[0114] S151. Extract the process parameter fields corresponding to each peptide record from the original sample index table, including enzymatic hydrolysis temperature, enzymatic hydrolysis pH, enzymatic hydrolysis time, solid-liquid ratio, enzyme-to-base ratio, purification molecular weight cutoff, and elution gradient parameters. Perform linear normalization on each process parameter field to map its numerical range to the interval between zero and one, and obtain the process parameter encoding vector.

[0115] In a preferred embodiment, the calculation formula for the linear normalization process is as follows:

[0116] ;

[0117] in, This represents the normalized process parameter value, with a range of values ​​of [value range missing]. ; This represents the original process parameter values; This indicates the minimum value of the process parameter field in the original sample index table; This indicates the maximum value of the process parameter field in the original sample index table; This indicates the range of the field.

[0118] S152, extract the activity score field corresponding to each peptide record from the original sample index table, and normalize the activity score according to the two dimensions of nerve cell survival rate and neurite growth promotion rate to obtain the activity tag vector.

[0119] S153, the complete sequence feature vector, the corresponding process parameter encoding vector, and the activity tag vector of each peptide in the sequence feature vector set are concatenated column by column to form the comprehensive feature row vector of the peptide.

[0120] S154: Stack the row vectors of the comprehensive features of all peptides to generate a neurotrophic peptide sequence feature matrix. The data structure of the neurotrophic peptide sequence feature matrix is ​​a two-dimensional numerical matrix. The number of rows equals the total number of peptides in the non-redundant peptide sequence set, and the number of columns equals the sum of the dimensions of the complete sequence feature vector, the process parameter encoding vector, and the activity tag vector. Each row of the neurotrophic peptide sequence feature matrix corresponds to one peptide sample, and each column corresponds to one feature dimension. The elements in the matrix take values ​​that are continuous real numbers between zero and one.

[0121] Specifically, in traditional studies of the structure-activity relationship of neurotrophic peptides in the porcine brain, sequence feature extraction is often limited to amino acid composition or simple physicochemical properties, making it difficult to capture high-order semantic information and local fragment patterns of peptide sequences. By introducing sequence embedding vectors generated by the ESM-2 protein language model, the implicit information of peptide sequences at the evolutionary and functional levels can be encoded into dense vector representations, complementing traditional physicochemical features. Simultaneously, by uniformly encoding process parameters and activity tags with sequence features into the same matrix structure, subsequent graph neural networks can simultaneously learn the synergistic effects of sequences and processes within a single input space. This avoids the information loss caused by separately modeling and then fusing multimodal data, laying a high-dimensional feature expression foundation for accurately mining the structure-activity relationship of neurotrophic peptides.

[0122] S2: Construct a sequence-process joint graph using the neurotrophic peptide sequence feature matrix, and train a structure-activity relationship graph neural network based on the sequence-process joint graph to obtain the trained structure-activity relationship graph neural network.

[0123] This step addresses the challenge of capturing the complex nonlinear relationship between neurotrophic peptide sequence characteristics and process conditions using traditional regression models. It unifies the modeling of peptide sequences and process conditions into a heterogeneous graph structure and employs a graph neural network with an attention mechanism for structure-activity relationship learning. The trained structure-activity relationship graph neural network, as the final output of S2, will be invoked in S3 to generate a process response sample set and derive a process control decision table.

[0124] Specifically, the process of constructing the sequence process joint graph and training the structure-property relationship graph neural network includes:

[0125] S21: Construct a peptide sequence similarity subgraph based on the neurotrophic peptide sequence feature matrix.

[0126] Specifically, the process for constructing the peptide sequence similarity subgraph is as follows:

[0127] S211, extract the complete sequence feature vector part corresponding to each row from the neurotrophic peptide sequence feature matrix to construct the sequence feature submatrix.

[0128] S212: Calculate the cosine similarity of the complete sequence feature vectors of any two peptides in the sequence feature submatrix.

[0129] In a preferred embodiment, the formula for calculating the cosine similarity is as follows:

[0130]

[0131] in, The cosine similarity between the feature vectors of the complete sequences of peptide i and peptide j is expressed as a value ranging from 1 to 2. ; This represents the complete sequence feature vector of peptide i; This represents the complete sequence feature vector of peptide j; This represents the dot product of two vectors; Representing vectors The L2 norm; Representing vectors The L2 norm.

[0132] S213, Set a sequence similarity adjacency threshold. This threshold is determined based on the distribution histogram of cosine similarity between all peptide pairs in the sequence feature submatrix. Specifically, it is determined by: calculating the cosine similarity values ​​of all peptide pairs and constructing a distribution histogram; statistically analyzing the cumulative distribution function of the histogram; and taking the similarity value corresponding to a cumulative probability reaching a preset nearest neighbor ratio as the sequence similarity adjacency threshold. The preset nearest neighbor ratio is determined by: evaluating the impact of different nearest neighbor ratios on the performance of downstream structure-activity relationship prediction tasks through cross-validation on the validation dataset; selecting the nearest neighbor ratio that optimizes the prediction performance on the validation set as the preset nearest neighbor ratio; and ensuring that the number of edges corresponding to this ratio is within the acceptable range of computational resources.

[0133] S214. For each peptide segment in the sequence feature sub-matrix, establish an undirected edge between it and other peptide segments whose cosine similarity is greater than or equal to the sequence similarity adjacency threshold, and set the edge weight to the corresponding cosine similarity value.

[0134] S215, construct a peptide sequence similarity subgraph by treating all peptide segments as nodes and all established undirected edges as edge sets. The data structure of the peptide sequence similarity subgraph is an undirected weighted graph. The number of elements in the node set is equal to the total number of peptide segments in the non-redundant peptide sequence set. The number of elements in the edge set is determined by the sequence similarity adjacency threshold. The weight attribute attached to each edge is the cosine similarity value of the corresponding peptide pair.

[0135] S22: Construct a set of process state nodes based on the process parameter encoding vectors in the neurotrophic peptide sequence feature matrix.

[0136] Specifically, the process for constructing the process state node set is as follows:

[0137] S221, extract the process parameter encoding vector part corresponding to each row from the neurotrophic peptide sequence feature matrix to construct the process parameter submatrix.

[0138] S222, perform density clustering on the process parameter submatrix to group records that are close in distance in the process parameter space into the same cluster. Specifically, the density clustering operation uses the DBSCAN density clustering algorithm. The input data of the DBSCAN density clustering algorithm is the process parameter submatrix, with an input data structure of a two-dimensional floating-point matrix. The number of rows in the matrix equals the number of peptide samples, and the number of columns equals the dimension of the process parameters. The output data of the DBSCAN density clustering algorithm is the cluster label for each sample point. The output data structure is a one-dimensional integer vector, with a vector length equal to the number of samples, and each component is a cluster number or a noise point label. The DBSCAN density clustering algorithm uses Euclidean distance as the distance metric. The neighborhood radius parameter is determined as follows: the distance from each sample point in the process parameter submatrix to its k-th nearest neighbor is calculated, and the distances are plotted in ascending order to create a k-distance curve. The distance value corresponding to the inflection point of the k-distance curve is taken as the neighborhood radius parameter, where k is twice the dimension of the process parameters. The minimum sample size parameter is determined as follows: based on the number of production batches and the desired clustering granularity, the quotient of the total number of production batches divided by the desired number of clusters is taken as the minimum sample size parameter. See also... Figure 4 This is a schematic diagram of process state node clustering provided in the embodiments of this application. As shown in the figure, this diagram vividly illustrates how continuously changing production process parameters are transformed into discrete graph nodes. The coordinate system in the figure represents the process parameter space (e.g., the horizontal axis represents the acidity / alkalinity of enzymatic hydrolysis, and the vertical axis represents the enzymatic hydrolysis temperature), and the scattered points represent the actual process records collected in historical batches. Due to the continuity and fluctuation of sensor data, directly using the raw data to construct a graph model would result in an explosion in the number of nodes and sparseness. This figure delineates three main clusters using dashed ellipses of different colors, corresponding to three typical process state nodes (Node1, Node2, Node3) for high-temperature weak acid, medium-temperature neutral, and low-temperature weak alkali, respectively. This processing method effectively performs noise reduction and state abstraction, compressing an infinite number of parameter combinations into a finite number of process states. In the subsequent construction of the sequence process joint graph in S23, these nodes serve as hubs connecting different peptides, enabling the model to learn the general rules of peptide enrichment for a certain type of process state (rather than a single isolated time point data), significantly improving the generalization ability and computational efficiency of the structure-activity relationship model.

[0139] S223, define each cluster as a process state node, calculate the mean vector of all process parameter encoding vectors in the cluster, and use it as the node attribute vector of the process state node.

[0140] S224. For process parameter records that are identified as noise points by density clustering, assign them to the nearest cluster to ensure that all process parameter records have a corresponding process status node to which they belong.

[0141] S225, collect all process status nodes and their node attribute vectors to obtain a process status node set. The data structure of the process status node set is a node list, where each node contains two fields: a process status number and a node attribute vector. The dimension of the node attribute vector is equal to the dimension of the process parameter encoding vector.

[0142] S23: Construct a sequence-process joint graph based on peptide sequence similarity subgraphs and process state node sets.

[0143] Specifically, the process for constructing the sequential process joint diagram is as follows:

[0144] S231, retain all peptide nodes in the peptide sequence similarity subgraph, and add all process state nodes in the process state node set to the node set, forming a heterogeneous node set containing both peptide nodes and process state nodes.

[0145] S232, for each row of peptide records in the neurotrophic peptide sequence feature matrix, determine the process state node to which its process parameter encoding vector belongs, and establish an association edge between the peptide node and the corresponding process state node. The weight of the association edge is set to the normalized abundance value of the peptide in the corresponding process state, and the normalized abundance value is the abundance of the peptide divided by the sum of the abundances of all peptides in the same process state.

[0146] S233. For all peptide nodes associated with the same process state node, calculate the mean value of the activity tag vector of these peptides, and use this mean value as the activity association attribute of the process state node to characterize the overall contribution of the process state to neurotrophic activity.

[0147] S234, merge the sequence similarity edges and peptide-process state association edges in the peptide sequence similarity subgraph to form a heterogeneous edge set. The heterogeneous edge set contains two types of edges: sequence similarity edges between peptide nodes and association edges between peptide nodes and process state nodes.

[0148] S235, combine the heterogeneous node set and the heterogeneous edge set to obtain the sequence process joint graph. The data structure of the sequence process joint graph is a heterogeneous graph, containing two types of nodes: peptide nodes and process state nodes, and two types of edges: sequence similarity edges and related edges. Each node is accompanied by a node attribute vector, and each edge is accompanied by an edge weight attribute.

[0149] S24: Construct and train a structure-property relationship graph neural network based on the sequence process joint graph.

[0150] Specifically, the process of constructing and training the structure-function graph neural network is as follows:

[0151] S241 defines the network architecture of the structure-property graph neural network. The structure-property graph neural network comprises three modules: a node feature embedding layer, a heterogeneous graph attention layer, and a multi-task prediction layer. The node feature embedding layer maps the complete sequence feature vector of peptide nodes and the node attribute vector of process state nodes to a unified-dimensional embedding space through fully connected layers, obtaining initial node embedding vectors.

[0152] The heterogeneous graph attention layer specifically adopts the HAN heterogeneous graph attention network architecture, which includes two layers: a node-level attention mechanism and a semantic-level attention mechanism. The node-level attention mechanism weighted aggregates the information of neighboring nodes of the same type for each node, with the attention weights obtained by calculating the correlation between neighboring nodes and the central node through a two-layer fully connected network. The semantic-level attention mechanism weighted fusions the information transmitted by different types of edges, with the semantic attention weights calculated by the similarity between learnable semantic vectors and the aggregation results of each type of edge. The number of attention heads in the heterogeneous graph attention layer is determined by evaluating the impact of different numbers of attention heads on the model's prediction performance and training stability on a validation dataset, and selecting the number of attention heads that provides the optimal prediction performance on the validation set and ensures stable training.

[0153] The multi-task prediction layer inputs the peptide node embedding vector updated by the heterogeneous graph attention layer into multiple parallel fully connected prediction heads, and outputs the predicted values ​​of nerve cell survival rate and neurite growth promotion rate, respectively.

[0154] S242, the peptide nodes in the sequence process joint diagram are divided according to a preset training, validation, and testing ratio to obtain a training node set, a validation node set, and a testing node set. The preset training, validation, and testing ratio is determined by: evaluating the minimum number of training samples required for model training and the minimum number of validation samples required for validation evaluation based on the data scale, and allocating sufficient samples to the validation and testing sets to ensure the statistical significance of the evaluation results while ensuring sufficient training samples.

[0155] S243, Define a multi-task loss function. The multi-task loss function is a weighted sum of the loss from predicting nerve cell survival rate and the loss from predicting neurite growth promotion rate, with each prediction loss calculated using mean squared error.

[0156] In a preferred embodiment, the calculation formula for the multi-task loss function is as follows:

[0157] ;

[0158] in, This represents the total loss value across multiple tasks; The loss represents the prediction loss of nerve cell survival rate, which is calculated as the mean of the square of the difference between the predicted survival rate and the actual survival rate over all training samples. The loss represents the prediction loss of neurite growth promotion rate, which is calculated as the mean of the square of the difference between the predicted growth promotion rate and the actual growth promotion rate across all training samples. and The weighting coefficients for the two prediction tasks satisfy the following condition: The weighting coefficients are determined as follows: initial weights are assigned based on the relative importance of the two prediction tasks in product quality evaluation, and an uncertainty weighting strategy is used during training to dynamically adjust the weights according to the prediction uncertainty of each task.

[0159] S244 employs a mini-batch gradient descent strategy to iteratively train the structure-activity relationship graph neural network. In each training iteration, a batch of peptide nodes and their local subgraphs in the sequence-process joint graph are sampled from the training node set. These local subgraphs are then input into the structure-activity relationship graph neural network for forward computation to obtain the predicted activity values. The predicted activity values ​​are then calculated using the activity label vectors, and the network parameters are updated through backpropagation.

[0160] S245, after each training cycle, the local subgraph corresponding to the set of validation nodes is input into the structure-function graph neural network for forward computation to calculate the validation loss value. When the validation loss value no longer decreases for several consecutive training cycles, the training is considered to have converged, and the current network parameters are saved. The "several consecutive" values ​​are 5 to 20 training cycles, with common preferred values ​​being 5, 10, and 15 training cycles. The core judgment logic is: when the validation loss value does not decrease statistically significantly within this consecutive period, the model is considered to have converged, and continuing training is likely to lead to overfitting. At this point, training is stopped, and the current optimal network parameters are saved.

[0161] S246, the structure-property relationship graph neural network after training convergence is denoted as the trained structure-property relationship graph neural network, and the attention weight matrix learned in the heterogeneous graph attention layer is saved for subsequent interpretation of the influence of process state on the structure-property relationship of sequence.

[0162] Specifically, traditional structure-activity relationship (SPR) models directly input peptide sequence features into regression models to predict activity values, neglecting the crucial background information of peptide formation under specific process conditions. In the preparation of porcine brain neurotrophic peptides, peptides with the same sequence may exhibit differentiated activity levels under different enzymatic hydrolysis temperatures or pH conditions. This interaction between sequence and process is difficult to capture through simple feature concatenation. By unifying peptide sequences and process states into a sequence-process joint graph and employing the HAN heterogeneous graph attention mechanism for message passing, the SPR graph neural network can automatically learn the activation effect of certain process states on specific amino acid site combinations, characterizing the synergistic effect of sequence and process at the graph structure level. The attention weight matrix of the heterogeneous graph attention layer provides interpretable contribution metrics, enabling researchers to identify which process states are conducive to enriching highly active neurotrophic peptides. This provides sequence-guided decision-making for subsequent process optimization, solving the bottleneck problem of traditional linear or shallow nonlinear models being unable to characterize complex structure-activity relationships.

[0163] S3: Use the trained structure-property relationship graph neural network to generate a process response sample set from historical process records, and derive the process control candidate rule set based on the process response sample set to obtain the process control decision table.

[0164] This step addresses the critical challenge of transforming structure-property relationship prediction into inverse optimization of process parameters. It utilizes a trained structure-property relationship graph neural network to batch evaluate candidate process combinations in a virtual process space, generating a process response sample set covering a broad parameter space. Furthermore, a strategy combining reinforcement learning and evolutionary search is employed to derive a process control decision table. This process control decision table, as the final output of S3, will be invoked in S4 to generate a sequence of process adjustment instructions from real-time production data.

[0165] Specifically, the process of generating a process response sample set and deriving a process control decision table includes:

[0166] S31: Based on the trained structure-property relationship graph neural network, a virtual evaluation of historical process parameter combinations is performed to generate a process response sample set.

[0167] Specifically, the process for generating the process response sample set is as follows:

[0168] S311, extract all historical batch process parameter records from the original sample index table, determine the process status node number to which each process parameter record belongs according to the density clustering results in S22, and construct a historical process status record table. The data structure of the historical process status record table is a two-dimensional relational table, and each row contains batch number, process stage, process status node number, and original process parameter encoding vector.

[0169] S312, a sampling grid is set in the process parameter space, and each process parameter dimension is divided at equal intervals according to a preset sampling step size to generate a set of candidate process parameter combinations. The preset sampling step size is determined as follows: based on the effective variation range of each process parameter and the control accuracy of the process equipment, the minimum adjustment amount that the process equipment can stably achieve is taken as the lower limit of the sampling step size, and the effective variation range is divided by the expected number of sampling points as the upper limit of the sampling step size. A step size value between the lower limit and the upper limit that makes the total number of candidate combinations within the range that computing resources can bear is selected as the preset sampling step size.

[0170] S313. For each set of process parameters in the candidate process parameter combination set, calculate the Euclidean distance between it and the node attribute vector of each process state node in the process state node set, and map the process parameter combination to the nearest process state node to obtain the process state mapping table of the candidate process parameters.

[0171] S314, for each set of process parameters in the process state mapping table of candidate process parameters, the corresponding process state node is embedded into the structure-activity relationship graph neural network trained by the vector input. Through the message passing mechanism of the heterogeneous graph attention layer, the activity distribution of neurotrophic peptides that may be enriched in this process state is predicted. The activity distribution includes the predicted mean of nerve cell survival rate, the predicted mean of neurite growth promotion rate, and the predicted abundance ratio of highly active peptides.

[0172] S315, for each set of candidate process parameters, calculate the comprehensive neurotrophic index.

[0173] In a preferred embodiment, the formula for calculating the comprehensive neurotrophic index is as follows:

[0174] ;

[0175] in, This represents the comprehensive neurotrophic index, with a value range of [value missing]. ; The mean predicted survival rate of nerve cells is represented by the arithmetic mean of the predicted survival rates of all associated peptides under this process condition. The mean predicted growth promotion rate of neurites is the arithmetic mean of the predicted growth promotion rates of all associated peptides under this process condition. The percentage of highly active peptides is represented by the sum of the abundance of peptides whose predicted activity values ​​exceed the activity threshold, divided by the sum of the abundance of all peptides under this process condition. , , For the weighting coefficients, satisfying The weighting coefficients are determined by: determining the proportional relationship of each coefficient based on the relative weight requirements of nerve cell survival rate, neurite growth promotion rate, and high-activity component content in the product quality standards.

[0176] S316, each set of candidate process parameters and its corresponding comprehensive neurotrophic index, activity distribution prediction results, and process status node number are collected to generate a process response sample set. The data structure of the process response sample set is a two-dimensional relational table, where each row corresponds to a set of candidate process parameters, and the column fields include the process parameter encoding vector, comprehensive neurotrophic index, predicted mean of nerve cell survival rate, predicted mean of neurite growth promotion rate, abundance ratio of highly active peptides, and process status node number.

[0177] S32: Based on the process response sample set, a joint strategy of reinforcement learning and evolutionary search is used to derive the candidate rule set for process control.

[0178] Specifically, the process for deriving the candidate rule set for process control is as follows:

[0179] S321, Define the objective function for process trajectory optimization. The objective function for process trajectory optimization is a multi-objective evaluation function, which includes the objectives of maximizing the comprehensive neurotrophic index, minimizing production energy consumption, minimizing batch-to-batch variability, and maximizing the abundance of target highly active peptides.

[0180] In a preferred embodiment, the objective function for optimizing the process trajectory uses a weighted summation method to fuse multiple objectives into a single scalar objective. The specific calculation formula is as follows:

[0181] ;

[0182] in, This represents the overall optimization target value of the process trajectory; the larger the value, the better the trajectory quality. This represents the average comprehensive neurotrophic index for each process stage in the trajectory; It represents the normalized production energy consumption, calculated by dividing the sum of energy consumption at each process stage in the trajectory by the baseline energy consumption value; It represents batch-to-batch variability and is calculated as the standard deviation of the difference in the comprehensive neurotrophic index between adjacent process states. This represents the average percentage of highly active peptides at each process stage in the trajectory. , , , These are the weighting coefficients for each objective. The weighting coefficients are determined as follows: based on the priority ranking of product quality, energy consumption costs, and production stability according to the enterprise's production strategy, the relative weights of each objective are calculated using the analytic hierarchy process (AHP) as the weighting coefficients.

[0183] The production energy consumption is calculated by accumulating the energy consumption models of the equipment in each process stage. The equipment energy consumption model is the sum of the product of the unit time energy consumption and the operating time of each process equipment under different operating parameters. The unit time energy consumption of each process equipment is calculated by the power parameters and actual operating load rate in the equipment technical manual.

[0184] S322, Construct the process trajectory state space and action space. The process trajectory state space is defined as the joint space of the current process stage, the current process parameter encoding vector, and the current intermediate peptide profile feature, where the current intermediate peptide profile feature is the peptide composition distribution predicted in the current process stage; the action space is defined as the set of adjustment ranges for each process parameter in the next process stage, where the adjustment range of each process parameter is between the negative adjustment upper limit and the positive adjustment upper limit, and the adjustment upper limit is determined based on the response capability and safety constraints of the process equipment.

[0185] S323, a process strategy network is constructed using a proximal strategy optimization algorithm. The input of the proximal strategy optimization algorithm is the current state vector in the process trajectory state space. The data structure of the current state vector is a one-dimensional floating-point vector, and the vector dimension is equal to the sum of the process stage encoding dimension, the process parameter encoding vector dimension, and the intermediate peptide profile feature dimension. The output of the proximal strategy optimization algorithm is the probability distribution of each adjustment action in the action space. The output data structure is a one-dimensional floating-point vector, and the vector dimension is equal to the total number of discretized action categories. Each component is the selection probability of the corresponding action.

[0186] The network architecture of the process policy network includes a state encoding layer and a policy output layer. The state encoding layer consists of multiple fully connected layers, mapping the current state vector to a hidden layer representation vector. The policy output layer consists of a fully connected layer and a Softmax activation function, mapping the hidden layer representation vector to an action probability distribution. The pruning parameters of the proximal policy optimization algorithm are determined by evaluating the impact of different pruning parameters on the stability of policy updates during the pre-training phase, and selecting pruning parameter values ​​that ensure a smooth decrease in policy entropy during training without policy collapse. The learning rate is determined by employing a learning rate preheating and cosine annealing strategy, with the initial learning rate determined through learning rate range testing on a small-scale validation set.

[0187] S324, using the process response sample set as the response model of the virtual environment, candidate process trajectories are generated by combining Monte Carlo tree search with a process strategy network. The search depth of the Monte Carlo tree search is determined by taking the total number of process stages as the maximum search depth to ensure the search covers the entire production process. The number of simulations is determined by selecting a number of simulations that ensures the stability of search results, based on computational resources and response time requirements, so that the time consumed by a single search does not exceed a preset time limit. The exploration coefficient is determined by evaluating the impact of different exploration coefficients on search diversity and convergence speed on a validation dataset, and selecting an exploration coefficient value that balances exploration and utilization in the search results.

[0188] In each trajectory generation, starting from the initial process state, the actions are sampled and adjusted according to the action probability distribution output by the process strategy network, and the comprehensive neurotrophic index corresponding to the adjusted process parameters is queried through the process response sample set to accumulate the trajectory reward value.

[0189] S325: Sort the generated candidate process trajectories according to their trajectory return values, and select the top 10% of process trajectories by trajectory return value as the set of high-quality trajectories.

[0190] S326 employs an evolutionary search strategy to perform crossover and mutation operations on the high-quality trajectory set to generate new candidate process trajectories. The trajectory reward value of the new trajectory is evaluated through a process response sample set. The high-quality trajectory set is iteratively updated until the average trajectory reward value of the high-quality trajectory set no longer significantly improves after multiple consecutive iterations.

[0191] S327, for each process trajectory in the high-quality trajectory set, state discretization processing is performed, mapping the continuous process states in the trajectory to discrete process state node numbers in the process state node set, and mapping the continuous adjustment actions in the trajectory to preset discrete action categories, thus obtaining the discretized process trajectory extraction state-action pairs. The preset discrete action categories are a set of discretized process control actions that can be directly mapped to continuous adjustment actions, based on the control parameters such as temperature, pH, elution flow rate, and enzymatic hydrolysis time in each process stage of porcine brain neurotrophic peptide preparation, combined with the equipment execution capability and parameter adjustment range. Specific examples include slight temperature increase, slight pH decrease, medium adjustment of elution flow rate, and coordinated temperature-pH adjustment.

[0192] S328, extract state-action pairs from the discretized process trajectory, perform a voting statistics on actions with the same state, select the action with the highest frequency as the recommended action for that state, and aggregate all state-recommended action pairs to obtain a process control candidate rule set. The data structure of the process control candidate rule set is a rule list, where each rule includes a condition part and an action part. The condition part is the discrete process state node number and intermediate peptide profile feature category, and the action part is the recommended process parameter adjustment category.

[0193] S33: Based on the candidate rule set for process control, perform conflict resolution and coverage optimization to generate a process control decision table.

[0194] Specifically, the process for generating the process control decision table is as follows:

[0195] S331, perform conflict detection on the rules in the candidate rule set for process control. The conflict detection method is as follows: if there are two rules that have the same condition part but different action parts, then the two rules are determined to be in conflict.

[0196] S332, For the detected conflict rule group, a confidence-first strategy is used to resolve the conflict. The method of the confidence-first strategy is as follows: calculate the frequency of each conflict rule in the high-quality trajectory set, retain the rules with higher occurrence frequencies, and delete the rules with lower occurrence frequencies.

[0197] S333, Perform coverage analysis on the rule set after conflict resolution. The method of coverage analysis is as follows: count the proportion of process state nodes that the condition parts of all rules in the rule set can cover to the total number of process state nodes, and use this as the rule coverage index.

[0198] S334, If the rule coverage index is lower than the preset coverage threshold, a nearest neighbor rule supplementation strategy is adopted for the uncovered process state nodes. The method of the nearest neighbor rule supplementation strategy is as follows: For each uncovered process state node, find the rule in the rule set that is closest to the uncovered node in the condition part of the process state node, copy the action part of the rule as the action part of the new rule, set the uncovered node number as the condition part of the new rule, and add the new rule to the rule set. The preset coverage threshold is determined as follows: Based on the diversity of process states that may occur in the production scenario and the fault tolerance requirements of the decision table query, statistically analyze the frequency distribution of each process state in the historical production data, and take the proportion of process states whose cumulative frequency reaches the production coverage requirement as the preset coverage threshold.

[0199] S335, the rule set after conflict resolution and coverage optimization is formatted into a process control decision table. The process control decision table has a two-dimensional lookup table data structure, with row indexes representing process status node numbers, column indexes representing intermediate peptide profile feature categories, and table cell content representing recommended process parameter adjustment categories. The process control decision table supports quick lookup of recommended actions using row and column indexes.

[0200] Specifically, traditional process optimization methods rely on numerous real-world experiments to gradually discover the optimal parameter combinations, which is not only time-consuming and costly but also makes it difficult to systematically explore high-dimensional process parameter spaces. By utilizing a trained structure-property graph neural network to evaluate candidate process parameter combinations in a virtual environment, a process response sample set covering a wide parameter space can be generated without real-world experiments, overcoming the traditional bottleneck of requiring numerous real-world experiments to obtain the process response surface. The joint strategy of reinforcement learning and evolutionary search can automatically find process regions in the continuous process parameter space that are both rich in highly active neurotrophic peptides and take into account energy consumption and stability, enabling reverse design from given process prediction results to given objectives. The process control decision table solidifies complex continuous optimization strategies into a set of rules that can be quickly queried through row and column indexes, enabling complex intelligent algorithms to be implemented in actual industrial control scenarios, reducing the computational complexity and response latency of online decision-making.

[0201] S4: Generate a process adjustment instruction sequence based on real-time production data using the process control decision table, and use the process adjustment instruction sequence to perform online optimization of the preparation process of porcine brain neurotrophic peptides.

[0202] This step addresses the closed-loop challenge of applying the offline-optimized process control decision table to real-time production control. It involves collecting real-time production data and matching it with the process control decision table to generate an executable sequence of process adjustment instructions. A production feedback mechanism is then established to enable adaptive updates to the decision table. This step achieves a seamless application from algorithm discovery to process control, significantly improving the targeted enrichment efficiency and bioactivity retention level of target neurotrophic peptides.

[0203] Specifically, the process of generating the process adjustment instruction sequence and performing online optimization includes:

[0204] S41: Collect real-time production data and encode it as a description of the current process status.

[0205] Specifically, the acquisition and encoding process is as follows:

[0206] S411 sets up real-time data acquisition nodes at each process stage of the pig brain neurotrophic peptide preparation production line, including an enzymatic reactor temperature sensor, an online pH electrode, a flow meter, an online chromatographic detector, and a rapid mass spectrometry fingerprint acquisition module.

[0207] S412: Read the output signals of each sensor and detection device according to the preset acquisition frequency, and generate real-time temperature readings, real-time pH readings, real-time flow rate readings, real-time chromatographic signal intensity sequences, and real-time mass spectrometry fingerprint feature vectors. The preset acquisition frequency is determined as follows: based on the dynamic response time constant of each process stage, take the reciprocal of the response time constant as the upper limit of the acquisition frequency; based on the data processing capability of the control system, take the frequency corresponding to the maximum data throughput that the system can process in real time as the lower limit of the acquisition frequency; and select an acquisition frequency between the upper and lower limits that meets the control accuracy requirements as the preset acquisition frequency.

[0208] S413, normalize the real-time temperature reading, real-time pH reading, and real-time flow rate reading according to the linear normalization method defined in S151 to obtain the real-time process parameter encoding vector.

[0209] S414 performs peak identification processing on the real-time chromatographic signal intensity sequence, extracting the retention time and peak area ratio of the main chromatographic peaks as the real-time chromatographic feature vector.

[0210] S415, calculate the matching degree between the real-time mass spectrometry fingerprint feature vector and the representative sequences in the non-redundant peptide sequence set to generate a real-time peptide spectrum distribution vector. Each component of the real-time peptide spectrum distribution vector represents the normalized matching strength of the corresponding representative sequence at the current time.

[0211] S416: Concatenate the real-time process parameter encoding vector, the real-time chromatographic feature vector, and the real-time peptide distribution vector to obtain the current process state description vector.

[0212] S417 Calculate the Euclidean distance between the current process state description vector and the node attribute vectors of each process state node in the process state node set, and take the process state node number with the smallest distance as the mapping result of the current process state node.

[0213] S418, perform category segmentation processing on the real-time peptide spectrum distribution vector. Based on the cumulative matching strength ratio of highly active peptides in the real-time peptide spectrum distribution vector, classify the current intermediate peptide spectrum into a highly active dominant class, a moderately active mixed class, or a low-activity dominant class as the current intermediate peptide spectrum feature category. The highly active peptide is defined as a peptide whose comprehensive neurotrophic index predicted in the structure-activity relationship graph neural network after training is higher than a preset activity threshold. The preset activity threshold is determined by using the minimum comprehensive neurotrophic index required for the product to meet the qualification standard, based on the minimum requirement for the content of highly active components in the product quality standard.

[0214] S42: Based on the current process status description and the process control decision table, generate a process adjustment instruction sequence.

[0215] Specifically, the process for generating the process adjustment instruction sequence is as follows:

[0216] S421: Using the current process status node mapping result as the row index and the current intermediate peptide profile feature category as the column index, query the process control decision table to obtain the recommended process parameter adjustment category. (See also...) Figure 5 This is a schematic diagram of the process control decision table lookup provided in this embodiment. As shown in the figure, this diagram illustrates how to achieve millisecond-level online process control using the results of offline optimization. The left side represents the real-time current state of the production line, including the mapped process state nodes (such as node 2) and the real-time monitored intermediate peptide profile features (such as moderately active mixed classes). The process control decision table in the middle is a two-dimensional matrix that pre-stores the optimal strategies derived through complex reinforcement learning algorithms. The arrows in the figure demonstrate the table lookup process: the system quickly locates the specific cell based on the row index (process state) and column index (peptide profile features), directly reads the recommended action (such as temperature decrease, acid-base increase), and converts it into specific output instructions. This mechanism avoids the high computational latency caused by running complex graph neural networks or evolutionary algorithms in real time on the production site. By solidifying the complex reverse optimization logic into a decision table, this embodiment can achieve rapid-response closed-loop control in the preparation process of porcine brain neurotrophic peptides, ensuring that production conditions are always dynamically adjusted in a direction conducive to the enrichment of highly active peptides.

[0217] S422, if there is no table cell corresponding to the current row and column index in the process control decision table, then the nearest neighbor matching strategy is executed. The method of the nearest neighbor matching strategy is as follows: find the filled row index that is closest to the mapping result of the current process state node in the process control decision table, and read the recommended action corresponding to the current intermediate peptide profile feature category in that row; if the column is also empty, then read the recommended action with the highest frequency in that row as the replacement.

[0218] S423, decode the recommended process parameter adjustment categories obtained from the query into specific parameter adjustment values. The decoding method is as follows: based on the mapping table between process parameter adjustment categories and continuous adjustment ranges, convert discrete categories into adjustment increments for each process parameter. The mapping table is generated synchronously during the discretization process in S327.

[0219] S424: The adjustment increment of each process parameter is superimposed on the original parameter value corresponding to the current real-time process parameter encoding vector to obtain the target process parameter setting value.

[0220] S425, Perform a safety constraint check on the target process parameter setpoints. The method for the safety constraint check is as follows: compare the target setpoint of each process parameter with the upper and lower allowable limits of that parameter; if it exceeds the allowable range, truncate to the boundary value. The upper and lower allowable limits are determined based on the technical specifications of the process equipment and the product quality and safety requirements.

[0221] S426, the target process parameter set values ​​that have passed the safety constraint check are arranged according to the process stage sequence and parameter type to generate a single process adjustment instruction. The data structure of the single process adjustment instruction is a list of key-value pairs, where the key is the process parameter name and the value is the target set value.

[0222] S427: During the production process, the processes from S411 to S426 are continuously repeated. New single-process adjustment commands are generated according to a preset control cycle. Multiple single-process adjustment commands are arranged in chronological order to form a process adjustment command sequence. The preset control cycle is determined as follows: based on the process response lag time and control accuracy requirements, the process response lag time is taken as the lower limit of the control cycle to avoid over-adjustment; the maximum adjustment interval that meets the control accuracy requirements is taken as the upper limit of the control cycle; and a cycle value between the lower and upper limits that balances the control effect and computational load is selected as the preset control cycle.

[0223] S43: Optimize the online preparation process of porcine brain neurotrophic peptides using process control instruction sequences.

[0224] Specifically, the process for performing online optimization is as follows:

[0225] S431, each single process adjustment command in the process adjustment command sequence is sent to the production line control system. The production line control system includes a temperature controller, a pH regulating valve, a flow rate regulating pump, and a chromatographic gradient program controller.

[0226] S432, the temperature controller receives the target temperature setpoint from a single process adjustment command, and adjusts the heating or cooling power to bring the temperature of the enzymatic hydrolysis reactor closer to the target value; the pH regulating valve receives the target pH setpoint, and adjusts the inflow ratio of the acid-base buffer to bring the pH of the reaction system closer to the target value; the flow rate regulating pump receives the target flow rate setpoint and adjusts the eluent flow rate in the chromatographic purification stage; the chromatographic gradient program controller receives the target elution gradient parameters and adjusts the slope and inflection point of the gradient elution program.

[0227] S433: During the execution of the process adjustment command sequence by the control system, the deviation between the actual response value and the target set value of each process parameter is continuously monitored. If the deviation exceeds the preset allowable deviation range, a compensation adjustment amount is added to the process adjustment command in the next control cycle. The preset allowable deviation range is determined based on the sensitivity of the process parameters to product quality.

[0228] S44: Establish a production feedback mechanism to achieve adaptive updates of the process control decision table.

[0229] Specifically, the process for establishing the production feedback mechanism is as follows:

[0230] S441. After each production batch is completed, mass spectrometry sequencing and activity evaluation are performed on the final product to obtain the actual peptide sequence set and actual activity score of that batch.

[0231] S442, compare the actual peptide sequence set of this batch with the non-redundant peptide sequence set, identify newly appearing peptide sequences and disappeared peptide sequences, and update the non-redundant peptide sequence set.

[0232] S443, the real-time production data collected during the production process of this batch, the sequence of process adjustment instructions executed, and the actual activity score are correlated to generate a new labeled sample record. The data structure of the new labeled sample record is the same as that of the original sample index table, including batch number, process stage, sampling time, peptide sequence, peptide abundance, process parameters, and activity score.

[0233] S444: Add the new labeled sample record to the original sample index table, and extract the sequence features of the new peptide according to the process in S14, and update the neurotrophic peptide sequence feature matrix.

[0234] S445, determine whether the cumulative number of newly labeled sample records has reached the preset update trigger threshold. The preset update trigger threshold is determined by: based on the computational resource cost required for model update and the data accumulation speed, calculating the minimum number of new samples required to improve model performance to a statistically significant level, and using this minimum number of samples as the preset update trigger threshold.

[0235] S446 If the cumulative number of new labeled sample records reaches the preset update trigger threshold, then the process from S21 to S246 is re-executed. The structure-property relationship graph neural network is retrained on the updated sequence process joint graph to obtain the updated trained structure-property relationship graph neural network. Based on the new model, the process from S312 to S335 is re-executed to update the process control decision table and realize the adaptive optimization of the process control strategy.

[0236] S447, based on the updated trained structure-property relationship graph neural network, re-execute the process from S31 to S335 to generate an updated process control decision table and replace the currently used process control decision table.

[0237] S448 deploys the updated process control decision table to the production line control system, enabling subsequent production batches to use the feedback-optimized decision rules for process adjustments.

[0238] For example, taking a batch of pig brain neurotrophic peptide preparation as an example, the generation and execution process of the above process adjustment instruction sequence is illustrated: In the enzymatic extraction stage, the real-time temperature sensor collects the temperature of the enzymatic reactor as 42 degrees Celsius, the online pH electrode collects the pH as 7.2, and the rapid mass spectrometry fingerprint acquisition module identifies that the matching intensity ratio of the highly active target peptide in the current intermediate product is 18%, classifying it into the medium-activity mixed category. The distance between the normalized real-time process parameter encoding vector and the process status node set is calculated and mapped to the process status node number 7. Using node 7 as the row index and the medium-activity mixed category as the column index, the process control decision table is queried, and the recommended process parameter adjustment category is "temperature slight decrease, pH slight increase". This adjustment category is decoded into a specific adjustment amount of 1.5 degrees Celsius decrease in temperature and 0.1 increase in pH. After safety constraint checks, a single process adjustment instruction with a target temperature set value of 40.5 degrees Celsius and a target pH set value of 7.3 is generated. The command is sent to the temperature controller and pH regulating valve to execute, causing the enzymatic hydrolysis reaction conditions to evolve in a direction that is conducive to the enrichment of highly active neurotrophic peptides.

[0239] Specifically, the traditional preparation of porcine brain neurotrophic peptides employs a static production model where process parameters are set first and then passively monitored. These parameters remain largely fixed throughout the batch, making real-time adjustments impossible based on fluctuations in raw material characteristics and changes in intermediate product states. This results in low target peptide separation efficiency and unstable activity retention between batches. By directly applying the offline-derived process control decision table to real-time production decisions, and by decomposing the complex optimization process into a series of executable control actions using a sequence of process adjustment instructions, an online process optimization closed loop with sequence structure-activity relationship (SPR) as the core constraint is achieved. The production feedback mechanism enables the system to continuously accumulate new labeled data and periodically update the SPR graph neural network and process control decision table, endowing the system with active learning and self-correction capabilities. This ensures that the decision rules can adapt to dynamic factors such as batch differences in raw materials, equipment aging, and changes in market demand. This step tightly couples the sequence structure-activity relationship mining results with the specific porcine brain neurotrophic peptide preparation process, achieving a seamless application from algorithm discovery to process control. This solves the industry pain point of difficulty in accurately controlling the complex relationships between various stages under the traditional manual experience-dependent model, providing algorithmic technical support for the efficient, stable, and large-scale production of neurotrophic peptides.

[0240] Example 2:

[0241] This embodiment, based on Embodiment 1, provides a system for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features, such as... Figure 6 As shown, it includes:

[0242] Feature matrix construction module: used to obtain original process and sequencing activity data of multiple batches of pig brain to construct an original sample index table, and extract sequence features based on the original sample index table to generate a neurotrophic peptide sequence feature matrix;

[0243] Graph Neural Network Training Module: Used to construct a sequence-process joint graph using the neurotrophic peptide sequence feature matrix, and train a structure-activity relationship graph neural network based on the sequence-process joint graph to obtain the trained structure-activity relationship graph neural network;

[0244] Process decision derivation module: Used to generate a process response sample set from historical process records using a trained structure-property relationship graph neural network, and derive a process control candidate rule set based on the process response sample set to obtain a process control decision table;

[0245] Online optimization execution module: This module generates a sequence of process adjustment instructions based on real-time production data from the process control decision table, and then uses this sequence to perform online optimization of the preparation process of porcine brain neurotrophic peptides.

Claims

1. A method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features, characterized in that, The method includes: S1: Obtain original process and sequencing activity data of multiple batches of pig brains to construct an original sample index table, and extract sequence features based on the original sample index table to generate a neurotrophic peptide sequence feature matrix; S2: Construct a sequence-process joint graph using the neurotrophic peptide sequence feature matrix, and train a structure-activity relationship graph neural network based on the sequence-process joint graph to obtain the trained structure-activity relationship graph neural network; S3: Use the trained structure-effect relationship graph neural network to generate a process response sample set from historical process records, and derive the process control candidate rule set based on the process response sample set to obtain the process control decision table. S4: Generate a process adjustment instruction sequence based on real-time production data using the process control decision table, and use the process adjustment instruction sequence to perform online optimization of the preparation process of porcine brain neurotrophic peptides.

2. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 1, characterized in that, The steps for constructing the original sample index table and generating the neurotrophic peptide sequence feature matrix include: Collect raw data from multiple batches of pig brain production lines and construct a multi-source production data set; Based on a multi-source production data set, an original sample index table is constructed using a batch process stage binding method. Based on the original sample index table, noise reduction and redundancy elimination are performed on the mass spectrometry peptide sequences to obtain a set of non-redundant peptide sequences. Based on a set of non-redundant peptide sequences, multi-dimensional sequence features are extracted to generate a set of sequence feature vectors. A neurotrophic peptide sequence feature matrix is ​​constructed based on the sequence feature vector set and the original sample index table.

3. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 1, characterized in that, The steps for constructing a sequence process joint graph and training a structure-property relationship graph neural network include: Based on the feature matrix of neurotrophic peptide sequences, a peptide sequence similarity subgraph is constructed. A set of process state nodes is constructed based on the process parameter encoding vectors in the neurotrophic peptide sequence feature matrix. Construct a sequence-process joint graph based on peptide sequence similarity subgraphs and process state node sets; Based on the sequence process joint graph, a structure-property relationship graph neural network is constructed and trained.

4. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 1, characterized in that, The steps for generating a process response sample set and deriving the process control decision table include: Based on a trained structure-property relationship graph neural network, a virtual evaluation of historical process parameter combinations is performed to generate a process response sample set. Based on the process response sample set, a joint strategy of reinforcement learning and evolutionary search is used to derive the candidate rule set for process control. Based on the candidate rule set for process control, conflict resolution and coverage optimization are performed to generate a process control decision table.

5. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 1, characterized in that, The steps for generating a process adjustment instruction sequence and performing online optimization include: Collect real-time production data and encode it as a description of the current process status; Based on the matching of the current process status description with the process control decision table, a process adjustment instruction sequence is generated. Online optimization of the preparation process of porcine brain neurotrophic peptides was performed using a process control instruction sequence. Establish a production feedback mechanism to enable adaptive updates of the process control decision table.

6. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 2, characterized in that, The process flow of the batch process stage binding method is as follows: The batch time axis is determined based on the batch timestamp acquisition sequence. The batch time axis is a set of time intervals divided into process stages within a single production batch, including the raw material pretreatment stage, enzymatic extraction stage, multi-stage purification stage, and activity protection stage. The enzymatic hydrolysis parameter acquisition sequence, temperature monitoring acquisition sequence, pH monitoring acquisition sequence, and purification condition acquisition sequence in the multi-source production data set are matched with time intervals according to the batch time axis. Each process parameter record is mapped to its batch number and process stage to obtain a process parameter binding record set. Each peptide sequence and its abundance information in the mass spectrometry sequencing acquisition sequence are matched with the sampling timestamp and batch timeline, and the peptide sequence is associated with the corresponding batch number and process stage to obtain a peptide batch association record set. Each activity score data in the activity evaluation collection sequence is associated with the peptide batch-associated record set according to the batch source identifier of the test sample, and a corresponding activity tag is attached to each peptide record to obtain a peptide record set with activity tags. The process parameter binding record set and the peptide record set with active tags are linked according to batch number and process stage to generate the original sample index table.

7. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 2, characterized in that, The process of multidimensional sequence feature extraction is as follows: For each peptide sequence in the set of non-redundant peptide sequences, its amino acid composition characteristics are statistically analyzed. The amino acid composition feature is a vector of length 20, and each component of the vector corresponds to the frequency of a standard amino acid in the peptide sequence. For each peptide sequence in the non-redundant peptide sequence set, its physicochemical properties are calculated, including molecular weight, isoelectric point, hydrophobicity index, net charge number, and instability index. For each peptide sequence in the set of non-redundant peptide sequences, the frequency characteristics of dipeptides and tripeptides are statistically analyzed. For each peptide sequence in the set of non-redundant peptide sequences, the secondary structure prediction model is called to generate secondary structure features, resulting in a secondary structure proportion vector of length 3. For each peptide sequence in the non-redundant peptide sequence set, the disorder region prediction model is called to generate disorder region features, which include the proportion of amino acids in the disorder region and the length of the longest continuous disorder region. For each peptide sequence in the non-redundant peptide sequence set, a pre-trained protein language model is called to generate a sequence embedding vector; The amino acid composition features, physicochemical properties features, dipeptide frequency features, tripeptide frequency features, secondary structure ratio vectors, disordered region features, and sequence embedding vectors are concatenated in sequence to obtain the complete sequence feature vector of each peptide segment, and the resulting set of sequence feature vectors is obtained.

8. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 3, characterized in that, The process of constructing and training a structure-activity relationship graph neural network includes: Define the network architecture of the structure-behavior graph neural network, which includes three modules: a node feature embedding layer, a heterogeneous graph attention layer, and a multi-task prediction layer. The peptide nodes in the sequence process combination diagram are divided according to a preset training, validation, and testing ratio to obtain a training node set, a validation node set, and a testing node set. Define a multi-task loss function, which is a weighted sum of the prediction loss of nerve cell survival rate and the prediction loss of neurite growth promotion rate, and each prediction loss is calculated using mean square error; A mini-batch gradient descent strategy is used to iteratively train the structure-activity relationship graph neural network. In each training iteration, a batch of peptide nodes and their local subgraphs are sampled and input into the network to calculate the predicted activity value. The multi-task loss function value is calculated with the activity label vector and backpropagated to update the parameters. After each training cycle, the local subgraph corresponding to the set of validation nodes is input into the network to calculate the validation loss value. When the validation loss value no longer decreases for several consecutive training cycles, the training is considered to have converged, and the current network parameters are saved. The structure-function graph neural network after training convergence is denoted as the trained structure-function graph neural network, while the attention weight matrix learned in the heterogeneous graph attention layer is saved.

9. The method for mining the structure-activity relationship of porcine brain neurotrophic peptides based on sequence features according to claim 4, characterized in that, The process of deriving the candidate rule set for process control using a joint strategy of reinforcement learning and evolutionary search is as follows: Define a process trajectory optimization objective function, which is a multi-objective evaluation function, including the objectives of maximizing the comprehensive neurotrophic index, minimizing production energy consumption, minimizing batch-to-batch variability, and maximizing the abundance of target high-activity peptides; Construct a process trajectory state space and an action space. The process trajectory state space is defined as the joint space of the current process stage, the current process parameter encoding vector, and the current intermediate peptide profile features. The action space is defined as the set of adjustment ranges for each process parameter in the next process stage. A process policy network is constructed using a near-end policy optimization algorithm. The input of the process policy network is the current state vector, and the output is the probability distribution of each adjustment action in the action space. The process response sample set is used as the response model of the virtual environment, and candidate process trajectories are generated by combining Monte Carlo tree search with process policy network. The generated candidate process trajectories are sorted according to their trajectory return values, and the top 10% of the process trajectories with the highest trajectory return values ​​are selected as the set of high-quality trajectories. An evolutionary search strategy is used to perform crossover and mutation operations on the high-quality trajectory set to generate new candidate process trajectories. The high-quality trajectory set is iteratively updated until the average trajectory reward value no longer increases. For each process trajectory in the high-quality trajectory set, state discretization processing is performed, mapping continuous process states to discrete process state node numbers, and continuous adjustment actions to preset discrete action categories. State-action pairs are extracted from the discretized process trajectory. Actions with the same state are voted on and statistically analyzed. The action with the highest frequency is selected as the recommended action for that state. All state-recommended action pairs are collected to obtain the candidate rule set for process control.

10. A system for mining the structure-activity relationship of porcine neurotrophic peptides based on sequence features, used to implement the method for mining the structure-activity relationship of porcine neurotrophic peptides based on sequence features according to any one of claims 1-9, characterized in that, The system includes: Feature matrix construction module: used to acquire original process and sequencing activity data of multiple batches of pig brain to construct an original sample index table, and extract sequence features based on the original sample index table to generate a neurotrophic peptide sequence feature matrix; Graph Neural Network Training Module: Used to construct a sequence-process joint graph using the neurotrophic peptide sequence feature matrix, and train a structure-activity relationship graph neural network based on the sequence-process joint graph to obtain the trained structure-activity relationship graph neural network; Process decision derivation module: Used to generate a process response sample set from historical process records using a trained structure-property relationship graph neural network, and derive a process control candidate rule set based on the process response sample set to obtain a process control decision table; Online optimization execution module: This module generates a process adjustment instruction sequence based on real-time production data using the process control decision table, and then uses this sequence to perform online optimization of the porcine brain neurotrophic peptide preparation process.