A pre-trained based method for predicting concentration of fermentation products
By processing non-uniform length data through pre-training methods and combining it with fine-tuning based on limited labeled data, the problem of insufficient utilization of unlabeled data during fermentation is solved. This enables efficient product concentration prediction even with insufficient labeled data, improving the real-time performance and accuracy of fermentation process optimization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HEBEI UNIV OF TECH
- Filing Date
- 2024-01-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively utilize unlabeled data for predicting fermentation product concentrations, resulting in low efficiency in fermentation process optimization. In particular, deep learning models perform poorly when labeled data is insufficient.
A pre-training-based approach is adopted, which uses a data rotation module to process non-uniform data, a positive and negative sample comparison learning module to enhance the model's sensitivity to strongly correlated variables, a context comparison learning module to learn the differences between fermentation batches, a data reconstruction module to learn the correlation between variables, and fine-tuning is performed using limited label data.
Even with insufficient labeled data, the model can accurately predict the product concentration change trend during fermentation, improving the real-time performance and accuracy of fermentation process optimization, especially demonstrating superior predictive performance under conditions with a small number of labels.
Smart Images

Figure CN117972425B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of fermentation product concentration prediction and deep learning pre-training technology, and in particular to a pre-training-based method for predicting fermentation product concentration. Background Technology
[0002] Biomanufacturing is a green production method that utilizes biological organisms for material processing and synthesis. The biomanufacturing industry chain includes... Figure 1 As shown, Figure 1 The upstream and midstream industries utilize gene editing and high-throughput screening technologies to produce high-performance industrial strains; the midstream industry optimizes and scales up the fermentation process to determine the optimal fermentation conditions for high-performance industrial strains suitable for large-scale fermentation; and the downstream industry separates and purifies fermentation products for commercial delivery. In recent years, synthetic biology and high-throughput screening technologies have significantly improved the availability of high-performance industrial strains upstream. However, midstream fermentation process optimization technologies still rely on empirical knowledge and statistical methods for repeated trial and error. Due to the difficulty in quantifying and standardizing these technologies, they have become a bottleneck for the industrialization of upstream innovations. Therefore, how to leverage intelligent technologies to improve the efficiency of fermentation optimization has attracted attention from both academia and industry.
[0003] Fermentation process optimization refers to the process by which fermenters adjust control parameters based on specific quality-characterizing variables and their experience. These variables, known as index variables, primarily include cell concentration, product concentration, and substrate concentration. Because index variables require offline measurement and real-time data is unavailable, fermenters are often limited to coarse-grained adjustments based on experience. Soft sensing, on the other hand, uses software to replace hardware functionality, employing easily measurable process variables to predict the variables to be measured. In recent years, researchers have applied soft sensing to online monitoring at fermentation sites, enabling more direct observation of the entire fermentation process for fine-grained, real-time, and precise control, thus improving the efficiency of fermentation process optimization.
[0004] Methods for establishing soft sensor models in the fermentation field include: biological mechanisms, traditional machine learning, and deep learning. Among these, methods based on biological mechanisms heavily rely on sufficient and reliable prior knowledge, developing models based on complex kinetic equations. The advantage of this approach is the establishment of interpretable white-box models. However, with the increased availability of genetically engineered bacteria and faster iteration speeds, it is difficult to elucidate the mechanisms in a short period. Furthermore, the kinetic equations have poor robustness in practical applications, making biological mechanism-based methods unsuitable for intelligent fermentation. Traditional machine learning methods can mine potential information from data through feature engineering without requiring mechanistic knowledge of fermentation. The advantage of this approach is its ability to learn the correlation between fermentation process data and indicator variables. However, the data must strictly meet the model's data distribution assumptions, and feature engineering requires modification based on the data, resulting in a lack of generality and a long development cycle. It cannot adapt to the rapid iteration of high-performance industrial strains, therefore, this approach is not suitable for intelligent fermentation. Deep learning methods do not require repeated modifications to feature engineering and have the ability to explore potential patterns in data. The advantage of this method is that it can quickly learn the temporal and nonlinear nature of fermentation process variables. However, most existing methods are data-supervised learning methods, which require a large amount of labeled data. In actual fermentation processes, indicator variables often need to be measured offline. Therefore, supervised learning methods are not suitable when there is a small amount of labeled data.
[0005] To address the problem of insufficient labeled data, the key lies in how to effectively utilize the information from unlabeled data. The paper [Yao L, Ge Z. Deep learning of semisupervised process data with hierarchical extreme learning machine and soft sensor application[J].IEEE Transactions on Industrial Electronics,2017,65(2):1490-1498.] uses an autoencoder to construct a data reconstruction task and minimize the data reconstruction error, extracting features from unlabeled data for subsequent training; the paper [Shen B, Yao L, Ge Z. Nonlinear probabilistic latent variable regression models for soft sensor application:From shallow to deep structure[J]. Control Engineering Practice,2020,94:104198.] stacks variational autoencoders, further considering the distributional differences between the reconstructed data and the original data, and extracts features from unlabeled data at a deeper level; the paper [Qiu K, Wang J, Zhou X, et al. Softsensor based on localized semi-supervised relevance vector machine for penicillin fermentation process with asymmetric] uses an autoencoder to construct a data reconstruction task and minimize the data reconstruction error, extracting features from unlabeled data at a deeper level; [data[J].Measurement,2022,202:111823.] integrates three similarity measures and uses labeled data to estimate the labels of unlabeled data; [Gopakumar V,Tiwari S,Rahman IA deep learning based data driven soft sensor for bioprocesses[J].Biochemical engineeringjournal,2018,136:28-39.] uses self-organizing maps to extract information from unlabeled data for the initialization of artificial neural network weights, and then uses labeled data to fine-tune the network, showing good performance in the prediction of indicator variables.The methods described above effectively utilize the information from unlabeled data and demonstrate superior performance in their respective downstream models. However, these methods cannot be applied to other types of deep learning models. Currently, the field of fermentation soft measurement still lacks a method for extracting information from unlabeled data that is applicable to most deep learning networks.
[0006] Self-supervised learning learns useful representations from unlabeled data. By setting different pre-training tasks, self-supervised pre-trained models can achieve performance comparable to supervised learning with limited labeled data. It holds promise for extracting information from unlabeled data in fermentation processes and applying it to most deep learning networks. The BERT proposed in the paper [Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).] introduces two pre-training tasks: the first is masked language modeling, which randomly masks a portion of words in the text and then predicts the original content of these masked words; the second is next sentence prediction, i.e., determining whether two sentences are consecutive in the original text. Through training on these two pre-training tasks, the model captures semantic information well, and fine-tuning with downstream tasks can achieve even better performance. Recently, training on such pre-training tasks has played a crucial role in large-scale natural language generative models, achieving powerful results in the field of natural language processing. However, in the domain of time series data, pre-training tasks are still in the exploratory stage. The article [Eldele E, Ragab M, Chen Z, et al. Time-series representation learning via temporal and contextual contrasting[J]. arXiv preprint arXiv:2106.14112,2021.] designs a contrastive learning framework for time series data. This method first proposes two data augmentation techniques: adding random noise and random shuffling. Then, it pulls the representations of two augmented samples closer together and widens the gap between the augmented and other sample representations. Finally, it designs a challenging cross-view prediction task. After pre-training using these methods, the model learns a consistent representation of time-series data and performs well on downstream classification tasks. The article [Zhang X, Zhao Z, Tsiligkaridis T, et al. Self-supervised contrastive pre-training for time series via time-frequency consistency[J]. arXiv preprint [arXiv:2206.08496,2022.] proposed the time-frequency consistency principle, suggesting that time-domain and frequency-domain data are different views of the same data. It designed a time-frequency domain contrastive learning framework to pull in the representation of time-domain and frequency-domain data, and achieved excellent results in classification problems. This research explored the universal patterns that exist in all time series.The introduction of time-series pre-training tasks is expected to improve model accuracy with a small number of labels. Since different batches of fermentation processes exhibit variations, and different variables have varying degrees of influence on the fermentation process, constructing contrastive learning tasks can easily learn this prior knowledge, demonstrating considerable potential. However, existing research is not directly applicable to predicting fermentation product concentrations; therefore, designing pre-training tasks to benefit fermentation product concentration prediction is an important issue. Summary of the Invention
[0007] The purpose of this invention is to provide a method for predicting the concentration of fermentation process products based on pre-training, such as... Figure 2 As shown, this method first processes the non-uniform length data, then pre-trains the model, and finally fine-tunes the model. Figure 2 As shown in (a), a data rotation module was first set up to solve the problem of non-uniform data length. Figure 2 As shown in (b), the pre-training process includes: a positive-negative sample contrast learning module to enhance the model's sensitivity to strongly correlated variables; a contextual contrast learning module to learn the differences between different fermentation batches; and a data reconstruction module that randomly masks fermentation process data and attempts to reconstruct the original samples to learn the correlation between variables and capture numerical relationships. Figure 2 As shown in (c), the fine-tuning process uses limited label data to further adjust the model parameters, enabling the model to accurately predict product concentrations even with insufficient label data.
[0008] The technical solution adopted in this invention is:
[0009] A pre-trained method for predicting fermentation product concentration includes the following steps:
[0010] S1: Construct a framework for predicting fermentation product concentrations that includes a pre-training process and a fine-tuning process;
[0011] S2: Set the data rotation module to organize historical fermentation process data into training data and store the training data into mini_batch;
[0012] S3: In the pre-training process, a positive and negative sample comparison learning module is first set up. This module perturbs strongly correlated variables to construct negative samples and perturbs weakly correlated variables to construct positive samples. It maximizes the distance between the original sample and the negative sample representation and minimizes the distance between the original sample and the positive sample representation to pre-train the model.
[0013] S4: The next step in the pre-training process is to set up a context contrast learning module. This module treats different batches of data in mini_batch as negative samples of each other, maximizes the distance between the original sample and the negative sample representation, and pre-trains the model.
[0014] S5: The last step in the pre-training process is to set up a data reconstruction module. This module randomly masks the original samples using a masking mechanism and attempts to reconstruct the masked samples as much as possible back to the original samples.
[0015] S6: Calculate the total loss of the pre-training process and update the model parameters using gradient descent;
[0016] S7: In the fine-tuning process, the model is further fine-tuned using a limited number of labels Y from the historical fermentation process data, the loss is calculated and the model parameters are further adjusted;
[0017] S8: Obtain the current batch data from the fermentation data system, load the model, and predict the product concentration of the current batch data;
[0018] Furthermore, in step S2, the data rotation module organizes historical fermentation process data into training data. For ease of explanation, a dataset D with n batches of fermentation data is defined. right All in Each subset represents easily measurable fermentation process data. There will be l i There are time steps of data, where the subscript i represents the i-th dataset, and Y... i If the variable to be predicted is represented, then it can be defined as follows: mini_batch (l) This refers to a set of training data of length l, which is simply referred to as mini_batch for ease of explanation. The data processing using a data rotation mechanism includes the following steps:
[0019] 1-1) Select a batch of data as the query set q;
[0020] 1-2) Maintain a queue S of size batch_size-1, where batch_size represents the number of batches of data contained in a set of training data;
[0021] 1-3) Integrate the query set q with the queue data S and sort them according to the minimum data length l. min Cut it.
[0022] This constructs a mini-batch that can be used for model training. Through multiple iterations of the query set and multiple updates of the queue, the tail data of the long sequence is preserved as much as possible.
[0023] Furthermore, in step S3, a positive and negative sample contrast learning module is set up, and a data augmentation family τ is defined. This data augmentation family can include data augmentation operations such as random data flipping, random shuffling, and random amplification. The i-th batch of data X is defined.i ∈D i ∈mini_batch, where weakly correlated variables are selected for data augmentation and positive samples are constructed. Select strongly correlated variables for data augmentation and construct negative samples Positive samples guiding the model to learn changes in weakly correlated variables will not significantly affect the results, while negative samples guiding the model to learn changes in strongly correlated variables will produce significantly different results. Using an autoregressive model M... enc Encode the process data to obtain Where h i This is the mapping of the original data after calculation by the autoregressive model. This is the mapping of positive samples calculated by the autoregressive model. This is the mapping calculated from negative samples using an autoregressive model. (Specify) For the correct orientation, For negative pairs, construct a positive and negative sample comparison learning task.
[0024] Furthermore, in step S4, a contextual comparison learning module is set up. Since the results of different fermentation batches are often different, the results calculated by the model from different batches of data should be different. Based on this, the j-th sample X is defined. j ∈D j ∈mini_batch. Select weakly correlated variables, perform data augmentation, and construct positive samples. Select strongly correlated variables for data augmentation and construct negative samples It can be obtained Where h j This is the mapping of the original data after calculation by the autoregressive model. This is the mapping of positive samples calculated by the autoregressive model. This is the mapping calculated from negative samples using an autoregressive model. (Specify) For negative pairs, construct a contextual contrast learning task.
[0025] Furthermore, in step S5, a data reconstruction module is set up. The data reconstruction pre-training task mainly includes the following steps:
[0026] 2-1) For the original sample X i Obtain masked samples by performing random masking.
[0027] 2-2) Calculate for a specific time step t in refer to Data extracted up to a specific time step t. yes The mapping calculated by the autoregressive model;
[0028] 2-3) Construct a linear projection head P m And calculate Where P m It consists of a fully connected layer. yes The mapping calculated by the linear projection head is considered to be That is, the reconstructed data at time t; similarly, the original sample X can be obtained. i Reconstructing data at full time steps
[0029] Furthermore, in step S6, the total loss of the pre-training process is calculated. Calculating the total loss mainly includes the following steps:
[0030] 3-1) For the positive and negative pairs in steps S3 and S4, in order to maximize the similarity between positive pairs and minimize the similarity between negative pairs, a loss function L is constructed. C,i Measure the distance between positive and negative pairs, for X i The loss function can be expressed as:
[0031]
[0032] Where sim(u, v) refers to the cosine similarity between vector u and vector v, and ξ is a hyperparameter used to adjust the similarity scale. This represents other batches of data and their augmented samples within the same mini_batch. It is an indicator function, where exp refers to an exponential function with the natural constant e as its base. Minimize L C,i This can help the model capture changes brought about by strongly correlated variables, weaken the influence of the correlated variables, and learn the differences between different fermentation batches;
[0033] 3-2) For the data reconstruction in step S5, the mean absolute error loss function is used to measure the original sample X. i With reconstructed samples The distance, specifically, for X i The loss function can be represented by L M,i :
[0034]
[0035] Here, pred_step is a set containing k unique positive integers, each of which is less than the sample X. i The total time step, and the size of k, are limited by the amount of computing resources available. Let x be the reconstructed data at time t. it Here are the original data at time t. Minimize L. M,iThis can prompt the model to reconstruct the original sample from the reconstructed sample;
[0036] 3-3) The total loss in the pre-training phase combines the losses from positive and negative sample contrastive learning, contextual contrastive learning, and data reconstruction. For sample X... i It can be represented as:
[0037] L i =λ1L C,i +λ2L M,i
[0038] Where λ1 and λ2 are hyperparameters that can be adjusted. Minimize L i Then, the pre-trained model is obtained. It will be copied for use in the fine-tuning phase.
[0039] 8. Furthermore, in step S6, the model is further fine-tuned using a limited set of labels Y from the historical fermentation process data. For labeled time steps t... label For sample X i There is labeled data Y i There is a limit up to t label Sample data calculate Output after pre-training the model Then a linear projection head P is used out Specifically, P out It contains a fully connected layer. Calculate t. label Predicted product concentration at time Similarly, we can obtain the label predictions for all labeled time steps. Using the mean absolute error as the loss function in the fine-tuning stage, for sample X i Fine-tuning loss L reg It can be represented as:
[0040]
[0041] Here, label_num refers to the total number of labels used during fine-tuning. refer to The k-th value, y ik Refers to Y i The kth value.
[0042] The beneficial effects of adopting the above technical solution are as follows:
[0043] The purpose of this invention is to provide a method for predicting the concentration of fermentation process products based on pre-training, such as... Figure 2As shown, this method first processes the non-uniform length data, then pre-trains the model, and finally fine-tunes the model. Figure 2 As shown in (a), a data rotation module was first set up to solve the problem of non-uniform data length. Figure 2 As shown in (b), the pre-training process includes: a positive-negative sample contrast learning module to enhance the model's sensitivity to strongly correlated variables; a contextual contrast learning module to learn the differences between different fermentation batches; and a data reconstruction module that randomly masks fermentation process data and attempts to reconstruct the original samples to learn the correlation between variables and capture numerical relationships. Figure 2 As shown in (c), the fine-tuning process uses limited label data to further adjust the model parameters, enabling the model to accurately predict product concentrations even with insufficient label data.
[0044] The proposed fermentation process product concentration prediction method was applied to the publicly available industrial penicillin production dataset IndPenSim. Experimental analysis showed that, on four different batches of fermentation data, the model using the data rotation module accurately identified the decreasing trend of product concentration at the end of fermentation. The model using the positive-negative sample contrast learning and context contrast learning modules outperformed the model without these modules, while the model using the random mask module had the highest accuracy. In particular, the model achieved optimal performance when all four modules were used in combination. Notably, especially when fine-tuning the model using only two labels, the model using the proposed pre-training method accurately identified the decreasing trend of product concentration in the middle of fermentation, a performance unattainable by purely supervised learning. Attached Figure Description
[0045] Figure 1 A schematic diagram illustrating the upstream, midstream, and downstream components of the biomanufacturing industry chain;
[0046] Figure 2 Framework diagram of a pre-trained fermentation process product concentration prediction method;
[0047] Figure 3 Data augmentation diagram;
[0048] Figure 4 Comparative experiment results of penicillin concentration prediction;
[0049] Figure 5 Prediction of penicillin concentration R under different label numbers 2 Trend chart;
[0050] Figure 6 Graph showing the predicted penicillin concentration results from the ablation experiment;
[0051] Figure 7 Hyperparameter sensitivity analysis results. Detailed Implementation
[0052] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.
[0053] This invention, taking the prediction of fermentation process product concentration as its background, utilizes pre-training technology as a carrier and deep learning as an aid, to propose a pre-training-based method for predicting fermentation process product concentration. Its framework is as follows: Figure 2 As shown, it includes the following steps:
[0054] S1: Construct a framework for predicting fermentation product concentrations that includes a pre-training process and a fine-tuning process;
[0055] S2: Set the data rotation module to organize historical fermentation process data into training data and store the training data into mini_batch;
[0056] For ease of explanation, let's define a dataset D with n batches of fermentation data. right All in Each subset represents easily measurable fermentation process data. There will be l i There are time steps of data, where the subscript i represents the i-th dataset, and Y... i If the variable to be predicted is represented, then it can be defined as follows: mini_batch (l) This refers to a set of training data of length l, which is simply referred to as mini_batch for ease of explanation. The data processing using a data rotation mechanism includes the following steps:
[0057] 1-1) Select a batch of data as the query set q;
[0058] 1-2) Maintain a queue S of size batch_size-1, where batch_size represents the number of batches of data contained in a set of training data;
[0059] 1-3) Integrate the query set q with the queue data S and sort them according to the minimum data length l. min Cut it.
[0060] This constructs a mini-batch that can be used for model training. Through multiple iterations of the query set and multiple updates of the queue, the tail data of the long sequence is preserved as much as possible.
[0061] S3: In the pre-training process, a positive and negative sample comparison learning module is first set up. This module perturbs strongly correlated variables to construct negative samples and perturbs weakly correlated variables to construct positive samples. It maximizes the distance between the original sample and the negative sample representation and minimizes the distance between the original sample and the positive sample representation to pre-train the model.
[0062] Define a data augmentation family τ, which can include data augmentation operations such as random data flipping, random shuffling, and random amplification. Define the i-th batch of data X. i ∈D i ∈mini_batch, where weakly correlated variables are selected for data augmentation and positive samples are constructed. Select strongly correlated variables for data augmentation and construct negative samples Positive samples guiding the model to learn changes in weakly correlated variables will not significantly affect the results, while negative samples guiding the model to learn changes in strongly correlated variables will produce significantly different results. Using an autoregressive model M... enc The process data is encoded to obtain h. i =M enc (X i ), Where h i This is the mapping of the original data after calculation by the autoregressive model. This is the mapping of positive samples calculated by the autoregressive model. This is the mapping calculated from negative samples using an autoregressive model. (Specify) For the correct orientation, For negative pairs, construct a positive and negative sample comparison learning task.
[0063] S4: The next step in the pre-training process is to set up a context contrast learning module. This module treats different batches of data in mini_batch as negative samples of each other, maximizes the distance between the original sample and the negative sample representation, and pre-trains the model.
[0064] Since the results of different fermentation batches often differ, the results calculated by the model from data from different batches should also differ. Based on this, the j-th sample X is defined. j ∈D j ∈mini_batch. Select weakly correlated variables, perform data augmentation, and construct positive samples. Select strongly correlated variables for data augmentation and construct negative samples We can get h j =M enc (X j ), Where h j This is the mapping of the original data after calculation by the autoregressive model. This is the mapping of positive samples calculated by the autoregressive model. This is the mapping calculated from negative samples using an autoregressive model. Specify (h) i h j ), For negative pairs, construct a contextual contrast learning task.
[0065] S5: The last step in the pre-training process is to set up a data reconstruction module. This module randomly masks the original samples using a masking mechanism and attempts to reconstruct the masked samples as much as possible back to the original samples.
[0066] 2-1) For the original sample X i Obtain masked samples by performing random masking.
[0067] 2-2) Calculate for a specific time step t in refer to Data extracted up to a specific time step t. yes The mapping calculated by the autoregressive model;
[0068] 2-3) Construct a linear projection head P m And calculate Where P m It consists of a fully connected layer. yes The mapping calculated by the linear projection head is considered to be That is, the reconstructed data at time t; similarly, the original sample X can be obtained. i Reconstructing data at full time steps
[0069] S6: Calculate the total loss of the pre-training process and update the model parameters using gradient descent;
[0070] 3-1) For the positive and negative pairs in steps S3 and S4, in order to maximize the similarity between positive pairs and minimize the similarity between negative pairs, a loss function L is constructed. C,i Measure the distance between positive and negative pairs, for X i The loss function can be expressed as:
[0071]
[0072] Where sim(u, v) refers to the cosine similarity between vector u and vector v, and ξ is a hyperparameter used to adjust the similarity scale. This represents other batches of data and their augmented samples within the same mini_batch. It is an indicator function, where exp refers to an exponential function with the natural constant e as its base. Minimize LC,i This can help the model capture changes brought about by strongly correlated variables, weaken the influence of the correlated variables, and learn the differences between different fermentation batches;
[0073] 3-2) For the data reconstruction in step S5, the mean absolute error loss function is used to measure the original sample X. i With reconstructed samples The distance, specifically, for X i The loss function can be represented by L M,i :
[0074]
[0075] Here, pred_step is a set containing k unique positive integers, each of which is less than the sample X. i The total time step, and the size of k, are limited by the amount of computing resources available. Let x be the reconstructed data at time t. it Here are the original data at time t. Minimize L. M,i This can prompt the model to reconstruct the original sample from the reconstructed sample;
[0076] 3-3) The total loss in the pre-training phase combines the losses from positive and negative sample contrastive learning, contextual contrastive learning, and data reconstruction. For sample X... i It can be represented as:
[0077] L i =λ1L C,i +λ2L M,i
[0078] Where λ1 and λ2 are hyperparameters that can be adjusted. Minimize L i Then, the pre-trained model is obtained. It will be copied for use in the fine-tuning phase.
[0079] S7: The fine-tuning process uses a limited number of label Y data from historical fermentation process data to further fine-tune the model, calculate the loss, and further adjust the model parameters;
[0080] 9. For labeled time steps t label For sample X i There is labeled data Y i There is a limit up to t label Sample data calculate Output after pre-training the model Then a linear projection head P is used out Specifically, P out It contains a fully connected layer. Calculate t.label Predicted product concentration at time Similarly, we can obtain the label predictions for all labeled time steps. Using the mean absolute error as the loss function in the fine-tuning stage, for sample X i Fine-tuning loss L reg It can be represented as:
[0081]
[0082] Here, label_num refers to the total number of labels used during fine-tuning. refer to The k-th value, y ik Refers to Y i The kth value.
[0083] S8: Obtain the current batch data from the fermentation data system, load the model, and predict the product concentration of the current batch data;
[0084] Experimental verification of this invention based on a pre-trained fermentation process product concentration prediction method:
[0085] 1. Test Environment
[0086] The system uses Windows 10, an Intel Core i7-10700K CPU (3.8GHz x 8 CPUs) and an NVIDIA GeForce RTX 3070 GPU. The development environment is Python 3.7, and the deep learning framework used is PyTorch 1.13.0.
[0087] 2. Experimental verification
[0088] Experimental results and analysis on the public penicillin dataset IndPenSim.
[0089] (1) Dataset Description
[0090] To verify the effectiveness of the proposed method, industrial-grade penicillin fermentation was selected as a research example. IndPenSim is an industrial-grade penicillin fermentation data simulator, with its fermentation scale extended to 1000L. Model validation was performed using real, readily available industrial penicillin fermentation data. IndPenSim provided 100 batches of simulation data. The process variables in the dataset are shown in Table 1, including 60 batches of formulation-driven data, 30 batches of operator intervention control data, and 10 batches of fault data. Formulation-driven data refers to "Manual control" variables that remain constant during fermentation, while operator intervention control refers to "Manual control" variables whose setpoints are manually changed by the operator at specific times. In this case, the 100 batches of data were sampled online at 0.2-hour intervals, with a fermentation cycle ranging from 167-290 hours and data lengths ranging from 794 to 1356 records. This paper uses operator intervention control data, randomly selecting 20 batches as the training set, 5 batches as the validation set, and 5 batches as the test set. The first 21 variables in the dataset are readily available parameters and play a decisive role in the fermentation process; therefore, they were selected as input variables. As a fermentation product, penicillin concentration has important indicative significance for adjusting fermentation process parameters and can only be measured offline. Therefore, penicillin concentration was selected as the variable to be predicted.
[0091] Table 1 Variables in the fermentation process
[0092]
[0093] (2) Implementation details and evaluation indicators
[0094] The model was built using the PyTorch library, and the transformer was chosen as the model type. enc The Adam optimizer was selected with a learning rate of 0.0003. Training was conducted for 30 epochs with an early stopping mechanism to prevent overfitting. The data masking ratio was 50%, pred_step was set to 10, λ1 = 1.5, and λ2 = 1. Strongly correlated variables were selected as glucose replenishment rate, dissolved oxygen concentration, oxygen uptake rate, and carbon dioxide release rate; weakly correlated variable was selected as pH value. For these two types of variables, random flipping, amplification, shuffling, and the addition of random noise were used for enhancement. Figure 3 As shown, where Figure 3 (a) shows the enhancement results for control variables such as glucose replenishment rate. Figure 3 (b) Demonstrates the enhancement results for curve-like variables such as dissolved oxygen concentration. When the number of labels is 2, the penicillin concentration values at the start and end of fermentation are selected as labels. When the number of labels is greater than 2, the start and end points of fermentation are selected, and then the penicillin concentration values during the middle of fermentation are evenly selected as labels. R is used. 2 To evaluate the model performance, R 2The calculation formula is as follows:
[0095]
[0096] Where y represents the true value, and This represents the model's predicted value. Let represent the mean of the true values, and l represent the length of the data. The data focuses on penicillin concentrations at each time step, so we can calculate R using all l predicted values. 2 R 2 It can characterize the degree of fit between two curves. Its value is between 0 and 1. The closer the value is to 1, the higher the degree of fit between the two curves and the better the predictive performance of the model.
[0097] (3) Comparative Experiment
[0098] To illustrate the effectiveness of the proposed method, we denote it as QMCP (Quene Masked Contrastive Pre-training). This section presents comparative experiments under 2-label and 16-label conditions and addresses two questions: (1) Can limited labeled data support the training of a purely supervised learning model? (2) Is the pre-training method in this paper superior to existing research? Therefore, three benchmark models were selected for comparative experiments: CNN-LSTM, TS-TCC, and TF-C. The CNN-LSTM model will be trained using only limited labeled data, while the TS-TCC and TF-C models will be pre-trained in their respective ways and then fine-tuned with a linear projection head.
[0099] Figure 4 The penicillin concentration prediction curves of the proposed method and the comparative methods were compared. "true" represents the actual penicillin concentration, "fine_tune" indicates the model pre-trained and then fine-tuned using the proposed method (QMCP), "supervised_only" indicates the model trained directly on a small amount of labeled data without pre-training, and "TF-C" and "TS-TCC" are the names of the comparative methods. Table 2 compares the average R-values of the proposed model and the three comparative models on the test set for 2-label and 16-label scenarios. 2 When the CNN-LSTM model is trained using 2-label and 16-label methods, R... 2 The highest value reached was only 0.8003, which is not ideal. The model in this paper was trained directly using 2-label and 16-label methods without pre-training, and its performance in R... 2 Numerically, considerable results were achieved, but as... Figure 4As shown in (a1) and (a2), the model failed to capture the upward trend of penicillin concentration in the middle and late stages of fermentation. In (a3) and (a4), the model failed to capture the turning point of penicillin concentration decline. Figure 4 (b1)-(b4) show that although increasing the number of labels is beneficial to improving the model performance, there may not be enough labeled data in actual production. Therefore, the limited labeled data is insufficient to support the training of supervised learning models.
[0100] For TS-TCC and TF-C, this paper uses their pre-training methods, adding only one linear layer to complete the regression task. Table 2 shows that the proposed method achieves optimal results, with the TS-TCC model showing the highest R-value. 2 All are lower than the model presented in this paper. TF-C performs poorly in the 2-label case. Figure 4 In (a1)-(a4), the predicted values of TF-C were too high in the early stage of fermentation and tended to stabilize in the later stage. It is speculated that this was due to the influence of abrupt changes in the frequency domain data. Figure 4 In (a1), the predicted value of TF-C in the later stage of fermentation best matches the true value, demonstrating the potential of the model under multi-label data. However, the model does not have an advantage in the case of few labels in this paper. In summary, compared with existing methods, the pre-training method proposed in this paper shows superior performance when there is insufficient labeled data.
[0101] Figure 5 The test set R shows the results as the number of labels increases from 2 to 16. 2 In most cases, the model in this paper is a purely supervised model. Only when the number of labels is 6 does the purely supervised model R... 2 The R-value is 0.0066 higher than that of our model. Especially when the number of labels is 2, the R-value of our model is significantly higher. 2 The lead of 0.1387 demonstrates the superiority of our model in the semi-supervised case.
[0102] Table 2 Comparative Experiments
[0103]
[0104] (4) Ablation test
[0105] To evaluate the effectiveness of each module of QMCP, this section conducts ablation experiments under 2-label and 16-label conditions, attempting to address three questions: (1) Is the data rotation mechanism proposed in this paper effective? (2) Is the mask pre-training mechanism used in this paper effective? (3) Is the data augmentation method proposed in this paper beneficial? Therefore, this section designs four benchmark models for ablation experiments: MCP; QMP; QCP and QMCP*. The MCP model does not use the data rotation mechanism, but instead prunes all training data to the minimum length. The QMP model uses only the mask mechanism for pre-training, and QCP uses only the contrastive learning method proposed in this paper for pre-training. QMCP* does not use the data augmentation method proposed in this paper, but instead augments all variables in the process data to form positive samples.
[0106] Figure 6 The penicillin concentration prediction curves of each model were compared under the 2-label case, where "true" represents the actual penicillin concentration value, "fine_tune" refers to the QMCP method in this paper, and "MCP", "QMP", "QCP", and "QMCP*" represent other ablation experimental models. Table 3 compares the average R-values of each model on the test set under the 2-label and 16-label cases. 2 In the 16-label case, our model outperforms other baselines, and its performance is particularly evident in the 2-label case. Figure 6 The paper presents the penicillin concentration prediction results of the model in this paper and the baseline models under the 2-label case. Figure 6 (c1) A slight decrease in yield occurred at the end of fermentation. Models using the data rotation mechanism accurately identified this downward trend, while the MCP model without the data rotation mechanism failed to identify this trend and its initial predictions increased too rapidly. This indicates that using the data rotation mechanism to handle non-uniform data is beneficial. QMP, compared to the model R in this paper... 2 The decrease of 0.2 indicates that the introduction of contrastive learning is beneficial to model prediction. The R-value of QCP... 2 The lowest, a decrease of 0.31, Figure 6 In (c3) and (c4), the model did not identify the point of penicillin concentration decrease, indicating the effectiveness of the masking mechanism. Finally, QMCP*R 2 The value decreased by 0.13, indicating that the data augmentation method proposed in this paper is superior to previous data augmentation methods.
[0107] Table 3 Ablation Experiment
[0108]
[0109] (5) Hyperparameter sensitivity analysis
[0110] This section performs a sensitivity analysis on the hyperparameters of the total loss during the pre-training phase. Figure 7 This shows the average R-value of the test set when the number of labels is 2. 2 . Figure 7 (a) By fixing λ2 to 1, we explored the results of varying λ1 from 0.1 to 10. The results showed that when the contrastive learning weights are too low or too high, the model performance is impaired. When 1 < λ1 < 2, the model is not sensitive to λ1. Figure 7 (b) Fixing λ1 to 1 shows that increasing the data reconstruction weights improves model performance, but this should be kept within a certain range. Excessive weights cause a sharp drop in model performance, possibly because the model is overly sensitive to the numerical values of the fermentation process data, hindering the mapping of label values during the final fine-tuning stage. When λ2 < 2, the model is insensitive to the value of λ1. Therefore, λ1 = 1.5 and λ2 = 1 were ultimately chosen as the final weights.
[0111] 3. Conclusion
[0112] This paper proposes a pre-trained product concentration prediction model, QMCP, which provides real-time indicator variables to help fermenters intuitively understand the current fermentation state and make appropriate operations. The proposed QMCP first undergoes pre-training: a data rotation mechanism handles the problem of non-uniform data length; positive and negative sample contrast learning uses perturbation samples of strongly correlated variables to form negative samples and perturbation samples of weakly correlated variables to form positive samples, pulling in positive pairs and pulling out negative pairs to enable the model to keenly capture the features of strongly correlated variables; contextual contrast learning uses different samples in the mini-batch to form negative pairs and pulls their representations out to learn the differences between different fermentation batches; a data reconstruction module attempts to restore the original data using masked data to encourage the model to learn the correlation between variables and capture some numerical features. Finally, the model is fine-tuned with a small amount of labeled data. Experiments show that QMCP outperforms other baselines, especially when the number of labels is 2, that is, only the fermentation start and end points are used as labels, accurately identifying the product concentration decline trend in the middle of fermentation, which is impossible with pure supervised learning. Furthermore, it was found that simple contrastive learning is detrimental to the task presented in this paper because it fails to capture numerical features, while masking mechanisms, when used in conjunction with contrastive learning, demonstrate powerful performance.
[0113] The foregoing detailed examples of the present invention are merely preferred embodiments and should not be construed as limiting the scope of the invention. All equivalent variations and modifications made within the scope of the present invention should still fall within the patent coverage of the present invention.
Claims
1. A pre-training-based fermentation product concentration prediction method, characterized by, Includes the following steps: S1: Construct a framework for predicting fermentation product concentrations that includes a pre-training process and a fine-tuning process; S2: Set the data rotation module to arrange the historical fermentation process data as training data, and store the training data into the database; S3: In the pre-training process, a positive and negative sample comparison learning module is first set up. This module perturbs strongly correlated variables to construct negative samples and perturbs weakly correlated variables to construct positive samples. It maximizes the distance between the original sample and the negative sample representation and minimizes the distance between the original sample and the positive sample representation to pre-train the model. S4: The next step in the pre-training process is to set up a contextual contrastive learning module, which will... Different batches of data in the model are treated as negative samples of each other, and the distance between the original sample and the negative sample representation is maximized to pre-train the model. S5: The last step in the pre-training process is to set up a data reconstruction module. This module randomly masks the original samples using a masking mechanism and attempts to reconstruct the masked samples into the original samples. S6: Calculate the total loss of the pre-training process and update the model parameters using gradient descent; S7: In the fine-tuning process, limited labels in historical fermentation process data are used Further fine-tune the model, calculate the loss and further adjust the model parameters; S8: Obtain the current batch data from the fermentation data system, load the model, and predict the product concentration of the current batch data; In step S3: In the positive and negative sample contrast learning module, a data augmentation family is defined. This data augmentation family includes data random flipping, random shuffling, and random amplification data augmentation operations, defining the first... Batch data The data augmentation was performed on weakly correlated variables, and positive samples were constructed. Select strongly correlated variables for data augmentation and construct negative samples. Positive samples guide the model to learn that changes in weakly correlated variables will not have a significant impact on the results, while negative samples guide the model to learn that changes in strongly correlated variables will produce huge differences in the results. Therefore, an autoregressive model is used. Encode the process data to obtain ,in This is the mapping of the original data after calculation by the autoregressive model. This is the mapping of positive samples calculated by the autoregressive model. The mapping of negative samples calculated by the autoregressive model is specified. For the correct orientation, For negative pairs, construct a positive and negative sample comparison learning task.
2. The method of claim 1, wherein, Step S2: The data rotation module organizes historical fermentation process data into training data. For ease of explanation, a [database name] is defined. n Batch fermentation data dataset , ,right All ,in Each subset represents easily measurable fermentation process data. There will be Data at each time step, subscript Indicates the first Each dataset If it represents the variable to be predicted, then define... , That is, length A set of training data, for ease of explanation, is referred to as Organizing data using a data rotation mechanism includes the following steps: 2-1) Selecting a batch of data as a query set ; 2-2) Maintain a size of queue ,in This indicates that a set of training data contains several batches of data. 2-3) Query set with queue data integrate and trim to the minimum data length therein trim; Thus far, a set of data that can be used for model training is constructed After multiple iterations of the query set and multiple updates of the queue, the tail data of the long sequence is retained.
3. The method of claim 1, wherein, In step S4: setting up the contextual comparison learning module, since the results of different fermentation batches are often different, the results calculated by the model from different batches of data should be different. Based on this definition, the first... Sample Select weakly correlated variables for data augmentation and construct positive samples. Select strongly correlated variables for data augmentation and construct negative samples. ,get ,in This is the mapping of the original data after calculation by the autoregressive model. This is the mapping of positive samples calculated by the autoregressive model. The mapping of negative samples calculated by the autoregressive model is specified. For negative pairs, construct a contextual contrast learning task.
4. The method of claim 1, wherein, Step S5: In the data reconstruction module, the data reconstruction pre-training task mainly includes the following steps: 5-1) Randomly masking the original sample to obtain a masked sample ; 5-2) For a certain time step Compute where denotes data at a certain time step t, is the mapping after computing an autoregressive model; 5-3) Constructing a linear projection head and compute where is composed of a fully connected layer, is the mapping after the linear projection head computation, consider to be the reconstruction data at time t, and similarly obtain the reconstruction data for the original sample across all time steps .
5. The method of claim 1, wherein, Step S6: Calculating the total loss of the pre-training process mainly includes the following steps: 6-1) For the positive pairs and negative pairs in step S3 and step S4, in order to maximize the similarity between the positive pairs and minimize the similarity between the negative pairs, a loss function is constructed The distance of the positive pairs and the negative pairs is measured, for The loss function is represented as: in pointer sum vector cosine similarity, These are hyperparameters used to adjust the similarity scale. Indicates the same Other batches of data and their augmented samples, It is an indicator function. , minimize This helps the model capture changes brought about by strongly correlated variables, weaken the influence of related variables, and learn the differences between different fermentation batches; 6-2) For data reconstruction in step S5, the mean absolute error loss function is used to measure the distance of the original samples to the reconstructed samples , specifically, the loss function of : : wherein, is a set of non-repeating positive integers, each positive integer being less than the total time steps of the sample , The size of the set is limited by the size of the computational resources, is the reconstruction data at time , is the original data at time , encourages the model to reconstruct the reconstructed sample to the original sample; 6-3) The total loss of the pre-training stage combines the loss of positive and negative sample pair comparison learning, the loss of context comparison learning, and the loss of data reconstruction. For a sample is represented as: wherein and are hyperparameters that can be adjusted, minimizing After this, the pre-trained model , will be copied to the fine-tuning phase.
6. The method of claim 1, wherein, Step S6: Using limited labels from historical fermentation process data Further fine-tuning of the model is underway, particularly for labeled time steps. For the sample There is tagged data There is a deadline Sample data ,calculate Output after pre-training the model Then a linear projection head is used. , specifically, It contains a fully connected layer, and computes... Predicted product concentration at time Similarly, we obtain the label predictions for all labeled time steps. Using the mean absolute error as the loss function in the fine-tuning stage, for the sample Fine-tuning loss Represented as: wherein, the total number of tags used for fine-tuning, refers to the first value of, refers to the first value of.