Industrial process modeling method and system fusing active learning and random configuration network

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating active learning with randomized network configuration, the problems of sample imbalance and information redundancy in industrial modeling are solved, achieving high-precision and high-efficiency model construction with low annotation budget, thus meeting the needs of online industrial modeling.

CN122196557APending Publication Date: 2026-06-12QINGDAO UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: QINGDAO UNIV OF SCI & TECH
Filing Date: 2026-05-14
Publication Date: 2026-06-12

AI Technical Summary

Technical Problem

Existing technologies in industrial modeling suffer from problems such as unbalanced sample selection, severe information redundancy, high computational complexity, and poor model generalization ability, making it difficult to achieve high-precision and high-efficiency model construction with low annotation budgets.

Method used

We adopt a method that integrates active learning and randomized network configuration. We calculate the uncertainty metric through query committee, combine two-layer clustering screening and sample redundancy elimination to optimize the candidate sample set, and perform incremental modeling within a preset annotation budget. We update the model by utilizing the adaptive hidden layer and analytical solution characteristics of the randomized network configuration.

Benefits of technology

With a low annotation budget, the model achieves simultaneous improvement in prediction accuracy, operational coverage, and training efficiency, avoids repeated sampling and information redundancy in local areas, improves the efficiency of annotation resource utilization and the model's fitting ability, and meets the timeliness requirements of industrial online modeling.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122196557A_ABST

Patent Text Reader

Abstract

The application provides an industrial process modeling method and system fusing active learning and random configuration network, relates to the technical field of industrial data modeling, and comprises the following steps: collecting industrial process data and performing pretreatment, and constructing a to-be-labeled sample pool and an initial labeling set; a query committee is constructed, and a first uncertainty measure of an unlabeled sample is calculated; based on the first uncertainty measure, a candidate labeling set is selected from the to-be-labeled sample pool; after double-layer clustering and redundancy elimination are performed on the candidate labeling set, a to-be-labeled sample set is selected; after the to-be-labeled sample set is labeled and incorporated into the initial labeling set, an updated training set is obtained, incremental modeling is performed based on the updated training set, and an iterative model is obtained; iteration is repeatedly performed until a preset iteration termination condition is reached, and a final industrial data modeling model is obtained. The application realizes synchronous improvement of model prediction accuracy, working condition coverage capability and training efficiency under a low labeling budget.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial data modeling technology, and in particular to an industrial process modeling method and system that integrates active learning and stochastic configuration networks. Background Technology

[0002] In recent years, with the deep integration of the Industrial Internet and big data technologies, data-driven soft measurement technology has become a core means for real-time monitoring of key process indicators in complex industrial processes. To address the core challenges of high costs and long cycles in acquiring labeled data from industrial sites, modeling frameworks integrating active learning and efficient machine learning models have become a research hotspot. Active learning, through intelligent selection of high-value samples for labeling, aims to improve model performance at the lowest cost. Its strategies have evolved from early uncertainty-based sampling to more robust batch sampling methods such as query committees and maximizing expected model changes. Meanwhile, stochastically configured networks, as a simple and efficient stochastic learning model, have shown potential in industrial modeling to handle nonlinear and small-sample problems due to their adaptive construction of hidden layers and analytical solution of weights. Currently, how to deeply combine advanced active learning mechanisms with efficient stochastically configured networks to construct industrial models with high accuracy, strong generalization, and high cost-effectiveness under limited labeling budgets is an important frontier direction in this field.

[0003] Despite the progress made by the aforementioned existing technologies, significant limitations remain in complex industrial modeling scenarios. First, at the sample selection level, existing mainstream active learning strategies have inherent flaws: while query committee strategies can assess uncertainty, they tend to repeatedly select samples from local regions in the feature space during batch sampling, leading to "sampling congestion," highly homogeneous selected samples, severe information redundancy, and a lack of consideration for the global representativeness of the sample space distribution. Second, in terms of efficiency and generalization, existing methods struggle to synergistically optimize multiple objectives: introducing clustering alone to ensure representativeness introduces a large number of "simple samples" that do not contribute to model improvement; attempting to remove redundancy by calculating sample similarity faces the problem of exploding computational complexity under high-dimensional data, failing to meet the timeliness requirements of industrial online applications. Furthermore, existing technologies often use deep networks or traditional SCNs for direct training on small samples, making them prone to overfitting and exhibiting poor generalization ability.

[0004] How to solve the above-mentioned technical problems is the challenge facing this invention. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides an industrial process modeling method and system that integrates active learning and stochastically configured networks, achieving simultaneous improvements in model prediction accuracy, operational condition coverage, and training efficiency under low annotation budget.

[0006] The technical solution adopted by this invention to solve its technical problem is as follows: This invention provides an industrial process modeling method that integrates active learning and randomized network configuration, comprising the following steps: S1. Collect and preprocess the input and output variable data of the industrial process to obtain the sample pool to be labeled, and select samples of a preset size from the sample pool to be labeled to construct the initial label set. S2. Construct a query committee based on the initial annotation set, calculate the first uncertainty measure of each unlabeled sample in the unlabeled sample pool through the query committee, and select a candidate annotation set from the unlabeled sample pool based on the first uncertainty measure; S3. Perform two-level clustering screening on the candidate annotation set to obtain an optimized candidate set, and perform sample redundancy elimination processing on the optimized candidate set to obtain a sample set after redundancy elimination; S4. Within the preset annotation budget, the sample set after redundancy elimination is filtered to obtain the sample set to be annotated; S5. After labeling the sample set to be labeled, merge it into the initial label set to obtain the updated training set. Based on the updated training set, use a randomly configured network for incremental modeling to obtain the iterative model. S6. Repeat steps S2 to S5 until the preset iteration termination condition is met to obtain the final industrial data modeling model.

[0007] Preferably, the preprocessing in step S1 includes missing value imputation, outlier correction and min-max normalization; the initial annotation set is constructed by selecting samples of a preset size from the pool of samples to be labeled using a random sampling method without replacement.

[0008] Preferably, step S2 specifically includes: For the initial labeled set, a bootstrap sampling method is used to generate several resampled subsets, and a randomly configured network sub-model is trained based on the several resampled subsets to construct a query committee composed of several randomly configured network sub-models; The query committee is used to predict samples in the unlabeled sample pool. The variance of the prediction results of all randomly configured network sub-models on different output dimensions is calculated, and the mean of the variance of each output dimension is taken as the first uncertainty measure. Based on the descending order of the first uncertainty measure, a predetermined proportion of samples are selected to form a candidate label set.

[0009] Preferably, the two-layer clustering screening in step S3 includes a first-layer clustering and a second-layer weighted clustering; The first layer of clustering uses the k-means clustering algorithm to divide the candidate label set into clusters, resulting in several clusters. The cluster diversity index of each cluster is calculated, and the clusters are sorted in descending order of their diversity indices. The top few clusters are selected to form a pre-selected cluster set. The cluster diversity index is calculated as follows: for a single cluster, the mean of the feature vectors of all samples in that cluster is calculated to obtain the cluster center vector; the mean of the feature vectors of all labeled samples in the initial label set is calculated to obtain the baseline center vector of the labeled region; the Euclidean distance between the cluster center vector and the baseline center vector of the labeled region is calculated, and the normalized value of this Euclidean distance is used as the cluster diversity index of that cluster. The second layer of weighted clustering takes the samples in the pre-selected cluster set as the object, uses the first uncertainty measure calculated in step S2 as the cluster weight, and takes minimizing the sum of squared distances from the weighted samples to the corresponding cluster centers as the objective function. Weighted k-means clustering is performed, and the samples with the highest uncertainty measure are selected from each weighted cluster sub-cluster to form an optimization candidate set.

[0010] Preferably, the sample redundancy elimination process performed on the optimized candidate set in step S3 includes: Based on the query committee, pseudo-labels are generated for each sample in the candidate set to optimize the sample; the pseudo-labels are the arithmetic mean of the prediction results of all randomly configured network sub-models in the query committee for the same sample. The simulated labeled samples are then used as single samples generated from the optimized candidate set, and these simulated labeled samples are merged into the initial labeled set to obtain a temporary labeled set. The query committee is reconstructed based on the temporary annotation set to obtain the updated query committee; and based on the updated query committee, the second uncertainty measure of other candidate samples in the candidate set other than the current simulated annotation samples is calculated. Calculate the difference between the first uncertainty measure and the second uncertainty measure for other candidate samples to obtain the expected change in variance; and calculate the ratio of the expected change in variance to the first uncertainty measure to obtain the relative decrease rate. An uncertainty threshold and a decline rate threshold are set for the second uncertainty measure and the relative decline rate, respectively. When the second uncertainty measure is lower than the uncertainty threshold and the relative decline rate is greater than or equal to the decline rate threshold, the corresponding candidate sample is determined to be a redundant sample and is removed. The step S4 of filtering the sample set after redundancy elimination includes: The sample set after redundancy elimination is sorted in descending order according to the first uncertainty metric calculated in step S2, and then filtered based on the preset annotation budget to obtain the sample set to be annotated.

[0011] Preferably, the incremental modeling using a randomly configured network based on the updated training set in step S5 includes: A randomly configured network is constructed, with the input features of the updated training set as the model input and the corresponding output variables as the model output target. Hidden layer nodes are adaptively and iteratively configured through preset inequality constraints. After each hidden layer node is configured, the output layer weights of the model are analytically solved using the least squares method with L2 regularization, and the model prediction residuals are updated. The preset inequality constraint formula is expressed as follows:

[0012] in, Indicates the first The hidden node for the th The contribution value of each output component is used to filter the optimal node parameters; This indicates that the network is adding the first... Before the nth node, the nth The current residual vector in each output dimension; This represents the hidden layer output vector of the candidate node under the current input; This represents the constraint parameters for network construction, with a value range of (0,1), used to control the strength of random mapping; Let f(x) represent a sequence of non-negative real numbers that satisfies the following conditions: This is used to dynamically adjust the looseness of inequality constraints during the iteration process; Indicates the index of the output dimension; The formula for the output layer weights is as follows:

[0013] in, It is the output vector of the input training set. It is the hidden layer weight matrix. It is the regularization coefficient. It is the identity matrix. This is the output weight matrix.

[0014] Preferably, the preset iteration termination condition in step S6 includes the model's performance reaching a preset accuracy requirement, or the cumulative number of labeled samples reaching a preset total labeling budget limit.

[0015] This invention also provides an industrial process modeling system that integrates active learning and stochastically configured networks, including: The data acquisition and preprocessing module is used to collect input and output variable data of industrial processes and preprocess them to obtain a sample pool to be labeled, and to select samples of a preset size from the sample pool to be labeled to construct an initial label set. The uncertainty screening module is used to construct a query committee based on the initial label set, calculate the first uncertainty measure of each unlabeled sample in the unlabeled sample pool through the query committee, and select a candidate label set from the unlabeled sample pool based on the first uncertainty measure. The clustering screening and redundancy elimination module is used to perform two-level clustering screening on the candidate annotation set to obtain an optimized candidate set, and to perform sample redundancy elimination processing on the optimized candidate set to obtain a sample set after redundancy elimination; The final sample selection module is used to filter the sample set after redundancy elimination within the preset annotation budget to obtain the sample set to be annotated; The incremental training and update module is used to label the unlabeled sample set and merge it into the initial label set to obtain the updated training set. Based on the updated training set, a randomly configured network is used for incremental modeling to obtain the iterative model. The iteration and model output module is used to repeatedly execute steps S2 to S5 until the preset iteration termination condition is reached, and the final industrial data modeling model is obtained.

[0016] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described industrial process modeling method that integrates active learning and randomized network configuration.

[0017] The present invention also provides a computer storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described industrial process modeling method that integrates active learning and randomized network configuration.

[0018] The beneficial effects of this invention are as follows: it achieves simultaneous improvement in model prediction accuracy, operational condition coverage, and training efficiency under low annotation budget. First, a query committee completes the initial uncertainty metric screening, accurately identifying high-value samples for the model. Then, combined with two-layer clustering screening and sample redundancy elimination, it retains core samples with high uncertainty while also considering the global distribution characteristics of the sample space, fundamentally avoiding repeated sampling in local areas, eliminating sample information redundancy, and significantly improving the efficiency of annotation resource utilization. The hierarchical screening logic of "uncertainty screening—two-layer clustering—redundancy removal" accurately filters simple samples with no gain. Simultaneously, two-layer clustering significantly reduces the complexity of high-dimensional data similarity calculation, significantly improving screening efficiency while ensuring sample representativeness, and adapting to the timeliness requirements of industrial online modeling. The incrementally updated annotation set uses a randomly configured network for modeling. Leveraging its advantages in adaptive hidden layer construction and analytical weight solving, combined with an iterative incremental optimization mechanism, it avoids overfitting problems under small sample sizes, significantly improving the model's fitting ability and generalization performance for industrial nonlinear and non-steady-state data, ensuring the accuracy and stability of industrial modeling. Attached Figure Description

[0019] Figure 1 This is a diagram illustrating the method steps of the present invention.

[0020] Figure 2 This is a system module diagram of the present invention.

[0021] Figure 3 This is a schematic diagram comparing the prediction results and error probability density of Embodiment 3 of the present invention with other comparative methods in the task of predicting ammonia nitrogen concentration in effluent.

[0022] Figure 4 This is a diagram of the internal structure of a computer device according to Embodiment 4 of the present invention. Detailed Implementation

[0023] To clearly illustrate the technical features of this solution, the following detailed implementation method will be used to explain the solution.

[0024] Example 1: See Figure 1 As shown, this embodiment is an industrial process modeling method that integrates active learning and randomized network configuration. Taking the prediction of ammonia nitrogen concentration in the effluent of a municipal wastewater treatment plant as an example, the method of the present invention will be described in detail, including the following steps: S1. Collect and preprocess the input and output variable data of the industrial process to obtain the sample pool to be labeled, and select samples of a preset size from the sample pool to be labeled to construct the initial label set. Historical data from 90 consecutive days of operation of the wastewater treatment plant were collected, with a sampling interval of 1 hour, resulting in a total of 2160 samples. Among the input variables, 12 easily measurable process variables closely related to the effluent ammonia nitrogen concentration were selected, including influent flow rate, influent chemical oxygen demand, influent ammonia nitrogen concentration, influent total phosphorus, dissolved oxygen concentration in the aerobic zone, mixed liquor suspended solids concentration, sludge return ratio, sludge age, effluent pH, effluent temperature, oxidation-reduction potential, and air flow rate. The output variable was the effluent ammonia nitrogen concentration, which needed to be obtained through offline laboratory testing.

[0025] It should be noted that the above variable selection is only a specific example of this embodiment. In actual applications, the corresponding input and output variables can be determined according to the specific industrial process.

[0026] Preprocessing operations were performed on the collected raw data. First, missing value handling was performed: for variables with consecutive missing values of no more than 3 hours, linear interpolation was used to fill in the missing values; for periods with consecutive missing values of more than 3 hours, all samples corresponding to that period were removed. In this example, a total of 27 invalid samples were removed, leaving 2133 samples.

[0027] Furthermore, the 3σ criterion is used to detect outliers for each variable. Sample points that exceed the mean ± 3 times the standard deviation are considered outliers, and the median of the two normal samples before and after the variable is used for replacement. If more than 5 outliers appear consecutively, the entire sample segment is removed. In this embodiment, a total of 16 outliers were corrected, 1 consecutive outlier segment (5 samples in total) was removed, and 2128 samples remained.

[0028] It should also be noted that the specific thresholds for missing value handling and outlier correction (3 hours, 3σ, 5 consecutive outliers) can be adaptively adjusted according to the data quality of different industrial sites.

[0029] Furthermore, min-max normalization is applied to all input and output variables, linearly mapping the data to the [0,1] interval. It should be noted that the normalization parameters need to be saved for applying the same transformation to unknown new samples and for inverse normalization of the model's prediction results.

[0030] The 2128 samples obtained after preprocessing constitute a sample pool to be labeled. In actual industrial scenarios, obtaining output variables requires offline testing; therefore, initially, all samples in the sample pool to be labeled are unknown. To initiate the active learning iteration process, a preset number of samples are randomly selected from the sample pool to be labeled for manual testing and labeling, forming an initial label set. In this embodiment, 30 samples are extracted using simple random sampling (without replacement) and sent to the laboratory to obtain their actual effluent ammonia nitrogen concentration values, constituting the initial label set. The remaining 2098 samples constitute an unlabeled sample pool.

[0031] Furthermore, the initial size of the annotation set can be set according to the annotation cost and initial modeling requirements in the industrial field. It is generally recommended to have 20 to 50 samples to ensure that the initially randomly configured network model can obtain basic training data support.

[0032] After step S1 is completed, the sample pool to be labeled, the initial labeled set, and the unlabeled sample pool are obtained. The above data will be used as the input for the subsequent step S2 to build the query committee and start active learning iteration.

[0033] S2. Construct a query committee based on the initial annotation set, calculate the first uncertainty measure of each unlabeled sample in the unlabeled sample pool through the query committee, and select a candidate annotation set from the unlabeled sample pool based on the first uncertainty measure; It should be noted that step S2 follows the initial labeled set and unlabeled sample pool output from step S1. Its core purpose is to construct a query committee system based on a random configuration network, complete the quantitative assessment of the prediction uncertainty of each sample in the unlabeled sample pool and the initial screening of high-information samples, solve the problems of large uncertainty assessment bias of single model and low-value samples occupying valuable labeling budget in the existing technology, and provide a reliable sample priority basis for clustering screening and redundancy elimination in subsequent steps.

[0034] Furthermore, this step first constructs a query committee, which consists of several structurally independent, randomly configured network sub-models with differentiated training data sources. It should be noted that this invention selects a randomly configured network as the base model for the query committee. The core reason is that the randomly configured network can adaptively configure hidden layer nodes through preset inequality constraints and analytically solve the output weights using the least squares method. It does not require time-consuming backpropagation iterative optimization and can quickly complete the training of multiple independent models. This adapts to the rapid modeling needs of active learning iterative processes in industrial scenarios and solves the industry pain points of high training overhead and low iteration efficiency of query committees based on deep learning models in the prior art.

[0035] It should also be noted that the specific construction process of the query committee adopts the self-sampling method. The initial annotation set consisting of 30 sets of samples generated in step S1 is used as the sole benchmark training data source. Multiple independent resampling subsets are generated through random resampling with replacement. The sample size of each resampling subset is consistent with the sample size of the initial annotation set. That is, a single resampling subset contains 30 sets of samples. During the resampling process, a single set of samples is allowed to be repeatedly extracted in the same subset. At the same time, it is ensured that the sample distribution of each resampling subset is consistent with the overall working condition distribution of the initial annotation set, so as to avoid serious working condition deviation in the resampling subset, which would cause the sub-model training to fail.

[0036] Furthermore, after constructing multiple resampled subsets, a randomly configured network sub-model is independently trained for each resampled subset. All sub-models use completely identical network construction constraint parameters to ensure that the fitting ability of each sub-model is at the same level, eliminating the interference of model capacity differences on subsequent uncertainty evaluation results. In this embodiment, the preset constraint parameters of each randomly configured network sub-model include: a maximum number of hidden layer nodes of 35, a residual convergence accuracy of 1e-4, a constraint parameter r of 0.99, and a regularization coefficient λ of 1e-6. All sub-models are trained until the residuals meet the preset accuracy or reach the upper limit of the number of hidden layer nodes, ensuring that each sub-model can effectively fit the initial labeled data and has reliable prediction capability for effluent ammonia nitrogen concentration.

[0037] Furthermore, after completing the construction of the query committee, the uncertainty measure of the prediction results of each unlabeled sample in the unlabeled sample pool is calculated. It should be noted that the uncertainty measure of the prediction results in this step is used to quantify the degree of disagreement of the query committee on the prediction results of a single group of unlabeled samples. The higher the degree of disagreement, the higher the uncertainty of the sample, and the more insufficient the existing model learns the feature patterns of the sample. Labeling the sample can bring greater information gain to the model, which is the core quantitative basis for subsequent sample selection.

[0038] It should also be noted that the specific calculation process of the uncertainty measure is as follows: the 12-dimensional input feature vector of each group of unlabeled samples in the unlabeled sample pool generated in step S1 is used to obtain the output predicted value of the effluent ammonia nitrogen concentration of the sample by querying the random configuration network sub-model in the committee; for a single group of unlabeled samples, the variance of the prediction results of all random configuration network sub-models on different output dimensions is calculated, and the mean of the variance of each output dimension is taken as the uncertainty measure; in this embodiment, for the 2098 groups of unlabeled samples in the unlabeled sample pool, the constructed random configuration network sub-model is input sequentially to obtain the predicted value of the effluent ammonia nitrogen concentration corresponding to each sample, and the arithmetic variance of the predicted value is calculated, which is the first uncertainty measure value corresponding to the sample.

[0039] It should be noted that in this embodiment, the predicted output dimension is 1 (i.e., the predicted value of ammonia nitrogen concentration in the effluent), so the variance can be directly used as a measure of uncertainty. If the predicted output dimension is ≥1, then the mean of the variances of each dimension is taken as the measure of uncertainty for the sample.

[0040] Furthermore, after calculating the uncertainty measure of all unlabeled samples in the unlabeled sample pool, sample screening is performed based on the first uncertainty measure to obtain a candidate labeled set. It should be noted that the core purpose of this screening is to eliminate simple samples with low uncertainty that do not substantially help improve model performance, while controlling the computational load of subsequent clustering screening and redundancy elimination steps to meet the timeliness requirements of industrial field modeling.

[0041] It should also be noted that the specific screening method is as follows: all unlabeled samples in the unlabeled sample pool are sorted in descending order according to the calculated uncertainty metric value, and a preset number of samples at the top of the ranking are selected to form a candidate label set. In this embodiment, combined with the preset single-round labeling budget and computational control requirements, the screening ratio is set to 15% of the total size of the unlabeled sample pool, that is, the top 315 groups of high uncertainty samples are selected to form a candidate label set. The samples in this candidate set are all samples with the greatest prediction divergence of the query committee and the highest potential for information gain, which completely avoids the interference of low-value simple samples and provides an accurate input data source for the subsequent two-layer clustering screening in step S3.

[0042] Furthermore, after step S2 is completed, the output is a complete query committee model system and candidate label set. The above output will be directly used as the input for the subsequent step S3 to perform the two-layer clustering screening operation.

[0043] S3. Perform two-level clustering screening on the candidate annotation set to obtain an optimized candidate set, and perform sample redundancy elimination processing on the optimized candidate set to obtain a sample set after redundancy elimination; It should be noted that step S3 follows the candidate annotation set and query committee output from step S2. Its core purpose is to perform spatial distribution optimization screening on the uncertain samples obtained from the initial screening. This addresses the core pain points of batch sampling in existing active learning technologies, such as "sampling congestion," severe sample homogenization, and the inability to balance high information content and spatial representativeness. Through a two-layer clustering screening mechanism, while retaining the high uncertainty of the samples, it strengthens the ability of the samples to cover the entire feature space, and finally obtains an optimized candidate set that balances high information content and spatial representativeness.

[0044] Furthermore, this step first performs a first-level clustering screening, which is used to divide the sample feature space distribution of the candidate labeled set, quantify the exploration value of different sample clusters relative to the labeled area, and screen out a pre-selected cluster set with high working condition coverage potential.

[0045] It should also be noted that the first layer of clustering is implemented using unsupervised clustering. The clustering object is all samples in the candidate label set generated in step S2. The 12-dimensional normalized input feature vector of the sample is used as the clustering basis, and the k-means clustering algorithm is used to complete the division of sample clusters. The number of clusters is determined according to the sample size of the candidate label set and the number of working conditions in the industrial process. In this embodiment, for the candidate label set consisting of 315 groups of samples, the number of clusters is set to 15. All samples are divided into 15 independent sample clusters by k-means clustering. Each sample cluster corresponds to a set of samples with similar working conditions in the feature space, ensuring that the sample features within the same cluster are highly similar and the sample features between different clusters are significantly different.

[0046] Furthermore, after completing the first-level clustering of sample clusters, the cluster diversity index of each sample cluster relative to the labeled area is calculated. It should be noted that the cluster diversity index is used to quantify the distribution difference between the corresponding sample cluster and the existing initial labeled set of samples in the feature space. The higher the index value, the lower the overlap between the sample cluster and the labeled sample, and the higher the exploration value for the uncovered working conditions. It can effectively solve the problem in the existing technology that the sampling is concentrated near the labeled area and ignores the edge working conditions.

[0047] It should also be noted that the specific calculation method of the cluster diversity index is as follows: For a single cluster, calculate the mean of the feature vectors of all samples in the cluster to obtain the cluster center vector; calculate the mean of the feature vectors of all labeled samples in the initial label set to obtain the baseline center vector of the labeled area; calculate the Euclidean distance between the cluster center vector and the baseline center vector of the labeled area, and use the normalized value of the Euclidean distance as the cluster diversity index of the sample cluster.

[0048] The clusters are sorted in descending order of their cluster diversity index, and the top few clusters are selected as a pre-selected cluster set. In this embodiment, for the 15 sample clusters obtained from the first layer of clustering, the cluster diversity index of each cluster is calculated sequentially. The 15 sample clusters are sorted in descending order of their cluster diversity index, and a predetermined number of sample clusters at the top of the sorting are selected to form a pre-selected cluster set. In this embodiment, combined with the subsequent annotation budget, the top 8 sample clusters with high diversity indices are selected to form a pre-selected cluster set. The sample clusters in this set retain the core attributes of high uncertainty and have high differences from the already labeled areas, which can effectively cover edge conditions and abnormal conditions that have not been captured by the existing labeled samples.

[0049] Furthermore, after completing the screening of the pre-selected cluster set for the first-level clustering, the second-level weighted clustering screening is performed. The core purpose is to further lock in the optimal sample within the pre-selected cluster set that has both high uncertainty and local representativeness, avoid the homogenization of samples within the same cluster, and further reduce the computational load of subsequent redundancy elimination steps.

[0050] It should also be noted that the second-level weighted clustering takes all samples in the pre-selected cluster set obtained from the first-level screening as the object, and uses the first uncertainty measure value of the sample calculated in step S2 as the cluster weight to perform weighted k-means clustering. The core objective function of weighted clustering is to minimize the sum of squared distances from the weighted samples to the corresponding cluster centers. By setting the weights, samples with higher uncertainty have higher priority in the clustering process, ensuring that the clustering results are tilted towards samples with high information content, thus solving the problem that conventional clustering only considers spatial distance and is prone to selecting samples with low information content.

[0051] Furthermore, in this embodiment, for the 8 pre-selected clusters obtained from the first layer of screening, a second layer of weighted clustering is independently performed on the samples within each cluster. The number of clusters within a single cluster is set to 3. The first uncertainty metric of the samples is used as the weighting coefficient. After completing the weighted clustering division of each cluster, the sample with the highest first uncertainty metric value is selected from the sub-clusters obtained from each weighted clustering as the representative sample of that sub-cluster. In this embodiment, after the 8 pre-selected clusters undergo the second layer of weighted clustering, a total of 24 weighted cluster sub-clusters are obtained. Correspondingly, 24 groups of samples with high uncertainty and high representativeness are selected to jointly constitute the optimized candidate set.

[0052] It should be noted that the samples in this optimized candidate set not only ensure global representativeness of working conditions not covered by the feature space through the first layer of clustering, but also ensure the high information content of a single sample through the second layer of weighted clustering. This effectively avoids the "sampling congestion" problem of batch sampling in existing technologies, while keeping the sample size within a reasonable range. This significantly reduces the computational cost of subsequent redundancy elimination processing in this step, and is suitable for the timeliness requirements of industrial site modeling.

[0053] Furthermore, based on the trained query committee, pseudo-labels are generated for all samples in the candidate set to optimize. It should be noted that pseudo-labels are simulated estimates of the sample output variables without requiring manual offline analysis or consuming annotation budget. Their core function is to provide a computational benchmark for subsequent simulated annotation operations and the identification of redundant relationships between samples.

[0054] It should also be noted that the specific method for generating pseudo-labels is as follows: Each set of samples in the optimization candidate set is input into each randomly configured network sub-model within the query committee. The predicted value of the output variable for that sample from each sub-model is obtained, and the arithmetic mean of all predicted values is calculated. This average is then used as the pseudo-label corresponding to that sample. In this embodiment, pseudo-labels are generated sequentially for the 24 sets of samples in the optimization candidate set, ensuring that each set of samples corresponds to a unique pseudo-label.

[0055] Furthermore, after the pseudo-labels are generated, redundant sample identification based on the expected variance change is performed. It should be noted that the expected variance change is used to quantify the reduction in prediction uncertainty of other candidate samples in the same batch after a sample is simulated and labeled. Further, the ratio of the expected variance change to the first uncertainty measure of the other candidate samples before simulated labeling is calculated to obtain the relative decrease rate, which is used to determine whether the uncertainty reduction of the other candidate samples reaches a preset significant decrease condition. If, after simulating labeling a sample, the relative decrease rate of another candidate sample reaches a preset decrease percentage threshold, it indicates that the information contained in the other candidate sample has been sufficiently covered by the previous sample, and further labeling of it is unlikely to bring additional information gain to the model.

[0056] Specifically, for any sample in the optimization candidate set First, a pseudo-label for the sample is generated based on the query committee, and then the sample... The pseudo-labels are temporarily added to the current annotation set as simulated annotation samples. After reconstructing the query committee, other samples in the optimization candidate set are calculated. The second uncertainty measure. Let other samples... The first uncertainty measure before simulation annotation is: In the sample The second uncertainty measure after simulation and annotation is Then the sample For the sample The expected change in variance is ; Further, calculate the sample The relative decrease in uncertainty ,in To prevent extremely small positive numbers with a denominator of zero.

[0057] Furthermore, an uncertainty threshold and a decrease rate threshold are set, and redundant samples are removed based on a second uncertainty measure, the expected change in variance, and the relative decrease rate. The uncertainty threshold is dynamically determined based on the first uncertainty measure of the samples in the candidate labeled set in step S2; the decrease rate threshold is used to limit the minimum relative decrease in sample uncertainty when it reaches a significant level.

[0058] In the sample After the pseudo-labels are temporarily added to the current annotation set as simulated annotation samples, if the samples The second uncertainty measure is lower than the adaptive redundancy decision threshold, and the sample Corresponding relative decline rate If the sample is greater than or equal to the preset decrease ratio threshold, then the sample is considered... With sample There is strong information overlap between the samples. Redundant samples are removed. Through the above processing, duplicate samples that are spatially close and have similar model contributions in the optimization candidate set are eliminated in advance, and the remaining samples constitute the sample set after redundancy elimination.

[0059] In one specific implementation, the 24 groups of samples in the optimization candidate set are arranged in descending order according to the first uncertainty metric. Starting from the high uncertainty samples at the top of the ranking, simulated labeling, uncertainty updating, and redundancy determination operations are performed sequentially. After completing one round of redundancy removal, the sample with the highest first uncertainty metric is selected from the remaining samples to continue the above processing until there are no candidate samples that meet the redundancy determination criteria. In this embodiment, a total of 9 groups of candidate samples with high information overlap are removed through the above process, and the remaining 15 groups of candidate samples constitute the sample set after redundancy elimination.

[0060] After obtaining the sample set after redundancy removal, it is used as input for step S4 to further filter the sample set to be labeled within the preset labeling budget. Thus, the expected variance change is used to identify the redundancy relationship between samples, and the first uncertainty measure is used to prioritize the samples after redundancy removal. The two play the roles of redundancy removal and priority retention of high-information samples, respectively, in the sample screening process.

[0061] S4. Within the preset annotation budget, the sample set after redundancy elimination is filtered to obtain the sample set to be annotated; It should be noted that after redundant samples are removed, the sample set to be labeled is selected within the preset labeling budget. The preset labeling budget is the maximum number of samples that can be labeled per round, pre-set based on the offline testing labor costs, reagent costs, and sample requirements for a single round of model optimization at the industrial site. Its core function is to maximize model performance while controlling labeling costs. In this embodiment, considering the wastewater treatment plant's offline testing capabilities and cost control requirements, the preset labeling budget for a single round of active learning iteration is 10 samples. The specific selection method is as follows: the 15 remaining valid candidate samples after removing redundancy are sorted again in descending order according to the prediction uncertainty metric, and the top 10 samples are selected to form the sample set to be labeled.

[0062] Furthermore, after step S4 is completed, the output is the sample set to be labeled. The samples in this sample set have three core advantages: high uncertainty, full-condition space representativeness, and low information redundancy. This ensures that the test labeling of each set of samples can bring maximum performance improvement to the model, effectively solving the industry pain points of wasteful labeling budget and redundant sample information in the existing technology. This sample set to be labeled will be directly used as the input of the subsequent step S5, which will be used to complete manual labeling and then merge into the initial labeling set to perform incremental modeling operation of the randomly configured network.

[0063] S5. After labeling the sample set to be labeled, merge it into the initial label set to obtain the updated training set. Based on the updated training set, use a randomly configured network for incremental modeling to obtain the iterative model. It should be noted that step S5 follows the unlabeled sample set output from step S4. Its core purpose is to complete the standardization of high-value samples into the label library, update the label set, and perform incremental modeling optimization based on the randomized network. This addresses the core pain points of existing technologies, such as the ease of overfitting in small sample scenarios, poor generalization ability, complex parameter tuning of deep networks, and inability to adapt to the dynamic iterative update requirements of industrial scenarios. Ultimately, the randomized network iterative model after this round of active learning iteration optimization is obtained.

[0064] Furthermore, this step first performs manual annotation of the sample set to be labeled and updates the label set. It should be noted that the manual annotation uses the same testing and detection methods as the initial label set in step S1 to ensure the accuracy, consistency and comparability of the labeled data, and to avoid systematic errors introduced due to differences in annotation methods, which would affect the model training effect.

[0065] It should also be noted that in this embodiment, the 10 final sets of samples to be labeled generated in step S4 are sent to the laboratory to perform offline testing of the ammonia nitrogen concentration in the effluent using Nessler's reagent spectrophotometry, and the real output labeled value corresponding to each set of samples is obtained to form a complete labeled sample. All 10 sets of labeled samples are then incorporated into the initial labeled set constructed in step S1 to update the labeled set for training. The updated labeled set contains a total of 40 fully labeled samples, which significantly expands the range of working conditions covered compared to the initial labeled set, providing more comprehensive training data support for the incremental modeling of the subsequent randomly configured network.

[0066] Furthermore, after updating the annotation set, an incremental modeling operation based on a randomly configured network is performed. It should be noted that this invention uses a randomly configured network as the core modeling model. The core reason is that this model can adaptively configure hidden layer nodes through preset inequality constraints, eliminating the need for time-consuming backpropagation iterative parameter tuning. The output weights can be analytically solved using the least squares method, resulting in fast training speed and strong small-sample fitting ability. It perfectly adapts to the rapid modeling needs of active learning in multi-round iterations in industrial scenarios, solving the industry pain points of traditional deep learning models being prone to overfitting and having low training efficiency in small-sample scenarios.

[0067] It should also be noted that the incremental modeling process uses the same basic constraint parameters as the query committee sub-model in step S2 to ensure the consistency of the model's fitting ability and the comparability of the iteration process. The core constraint parameters set in this embodiment include: the maximum number of hidden layer nodes is 35, the model residual convergence accuracy is 1e-4, the random mapping constraint parameter r is 0.99, the output weight solution regularization coefficient λ is 1e-6, and the activation function is the same as the continuously differentiable sigmoid function in step S2 to ensure the model's nonlinear fitting ability and numerical stability.

[0068] Furthermore, the specific implementation process of incremental modeling of the randomized network starts from model initialization and residual calculation. It should be noted that in the model initialization stage, the 12-dimensional input features of the updated label set are used as the model input, and the normalized value of the effluent ammonia nitrogen concentration of the corresponding label is used as the model output target. The number of hidden layer nodes of the model is initialized to 0. At the same time, the initial residual vector of the model is calculated. The initial residual vector is the difference between the output target vector of the label set and the initial predicted value of the model. The initial predicted value of the model is the arithmetic mean of the output target of the label set, which is used as the optimization benchmark for the iterative configuration of hidden layer nodes.

[0069] It should also be noted that after initialization, adaptive configuration and selection of hidden layer nodes are performed. Specifically, in each iteration, 100 sets of input weights and bias parameters for candidate hidden layer nodes are randomly generated. For each set of candidate parameters, it is verified whether they satisfy the preset inequality constraints consistent with those of the sub-model in step S2. These inequality constraints ensure that newly added hidden layer nodes can make a positive contribution to the current residual of the model, avoiding the addition of invalid nodes. From the candidate parameters that satisfy the constraints, the set of parameters that contributes the most to the reduction of residuals is selected as the parameters for the newly added valid hidden layer nodes in this round, completing the configuration of a single node. The formulas for the inequality constraints are as follows:

[0070] in, Indicates the first The hidden node for the th The contribution value of each output component is used to filter the optimal node parameters; This indicates that the network is adding the first... Before the nth node, the nth The current residual vector in each output dimension; This represents the hidden layer output vector of the candidate node under the current input; This represents the constraint parameters for network construction, with a value range of (0,1), used to control the strength of random mapping; Let f(x) represent a sequence of non-negative real numbers that satisfies the following conditions: This is used to dynamically adjust the looseness of inequality constraints during the iteration process; Indicates the index of the output dimension.

[0071] Furthermore, after configuring each effective hidden layer node, based on the output matrices of all current hidden layer nodes, the output layer weights of the model are analytically solved using least squares with L2 regularization. The regularization coefficient is a preset 1e-6 to avoid overfitting under small sample training conditions; the formula is expressed as follows:

[0072] in, It is the output vector of the input training set. It is the hidden layer weight matrix. It is the regularization coefficient. It is the identity matrix. This is the output weight matrix.

[0073] After solving for the output weights, the predicted residual vector of the model is updated. It is then determined whether the current residual has reached the preset convergence accuracy of 1e-4, or whether the number of hidden layer nodes has reached the upper limit of 35. If the termination condition is not met, the configuration of hidden layer nodes, weight solving, and residual update operations are repeated until the termination condition is met. In this embodiment, after iterative optimization, the final number of hidden layer nodes is 28, and the model residual is reduced to 9.62e-5, meeting the preset convergence accuracy requirement. Model construction can be completed without reaching the upper limit of the number of nodes, significantly improving modeling efficiency while ensuring the model's fitting accuracy.

[0074] It should also be noted that after completing the incremental construction of the randomly configured network, the normalization parameters of the model are synchronized and the output is denormalized. It should be noted that the model training process is completed using normalized data. The normalized extreme value parameters of each input variable and output variable generated in step S1 need to be saved synchronously. When the model is actually inferring, the input online data is first subjected to min-max normalization processing consistent with the training process. After the model outputs the prediction results, the denormalization is then completed through the corresponding extreme value parameters to obtain the predicted value of the effluent ammonia nitrogen concentration with real physical meaning, ensuring that the model can be directly adapted to the online prediction scenario of industrial site.

[0075] Furthermore, after step S5 is completed, the output is the random configuration network iterative model optimized in this round of iteration. Compared with the initial model, this model achieves a dual improvement in prediction accuracy and operating condition coverage, and can effectively adapt to the multi-operating condition fluctuation scenario of the sewage treatment process. This iterative model will be directly used as the iterative basis for the subsequent step S6, and will be used to perform multiple rounds of active learning loop optimization iterations until the model performance meets the preset application requirements of the industrial site.

[0076] S6. Repeat steps S2 to S5 until the preset iteration termination condition is met to obtain the final industrial data modeling model.

[0077] It should be noted that step S6 follows the randomly configured network iterative model, the updated labeled set, and the remaining unlabeled sample pool output from step S5. Its core purpose is to perform closed-loop iteration of multiple rounds of active learning and model optimization. This addresses the core pain points in existing technologies, such as insufficient single-round sampling optimization, the inability of the model to continuously adapt to the fluctuations of multiple working conditions in industrial processes, and the inability to optimally balance labeling budget and model performance. Through iterative iteration, high-value information in unlabeled samples is continuously mined until the model performance meets the preset application requirements of the industrial field, ultimately resulting in an industrial data soft measurement model that can be directly applied.

[0078] Furthermore, the core execution logic of this step is to repeatedly execute the complete technical process from steps S2 to S5, using a single round of active learning iteration as the basic unit. Each iteration takes the updated labeled set, iterative model, and remaining unlabeled sample pool as input after the previous iteration, and outputs a newly optimized iterative model and a further expanded labeled set, forming a closed-loop active learning optimization chain. This ensures that each iteration supplements the model with new, non-redundant, high-value working condition information, continuously improving the model's prediction accuracy and generalization ability. It should be noted that after each iteration, the labeled samples are permanently removed from the unlabeled sample pool. The next iteration only performs uncertainty assessment and sampling operations on the remaining unlabeled samples to avoid duplicate sampling.

[0079] It should also be noted that, to avoid unlimited iteration and to balance the labeling costs and model performance requirements in industrial settings, this step pre-sets dual iteration termination conditions. Iteration stops when either condition is triggered, ensuring that model performance meets industrial application requirements while strictly controlling labeling costs to stay within budget. The first termination condition is model performance reaching the target, specifically when the model's prediction accuracy on the preset independent test set reaches the preset accuracy requirement. In this embodiment, the accuracy indicators are: root mean square error (RMSE) ≤ 0.32 for predicted effluent ammonia nitrogen concentration, and mean absolute percentage error (MAPE) ≤ 13%. These indicators meet the engineering application requirements for online control of wastewater treatment plant effluent quality. The second termination condition is budget exhaustion, specifically when the cumulative number of labeled samples reaches the preset total labeling budget limit. In this embodiment, considering the offline testing cost control requirements of wastewater treatment plants, the total labeling budget limit is set to 10% of the total number of samples in the sample pool to be labeled, i.e., the cumulative number of labeled samples does not exceed 212 sets, avoiding cost overruns caused by unlimited labeling.

[0080] Furthermore, in this embodiment, after 18 rounds of complete active learning iteration optimization, the cumulative number of labeled samples is 210, accounting for 9.87% of the total number of samples in the sample pool to be labeled, which does not exceed the preset total labeling budget limit; at the same time, the root mean square error (RMSE) of the optimized model in predicting effluent ammonia nitrogen concentration on the independent test set is reduced to 0.3038, and the mean absolute percentage error (MAPE) is reduced to 12.25%, which fully meets the preset industrial application accuracy requirements, triggering the first termination condition and stopping the iteration loop.

[0081] It should also be noted that after the iteration ends, the randomized network model completed in the last iteration is locked as the final industrial data model. The network structure parameters, hidden layer node configuration, output weight matrix, and data normalization extreme value parameters corresponding to the model are simultaneously sealed to form a complete deployable model file, ensuring that the model can be directly reproduced and applied in engineering.

[0082] It should also be noted that the final model obtained after this step has significant performance advantages over conventional modeling methods in the existing technology. Specifically, it can achieve better prediction accuracy than the benchmark model trained with 100% full labeled data using only 9.87% of the full sample labeled data, directly saving more than 90% of offline testing and labeling costs for industrial sites. At the same time, through multiple rounds of iterative two-layer clustering screening, the model fully covers all working conditions of the sewage treatment process, including low load, normal load, high load, and rainy season impact. Compared with traditional active learning methods, the model has significantly reduced prediction bias under marginal and abnormal working conditions, and has extremely strong anti-interference ability and generalization performance.

[0083] Furthermore, after step S6 is completed, the entire process of the industrial data modeling method based on active learning reinforced random configuration network disclosed in this embodiment is completed. The final industrial data model can be directly deployed in the industrial SCADA system of the sewage treatment plant to realize real-time online soft measurement prediction of effluent ammonia nitrogen concentration, providing reliable data support for the precise control and compliance management of sewage treatment process, fully realizing all the inventive objectives of this invention, and completely solving the core pain points of the prior art in the background art.

[0084] Example 2: See Figure 2 As shown, this embodiment is an industrial process modeling system that integrates active learning and randomized network configuration, including: The data acquisition and preprocessing module is used to collect input and output variable data of industrial processes and preprocess them to obtain a sample pool to be labeled, and to select samples of a preset size from the sample pool to be labeled to construct an initial label set. The uncertainty screening module is used to construct a query committee based on the initial label set, calculate the first uncertainty measure of each unlabeled sample in the unlabeled sample pool through the query committee, and select a candidate label set from the unlabeled sample pool based on the first uncertainty measure. The clustering screening and redundancy elimination module is used to perform two-level clustering screening on the candidate annotation set to obtain an optimized candidate set, and to perform sample redundancy elimination processing on the optimized candidate set to obtain a sample set after redundancy elimination; The final sample selection module is used to filter the sample set after redundancy elimination within the preset annotation budget to obtain the sample set to be annotated; The incremental training and update module is used to label the unlabeled sample set and merge it into the initial label set to obtain the updated training set. Based on the updated training set, a randomly configured network is used for incremental modeling to obtain the iterative model. The iteration and model output module is used to repeatedly execute steps S2 to S5 until the preset iteration termination condition is reached, and the final industrial data modeling model is obtained.

[0085] Example 3: See Figure 3 As shown, to verify the effectiveness of the industrial process modeling method that integrates active learning and random configuration networks proposed in this invention, this embodiment compares it with random sampling, k-means clustering sampling, query committee sampling, and expected model change maximization sampling methods in the task scenario of "predicting ammonia nitrogen concentration in effluent".

[0086] Figure 3 This paper presents a comparison of the experimental performance of the proposed active learning regression method on the prediction of effluent ammonia nitrogen concentration in a wastewater treatment plant (WWTP). The left figure (a) is a line graph comparing the predicted values and actual effluent ammonia nitrogen concentrations of each method on the selected samples. The prediction curve of the proposed method (red solid line) shows a high degree of consistency with the actual value (black solid line) in overall trend, closely following the dynamic changes in ammonia nitrogen concentration and achieving relatively accurate fitting. In contrast, the prediction curves of other sampling methods—RS, K-means, QBC, EMCM, and the pre-improvement method QWE—show significant deviations. The right figure (b) is a graph of the probability density function of the prediction error for ammonia nitrogen concentration for each method. It is evident that the error probability density function curve of the proposed method exhibits the highest peak value, and its distribution range is narrower than other methods. This indicates that the prediction error of this method is very small on most samples, with limited error fluctuation, demonstrating good stability and reliability.

[0087] All methods employed the same data preprocessing, training candidate pool, validation set, and test set partitioning, using the same randomly configured network as the base model. Repeated experiments were conducted with the same annotation budget and model parameter settings. The root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) on the test set were used as evaluation metrics. The experimental results in Tables 1 and 2 represent the mean ± standard deviation obtained from multiple repeated experiments. Smaller RMSE, MAE, and MAPE values indicate lower model prediction error.

[0088] Table 1. Comparison of prediction performance of different sampling methods in the task of predicting ammonia nitrogen concentration in effluent.

[0089] As shown in Table 1, in the task of predicting ammonia nitrogen concentration in effluent, the method of this invention achieved the lowest values in RMSE, MAE, and MAPE, indicating that it is superior to the comparative methods in terms of prediction accuracy and error stability. Specifically, the test set RMSE of the method of this invention is 0.3840±0.0098, which is lower than that of RS, k-means, QBC, EMCM, and QWE methods; compared with the above methods, the RMSE of the method of this invention is reduced by 13.6%, 4.2%, 3.1%, 3.0%, and 5.4%, respectively. This result shows that the sample selection mechanism of this invention, namely "uncertainty screening—double-layer clustering screening—redundancy elimination of expected variance variation", can select more informative, representative, and low-redundancy training samples under limited labeled sample conditions, thereby improving the prediction accuracy of the randomized network model.

[0090] When the annotation ratio is set to 40%, 50%, and 60% of the training candidate samples, the method of the present invention can maintain a low prediction error. Under partial annotation conditions, its prediction performance can approach or even outperform a randomly configured network model trained with all labeled samples. This demonstrates that the method of the present invention can maintain the accuracy and stability of the prediction model for key variables in industrial processes while reducing the cost of manual testing or annotation.

[0091] Table 2. Comparison of prediction performance of various methods under different training sample annotation ratios.

[0092] As shown in Table 2, when the annotation ratio is 40%, the test set RMSE, MAE, and MAPE of the method of this invention are 0.3038±0.0034, 0.1856±0.0009, and 12.2536±6.9438, respectively, all lower than the 0.3168±0.0092, 0.1901±0.0014, and 13.3626±11.0034 corresponding to the full-sample SCN model. This indicates that the method of this invention can achieve or even exceed the modeling effect of the full sample model using only a portion of the labeled samples. Furthermore, when the annotation ratio is increased to 50% and 60%, the method of this invention outperforms the QBC, EMCM, and QWE methods in all three indicators of RMSE, MAE, and MAPE, indicating that the method of this invention can continuously improve model performance and maintain good error stability as the annotation budget increases.

[0093] Example 4: This embodiment provides a computer device, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps in the above-described method embodiments.

[0094] This computer device can be a server, and its internal structure diagram can be as follows: Figure 4 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores server data. The network interface communicates with external terminals via a network connection. When the computer program is executed by the processor, it implements an industrial process modeling method that integrates active learning and stochastically configured networks.

[0095] Those skilled in the art will understand that Figure 4 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0096] Example 5: This embodiment provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps in the above-described method embodiments.

[0097] If the functions implemented by the method are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art or the current technical solution, can be embodied in the form of a software product. This current computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0098] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-including system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.

[0099] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.

[0100] The technical features of this invention not described can be implemented by or using existing technology, and will not be repeated here. Of course, the above description is not a limitation of this invention, and this invention is not limited to the examples above. Any changes, modifications, additions or substitutions made by those skilled in the art within the scope of this invention should also be within the protection scope of this invention.

Claims

1. An industrial process modeling method integrating active learning and stochastically configured networks, characterized in that, Includes the following steps: S1. Collect and preprocess the input and output variable data of the industrial process to obtain the sample pool to be labeled; An initial annotation set is constructed by selecting samples of a predetermined size from the pool of samples to be labeled; S2. Construct a query committee based on the initial annotation set, calculate the first uncertainty measure of each unlabeled sample in the unlabeled sample pool through the query committee, and select a candidate annotation set from the unlabeled sample pool based on the first uncertainty measure; S3. Perform two-level clustering screening on the candidate annotation set to obtain an optimized candidate set, and perform sample redundancy elimination processing on the optimized candidate set to obtain a sample set after redundancy elimination; S4. Within the preset annotation budget, the sample set after redundancy elimination is filtered to obtain the sample set to be annotated; S5. After labeling the sample set to be labeled, merge it into the initial label set to obtain the updated training set. Based on the updated training set, use a randomly configured network for incremental modeling to obtain the iterative model. S6. Repeat steps S2 to S5 until the preset iteration termination condition is met to obtain the final industrial data modeling model.

2. The industrial process modeling method integrating active learning and stochastic configuration networks according to claim 1, characterized in that, The preprocessing in step S1 includes missing value imputation, outlier correction and min-max normalization; the initial annotation set is constructed by selecting samples of a preset size from the pool of samples to be labeled using a random sampling method without replacement.

3. The industrial process modeling method integrating active learning and stochastic configuration networks according to claim 2, characterized in that, Step S2 specifically includes: For the initial labeled set, a bootstrap sampling method is used to generate several resampled subsets, and a randomly configured network sub-model is trained based on the several resampled subsets to construct a query committee composed of several randomly configured network sub-models; The query committee is used to predict samples in the unlabeled sample pool. The variance of the prediction results of all randomly configured network sub-models on different output dimensions is calculated, and the mean of the variance of each output dimension is taken as the first uncertainty measure. Based on the descending order of the first uncertainty measure, a predetermined proportion of samples are selected to form a candidate label set.

4. The industrial process modeling method integrating active learning and stochastic configuration networks according to claim 3, characterized in that, The two-layer clustering screening in step S3 includes a first-layer clustering and a second-layer weighted clustering; The first layer of clustering uses the k-means clustering algorithm to divide the candidate label set into clusters, resulting in several clusters. The cluster diversity index of each cluster is calculated, and the clusters are sorted in descending order of their diversity indices. The top few clusters are selected to form a pre-selected cluster set. The cluster diversity index is calculated as follows: for a single cluster, the mean of the feature vectors of all samples in that cluster is calculated to obtain the cluster center vector; the mean of the feature vectors of all labeled samples in the initial label set is calculated to obtain the baseline center vector of the labeled region; the Euclidean distance between the cluster center vector and the baseline center vector of the labeled region is calculated, and the normalized value of this Euclidean distance is used as the cluster diversity index of that cluster. The second layer of weighted clustering takes the samples in the pre-selected cluster set as the object, uses the first uncertainty measure calculated in step S2 as the cluster weight, and takes minimizing the sum of squared distances from the weighted samples to the corresponding cluster centers as the objective function. Weighted k-means clustering is performed, and the samples with the highest uncertainty measure are selected from each weighted cluster sub-cluster to form an optimization candidate set.

5. The industrial process modeling method integrating active learning and stochastic configuration networks according to claim 4, characterized in that, The sample redundancy elimination process performed on the optimized candidate set in step S3 includes: Based on the query committee, pseudo-labels are generated for each sample in the candidate set to optimize the sample; the pseudo-labels are the arithmetic mean of the prediction results of all randomly configured network sub-models in the query committee for the same sample. The simulated labeled samples are then used as single samples generated from the optimized candidate set, and these simulated labeled samples are merged into the initial labeled set to obtain a temporary labeled set. The query committee is reconstructed based on the temporary annotation set to obtain the updated query committee; and based on the updated query committee, the second uncertainty measure of other candidate samples in the candidate set other than the current simulated annotation samples is calculated. Calculate the difference between the first uncertainty measure and the second uncertainty measure for other candidate samples to obtain the expected change in variance; and calculate the ratio of the expected change in variance to the first uncertainty measure to obtain the relative decrease rate. An uncertainty threshold and a decline rate threshold are set for the second uncertainty measure and the relative decline rate, respectively. When the second uncertainty measure is lower than the uncertainty threshold and the relative decline rate is greater than or equal to the decline rate threshold, the corresponding candidate sample is determined to be a redundant sample and is removed. The step S4 of filtering the sample set after redundancy elimination includes: The sample set after redundancy elimination is sorted in descending order according to the first uncertainty metric calculated in step S2, and then filtered based on the preset annotation budget to obtain the sample set to be annotated.

6. The industrial process modeling method integrating active learning and stochastic configuration networks according to claim 5, characterized in that, The incremental modeling based on the updated training set using a randomly configured network in step S5 includes: A randomly configured network is constructed, with the input features of the updated training set as the model input and the corresponding output variables as the model output target. Hidden layer nodes are adaptively and iteratively configured through preset inequality constraints. After each hidden layer node is configured, the output layer weights of the model are analytically solved using the least squares method with L2 regularization, and the model prediction residuals are updated. The preset inequality constraint formula is expressed as follows: in, Indicates the first The hidden node for the th The contribution value of each output component is used to filter the optimal node parameters; This indicates that the network is adding the first... Before the nth node, the nth The current residual vector in each output dimension; This represents the hidden layer output vector of the candidate node under the current input; This represents the constraint parameters for network construction, with a value range of (0,1), used to control the strength of random mapping; Let f(x) represent a sequence of non-negative real numbers that satisfies the following conditions: This is used to dynamically adjust the looseness of inequality constraints during the iteration process; Indicates the index of the output dimension; The formula for the output layer weights is as follows: in, It is the output vector of the input training set. It is the hidden layer weight matrix. It is the regularization coefficient. It is the identity matrix. This is the output weight matrix.

7. The industrial process modeling method integrating active learning and stochastic configuration networks according to claim 6, characterized in that, The preset iteration termination conditions in step S6 include the model's performance reaching a preset accuracy requirement, or the cumulative number of labeled samples reaching a preset total labeling budget limit.

8. An industrial process modeling system integrating active learning and stochastic configuration networks, characterized in that, The steps for performing the industrial process modeling method that integrates active learning and stochastic configuration networks as described in any one of claims 1 to 7 include: The data acquisition and preprocessing module is used to collect input and output variable data of industrial processes and preprocess them to obtain a sample pool to be labeled, and to select samples of a preset size from the sample pool to be labeled to construct an initial label set. The uncertainty screening module is used to construct a query committee based on the initial label set, calculate the first uncertainty measure of each unlabeled sample in the unlabeled sample pool through the query committee, and select a candidate label set from the unlabeled sample pool based on the first uncertainty measure. The clustering screening and redundancy elimination module is used to perform two-level clustering screening on the candidate annotation set to obtain an optimized candidate set, and to perform sample redundancy elimination processing on the optimized candidate set to obtain a sample set after redundancy elimination; The final sample selection module is used to filter the sample set after redundancy elimination within the preset annotation budget to obtain the sample set to be annotated; The incremental training and update module is used to label the unlabeled sample set and merge it into the initial label set to obtain the updated training set. Based on the updated training set, a randomly configured network is used for incremental modeling to obtain the iterative model. The iteration and model output module is used to repeatedly execute steps S2 to S5 until the preset iteration termination condition is reached, and the final industrial data modeling model is obtained.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the industrial process modeling method that integrates active learning and randomized network configuration as described in any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the industrial process modeling method that integrates active learning and randomized network as described in any one of claims 1 to 7.