Data set cleaning method, model prediction method and system based on multi-model cooperation
By employing a multi-model collaborative dataset cleaning method and utilizing cross-validation and voting mechanisms to filter high-error samples, the problem of quality assessment and cleaning in high-throughput materials data processing is solved, improving the robustness and transferability of the model and making it suitable for materials design in high-performance computing and big data environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- EAST CHINA UNIV OF SCI & TECH
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies lack automatic and objective quality assessment and high-error sample cleaning methods in high-throughput materials data processing, leading to decreased model accuracy and interference with the judgment of physical laws. Furthermore, general-purpose automatic machine learning tools lack a systematic dataset screening mechanism, affecting model performance and robustness.
A dataset cleaning method based on multi-model collaboration is adopted. Through cross-validation and voting mechanisms of multiple machine learning models, combined with adaptive thresholds to filter high-error samples, and iterative cleaning and backfilling are carried out during training to form a closed-loop iterative process to ensure that the model performance reaches the preset indicators.
It achieves automated and robust data quality assessment and high-error sample cleaning, improves the generalization performance and robustness of the model, adapts to different data scales and target scales, and has good transferability and reproducibility.
Smart Images

Figure CN122196339A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a dataset cleaning method, model prediction method and system based on multi-model collaboration. Background Technology
[0002] Against the backdrop of rapid development in high-performance computing and big data, new material research and development is shifting from the traditional "trial and error" approach to rational design driven by first-principles calculations such as density functional theory (DFT). However, DFT and other computational data consume significant computing resources. Furthermore, these theoretical calculations inevitably contain anomalous data caused by factors such as non-convergence, pseudo-convergence, unreasonable structural design, or inconsistent parameter selection. If these anomalous data are used directly for machine learning modeling and analysis without effective screening, they can easily reduce model accuracy and interfere with the judgment of physical laws. Currently, dataset processing often relies on researchers' experience for manual checks or the use of simple fixed thresholds, quantiles, and other rules to eliminate absolute / relative errors. For example, the invention disclosed in CN114881290A, which uses mathematical statistical analysis coupled with neural network prediction for anomalous data identification method, is not only inefficient and highly subjective but also lacks good transferability across different systems, different physical properties, and different data scales.
[0003] Meanwhile, while general-purpose automated machine learning tools can automatically search for models and hyperparameters, they typically only focus on optimizing the overall model score and lack a systematic dataset screening mechanism. Therefore, there is an urgent need in this field for a method that can automatically and objectively assess the quality of high-throughput computational material data and clean up high-error samples during model training. This method would improve the performance and robustness of machine learning models while minimizing computational resource consumption and ensuring a sufficient number of effective samples, providing reliable technical support for high-throughput machine learning-driven material screening design. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of the existing technology and provide a dataset cleaning method, model prediction method and system based on multi-model collaboration, which can automatically and objectively assess the quality of high-throughput computing data and clean high-error samples.
[0005] The objective of this invention can be achieved through the following technical solutions: A dataset cleaning method based on multi-model collaboration includes the following steps: S1: Obtain the raw data to be processed, and parse the data according to the preset field names to form a feature matrix and target vector to build a sample set; S2: Deploy configuration information for dataset cleaning, including the number of machine learning models. nCross-validation folds K Voting threshold, target machine learning performance threshold r 2 and minimum training sample size m ; S3: This will achieve the minimum training sample size. m The feature matrix and target vector are respectively input n Train in 10 machine learning models and calculate the performance of each machine learning model. K The cross-validation score is used to determine whether it is not less than the target machine learning performance threshold. r 2 If yes, proceed to step S7; otherwise, proceed to step S4. S4: Use the trained... n Each machine learning model predicts the target vector for the feature matrix of each sample and obtains the corresponding relative error to determine whether it is a high-error sample. S5: Statistically analyze each sample. n The number of times a machine learning model identifies a sample as having a high error is compared with a preset voting threshold. If the number of times the number of times the sample is identified as having a high error is greater than the voting threshold, the sample is identified as having a high error and is removed from the sample set. S6: Determine whether the number of samples in the deleted sample set is lower than the minimum training sample size. m If so, then according to the replenishment strategy, new data samples are added to the sample set, and then the process returns to step S3 for iterative training. S7: Output the current sample set and the corresponding machine learning model.
[0006] Furthermore, the aforementioned K The expression for calculating the cross-validation score is: In the formula, For regression function, This is the current fold number. T This refers to the complete training set corresponding to the machine learning model training process. For the first step in training a machine learning model The validation set of the fold. To remove the first k The training set of the validation set. For regression function of K Cross-validation score.
[0007] Furthermore, the expression for calculating the relative error is as follows: errors = |y - ŷ| / ( max ( y ) - minutes ( y )) In the formula, errors This is a relative error. y For the corresponding target vector truth value, ŷ The predicted value for the corresponding target vector. max ( y ) represents the true value of the target vector y The maximum value, minutes ( y ) represents the true value of the target vector y The minimum value.
[0008] Furthermore, the configuration information in step S2 also includes an error threshold coefficient. E In step S4, the mean and standard deviation are calculated based on the relative error of the prediction results from the machine learning model, and then combined with the error threshold coefficient. E An error threshold is determined for the current model. This threshold is used to determine whether the relative error of each sample predicted by the machine learning model is greater than the error threshold. If it is, the corresponding sample is a high-error sample.
[0009] Furthermore, the expression for calculating the error threshold is as follows: S = m + You In the formula, S For the error threshold, m The mean, Standard deviation For the current machine learning model i The relative error of each prediction result, This represents the total number of predictions from the previous machine learning model.
[0010] Furthermore, the method also includes setting the number of cleaning rounds in the configuration information in step S2, and determining in step S6 whether the current iteration round number has reached the number of cleaning rounds. If it has, then step S7 is executed; otherwise, step S3 is returned.
[0011] Furthermore, the data parsing process in step S1 includes: extracting data from the original data according to preset field names, and performing numerical conversion and outlier processing to obtain a feature matrix and a target vector.
[0012] Furthermore, the configuration information in step S2 also includes setting the minimum number of backups in the backup strategy.
[0013] This invention also provides a model prediction method for the maximum Gibbs free energy change of the electrocatalytic reduction of CO2 to CO using a graphene-based tetranitrogen-coordinated metal single-atom catalyst, comprising the following steps: S1: Obtain the raw data to be processed. The process of obtaining the raw data includes: randomly generating a series of single-atom structures within a preset element range to form single-atom structures composed of different elements, and obtaining the maximum Gibbs free energy change in the reaction path; performing data parsing on the raw data according to preset field names to form a feature matrix and a target vector to build a sample set, wherein the target vector is the maximum Gibbs free energy change. S2: Deploy configuration information for dataset cleaning, including the number of machine learning models. n Cross-validation folds K Voting threshold, target machine learning performance threshold r 2 and minimum training sample size m ; S3: This will achieve the minimum training sample size. m The feature matrix and target vector are respectively input n Train in 10 machine learning models and calculate the performance of each machine learning model. K The cross-validation score is used to determine whether it is not less than the target machine learning performance threshold. r 2 If yes, proceed to step S7; otherwise, proceed to step S4. S4: Use the trained... n Each machine learning model predicts the target vector for the feature matrix of each sample and obtains the corresponding relative error to determine whether it is a high-error sample. S5: Statistically analyze each sample. n The number of times a machine learning model identifies a sample as having a high error is compared with a preset voting threshold. If the number of times the number of times the sample is identified as having a high error is greater than the voting threshold, the sample is identified as having a high error and is removed from the sample set. S6: Determine whether the number of samples in the deleted sample set is lower than the minimum training sample size. m If so, then according to the replenishment strategy, new data samples are added to the sample set, and then the process returns to step S3 for iterative training. S7: Output the current sample set and the corresponding machine learning model, and use the output machine learning model to predict the maximum Gibbs free energy change of the single-atom structure.
[0014] The present invention also provides a dataset cleaning system based on multi-model collaboration, including a memory and a processor. The memory stores a computer program, and the processor calls the computer program to execute the steps of the method described above.
[0015] Compared with the prior art, the present invention has the following advantages: (1) Compared with conventional machine learning, which only focuses on “model selection”, this invention can automatically obtain the optimal machine learning model that can be deployed without relying on a lot of manual labor, and explicitly handle abnormal samples in the data.
[0016] The method first selects random samples as the training set for multi-model search. While searching for the optimal model, it introduces a sample residual metric based on multi-model cross-validation and a data-driven adaptive threshold for voting and cleaning. When the training set is insufficient, sample data is added back, forming a closed-loop iteration of "training—evaluation—cleaning—addition—retraining" until the model performance metrics meet the predefined machine learning model requirements, producing a usable dataset and the optimal machine learning model. This invention has advantages such as automation, robustness, and transferability. The process is clear and simple to implement, and it can dynamically balance the quality and scale of training samples while ensuring rigorous evaluation.
[0017] (2) To address the common data problems in large-scale DFT theoretical calculations, such as non-convergence, pseudo-convergence, and structural issues, this method measures the prediction error of each calculation sample within a cross-validation framework and employs... S = m + E s Adaptive statistical thresholds replace static data filtering rules such as fixed quantiles with data-driven thresholds, automatically identifying and removing high-error samples, adapting to data of different target scales and distributions, and thus exhibiting greater stability.
[0018] (3) This invention identifies high-error samples not by using a single model for screening, but by... n The machine learning model uses a voting system, with the number of times a sample is identified as a high-error sample as the voting threshold. This effectively improves the cleaning of high-error samples, thus determining whether a sample data deviates significantly from the model in multiple aspects and should indeed be cleaned and discarded.
[0019] (4) This invention modularizes the data “training-evaluation-cleaning-replenishment-retraining” process: the composition of multiple model sets, the weights of each model, and the voting thresholds are managed through independent configuration. Users can flexibly adjust the data screening strategy of multiple models according to specific material properties. When switching from one data system to another, users only need to change the data adaptation method and a small amount of configuration to reuse the same automatic screening and modeling process, which reflects good portability and reproducibility. Attached Figure Description
[0020] Figure 1 A schematic diagram of a dataset cleaning method based on multi-model collaboration provided in an embodiment of the present invention; Figure 2 A simplified flowchart illustrating a dataset cleaning method based on multi-model collaboration provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of a series of graphene-based tetranitrogen-coordinated metal single-atom catalyst structures randomly generated within a certain elemental range, as provided in Example 1 of the present invention. Figure 4 This invention provides a final result after fifty-two rounds of data iteration, as shown in Embodiment 1. r 2 value; Figure 5 This invention provides a final result after nine rounds of data filtering, as provided in Embodiment 2. r 2 value. Detailed Implementation
[0021] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0022] Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.
[0023] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.
[0024] This invention provides a dataset cleaning method based on multi-model collaboration. This method centers on a unified data adapter, automated search of multiple machine learning models, and sample residual statistics. It automatically constructs and selects the optimal model without requiring extensive manual trial and error or empirical threshold settings. Simultaneously, it identifies and votes to eliminate high-error samples across multiple models, steadily improving the model's generalization performance and robustness. Compared to traditional methods relying on manual cleaning and single-stage partitioning and evaluation, this invention employs a closed-loop process of "training-evaluation-cleaning-retraining" within a rigorous cross-validation framework, significantly reducing human subjectivity and overfitting risks. This method is suitable for high-throughput modeling and batch evaluation of data, providing transferable, reproducible, and traceable modeling support for automated prediction, quality screening, and rational design in scientific research and industrial scenarios.
[0025] like Figure 1 and Figure 2 As shown, the specific steps of the solution are as follows: S1: Obtain the raw data to be processed, and parse the data according to the preset field names to form a feature matrix and target vector to build a sample set; S2: Deploy configuration information for dataset cleaning, including the number of machine learning models. n Cross-validation folds K Voting threshold, target machine learning performance threshold r 2 and minimum training sample size m ; S3: This will achieve the minimum training sample size. m The feature matrix and target vector are respectively input n Train in 10 machine learning models and calculate the performance of each machine learning model. K The cross-validation score is used to determine whether it is not less than the target machine learning performance threshold. r 2 If yes, proceed to step S7; otherwise, proceed to step S4. S4: Use the trained... n Each machine learning model predicts the target vector for the feature matrix of each sample and obtains the corresponding relative error to determine whether it is a high-error sample. S5: Statistically analyze each sample. n The number of times a machine learning model identifies a sample as having a high error is compared with a preset voting threshold. If the number of times the number of times the sample is identified as having a high error is compared with the threshold, the sample is identified as having a high error and is removed from the sample set. S6: Determine if the number of samples in the deleted sample set is lower than the minimum training sample size. mIf so, then according to the replenishment strategy, new data samples are added to the sample set, and then the process returns to step S3 for iterative training. S7: Output the current sample set and the corresponding machine learning model.
[0026] Specifically, the data parsing process in step S1 includes: extracting data from the original data according to the preset field names, and performing numerical conversion and outlier processing to obtain the feature matrix and target vector.
[0027] The adaptation phase completes quantification and basic outlier handling, resulting in standardized output. X , y and feature names.
[0028] Optionally, the method also includes setting the number of cleaning rounds in the configuration information in step S2, and determining in step S6 whether the current iteration round has reached the number of cleaning rounds. If it has, then step S7 is executed; otherwise, step S3 is returned.
[0029] Optionally, the configuration information in step S2 also includes setting the minimum number of backups in the backup strategy, which can be changed with a high degree of flexibility, and different deployment configurations can be made for different samples.
[0030] In step S3, the identified feature information is transmitted to the custom... n Each machine learning model will be trained from... n Choose the best machine learning model from among the available models as the output.
[0031] Based on the machine learning model obtained from each round of training, cross-validation scores are calculated, as follows: In the formula, For regression function, This is the current fold number. T This refers to the complete training set corresponding to the machine learning model training process. For the first step in training a machine learning model The validation set of the fold. To remove the first k The training set of the validation set. For regression function of K Cross-validation score.
[0032] In step S4, the relative error of each sample under each model is calculated as follows: errors = |y - ŷ| / ( max ( y ) - minutes ( y)) In the formula, errors This is a relative error. y For the corresponding target vector truth value, ŷ The predicted value for the corresponding target vector. max ( y ) represents the true value of the target vector y The maximum value, minutes ( y ) represents the true value of the target vector y The minimum value.
[0033] in y - ŷ It is the absolute error of each sample, then used max(y) - min(y) The error was relativized.
[0034] Optionally, the configuration information in step S2 may also include an error threshold coefficient. E In step S4, the mean and standard deviation are calculated based on the relative error of the prediction results from the machine learning model, and then combined with the error threshold coefficient. E Determine the error threshold of the current model. This threshold is used to determine whether the relative error of each sample predicted by the machine learning model exceeds the error threshold. If it does, the corresponding sample is considered a high-error sample.
[0035] The expression for calculating the error threshold is: S = m + You In the formula, S For the error threshold, m The mean, Standard deviation For the current machine learning model i The relative error of each prediction result, This represents the total number of predictions from the previous machine learning model.
[0036] by S = m + You Using the threshold as a benchmark, this normalization strategy eliminates the incomparability caused by differences in the dimensions and scales of different task objectives, making subsequent thresholds more transferable and stable.
[0037] In step S5, the determination of high-error samples adopts a multi-machine learning model voting mechanism: several machine learning models are selected to perform cross-validation prediction on the training data, the relative error of each model on the training samples is calculated, and the average error of the model on all samples is used as the basis for the determination. m with standard deviation s Construct threshold S = m + You The high error criterion for this model is given. E To balance cleaning intensity and sample retention, adjustable hyperparameters are used. The number of times each training sample is marked as having high error by each model is counted. V i And set a minimum voting threshold. V min .if V i ≥ V min If so, the sample is determined to be a high-error sample in the final analysis, and based on... id Delete data from the original training data table to create a deletion list.
[0038] To prevent over-cleaning, the process is subject to two constraints: first, a minimum number of training data sets. m Secondly, if no high-error samples are detected in a given round, cleaning is not performed. After each round of cleaning, the system reconstructs the feature matrix and performs cross-validation to ensure that the matrix index matches the sample... id Always aligned.
[0039] In step S6, if it is determined that the number of training set samples after deletion is lower than the minimum number of samples to be filtered, new data is added to the training set according to the replenishment strategy (e.g., if the dataset is DFT theoretical calculation data, additional DFT calculation is required to supplement the data) until the lower limit of the scale is met or the final training samples are insufficient, resulting in the end of training.
[0040] Reasons for ending training include reaching the preset number of training rounds and the final cross-validation score reaching the performance threshold.
[0041] Specifically, steps S3 to S6 are re-executed on the cleaned dataset to form a closed-loop iteration of "training-evaluation-cleaning-retraining": (1) If the cross-validation in this round R 2 Reaching or exceeding a preset threshold r 2 If the target is met, the data combination and optimal model of this round are immediately exported and the process ends; (2) If the target is not met, but there are no more samples to delete, or the maximum number of cleaning rounds is reached, then the process stops and the current best result is retained. This closed-loop mechanism can improve the robustness and generalization ability of the model round by round while ensuring the sample size and statistical robustness.
[0042] The process is complete. Step S7 provides the optimal model, the cleaned data table, and a list of error samples to be deleted (including samples...). id The remaining sample set, and the training logs for each round (including each round) R2 , m , s Threshold S Statistics such as the number of deleted samples and the number of remaining samples.
[0043] Example 1 In the field of computational materials science, this invention, as an adaptive high-error dataset cleaning method based on machine learning, also shows good results in filtering computational data based on first principles (such as density functional theory) and machine learning modeling. Taking the tetranitrogen-coordinated metal single-atom catalyst based on graphene in the field of electrocatalytic CO2 reduction to CO as the object, through iterative data filtering, the machine learning model performance is effectively improved with a small amount of density functional theory computation. The overall process includes: randomly generating a series of single-atom structures (such as...) within a certain elemental range. Figure 3 As shown), single-atom structures with different elemental compositions are formed. DFT calculations are performed to obtain COOH, CO adsorption energy, etc., and the maximum Gibbs free energy change in the reaction pathway is obtained. The calculated maximum Gibbs free energy change is used as the target value y. k The automatic machine learning model search is performed under the cross-validation framework. Based on the adaptive threshold voting of the residuals of multiple model samples, high error samples are identified and cleaned up, and the process is iterated until the optimal regression model that meets the set indicators is obtained.
[0044] Specifically, this embodiment provides a model prediction method for the maximum Gibbs free energy change of the electrocatalytic reduction of CO2 to CO using a graphene-based tetranitrogen-coordinated metal single-atom catalyst, including the following steps: S1: Obtain the raw data to be processed. The process of obtaining the raw data includes: randomly generating a series of single-atom structures within a preset element range to form single-atom structures composed of different elements, and obtaining the maximum Gibbs free energy change in the reaction path; parsing the raw data according to the preset field names to form a feature matrix and a target vector to build a sample set, with the target vector being the maximum Gibbs free energy change. In this embodiment, the maximum Gibbs free energy changes of 180 random single-atom structures are pre-calculated, and the calculation results are stored in a dataset file along with the corresponding element codes. Data is parsed based on preset field names extracted according to the element codes, and seven feature fields—atomic radius (R), electronegativity (EG), d electron number (DE), cohesive energy (EC), electron affinity (EA), ionization energy (IE), and smooth overlap at atomic positions (SOAP)—are uniformly quantified to form a feature matrix X, with the maximum Gibbs free energy change used as the target vector y. To facilitate tracking and backtracking, a sample ID is generated for each record.
[0045] S2: Deploy configuration information for dataset cleaning, including the number of machine learning models. n Cross-validation folds K Voting threshold, target machine learning performance threshold r 2 and minimum training sample size m ; In this embodiment, due to the small data size, the sample size is selected as 100, and the number of data to be replenished at one time is the amount of data that was deleted. n = 12 machine learning models were deployed, with the number of iterations set to 100 in the hyperparameters and a fixed random number sequence seed of 1 to ensure reproducible results; model evaluation was conducted using... k = 5-fold cross-validation, and set a performance threshold. r 2 =0.90, voting threshold V min = 12, perform multi-model search. When the cross-validation R of the candidate models is 12, perform multi-model search. 2 The search can be terminated early if this value is reached or exceeded; during error cleaning, the number of samples retained in the training set should not be less than the minimum training sample size. m = 100; at the same time E = 1 as the adaptive error threshold coefficient.
[0046] S3: This will achieve the minimum training sample size. m The feature matrix and target vector are respectively input n Train in 10 machine learning models and calculate the performance of each machine learning model. K The cross-validation score is used to determine whether it is not less than the target machine learning performance threshold. r 2 If yes, proceed to step S7; otherwise, proceed to step S4. In this embodiment, k Cross-validation ( k = 5) Within the framework, 100 pre-set training samples are fed into 12 machine learning models, and the cross-validation score of each machine learning model is used as the basis for evaluation. To optimize the objective, train on a variety of given machine learning models and compute their cross-validations. The calculation method is as follows: The best candidate model with the highest score in this round is retained. f It is a regression function. K Cross-validation folds k Which chapter is it? T It is a complete training set. Tk Which fold's verification set is it? T / T k It is to remove the first k The training set is used to validate the model. This approach avoids manual trial and error on a model-by-model basis, improving modeling efficiency, portability, and consistency of experimental reproducibility.
[0047] S4: Use the trained... n Each machine learning model predicts the target vector for the feature matrix of each sample and obtains the corresponding relative error to determine whether it is a high-error sample. Reusing the same 12 machine learning models k The sample cross-validation prediction (ŷ) is obtained by splitting the sample into two parts, and the relative error is defined as follows: errors = |y - ŷ| / (max(y) - min(y)) .
[0048] Statistical analysis of errors for different models errors mean m and standard deviation s The calculation method is as follows: , in x i It is the first i The error of each sample. S = m + You For the threshold, E = 1. When the sample ( errors > T When a sample is marked as a high-error sample under this model, it is considered to be a high-error sample.
[0049] S5: Statistically analyze each sample. n The number of times a machine learning model identifies a sample as having a high error is compared with a preset voting threshold. If the number of times the number of times the sample is identified as having a high error is compared with the threshold, the sample is identified as having a high error and is removed from the sample set. The number of times it was judged as having high error in the model was counted, based on V min = 12, if a sample V i ≥ V min Samples with high errors are identified and removed from the dataset. This strategy replaces static data filtering rules such as fixed quantiles with data-driven thresholds, making it more adaptable to data of different target scales and distributions, and thus more stable.
[0050] The cleaning intensity is adaptively determined by the error distribution on different datasets, which improves the stability and robustness of the modeling process.
[0051] S6: Determine if the number of samples in the deleted sample set is lower than the minimum training sample size. m If so, then according to the replenishment strategy, new data samples are added to the sample set, and then the process returns to step S3 for iterative training. If the size of the training set after cleaning is less than the minimum screening size, the deleted samples are randomly added to the training set from the remaining samples according to the strategy until the minimum screening sample size is met; if the sample dataset is insufficient, cleaning is stopped.
[0052] Repeat steps S3 to S6 to form a closed loop of "training-evaluation-cleaning-replenishment-retraining". Finally, conduct the 52nd round of cross-validation training. R 2 The score is 0.92, which meets the performance threshold. r 2 = 0.90, training terminated, as follows Figure 4 As shown.
[0053] S7: Output the current sample set and the corresponding machine learning model, and use the output machine learning model to predict the maximum Gibbs free energy change of the single-atom structure.
[0054] Example 2 This embodiment uses the public dataset "Boston Housing" as the target dataset. The core tasks include: constructing a normalized feature matrix and target vector from the public data source; and... k The system employs an automated machine learning model search within a cross-validation framework. It uses adaptive threshold voting based on the residuals of multiple models to identify and clean up high-error samples, iterating until the optimal regression model that meets the set criteria is obtained. This process provides a complete and reproducible modeling baseline and comparable evaluation results for subsequent quality score prediction, outlier identification, and interpretable analysis.
[0055] Step 1: Directly read "Boston Housing Prices" data from the UCI public database. The adapter module parses the data according to preset field names, and converts 11 feature fields—Residential Land Ratio (ZN), Non-Retail Land Ratio (INDUS), Proximity to the Charles River (CHAS), Nitric Oxide Concentration (NOX), Average Number of Rooms (RM), Older Home Ratio (AGE), Distance to Employment Center (DIS), Highway Accessibility (RAD), Property Tax Rate (TAX), Student-Teacher Ratio (PTRATIO), and Low-Income Population Ratio (LSTAT)—into numerical values to form a feature matrix. X ; and use the median house price (MEDV) as the target vector. y To facilitate tracking and backtracking, a sample ID is generated for each record.
[0056] Step 2: Deploy configuration information, select a sample size of 200, and a one-time replenishment count of 100. n = 5 machine learning models. In the deployment hyperparameters, the number of iteration rounds is set to 50, and the random seed is set to a fixed random number sequence of 1 to ensure reproducible results; model evaluation uses... k = 10-fold cross-validation, and set a performance threshold. r 2 = 0.90, voting threshold V min = 2. Perform multi-model search, and when the candidate models are cross-validated... R 2 The search can be terminated early if this value is reached or exceeded; during error cleaning, the number of samples retained in the training set should not be less than the minimum training sample size. m = 200; at the same time with E = 2 is used as the adaptive error threshold coefficient.
[0057] Step 3: In k Cross-validation ( k = 10) Within the framework, 200 pre-set training samples are fed into 5 machine learning models, and the cross-validation score of each machine learning model is used as the basis for evaluation. To optimize the objective, train on a variety of given machine learning models and compute their cross-validations. The calculation method is as follows: The highest-scoring candidate model in this round is retained based on the score. f It is a regression function. K Cross-validation folds k Which chapter is it? T It is a complete training set. T k Which fold's verification set is it? T / T k It is to remove the first k The training set is used to validate the model. This approach avoids manual trial and error on a model-by-model basis, improving modeling efficiency, portability, and consistency of experimental reproducibility.
[0058] Step 4: Based on the error calculation in Step 2, use the same K = 10-fold partitioning to obtain sample cross-validation predictions across 5 machine learning models. ŷ The relative error is defined as... errors = |y - ŷ| / (max(y) - min(y)) The errors of different models were statistically analyzed separately. errors mean m and standard deviation s The calculation method is as follows: , Where x i This is the error of the i-th sample. S = m + You For the threshold, E = 2. When the sample ( errors > T When a sample is marked as a high-error sample under the model, the number of times it is judged as a high-error sample under the model is counted, based on... V min = 2, if a sample V i ≥ V min Samples with high errors are identified and removed from the dataset. This strategy replaces static data filtering rules such as fixed quantiles with data-driven thresholds, making it more adaptable to data of different target scales and distributions, and thus more stable.
[0059] Step 5: Based on the sample error distribution obtained from cross-validation under the 5 models, ... S = m + You Determine whether a sample is a high-error sample for a certain model, and make the determination through voting. V i ≥ V min The system identifies high-error samples and adaptively determines the cleaning intensity based on the error distribution across different datasets, thereby improving the stability and robustness of the modeling process. If the number of samples in the training set after cleaning is lower than the minimum screening size, 100 samples are randomly added to the training set from the remaining samples according to a strategy until the minimum screening sample size is met; if the sample dataset is insufficient, the cleaning process stops.
[0060] Step 6: Based on the list of samples to be deleted in Step 5, delete the corresponding samples from the original data table and reconstruct the data. X , y Then, steps 3 through 5 are repeated to form a closed loop of "training-evaluation-cleaning-replenishment-retraining". Finally, the ninth round of training achieves cross-validation. R 2 The score is 0.90, which meets the performance threshold r. 2 = 0.90, training terminated, as follows Figure 5 As shown.
[0061] Step 7: Export the optimal regression model, cleaned data tables, deletion list, and complete log (recording each round). R 2 , m , s ThresholdT (The process ends when the number of deleted samples and the number of remaining samples are counted).
[0062] Example 3 This embodiment provides a dataset cleaning system based on multi-model collaboration, including a memory and a processor. The memory stores a computer program, and the processor calls the computer program to execute the steps of the dataset cleaning method based on multi-model collaboration as described in Embodiment 1.
[0063] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.
Claims
1. A dataset cleaning method based on multi-model collaboration, characterized in that, Includes the following steps: S1: Obtain the raw data to be processed, and parse the data according to the preset field names to form a feature matrix and target vector to build a sample set; S2: Deploy configuration information for dataset cleaning, including the number of machine learning models. n Cross-validation folds K Voting threshold, target machine learning performance threshold r 2 and minimum training sample size m ; S3: This will achieve the minimum training sample size. m The feature matrix and target vector are respectively input n Train in 10 machine learning models and calculate the performance of each machine learning model. K The cross-validation score is used to determine whether it is not less than the target machine learning performance threshold. r 2 If so, proceed to step S7; Otherwise, proceed to step S4; S4: Use the trained... n Each machine learning model predicts the target vector for the feature matrix of each sample and obtains the corresponding relative error to determine whether it is a high-error sample. S5: Statistically analyze each sample. n The number of times a machine learning model identifies a sample as having a high error is compared with a preset voting threshold. If the number of times the number of times the sample is identified as having a high error is greater than the voting threshold, the sample is identified as having a high error and is removed from the sample set. S6: Determine whether the number of samples in the deleted sample set is lower than the minimum training sample size. m If so, then according to the replenishment strategy, new data samples are added to the sample set, and then the process returns to step S3 for iterative training. S7: Output the current sample set and the corresponding machine learning model.
2. The dataset cleaning method based on multi-model collaboration according to claim 1, characterized in that, The K The expression for calculating the cross-validation score is: In the formula, For regression function, This is the current fold number. T This refers to the complete training set corresponding to the machine learning model training process. For the first step in training a machine learning model The validation set of the fold. To remove the first k The training set of the validation set. For regression function of K Cross-validation score.
3. The dataset cleaning method based on multi-model collaboration according to claim 1, characterized in that, The expression for calculating the relative error is: errors = |y - ŷ| / ( max ( y ) - min ( y )) In the formula, errors This is a relative error. y For the corresponding target vector truth value, ŷ The predicted value for the corresponding target vector. max ( y ) represents the true value of the target vector y The maximum value, min ( y ) represents the true value of the target vector y The minimum value.
4. The dataset cleaning method based on multi-model collaboration according to claim 1, characterized in that, The configuration information in step S2 also includes an error threshold coefficient. E In step S4, the mean and standard deviation are calculated based on the relative error of the prediction results from the machine learning model, and then combined with the error threshold coefficient. E An error threshold is determined for the current model. This threshold is used to determine whether the relative error of each sample predicted by the machine learning model is greater than the error threshold. If it is, the corresponding sample is a high-error sample.
5. The dataset cleaning method based on multi-model collaboration according to claim 4, characterized in that, The expression for calculating the error threshold is: S = μ + Eσ In the formula, S For the error threshold, μ The mean, Standard deviation For the current machine learning model i The relative error of each prediction result, This represents the total number of predictions from the previous machine learning model.
6. The dataset cleaning method based on multi-model collaboration according to claim 1, characterized in that, The method further includes setting the number of cleaning rounds in the configuration information in step S2, and determining in step S6 whether the current iteration round number has reached the number of cleaning rounds. If it has, then step S7 is executed; otherwise, step S3 is returned.
7. The dataset cleaning method based on multi-model collaboration according to claim 1, characterized in that, The data parsing process in step S1 includes: extracting data from the original data according to the preset field names, and performing numerical conversion and outlier processing to obtain a feature matrix and a target vector.
8. The dataset cleaning method based on multi-model collaboration according to claim 1, characterized in that, The configuration information in step S2 also includes setting the minimum number of backups in the backup strategy.
9. A model prediction method for the maximum Gibbs free energy change of CO production in electrocatalytic CO2 reduction catalyzed by a graphene-based tetranitrogen-coordinated metal single-atom catalyst, characterized in that, Includes the following steps: S1: Obtain the raw data to be processed. The process of obtaining the raw data includes: randomly generating a series of single-atom structures within a preset element range to form single-atom structures composed of different elements, and obtaining the maximum Gibbs free energy change in the reaction path; performing data parsing on the raw data according to preset field names to form a feature matrix and a target vector to build a sample set, wherein the target vector is the maximum Gibbs free energy change. S2: Deploy configuration information for dataset cleaning, including the number of machine learning models. n Cross-validation folds K Voting threshold, target machine learning performance threshold r 2 and minimum training sample size m ; S3: This will achieve the minimum training sample size. m The feature matrix and target vector are respectively input n Train in 10 machine learning models and calculate the performance of each machine learning model. K The cross-validation score is used to determine whether it is not less than the target machine learning performance threshold. r 2 If so, proceed to step S7; Otherwise, proceed to step S4; S4: Use the trained... n Each machine learning model predicts the target vector for the feature matrix of each sample and obtains the corresponding relative error to determine whether it is a high-error sample. S5: Statistically analyze each sample. n The number of times a machine learning model identifies a sample as having a high error is compared with a preset voting threshold. If the number of times the number of times the sample is identified as having a high error is greater than the voting threshold, the sample is identified as having a high error and is removed from the sample set. S6: Determine whether the number of samples in the deleted sample set is lower than the minimum training sample size. m If so, then according to the replenishment strategy, new data samples are added to the sample set, and then the process returns to step S3 for iterative training. S7: Output the current sample set and the corresponding machine learning model, and use the output machine learning model to predict the maximum Gibbs free energy change of the single-atom structure.
10. A dataset cleaning system based on multi-model collaboration, characterized in that, It includes a memory and a processor, the memory storing a computer program, the processor invoking the computer program to perform the steps of the method as described in any one of claims 1 to 8.