Tunneling parameter optimization prediction method and system based on semi-supervised learning for shield tunneling machine

By employing semi-supervised learning methods, combined with multi-stage missing value completion, outlier detection, multi-model integration, and GPU acceleration, an integrated system architecture was constructed. This solved the data quality and computational efficiency problems in the prediction of tunnel boring machine excavation parameters, enabling efficient, robust real-time prediction and rapid decision-making.

CN122019985BActive Publication Date: 2026-06-23SOUTHWEST JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTHWEST JIAOTONG UNIV
Filing Date
2026-04-09
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing tunnel boring machine (TBM) parameter prediction technologies suffer from problems such as insufficient data quality processing, poor model generalization ability, scarce labels leading to decreased accuracy, low computational efficiency, and lack of an integrated system architecture, making it difficult to meet the real-time requirements of low-latency prediction and rapid decision-making at construction sites.

Method used

By employing a semi-supervised learning approach, through multi-stage missing value completion, dual-index outlier detection, fusion of multiple clustering algorithms, multi-model integration, and GPU-accelerated training, an end-to-end integrated system architecture is constructed to achieve efficient processing and real-time prediction of multi-source heterogeneous construction data.

Benefits of technology

It significantly improves data quality, enhances the robustness and effectiveness of the model, is highly adaptable and computationally efficient, and can meet the needs of rapid deployment and management in engineering sites, thus addressing the shortcomings of existing technologies.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122019985B_ABST
    Figure CN122019985B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on semi-supervised learning's shield machine tunneling parameter optimization prediction method and system, and relates to engineering construction control technical field.The method includes: data preprocessing and feature generation;Adopt multi-algorithm fusion to carry out geological clustering;Execute feature optimization and multi-model construction;Through the collaborative training in geological cluster, noise enhancement and iterative learning carry out semi-supervised enhancement;Based on comprehensive score selection optimal model;Uncertainty evaluation is output by using cluster center distance to carry out routing prediction.The application effectively solves the problems of poor data quality, label scarcity, geological distribution deviation and low calculation efficiency in shield construction, improves the accuracy, generalization ability and engineering applicability of tunneling parameter prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of engineering construction control technology, and in particular to a method and system for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning. Background Technology

[0002] Current methods for predicting tunnel boring machine (TBM) parameters can be mainly divided into two categories: mechanistic model-based methods and data-driven methods. Mechanism-based methods rely on theories such as rock and soil mechanics and TBM dynamics to establish mathematical models. Although they are interpretable in terms of mechanism, their modeling process requires a large number of geological and equipment internal parameters that are difficult to obtain accurately. Furthermore, they are poorly adaptable to complex working conditions such as heterogeneous strata and groundwater variations. The model assumptions are often difficult to meet in actual construction, resulting in limited prediction accuracy.

[0003] Data-driven approaches, especially those utilizing machine learning techniques to learn mapping relationships from historical data, are gradually becoming the mainstream research method. These methods include traditional statistical analysis and modern machine learning models. While data-driven approaches reduce reliance on precise mechanistic models, their performance is severely limited by data quality and completeness. In practical engineering, shield tunneling construction data commonly contains a large number of missing values, anomalous noise, and multi-source heterogeneous characteristics, directly affecting the input quality and stability of the model. Furthermore, traditional supervised learning models typically assume that training and prediction data follow the same distribution; however, in long-distance tunnel excavation, geological conditions often exhibit drastic and non-stationary changes, leading to severe domain shift problems and significantly reducing the model's generalization ability when applied across geological formations or engineering projects.

[0004] While some studies have introduced transfer learning or domain adaptation techniques to address the distribution offset problem, these approaches are still in the early stages of exploration in the field of tunnel boring machine (TBM) construction. They generally suffer from limitations such as computational complexity, high dependence on labeled data, and a lack of deep integration with engineering geological features. Furthermore, many key parameters during construction (such as soil properties and tool wear) rely on manual or laboratory testing, making it difficult to obtain complete data in real time. This results in a severe shortage of labeled data suitable for model training, hindering further improvements in the performance of supervised learning models.

[0005] From an engineering implementation perspective, existing technical solutions still have the following systemic shortcomings: First, data processing flows are usually isolated and fragmented, with preprocessing, feature engineering, model training, and deployment stages separated, lacking an end-to-end integrated architecture, which increases the complexity of system maintenance and result traceability; Second, when faced with large-scale, high-dimensional real-time construction data streams, traditional CPU-based computing models are inefficient in feature construction, cluster analysis, and multi-model training, making it difficult to meet the real-time requirements of low-latency prediction and rapid decision-making at construction sites; Third, few methods can organically integrate geological condition identification, feature adaptive optimization, semi-supervised learning, and high-performance computing to form a comprehensive solution that can simultaneously address poor data quality, few labels, rapid distribution changes, and high computational requirements. Summary of the Invention

[0006] To address the problems in existing shield tunneling parameter prediction technologies, such as insufficient data quality processing, poor model generalization ability, decreased accuracy due to scarce labels, low computational efficiency, and lack of integrated system architecture, this invention proposes a shield tunneling parameter optimization prediction method and system based on semi-supervised learning. This method enables efficient processing of multi-source heterogeneous construction data, adaptive geological modeling, and real-time prediction, thereby solving the aforementioned problems.

[0007] This application discloses a method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning, including:

[0008] S1. Extract active control parameters, passive response parameters and geological-related parameters from historical construction data, perform multi-stage completion of missing values, and perform dual-index detection and cleaning of outliers.

[0009] S2. Generate extended features based on the preprocessed passive response parameters and geologically related parameter data, including interactive features, statistical features and combined features;

[0010] S3. Multiple normalization methods are used to construct multi-scale views for geologically related parameters, and multiple clustering algorithms are used to cluster them under each view. The final geological cluster labels are generated through a two-level fusion strategy.

[0011] S4. Construct a random forest regressor under different normalization methods to evaluate the feature scaling effect, select the optimal scaling method, and perform multi-stage feature selection.

[0012] S5. For each geological cluster, extract the corresponding sample subset, and train multiple prediction models for each active parameter within the subset.

[0013] S6. Perform collaborative training regression, Gaussian noise data augmentation, and iterative incremental training within geological clusters to improve the model's generalization ability;

[0014] S7. Under each geological cluster and active parameter combination, select the optimal sub-model based on the comprehensive evaluation index on the validation set.

[0015] S8. Calculate the distance between the new input data and the centers of various geological clusters, route it to the corresponding sub-model for prediction, and calculate the uncertainty score of the prediction results;

[0016] S9. Save the trained model and related components as a structured file, supporting local and remote deployment.

[0017] Preferably, the multi-stage completion of missing values ​​in S1 includes: smoothing the feature columns with continuous missing values ​​using cubic spline interpolation, and using the median to complete isolated missing values ​​that still exist after interpolation.

[0018] The dual-index detection and cleaning of outliers includes: first, calculating the corrected Z-score for each feature, screening out samples that exceed a preset threshold as suspected outliers, then calculating the Mahalanobis distance of each sample in the multidimensional feature space to measure its deviation from the overall distribution, and removing the top 5% of samples with the highest degree of anomaly.

[0019] Preferably, the interaction features in S2 are second-order interaction polynomial features, retaining only the interaction terms between different original features;

[0020] The statistical characteristics include mean, standard deviation, skewness, and kurtosis;

[0021] The combined features include the sum, product, and ratio features of geological parameters, wherein a smoothing factor is introduced into the denominator of the ratio features.

[0022] Preferably, the normalization methods in S3 include standardization, robust scaling, and quantile scaling;

[0023] The various clustering algorithms include K-means, DBSCAN, and Gaussian mixture model;

[0024] The two-level fusion strategy includes: performing consensus voting on the clustering results of views at different scales within the same algorithm to obtain stable labels for that algorithm; and then performing majority voting on the stable labels of different algorithms to generate the final geological cluster labels.

[0025] Preferably, in step S4, selecting the optimal scaling method includes: constructing random forest regressors under different normalization methods, calculating the coefficient of determination R² using 3-fold cross-validation, and selecting the scaling method with the highest R².

[0026] The multi-stage feature selection includes variance filtering, mutual information selection, and recursive feature elimination, which are performed sequentially.

[0027] Preferably, the multiple prediction models in S5 include random forest, extreme random tree, gradient boosting tree, XGBoost, LightGBM, CatBoost, and deep neural network regressors;

[0028] When a GPU is detected as available, the GPU-accelerated model training process is automatically enabled.

[0029] Preferably, the collaborative training regression in S6 includes: using two types of base models with complementary error characteristics to predict unlabeled data within the same geological cluster, screening samples whose prediction difference is lower than the consistency threshold and whose distance from the cluster center is within a typical range, generating pseudo-labels by weighted averaging and injecting them into the training set;

[0030] In the Gaussian noise data augmentation, the noise standard deviation is proportional to the standard deviation of the corresponding feature;

[0031] The iterative incremental training divides the training data into multiple incremental packages and updates the model parameters in batches.

[0032] Preferably, the comprehensive evaluation index in S7 includes the coefficient of determination. Root mean square error (RMSE) and mean absolute error (MAE);

[0033] The selection of the optimal sub-model is based on a comprehensive scoring formula:

[0034]

[0035] Choose overall rating The highest-performing model is selected as the optimal sub-model.

[0036] Preferably, the routing process in S8 includes: calculating the Euclidean distance between the input geological parameters and the centers of various geological clusters, and selecting the sub-model corresponding to the cluster with the smallest distance for prediction;

[0037] When the sub-model is an ensemble model, the uncertainty score is the standard deviation of the prediction results of each base learner.

[0038] This application also discloses a semi-supervised learning-based shield tunneling parameter optimization and prediction system, used to implement the above-mentioned semi-supervised learning-based shield tunneling parameter optimization and prediction method, including:

[0039] The data preprocessing module is used to classify and extract historical construction data, perform multi-stage completion of missing values, outlier detection and cleaning based on modified Z-scores and Mahalanobis distance, and generate interactive, statistical and combined features.

[0040] The geological clustering module is used to perform multi-scale normalization of geological parameters and to perform parallel clustering using K-means, DBSCAN and Gaussian mixture models. It generates geological cluster labels and cluster centers through a two-level fusion strategy.

[0041] The feature optimization module is used to evaluate the performance of different normalization methods and select the optimal scaling method through a random forest regressor, and then perform variance filtering, mutual information selection and recursive feature elimination in sequence to obtain an optimized feature subset.

[0042] The model building module is used to train multiple base models for samples from various geological clusters and implement a semi-supervised enhancement strategy based on collaborative training, Gaussian noise enhancement, and iterative incremental learning within the geological clusters.

[0043] The model selection module is used to evaluate each candidate model on the validation set based on the comprehensive scoring formula and select the optimal sub-model under the combination of each geological cluster and active parameters.

[0044] The prediction and uncertainty assessment module is used to perform geological routing based on the distance between the input data and the cluster center and call the corresponding sub-model for prediction. It calculates the standard deviation of each base learner's prediction as an uncertainty score on the output of the integrated model.

[0045] The model storage and deployment module is used to save the trained model, normalizer, geological labels and evaluation results as structured files, and supports local or remote deployment via REST API and WebSocket interface.

[0046] The beneficial effects of this invention are:

[0047] (1) The introduction of dual-strategy outlier detection and multi-strategy missing value completion significantly improves data quality;

[0048] (2) By integrating multiple clustering algorithms and combining automatic parameter selection and noise point redistribution, a more robust geological classification can be obtained;

[0049] (3) Employ multi-stage feature optimization and scaler selection to improve model generalization and robustness;

[0050] (4) A multi-model integrated structure is constructed for each geological category, which is highly adaptable and computationally efficient;

[0051] (5) The semi-supervised mechanism combining collaborative training, data augmentation and iterative increment effectively alleviates the problem of insufficient labels;

[0052] (6) Fully introduce GPU acceleration for training and inference, significantly reducing model building and prediction latency;

[0053] (7) Provides an end-to-end integrated system architecture to support rapid deployment and traceable management on engineering sites. Attached Figure Description

[0054] Figure 1 This is a flowchart of the shield tunneling parameter optimization and prediction method based on semi-supervised learning according to an embodiment of the present invention;

[0055] Figure 2 This is a schematic diagram of the geological clustering process according to an embodiment of the present invention;

[0056] Figure 3 This is a schematic diagram of the multi-model construction and semi-supervised enhancement process in an embodiment of the present invention. Detailed Implementation

[0057] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided with reference to the accompanying drawings and embodiments.

[0058] This application discloses a method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning, the process of which is as follows: Figure 1 As shown, it includes:

[0059] S1. Data Preprocessing. First, historical records are extracted from the tunnel boring machine (TBM) construction database. The raw data is then broken down according to physical meaning into active control parameters (such as thrust, cutterhead torque, grouting volume, etc.), passive response parameters (such as earth pressure, displacement, vibration signals, etc.), and geologically relevant parameters (such as surrounding rock type, water content, permeability coefficient, etc.). Active parameters characterize the amount of operation actively applied by the TBM during construction, passive response parameters reflect the passive response data collected during construction, and geological parameters describe the geological conditions of the tunneling section. Each type of parameter is stored as an independent dataset. This categorized storage method facilitates flexible feature selection and combination in subsequent steps, while reducing interference from irrelevant features on model training. To ensure efficiency in subsequent processing, this stage also performs timestamp alignment and redundant field removal.

[0060] To address the issue of missing values ​​in numerical features, a multi-stage imputation process is employed. This embodiment adopts a two-stage strategy. The first stage uses cubic spline interpolation to smoothly impute missing segments with temporal continuity, maintaining the continuity of the temporal trend. The second stage uses the median to impute isolated missing values ​​that still exist after interpolation, eliminating the bias caused by extreme values ​​in mean imputation. This method balances local data smoothness with global distribution consistency, thereby reducing the perturbation of the model input distribution by missing values.

[0061] Optionally, before performing cubic spline interpolation and median imputation for missing values, a missing pattern analysis can be performed. This analysis statistically analyzes the distribution characteristics of missing values ​​across feature dimensions and time series, generating a visual report. This report helps engineers determine whether the missing values ​​are random or systematic, guiding whether more complex interpolation strategies or data repair methods are needed.

[0062] Robust outlier detection is performed on each type of parameter data. The outlier detection stage combines Modified Z-Score and Mahalanobis Distance for multidimensional anomaly identification. First, the Modified Z-Score (based on the Median Absolute Deviation, MAD) of each feature is calculated, and samples exceeding a preset threshold are screened as suspected outliers. For a multidimensional feature set, the Mahalanobis distance of the sample is calculated first to measure its deviation from the overall distribution. If, when calculating the Mahalanobis distance of a sample to measure its multidimensional deviation from the overall distribution, the determinant of the feature covariance matrix is ​​equal to zero or its absolute value is close to zero, making it impossible to invert, it automatically degenerates into calculating the Euclidean distance between the sample and the feature mean vector. This ensures that the outlier measurement process can still operate normally in the case of singular matrices and avoids interrupting the data cleaning process due to matrix invertibility.

[0063] Optionally, the threshold for correcting the Z-score in outlier detection can be dynamically adjusted based on the statistical characteristics of the current dataset. For example, the threshold range can be adaptively set according to the skewness and kurtosis of the sample distribution, rather than being fixed at the traditional 3.0, thereby improving the sensitivity and accuracy of detection under different geological conditions and working conditions, and reducing the possibility of false positives and false negatives.

[0064] To avoid data loss due to excessive cleaning, only the top 5% of samples with the highest degree of abnormality are removed. At the same time, the feature values ​​and locations of abnormal samples are recorded for review, thereby reducing the interference of extreme abnormal data on model training.

[0065] S2. Feature Generation. Based on the preprocessed passive response parameters and geologically relevant parameter data, extended features are generated, including interactive features, statistical features, and combined features.

[0066] Several key features in the passive response parameters are paired to form second-order interaction polynomial features, retaining only the interaction terms to capture the correlation between different features. Specifically, a second-order interaction polynomial feature generation strategy is introduced. The polynomial feature generation process is implemented using the `PolynomialFeatures` class. The parameter `interaction_only` is set to `True` to ensure that the generated polynomial features only contain interaction terms between different original features, excluding the square terms of individual features. The parameter `include_bias` is set to `False` to avoid introducing constant bias columns, thereby reducing redundant feature dimensions and improving computational efficiency. This strategy, which retains only interaction terms between features and excludes square and constant terms, enhances the model's ability to capture nonlinear relationships between parameters while effectively controlling feature dimensions and avoiding the dimensionality explosion problem.

[0067] In certain application scenarios, to capture more complex parameter interactions, third-order interaction features can be added on top of the generated second-order interaction features to more comprehensively reflect the higher-order nonlinear relationships between active parameters, passive parameters, and geological parameters. However, the dimensionality of third-order interaction features grows rapidly, so feature selection algorithms (such as mutual information, L1 regularization, or recursive feature elimination) need to be combined to control the number of features and avoid the curse of dimensionality and excessive computational overhead.

[0068] The statistical characteristics of the passive response parameters across the sample dimension are calculated to characterize the overall distribution pattern. To capture the changing trends of the passive parameters within different time windows, this embodiment calculates statistical characteristics including mean, standard deviation, skewness, and kurtosis. These statistical indicators reflect the location of the data center, the degree of fluctuation, the symmetry of the distribution, and the kurtosis, providing additional morphological information to the model and improving its sensitivity to complex fluctuation patterns.

[0069] In the geological parameter section, combined features are generated, including the sum and product of geological parameters, as well as ratio features between several parameters. A smoothing factor is introduced during ratio calculation to avoid division by zero errors, thereby enhancing the separability and robustness of geological features. In the construction of geological parameter features, various combined features such as addition (Sum), multiplication (Product), and ratio are designed to enhance the model's ability to perceive the coupling effects of geological conditions. For ratio features, the denominator is uniformly increased by... A smoothing factor is used to prevent division by zero errors, while ensuring numerical stability under extremely small denominators, avoiding numerical overflow, and improving robustness to extreme geological parameter values.

[0070] Optionally, specific ratio and product characteristics can be calculated separately for different stratigraphic types (such as sandy soil, clay, gravel, granite, etc.). For example, for the combination of water content and permeability coefficient, the ratio of the two can be calculated in highly permeable gravel layers to reflect seepage risk. In clay layers, the product of water content and plasticity index can be calculated to characterize the combined effect of plasticity and water content. By differentiating stratigraphic structural characteristics, geological sensitivity can be significantly enhanced, making the model more closely reflect the mechanical and hydrological characteristics of each stratum.

[0071] S3, Geological Clustering, its process is as follows: Figure 2 As shown, in the preprocessing stage before geological clustering, this embodiment normalizes the same set of geological parameters using different standardization methods, constructing three parallel scale views: Standard Scaler to enhance mean-variance comparability; Robust Scaler to reduce the dominant influence of a few extreme geological values ​​on distance metrics; and Quantile Transformer to obtain a smoother neighborhood structure under non-normal, long-tailed distributions. Unlike existing schemes that only use a single scaler for re-clustering, this embodiment uses three scale views as multi-perspective geological representations, providing complementary inputs for the subsequent three clustering algorithms, and explicitly utilizing the consistency between scale views in the fusion stage to improve the stability of geological partitioning.

[0072] Specifically, let the number of historical samples be... The geologically relevant parameter feature matrix is ​​denoted as The passive response parameter characteristic matrix is ​​denoted as The active control parameter (prediction target) is denoted as ,in Indicates the first The active parameters (e.g., propulsion force, cutterhead torque, grouting volume, etc.). The first [parameter / factor] obtained from geological clustering. Each geological cluster is recorded as Its cluster center vector is denoted as For any input sample Its routing distance can be expressed using Euclidean distance. express.

[0073] The clustering stage employs a combination of three complementary algorithms: K-means, DBSCAN, and Gaussian Mixture Model (GMM), running them at each of the three scales to obtain multiple candidate clustering results. K-means utilizes the centroid assumption to characterize an approximately convex geological distribution; DBSCAN uses density connectivity to identify irregular clusters and explicitly label noise points; and GMM uses probabilistic mixture to softly partition geological distributions with uneven variance within clusters and ellipsoidal boundaries. A two-stage fusion strategy is then adopted: first, consensus voting is performed on the labels of the three scales within the same algorithm to obtain stable labels for that algorithm; then, majority voting is performed among the three algorithms to form the final geological cluster labels, and cluster centers and cluster confidence are output simultaneously. This framework significantly reduces the bias risk of a single clusterer under specific geological distribution conditions, unlike existing approaches that rely solely on a single K-means or single density clustering to obtain geological segments. The final cluster centers and confidence are reused in subsequent training and inference modules, achieving deep coupling between geological clustering and the prediction model.

[0074] In the geological clustering process, this embodiment can add hierarchical clustering as a fourth clustering method on the basis of the original K-means, DBSCAN, and Gaussian mixture models, and incorporate its results into the majority voting fusion mechanism. Hierarchical clustering does not require pre-setting the number of clusters and can adaptively determine the geological category boundaries through dendritic diagrams, thereby improving the robustness of classification for complex stratigraphic boundaries and irregularly distributed data.

[0075] In K-means clustering, the optimal number of clusters K is automatically determined within a preset search interval: for each candidate number of clusters K, k-means++ is used for initialization and multiple random restarts are performed to reduce the risk of local optima. The silhouette coefficient and the Calinski-Harabasz exponent are calculated separately, and the two are weighted and synthesized according to preset weights (the silhouette coefficient has a weight of 0.6, and the Calinski-Harabasz exponent has a weight of 0.4 after normalization). The number of clusters K corresponding to the maximum value is taken as the optimal number of clusters K, thus taking into account both intra-cluster density and inter-cluster separation. In addition to outputting cluster labels, this embodiment also outputs the corresponding cluster center vectors and intra-cluster variance statistics for subsequent model routing and boundary sample determination. This approach differs from existing technologies that only output discrete labels and do not retain routable continuous center representations, thus providing a foundation for sharing the same geological geometry for "clustering-training-inference". To reduce the sensitivity of K-means clustering results to the selection of initial cluster centers, this embodiment employs a k-means++ initialization strategy. This strategy guides the selection of initial cluster center positions through a probability distribution, maximizing the distance between initial cluster centers. This improvement significantly reduces the probability of getting trapped in local optima and enhances the stability of clustering and the consistency of repeated experiments.

[0076] Figure 2 The neighborhood radius parameter of DBSCAN is denoted as This embodiment uses " "Strategy, in which" This represents the 90th percentile of the k-nearest neighbor distance distribution. In DBSCAN clustering, the core parameter eps is adaptively determined using the k-nearest neighbor distance curve: the k-nearest neighbor distance of each sample in the geological feature space is calculated (k takes a fixed small value or is set according to the dimension; in this embodiment, it is 5), and its 90th percentile is used as eps, thereby adapting to the density level of different engineering data and ensuring that the clustering results remain stable under different data distribution conditions. min_samples can be adaptively set according to the sample size and dimension to avoid excessive fragmentation under high-dimensional sparse conditions. For noise points identified by DBSCAN (labeled -1), this embodiment does not discard them directly, but instead constructs a KNN classifier (k is 3) using the "core point set" as training samples, and performs nearest cluster reassignment on the noise points, thereby ensuring that each sample obtains an effective geological condition label while retaining DBSCAN's ability to detect outliers. A distance threshold is introduced during reassignment: when the distance from the noise point to the nearest cluster center is still significantly large, it is marked as a low-confidence sample and its weight in subsequent semi-supervised pseudo-label injection is reduced. This noise point processing link not only improves the integrity of cluster partitioning, but also provides a "trustworthy / low-trustworthy" hierarchical basis for the semi-supervised stage, thus reflecting the coupled design of geological clustering and semi-supervised enhancement.

[0077] In DBSCAN clustering, the minimum number of samples (min_samples) parameter can be automatically calculated based on the total number of samples in the dataset and the feature dimension. For example, it can be set to min_samples = max(5, 2 × number of dimensions) to ensure that the clustering algorithm maintains good performance on high-dimensional data and datasets of different sizes. This automated parameter setting method can improve the versatility of the algorithm in multiple projects and under multiple working conditions.

[0078] S4. Feature Optimization. Random Forest Regressors are constructed for geological parameters using different normalization methods (standardization, robust scaling, quantile scaling, etc.) to obtain multiple sets of geological data. The coefficient of determination (R²) is calculated using 3-fold cross-validation, and the scaler with the highest R² is selected as the feature standardization scheme for subsequent model training. This strategy adaptively selects the optimal feature scaling method for different data distribution patterns, thereby significantly improving the model's generalization ability and robustness.

[0079] The main hyperparameters of the random forest regressor in this embodiment include: n_estimators=50 (number of base learners) to control the number of trees, and random_state=42 (random seed) to ensure reproducibility. A smaller number of base learners reduces computational overhead while maintaining prediction performance, while a fixed random seed ensures consistency of results across multiple runs, which is beneficial for experimental reproducibility and the stability of engineering deployment.

[0080] The feature selection process is executed in three stages: variance filtering, mutual information selection, and recursive feature elimination. The first stage uses variance filtering to remove invalid features with variance below a preset threshold to reduce redundancy. The second stage uses mutual information to evaluate the non-linear dependencies between features and the target variable, retaining features with higher importance. The third stage uses recursive feature elimination (RFE) to iteratively remove the features with the lowest contribution from the training set until the feature subset with optimal model performance is obtained. This process maximizes the predictive value of the feature set while reducing computational complexity.

[0081] The mutual information calculation in the feature optimization stage not only relies on traditional statistical mutual information methods, but can also be supplemented by random forest feature importance ranking. Specifically, after initially screening out highly relevant features, the importance of these features is scored using a random forest model, and the intersection or weighted result of the two is taken to enhance the stability and robustness of feature selection.

[0082] S5, multi-model construction, the process is as follows Figure 3 As shown. For each geological condition, a group is formed, and a corresponding subset of samples is extracted. Within this subset, multiple prediction models are trained separately for each active parameter. This embodiment includes seven types of base models: Random Forest (RF), ExtraTrees, Gradient Boosting Tree (GBDT), XGBoost, LightGBM, CatBoost, and Deep Neural Network Regressor (DNN). When GPUs are available, they are prioritized to accelerate the model training process and improve computational efficiency. For the same geological cluster... Active parameters of the same prediction target The above models each output their corresponding predicted values: Random Forest output Extremely random tree output Gradient boosting tree output XGBoost output LightGBM output CatBoost output Deep neural network output The model is then trained in a semi-supervised augmentation and model selection process. This multi-model parallel training approach can cover the advantages of different algorithms in nonlinear fitting, feature interaction capture, and imbalanced sample handling, providing diverse candidate models for subsequent integration. During model construction, this embodiment allows for hyperparameter tuning of base models such as Random Forest, XGBoost, LightGBM, CatBoost, and deep neural networks for each geological cluster, rather than using a single fixed set of hyperparameters. This allows for refined optimization based on the characteristics of different geological conditions (such as differences in data distribution and noise levels), thereby significantly improving the adaptability and prediction accuracy of each cluster model.

[0083] In CatBoost model training, this embodiment automatically selects the computing mode based on the operating environment: when a GPU is detected as available, the `task_type` parameter is set to GPU to significantly accelerate training by utilizing the high parallel computing capabilities of the graphics card. If the GPU is unavailable, the `task_type` parameter is set to CPU, automatically reverting to CPU mode to ensure compatibility with environments without a GPU, thereby maintaining system hardware adaptability while ensuring training speed. This mechanism can flexibly adapt to different hardware conditions, ensuring the stability and efficiency of the model training process.

[0084] The deep neural network regressor consists of multiple fully connected layers, each followed by a batch normalization layer to accelerate convergence and prevent gradient vanishing. The ReLU (Rectified Linear Unit) activation function is used to enhance non-linear expressiveness, and a dropout layer (with a dropout rate of 0.3) is introduced between key layers to prevent overfitting. The optimizer is Adam, combined with the ReduceLROnPlateau learning rate scheduler to dynamically adjust the learning rate, reducing it if the validation loss fails to improve over multiple rounds. An early stopping strategy is also implemented to terminate training prematurely when validation set performance no longer improves, thus balancing convergence speed and generalization performance. A batch normalization layer can also be added before the input layer to batch normalize the input features. This not only alleviates training instability caused by differences in feature distribution but also accelerates model convergence and reduces the risk of overfitting to some extent. In this embodiment, the random forest regressor sets the number of base learners... (i.e., n_estimators=500); Gradient Boosting Tree (GBDT) allows setting the learning rate. Maximum depth XGBoost allows setting the column sampling parameter `colsample_bytree` to 0.8; the DNN structure can consist of FC, BN, ReLU, and Dropout, and employ the Adam optimizer. Parameters can be adjusted according to different project requirements.

[0085] S6. Semi-supervised Augmentation. The semi-supervised augmentation strategy consists of three parts: co-training regression, Gaussian noise data augmentation, and iterative incremental training. The key difference lies in this implementation: the semi-supervised augmentation is confined to the training domain within geological clusters, and the cluster center distance and cluster credibility output from the clustering are used as gating conditions. Unlike existing global pseudo-label augmentation, this implementation performs consistency constraints within clusters, which significantly reduces pseudo-label contamination caused by cross-geological distribution drift. Simultaneously, different clusters can employ different augmentation intensities and screening thresholds. For example, a more conservative screening ratio is used for geological clusters with larger intra-cluster variance, thereby adaptively matching the semi-supervised strategy with the geological structure, forming a geological grouping-driven semi-supervised augmentation strategy.

[0086] In co-training regression, two high-performing base models are trained from labeled data. Unlabeled data is then predicted using these models, and samples with predicted differences below a consistency threshold are added to the training set. This process is iterated repeatedly to expand the labeled sample size. Specifically, two base models with complementary error characteristics (e.g., tree models and neural networks) are selected and trained on labeled samples within the same geological cluster. Then, dual-model predictions are performed on unlabeled samples, and the difference between the two model predictions and the distance from the unlabeled sample to the cluster center are calculated simultaneously. Only samples with prediction differences at the low quantile of the distribution of unlabeled sample differences and whose distances meet the typical range within the cluster are injected into the training set as high-confidence pseudo-label samples. The pseudo-label is a weighted average of the two model predictions, and pseudo-label samples are assigned training weights lower than those of true samples to further suppress pseudo-label noise propagation. This co-training mechanism, which combines consistency, geological distance gating, and weighted pseudo-labels, differs from existing semi-supervised regression schemes that only use single-model thresholds or only select based on consistency. It can more stably expand the effective training samples in geologically non-stationary scenarios.

[0087] In the collaborative training phase, heterogeneous model combinations can be employed, that is, selecting different types of base models (such as tree models and neural networks) for collaborative training. Tree models have a strong ability to handle nonlinear feature interactions, while neural networks have a good fitting ability in high-dimensional feature spaces. Combining the two can increase predictive complementarity and improve the consistency screening effect on unlabeled data.

[0088] Optionally, in this embodiment, during the unlabeled sample selection stage of the collaborative training regression, the prediction differences between the two base models on the unlabeled data are sorted from smallest to largest. Samples with differences at or below the 20th percentile are selected as samples with high prediction consistency and added to the training set to improve the reliability of the labels of newly added samples. The consistency selection can be written as "difference". "This means selecting only models whose prediction difference does not exceed the 20th percentile of the difference distribution." Unlabeled samples participate in pseudo-label injection.

[0089] Gaussian noise data augmentation is performed within clusters. Small-amplitude noise following a Gaussian distribution is injected into the original training samples to generate an augmented sample set, which is then merged with the original data to retrain the model. Specifically, the noise intensity of each dimension is first determined statistically according to the intra-cluster feature scale, ensuring that the noise standard deviation is proportional to the corresponding feature standard deviation. Pruning / constraints are applied to key physical quantities when necessary to avoid generating unreasonable samples. The augmented samples are then merged with the original samples to retrain candidate models or ensemble models, enabling the model to maintain output stability when facing sensor jitter, acquisition errors, and perturbations in geological parameters. Unlike existing methods that use noise augmentation as a general regularization technique, the noise augmentation in this embodiment is driven by intra-cluster scale statistics. The augmentation intensity can adaptively adjust according to the differences in geological clusters, thus coupling with the geological clustering framework.

[0090] In addition to Gaussian noise perturbation, oversampling methods such as SMOTE (Synthetic Minority Over-sampling Technique) can be introduced during the data augmentation stage to generate synthetic samples for geological clusters with a small number of samples, thereby balancing the proportion of various types of samples and preventing the model from having a prediction bias towards large categories of samples during training.

[0091] Optionally, in this embodiment, when generating Gaussian noise during data augmentation training, the noise standard deviation is set to 0.01 times the standard deviation of the input features to ensure that the distribution of the augmented samples is close to that of the original samples, thereby increasing the robustness of the model without destroying the original feature patterns. The noise can be formally represented as feature injection. , where This is the variance statistic of this feature dimension across samples within the cluster.

[0092] Figure 3 In this context, "partial_fit" refers to the incremental update interface for the model that supports incremental learning. "Model Registry" represents the model registry, used to register and version-manage the optimal sub-models obtained during training.

[0093] Iterative training is used to handle the influx of data during construction and changes in geological stages. Training data within a cluster is divided into multiple incremental packages by time or batch. The first package completes baseline training, and subsequent incremental packages prioritize rapid updates using `partial_fit`. For models that do not support `partial_fit`, `warm_start` or the previous round's model is used for retraining with initialization parameters. After each iteration, model performance and intra-cluster error statistics are recorded. An optional process can be triggered to re-cluster / update cluster centers when significant intra-cluster distribution drift is detected, ensuring model updates are consistent with geological grouping. For models supporting incremental learning, the `partial_fit` interface is used for continuous iterative updates in a manner that continues `partial_fit` after the first batch of full training. This mechanism differs from existing static offline training methods and can maintain the model's continued effectiveness when geological conditions change dynamically.

[0094] In an iterative training strategy, a snapshot of the current intermediate model can be saved after each round of training, and its performance metrics can be recorded. This not only facilitates reverting to a better historical version when performance degrades, but also provides multi-stage candidate models for subsequent model fusion, thereby improving overall prediction robustness.

[0095] Optionally, in this embodiment, the iterative training divides all training samples into 5 sub-batches, and adds new samples to the training set in batches. The number of new samples added each time is 1 / 5 of the total number of samples. In each iteration, the model is incrementally updated or fully retrained to gradually improve the model performance.

[0096] S7. Model Selection. Under each geological condition and active parameter combination, calculate the R², RMSE, and MAE indices for all candidate models on the validation set. Then, weight them according to the comprehensive scoring formula. The model with the highest comprehensive score is determined as the final sub-model for that combination. If necessary, use ensemble methods such as voting regression to fuse the prediction results of multiple models. The comprehensive scoring formula is as follows:

[0097]

[0098] In this formula, R² is the coefficient of determination, RMSE is the root mean square error, and MAE is the mean absolute error. R² serves as the primary evaluation indicator, while RMSE and MAE act as penalty terms to achieve a balance between model fitting accuracy and error control. The weights in the comprehensive scoring formula can be dynamically adjusted based on the specific optimization objectives of the construction task. For example, when prioritizing safety, the weight of RMSE can be increased to more strictly control the risk of large errors. In tasks aiming for energy consumption optimization, the weight of R² can be increased to enhance the accuracy of overall trend prediction.

[0099] This formula emphasizes model fit while penalizing error metrics with weighted penalties, thus reflecting the overall predictive performance of the model more evenly. The sub-model with the highest overall score is selected as the globally optimal model for the corresponding geological conditions and active parameter combinations. The globally optimal model is determined by selecting the model with the highest coefficient of determination (R²) on the validation set from among the sub-models trained under all geological conditions and active parameter combinations. In addition to R², the formula also outputs the model's root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and the number of training samples on the validation set, allowing for a comprehensive evaluation of the model's applicability under different geological conditions and sample sizes. This multi-metric output not only helps engineers intuitively understand the model's predictive accuracy and applicability but also serves as a reference for subsequent construction optimization.

[0100] In the statistical analysis of the globally optimal model, this embodiment not only records the model's global performance indicators but also statistically analyzes its error distribution characteristics under different geological clusters, such as the mean absolute error and error variance of each cluster. This information can provide a reference for construction strategies, helping construction units determine under which geological conditions the monitoring frequency should be increased or construction parameters adjusted.

[0101] An uncertainty score threshold can be set during the prediction phase. When the model's uncertainty score exceeds this threshold, the system will automatically trigger a manual review mechanism or call a backup model to recalculate the prediction results. This safety strategy effectively prevents high-risk prediction results from being directly used for construction control, reducing potential engineering risks.

[0102] S8. Prediction and Uncertainty Assessment. In the prediction phase, the input passive parameters and geological parameters undergo the same standardization process as in the training phase. The cluster center vectors output from the clustering phase are used for geological routing: after scaling the input geological parameters in the same way, their Euclidean distances to each cluster center are calculated, and the sub-model of the cluster with the smallest distance is selected for prediction. When an input sample is at a cluster boundary, if the difference between the smallest and second smallest distances is below a threshold, the sub-models of the top two clusters can be simultaneously called for weighted fusion, and the uncertainty score and boundary sample label are reported concurrently. By integrating cluster center and routing logic throughout training and inference, this invention significantly differs from schemes that simply hard-switch models based on fixed geological categories, and further demonstrates the deep coupling between the geological clustering framework and the prediction system.

[0103] When the sub-model is an ensemble model, the standard deviation of the prediction results of each base learner is calculated simultaneously with the output prediction value. This standard deviation serves as a quantitative indicator of the uncertainty of the prediction result, thus providing a confidence reference for decision-making. Prediction uncertainty assessment involves calculating the standard deviation of the prediction results of each base learner (such as RF, GBDT, CatBoost, DNN, etc.) within the ensemble model, and taking the mean of the standard deviations of all prediction targets as the uncertainty score. This provides a quantitative basis for decision-making regarding risk. The lower the score, the more consistent the prediction results of each base model, and the higher the reliability of the prediction results. Conversely, a higher score suggests that the prediction may fluctuate significantly. This uncertainty information can provide a risk reference for construction decisions, such as conducting geological surveys or adjusting parameters in advance when uncertainty is high.

[0104] To facilitate horizontal comparison of prediction uncertainty across different engineering projects and time periods, this embodiment can perform quantile standardization on the uncertainty score, so that the score falls within the range of 0 to 1 and reflects its relative position in the historical prediction results of the project.

[0105] S9. Model Storage, Archiving, and Deployment. The trained ensemble model, normalizer, geological labels, and evaluation results are saved as files for subsequent retrieval and online deployment. The method proposed in this embodiment generates three types of Excel files: (1) a detailed evaluation result file, recording the detailed performance of each sub-model under different indicators; (2) a summary file of overall indicators, recording the average performance and distribution of all models, used for quickly comparing model performance under different geological conditions and parameter combinations; and (3) a summary file of optimal models, listing the best models and their performance indicators for each combination. This structured storage method facilitates subsequent retrieval and version management, while also supporting rapid loading and retrieval on-site.

[0106] In addition to supporting common Pickle or Joblib formats, the model storage stage can also export models to ONNX (Open Neural Network Exchange) format, enabling cross-platform deployment. The ONNX format is compatible with various inference frameworks (such as TensorRT, ONNX Runtime, OpenVINO, etc.), facilitating efficient inference on different hardware platforms (such as GPU servers, embedded devices, edge computing nodes), and improving the flexibility of engineering implementation.

[0107] This embodiment offers flexible deployment options, supporting both local and remote deployment modes. Local deployment is suitable for single tunnel boring machines (TBMs) or partial construction projects, running directly in the control room or on edge computing nodes to meet low-latency prediction requirements. Remote deployment connects to a construction monitoring platform via a network interface, enabling centralized management and simultaneous monitoring of multiple projects. Both deployment modes can be seamlessly integrated with existing TBM construction management systems, enabling automatic collection, display, and alarm functions for prediction results.

[0108] During the deployment phase, this embodiment can simultaneously provide a dual-interface mode based on REST API and WebSocket. The REST API is suitable for batch prediction scenarios, allowing users to submit multiple data samples at once via HTTP requests and obtain prediction results. The WebSocket interface is suitable for real-time data stream prediction, maintaining a continuous connection with the tunnel boring machine control system and triggering the prediction process immediately when new sensor data arrives. This dual-mode design can cover various engineering application needs from offline analysis to online control and supports flexible switching within the same deployment instance.

[0109] To further improve computational efficiency, this embodiment's multi-precision method supports operation on multi-GPU platforms. It utilizes data parallelism and model parallelism strategies to distribute large-scale data and multi-model training tasks across multiple GPUs for simultaneous execution. Furthermore, mixed-precision training reduces GPU memory usage and accelerates computation, thereby significantly shortening training and prediction time while maintaining model accuracy.

[0110] In the GPU-accelerated training phase, this embodiment can introduce mixed precision, using FP16 (16-bit floating-point numbers) for some calculations, while keeping critical calculations (such as gradient accumulation) using FP32 (32-bit floating-point numbers). This approach can significantly improve computational throughput and GPU memory utilization efficiency without significantly sacrificing numerical precision, thereby accelerating model training and reducing hardware resource consumption, making it particularly suitable for rapid iterative training of large-scale engineering data.

[0111] The data processing stage can employ a multi-threaded data loading and preprocessing pipeline design. This means that during model training and inference, steps such as data loading, cleaning, and feature construction are executed in parallel with model computation. By pre-completing data preparation on the CPU, the GPU can achieve continuous training and inference without waiting, effectively reducing I / O bottlenecks and improving the overall system throughput.

[0112] This application also discloses a semi-supervised learning-based shield tunneling parameter optimization and prediction system, used to implement the above-mentioned semi-supervised learning-based shield tunneling parameter optimization and prediction method, including:

[0113] The data preprocessing module, used to execute S1-S2, is configured to read tunnel boring machine (TBM) construction records from historical construction datasets and classify them into three datasets based on the physical meaning of the parameters: active parameters, passive parameters, and geological parameters. Based on this, missing numerical features are first imputed using cubic spline interpolation and then by the median. Robust outlier detection, combining modified Z-scores and Mahalanobis distance, is performed independently on each type of parameter, and the top 5% of samples with the highest outlier severity are removed. Subsequently, second-order interactive polynomial features and statistical features are generated for passive parameters, and combined features are generated for geological parameters. A smoothing factor is introduced during the calculation of ratio features to prevent numerical overflow.

[0114] The data preprocessing module can be further configured to automatically generate visual statistical reports during the missing value imputation, outlier detection, and feature construction processes, including histograms of missing value distribution, box plots of outlier detection, and comparison charts of feature importance before and after feature construction. This allows operators to intuitively assess data quality and preprocessing effectiveness, thereby assisting in decisions on whether to adjust the preprocessing strategy.

[0115] The geological clustering module is used to execute S3. This module is configured to standardize, robustly scale, and scale geological parameters respectively. For each standardized result, K-means, DBSCAN, and Gaussian mixture models are used for clustering. The optimal number of clusters is determined by weighted scoring using silhouette coefficient and Calinski-Harabasz index. After DBSCAN clustering, a K-nearest neighbor classifier is used to reassign noisy points to the nearest cluster. Finally, the different clustering results are merged by majority voting to generate global geological condition labels.

[0116] The feature optimization module is used to execute S4. This module is configured to evaluate the average R² performance of each method using a random forest regressor combined with 3-fold cross-validation under various standardization methods, and select the best scaler as the benchmark for subsequent processing. Based on this, variance filtering, mutual information selection and recursive feature elimination are performed on the passive parameter features in sequence to output the optimal feature subset.

[0117] The model building module is used to execute S5-S6. This module is configured to train multiple prediction models for each sample subset of geological conditions and each active parameter, including random forest, extreme random tree, gradient boosting tree, XGBoost, LightGBM, CatBoost, and deep neural network regressors. GPU acceleration training is automatically enabled when GPU is available. At the same time, semi-supervised strategies such as co-training regression, data augmentation training, and iterative training are introduced to improve the generalization performance of the model by utilizing unlabeled data when there are insufficient labeled samples.

[0118] The model selection module is used to execute S7. This module is configured to comprehensively weight and score the candidate models on the validation set, including R², RMSE, and MAE, under each geological cluster and active parameter combination, and select the model with the highest score as the final sub-model of the combination. It then traverses all sub-models globally and selects the model with the best R² performance as the global optimal model. At the same time, it calculates the distribution of geological conditions in the global optimal model set.

[0119] The prediction and uncertainty assessment module is used to execute S8. This module is configured to perform the same standardization process as the training phase after receiving the input passive parameters and geological parameters, and select the most similar geological condition label based on Euclidean distance to call the corresponding sub-model for prediction. When the called sub-model is an ensemble model, the standard deviation of the prediction results of each base learner is calculated as the prediction uncertainty value, and the uncertainty index and the most similar geological condition label are returned when the prediction value is output.

[0120] The prediction and uncertainty assessment module can be further configured to output the uncertainty score calculated by the standard deviation of the prediction results of each base learner in the integrated model while outputting the prediction results, and at the same time return the label of the most similar geological condition selected. This allows the uncertainty index to be combined with the prediction values ​​for confidence analysis and safety judgment when the prediction values ​​are applied to construction control or risk assessment, thereby achieving more reliable prediction of tunnel boring machine excavation parameters.

[0121] The model storage and deployment module, used in S9, is configured to save the integrated model, normalizer, geological labels, detailed evaluation results, overall index summary, and optimal model summary as a file after model training and evaluation are completed. The preferred format is Excel, which is used for subsequent analysis, archiving, and online deployment.

[0122] Optionally, the system proposed in this embodiment supports running on cloud computing platforms (such as AWS, Azure, Alibaba Cloud, etc.) and can be combined with distributed training frameworks (such as Horovod or DeepSpeed) for multi-node parallel training. This mode is particularly suitable for processing ultra-large-scale historical shield tunneling construction data across projects and regions, which can significantly shorten the model training cycle and provide technical support for synchronous modeling and updating of multiple construction sites.

[0123] In engineering applications, the system can connect to the real-time sensor data interface of the tunnel boring machine (TBM) to directly receive multi-source sensor data, including thrust, cutterhead torque, earth pressure, and grouting volume. The system can trigger prediction tasks at fixed time intervals or when geological conditions change significantly, and supports dynamic updates of model parameters to ensure that the prediction results are consistent with the current geological and construction conditions, thereby improving the model's timeliness and on-site adaptability.

[0124] The system log module can record detailed input data, prediction output, uncertainty score, and model version number used for each prediction task. The log can also include prediction execution time and hardware operating status information, providing comprehensive data for engineering quality traceability and performance optimization, and ensuring the verifiability and monitorability of the construction process.

[0125] To ensure system security, this embodiment performs integrity checks using secure hash algorithms such as SHA256 during the model file deployment phase, verifying that the model file has not been tampered with or damaged during transmission and deployment. If an anomaly is detected in the model file, the system will prevent loading the model and issue a security warning, thereby preventing erroneous predictions due to model anomalies.

[0126] The system provides construction personnel with an interactive visual interface that displays prediction results, uncertainty scores, corresponding geological clustering labels, and historical prediction trend curves in real time. The interface also integrates multi-dimensional data comparison functions, allowing construction personnel to simultaneously view predicted values, actual measured values, and their deviations on the same screen, facilitating timely adjustments to construction strategies.

[0127] To adapt to the needs of engineering projects in different countries and regions, the system supports multilingual interface switching, including but not limited to Chinese, English, Russian, and Arabic. Users can switch the display language with a single click in the interface settings, enabling collaborative work among cross-language teams. This is particularly suitable for the rapid deployment of multinational engineering contracting companies when working in different regions.

[0128] For special projects, this embodiment can introduce external meteorological or hydrological data (such as rainfall, groundwater level changes, river flow velocity, etc.) into the model as additional input features. Especially in tunnel construction that crosses soft soil layers or is close to water bodies, it can better reflect the impact of environmental factors on tunneling parameters, thereby improving the accuracy of prediction and the ability to predict engineering risks.

[0129] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.

Claims

1. A method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning, characterized in that, include: S1. Extract active control parameters, passive response parameters and geological-related parameters from historical construction data, perform multi-stage completion of missing values, and perform dual-index detection and cleaning of outliers. S2. Generate extended features based on the preprocessed passive response parameters and geologically related parameter data, including interactive features, statistical features and combined features; S3. Construct multi-scale views by using multiple normalization methods for geological parameters, and perform clustering using multiple clustering algorithms under each view. Generate final geological cluster labels through a two-level fusion strategy. The two-level fusion strategy includes: performing consensus voting on the clustering results of different scale views within the same algorithm to obtain stable labels for that algorithm, and then performing majority voting on the stable labels of different algorithms to generate final geological cluster labels. S4. Construct a random forest regressor under different normalization methods to evaluate the feature scaling effect, select the optimal scaling method, and perform multi-stage feature selection. S5. For each geological cluster, extract the corresponding sample subset, and train multiple prediction models for each active parameter within the subset. S6. Perform collaborative training regression, Gaussian noise data augmentation, and iterative incremental training within geological clusters to improve the model's generalization ability; S7. Under each geological cluster and active parameter combination, select the optimal sub-model based on the comprehensive evaluation index on the validation set. S8. Calculate the distance between the new input data and the centers of various geological clusters, route it to the corresponding sub-model for prediction, and calculate the uncertainty score of the prediction results; S9. Save the trained model and related components as a structured file, supporting local and remote deployment.

2. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 1, characterized in that, The multi-stage completion of missing values ​​in S1 includes: smoothing the feature columns with continuous missing values ​​using cubic spline interpolation, and using the median to complete isolated missing values ​​that still exist after interpolation. The dual-index detection and cleaning of outliers includes: first, calculating the corrected Z-score for each feature, screening out samples that exceed a preset threshold as suspected outliers, then calculating the Mahalanobis distance of each sample in the multidimensional feature space to measure its deviation from the overall distribution, and removing the top 5% of samples with the highest degree of anomaly.

3. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 2, characterized in that, The interaction features in S2 are second-order interaction polynomial features, retaining only the interaction terms between different original features; The statistical characteristics include mean, standard deviation, skewness, and kurtosis; The combined features include the sum, product, and ratio features of geological parameters, wherein a smoothing factor is introduced into the denominator of the ratio features.

4. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 3, characterized in that, The S3 includes various normalization methods such as standardization, robust scaling, and quantile scaling. The various clustering algorithms include K-means, DBSCAN, and Gaussian mixture model.

5. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 4, characterized in that, The selection of the optimal scaling method in S4 includes: constructing random forest regressors under different normalization methods, calculating the coefficient of determination R² using 3-fold cross-validation, and selecting the scaling method with the highest R². The multi-stage feature selection includes variance filtering, mutual information selection, and recursive feature elimination, which are performed sequentially.

6. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 5, characterized in that, The S5 includes multiple prediction models such as random forest, extreme random tree, gradient boosting tree, XGBoost, LightGBM, CatBoost, and deep neural network regressors. When a GPU is detected as available, the GPU-accelerated model training process is automatically enabled.

7. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 6, characterized in that, The S6 co-training regression includes: using two types of base models with complementary error characteristics to predict unlabeled data within the same geological cluster, screening samples whose prediction difference is lower than the consistency threshold and whose distance from the cluster center is within a typical range, generating pseudo-labels by weighted averaging and injecting them into the training set; In the Gaussian noise data augmentation, the noise standard deviation is proportional to the standard deviation of the corresponding feature; The iterative incremental training divides the training data into multiple incremental packages and updates the model parameters in batches.

8. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 7, characterized in that, The comprehensive evaluation index in S7 includes the coefficient of determination. Root mean square error (RMSE) and mean absolute error (MAE); The selection of the optimal sub-model is based on a comprehensive scoring formula: Choose overall rating The highest-performing model is selected as the optimal sub-model.

9. The method for optimizing and predicting tunneling parameters of a tunnel boring machine based on semi-supervised learning according to claim 8, characterized in that, The routing process in S8 includes: calculating the Euclidean distance between the input geological parameters and the centers of various geological clusters, and selecting the sub-model corresponding to the cluster with the smallest distance for prediction; When the sub-model is an ensemble model, the uncertainty score is the standard deviation of the prediction results of each base learner.

10. A shield tunneling parameter optimization and prediction system based on semi-supervised learning, characterized in that, To implement the method of claims 1-9, the method comprises: The data preprocessing module is used to classify and extract historical construction data, perform multi-stage completion of missing values, outlier detection and cleaning based on modified Z-scores and Mahalanobis distance, and generate interactive, statistical and combined features. The geological clustering module is used to perform multi-scale normalization of geological parameters and to perform parallel clustering using K-means, DBSCAN and Gaussian mixture models. It generates geological cluster labels and cluster centers through a two-level fusion strategy. The two-level fusion strategy includes: performing consensus voting on the clustering results of different scale views within the same algorithm to obtain the stable label of the algorithm, and then performing majority voting on the stable labels of different algorithms to generate the final geological cluster labels. The feature optimization module is used to evaluate the performance of different normalization methods and select the optimal scaling method through a random forest regressor, and then perform variance filtering, mutual information selection and recursive feature elimination in sequence to obtain an optimized feature subset. The model building module is used to train multiple base models for samples from various geological clusters and implement a semi-supervised enhancement strategy based on collaborative training, Gaussian noise enhancement, and iterative incremental learning within the geological clusters. The model selection module is used to evaluate each candidate model on the validation set based on the comprehensive scoring formula and select the optimal sub-model under the combination of each geological cluster and active parameters. The prediction and uncertainty assessment module is used to perform geological routing based on the distance between the input data and the cluster center and call the corresponding sub-model for prediction. It calculates the standard deviation of each base learner's prediction as an uncertainty score on the output of the integrated model. The model storage and deployment module is used to save the trained model, normalizer, geological labels and evaluation results as structured files, and supports local or remote deployment via REST API and WebSocket interface.