Wafer cmp non-uniformity virtual measurement method based on data mechanism fusion large model
By employing a large-scale model approach that integrates data mechanisms, the challenge of multi-source heterogeneous data in non-uniform virtual measurement of wafer CMP was solved, achieving high-precision, interpretable, and stable virtual measurement that can adapt to process changes in semiconductor manufacturing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies face challenges in non-uniform virtual measurement of wafer CMP with multi-source heterogeneous data, including difficulties in bridging semantic gaps, lack of physical consistency and interpretability in purely data-driven models, and insufficient generalization performance under unknown operating conditions.
We adopt a large model approach based on data mechanism fusion, extract multimodal data features through a heterogeneous encoder, and use a cross-modal cross-attention mechanism for semantic alignment. We combine physical mechanism and data-driven dual-branch prediction network to construct a joint loss function for model optimization.
It achieves deep semantic alignment of multimodal data, improves the accuracy and interpretability of virtual measurements, and has excellent cross-condition generalization ability. It can stably adapt to process parameter fine-tuning and consumable aging in semiconductor manufacturing, reducing the economic cost of retraining models.
Smart Images

Figure CN122241353A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of measurement technology for chemical mechanical polishing (CMP), and in particular to a virtual measurement method for wafer CMP non-uniformity based on a large model of data mechanism fusion. Background Technology
[0002] Chemical Mechanical Polishing (CMP) is a core process for achieving global wafer planarization in advanced integrated circuit manufacturing, and its processing accuracy directly determines the yield of chip manufacturing processes. As process nodes continue to approach physical limits, the coupled effects of patterning effects, dynamic wear of polishing pads, and fluctuations in process parameters have made intra-wafer non-uniformity (WIWNU) and inter-wafer non-uniformity (WIDNU) critical bottlenecks restricting yield. Therefore, performing virtual measurement (VM) of CMP non-uniformity indicators to replace or supplement physical measurements with predicted results is a core prerequisite for achieving closed-loop process control.
[0003] Currently, the main technical solutions for virtual measurement of CMP non-uniformity are as follows: 1. Traditional statistical modeling methods: such as building empirical models based on response surface methodology (RSM) or modified Preston equations to quantify the influence of parameters such as polishing pressure and rotation speed on non-uniformity. 2. Pure data-driven deep learning methods: using convolutional neural networks (CNNs) to extract wafer space features, or using long short-term memory networks (LSTMs) to mine time-series signal features such as pressure and torque, to achieve a non-linear mapping from data to quality indicators. 3. Early physical information neural network (PINN) methods: adding the residuals of partial differential equations (such as mechanical equations) as regularization terms to the loss function of the neural network for model training.
[0004] In practical applications, facing the extreme conditions of strong coupling of multiple factors in advanced processes, the aforementioned existing technologies have revealed the following serious technical bottlenecks:
[0005] (1) The “semantic gap” of multi-source heterogeneous data is difficult to bridge: CMP process includes heterogeneous data such as high-frequency time series signals, high-resolution spatial graphics, structured recipes and unstructured expert documents. Existing methods mostly use simple vector surface splicing and fusion, which makes it difficult to explore the deep interaction mechanism between different modal data, resulting in limited collaborative representation ability of multi-source information.
[0006] (2) Pure data-driven models lack physical consistency and interpretability: Pure deep learning models rely solely on statistical correlation fitting of data without incorporating the fluid dynamics and contact mechanics mechanisms in the polishing process. This makes the models prone to overfitting, where the data fits well but violates basic mechanical common sense, and thus cannot provide process engineers with physically meaningful parameter tuning decision support.
[0007] (3) Severe degradation of generalization performance under unknown and complex operating conditions: Existing shallow hybrid models based on PINN do not have sufficient depth of integration into the mechanism. In the operation of semiconductor production lines, once a data-scarce scenario or dynamic switching of operating conditions occurs (such as changing to a new type of paste, aging of polishing pads, and other distributed operating conditions), the model relies too much on the statistical regularity of the training data, and the prediction accuracy will drop sharply, failing to meet the robustness requirements of actual mass production. Summary of the Invention
[0008] To address the aforementioned problems in existing technologies, this paper proposes a wafer CMP non-uniformity virtual measurement method based on a large data mechanism fusion model. This method achieves semantic alignment of time series, images, parameters, and knowledge depth through a heterogeneous encoder and a cross-modal cross-attention mechanism.
[0009] This invention also provides a virtual measurement system for wafer CMP non-uniformity based on a large model of data mechanism fusion.
[0010] To address the above problems, this invention adopts the following technical solution: a virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model, comprising the following steps:
[0011] Step 1: Acquire multi-source heterogeneous data and perform feature extraction based on heterogeneous encoders. Collect multimodal data in the CMP process, including time-series process signals, spatial graphic features, structured process knowledge, and unstructured expert knowledge. For the multimodal data, use a hybrid architecture of 1D-CNN+Transformer, visual Transformer, multilayer perceptron, and graph attention network as heterogeneous encoders to extract the corresponding single-modal representation vectors in parallel.
[0012] Step 2: Construct a cross-modal attention fusion mechanism to generate a unified representation. Using the single-modal representation vector described in Step 1 as input, calculate the semantic association matrix between the feature vectors of each modality and propagate attention information. Introduce a learnable gate control mechanism to adaptively weight and integrate the propagated modal features with the original features to generate a unified representation vector of multi-source data.
[0013] Step 3: Construct a physical and data-driven dual-branch prediction network. The unified representation vector output from Step 2 is input into two parallel prediction branches. The prediction branches are: the physical mechanism branch, which calculates the theoretical contact pressure using Hertz contact theory and estimates the slurry film thickness (as a priori feature) using the Reynolds equation; combined with the extended Preston equation, it outputs a physical mechanism-driven prediction result; and the knowledge-enhanced data branch, which uses the unified representation to search in the expert knowledge graph, formats it into prompt words, and inputs it into the Large Language Model (LLM); at the same time, it extracts spatial sequence features by introducing the physical information Transformer (PIT) of the physical affinity matrix, and outputs the data-driven prediction result and confidence interval.
[0014] The prediction results of the two branches are weighted and fused using learnable weights to generate a non-uniform prediction tensor.
[0015] Step 4: Optimize the network model by calculating the joint loss function based on multiple physical conservation laws. During the backpropagation training phase of the model, construct and optimize the end-to-end joint loss function, which includes data fitting terms, multiple physical conservation constraints, knowledge retrieval terms, and uncertainty regularization terms. Minimize the joint loss function using the gradient descent algorithm to drive the iterative update of network parameters until the model converges.
[0016] Furthermore, the timing process signal mentioned in step one is acquired through a timing signal sensor, an acoustic emission sensor, or a motor drive current sensor.
[0017] Furthermore, step three also outputs physical mechanism-driven prediction results by combining the Archard wear equation or the nonlinear removal rate equation based on the Tseng-Wang model.
[0018] Furthermore, step three also extracts spatial sequence features by simultaneously introducing a physical affinity matrix into a Mamba or bidirectional LSTM network with physical attention bias, and outputs data-driven prediction results and confidence intervals.
[0019] Furthermore, in step one, for the multimodal data, a 1D-CNN+Transformer hybrid architecture, a visual Transformer, a multilayer perceptron, and a graph attention network are used as heterogeneous encoders to extract the corresponding single-modal representation vectors in parallel. The specific process is as follows:
[0020] (11) Temporal process signal encoding: The acquired temporal feature matrix is input into a hybrid architecture consisting of one-dimensional convolution and Transformer; firstly, local features are extracted through a one-dimensional convolutional layer:
[0021]
[0022] The input is then fed into the Transformer coding layer for multi-head self-attention (MHSA) processing:
[0023]
[0024] in, The input is the time series feature matrix; and These represent the kernel weights and biases, respectively; * indicates the convolution operation. For activation functions; It is a local feature matrix; They represent respectively by The query matrix, key matrix, and value matrix obtained by linear transformation; Represents the dimension of the key vector; Normalized exponential function; outputs temporal mode coding vector. ;
[0025] (12) Spatial graphic feature encoding: The acquired spatial feature maps are processed using a visual Transformer architecture. First, the image is flattened into a sequence using a Patch embedding layer, and positional encoding is added:
[0026]
[0027] in, The input is a spatial graphic feature map; Indicates the feature flattening operation; This represents the position encoding matrix to be added; The features are serialized; then, after multiple layers of self-attention processing, the output image mode coding vector is obtained. ;
[0028] (13) Structured process knowledge encoding, using a multi-layer fully connected network to concatenate the feature vectors of process parameters. Perform nonlinear mapping:
[0029]
[0030] in, The input is a structured process feature vector; and These represent the weight matrix and bias vector of the Lth layer, respectively. ); It is a non-linear activation function; the final output is a structured mode code vector. ;
[0031] (14) Unstructured expert knowledge encoding: After extracting expert text into an entity relationship graph, a graph attention network is used for encoding. In layer graph convolution operations, the aggregation method for node information is defined as follows:
[0032]
[0033] in, This represents the feature vector of node i in the l-th layer; Represents the set of neighboring nodes of node i; The calculated attention coefficient; The weight matrix is used to output the knowledge graph modality encoding vector. .
[0034] Furthermore, the construction of the cross-modal cross-attention fusion mechanism in step two to generate a unified representation specifically involves semantically aligning the independent features output in step one.
[0035] First, calculate the semantic association matrix between any two modalities i and j. :
[0036]
[0037] in, and These represent the encoded feature vectors of the i-th and j-th modes, respectively. ; and These are the learnable query projection matrix and the key projection matrix, respectively; The semantic association score matrix represents the relationship between modalities;
[0038] Subsequently, attention information is propagated, and a gating mechanism is used for adaptive feature integration:
[0039]
[0040] in, (Right now () represents the unified representation vector of the multi-source data generated by the final fusion; For modal index sets; This represents the feature vector updated after cross-modal attention propagation; This is the original feature vector; This is the gating coefficient matrix generated by a learnable network.
[0041] Furthermore, step three, which involves constructing a dual-branch prediction network for both physical and data aspects, specifically involves:
[0042] (31) The theoretical pressure field and liquid film thickness physical prior features are generated using Hertz contact theory and Reynolds equations; the network prediction is combined with the extended Preston equation to calculate the prediction tensor of the physical branch. :
[0043]
[0044] in, Represents wafer coordinates Physical baseline removal rate at the location; This is an empirical coefficient; , and These represent the local pressure distribution, relative sliding speed, and lubrication efficiency factor extracted by the neural network, respectively.
[0045] (32) Based on the unified representation described in step two Retrieve data from a knowledge graph, formatted as prompt text P, inputting a Large Language Model (LLM) and a Physical Information Transformer (PIT), and outputting a data-driven prediction tensor. With uncertainty and variance :
[0046]
[0047] in, This represents the prediction tensor generated by the data-driven branch; This represents the variance of the confidence interval for the corresponding prediction result; Features of the hidden state; Formatted prompt text for retrieval; This represents the eigenvector concatenation operation; where PIT explicitly injects the physical affinity matrix characterizing wafer physical coupling during attention computation. ;
[0048] (33) Prediction result fusion:
[0049]
[0050] in, The final output is the full-field prediction tensor for the non-uniformity indices (WIWNU and WIDNU); The fusion weights for physical terms and data-driven terms learned automatically through a gating network.
[0051] Furthermore, step four, which involves calculating the joint loss function based on multiple physical conservation laws to optimize the network model, specifically includes: during the network backpropagation training phase, calculating the overall optimization objective loss function that incorporates physical regularization. :
[0052]
[0053] in, This represents the total loss during end-to-end training; The mean squared error of the data fitting between the predicted and actual values; the value in parentheses is the total physical loss penalty term (where... This represents the loss due to the mass conservation constraint. This represents the energy conservation constraint loss. This represents the monotonicity constraint loss. , , (for corresponding weights) For knowledge retrieval-related losses; This is a regularization term for uncertainty estimation; by optimizing the joint loss function, the iterative convergence of the network model is achieved.
[0054] This invention also provides a system for implementing the virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model, which is deployed in the form of software intelligent algorithm modules on semiconductor manufacturing workshop control equipment or cloud platforms. The system includes:
[0055] A multi-source data unified representation and fusion module, connected to the CMP machine sensor network, is used to fuse independent feature encodings of multimodal data with CMAF cross-modal attention to generate a unified representation of multi-source features. ;
[0056] The mechanism data dual-branch hybrid modeling module has a built-in physical calculation engine and physical information neural network, which are used to perform prior extraction based on Hertz / Reynolds theory and physical branch prediction calculation based on the extended Preston equation.
[0057] The large model retrieval enhancement prediction and optimization module incorporates a large language model and a PIT model to perform contextual retrieval and global non-uniformity index inference. During model training, it calculates a joint loss function that incorporates mass / energy conservation constraints to complete the system update closed loop.
[0058] A computer device includes a memory and a processor; the memory stores a computer program, and the processor, when executing the program, is able to fully implement all the steps of heterogeneous feature extraction, CMAF fusion, bi-branch inference prediction, and joint loss optimization update.
[0059] A computer-readable storage medium storing a corresponding computer instruction program, which, when read and executed by a processor, implements any of the method steps described above.
[0060] Compared with the prior art, the beneficial technical effects of the present invention are as follows:
[0061] (1) The method described in this application achieves deep semantic alignment and significantly improves the efficiency of multimodal data utilization. The cross-modal cross attention (CMAF) mechanism provided by this invention effectively mines the deep correlation between modalities. After CMAF fusion, the average feature similarity between modalities is increased to 0.78, which is 27.9% higher than the simple splicing scheme, and the compactness within the cluster reaches 0.87. It can be seen that the model described in this application effectively eliminates the semantic gap of multi-source heterogeneous data and provides a high-purity feature base for virtual measurement.
[0062] (2) The method described in this application has high physical consistency, which greatly improves the accuracy and interpretability of virtual measurement. The present invention uses a dual-branch network and multiple physical conservation constraints to make the root mean square error (RMSE) of global prediction as low as 0.028 and the coefficient of determination (R²) as high as 0.94. The accuracy of virtual measurement is 30.6% higher than the existing basic Transformer benchmark. The key indicator, physical consistency error (PCE), is 0.018, which is 22.2% lower than the traditional physical information neural network (PINN), effectively avoiding the output of predictions that violate basic mechanical common sense.
[0063] (3) The method described in this application has excellent cross-condition generalization ability and effectively alleviates the problem of "virtual measurement drift". In the face of the common condition drift problem in semiconductor manufacturing, when the present invention introduces an unknown new polishing slurry and polishing pad as an out-of-distribution (OOD) test set, the traditional pure data-driven model shows significant performance degradation (R² drops to 0.78), while the R² of the present invention remains at 0.91, the error increase is strictly controllable, and the PCE is reduced by 56.9% compared with the control group. This generalization performance shows that when the system is deployed on the wafer mass production line, it can smoothly adapt to the fine adjustment of process parameters and the aging of consumables, reducing the economic cost of re-collecting data and retraining the model due to frequent changes in operating conditions, and providing a reliable monitoring barrier for the yield control of advanced processes. Attached Figure Description
[0064] Figure 1 This is a schematic diagram of the overall framework of the multimodal large model for mechanism enhancement of the present invention;
[0065] Figure 2 This is a framework diagram of the multimodal data preprocessing method of the present invention;
[0066] Figure 3 This is a schematic diagram of the cross-modal alignment mechanism based on a large model according to the present invention;
[0067] Figure 4 This is a schematic diagram of the deep integration framework of the physical mechanism of the present invention;
[0068] Figure 5 This is a diagram illustrating the predictive reasoning framework of this invention, which integrates physical mechanisms with large-scale models. Detailed Implementation
[0069] The technical solution of the present invention will be further described clearly and in detail below with reference to the embodiments and accompanying drawings.
[0070] Example 1
[0071] like Figure 1 As shown, a virtual measurement method for wafer CMP based on a mechanism-enhanced multimodal large model is presented. This embodiment provides a mechanism-enhanced multimodal large model framework for virtual measurement of wafer CMP non-uniformity.
[0072] It should be noted that the method steps described in this embodiment reflect a strict computational data flow from data perception, feature independent encoding, cross-modal fusion, bi-branch reasoning to joint physics optimization. The computational order of each step is fixed and cannot be arbitrarily interchanged.
[0073] Specifically, the following steps are included:
[0074] Step 1: Acquire multi-source heterogeneous data and perform independent feature extraction based on a heterogeneous encoder, such as... Figure 1 , Figure 2 As shown, real-time acquisition of time-series process signals, spatial graphic images, structured process parameters, and unstructured text from experts on the production line is performed. For these four types of heterogeneous data, dedicated heterogeneous encoders are used to extract single-modal features, as detailed below:
[0075] (11) Temporal process signal encoding: The temporal feature matrix is input into a hybrid architecture consisting of one-dimensional convolution (1D-CNN) and Transformer. First, local features are extracted through one-dimensional convolutional layers:
[0076]
[0077] The input is then fed into the Transformer coding layer for multi-head self-attention (MHSA) processing:
[0078]
[0079] in, The input is the time series feature matrix; and These represent the kernel weights and biases, respectively; * indicates the convolution operation. For activation functions; It is a local feature matrix; They represent respectively by The query matrix, key matrix, and value matrix obtained by linear transformation; Represents the dimension of the key vector; The function is a normalized exponential function; the final output is a temporal mode coding vector. ;
[0080] (12) Spatial graphic feature encoding: The spatial feature map is processed using the Visual Transformer (ViT) architecture. First, the image is flattened into a sequence using a Patch embedding layer, and positional encoding is added:
[0081]
[0082] in, The input is a spatial graphic feature map; Indicates the feature flattening operation; This represents the position encoding matrix to be added; The features are serialized; then, after multiple layers of self-attention processing, the output image mode coding vector is obtained. ;
[0083] (13) Structured process knowledge encoding: a multi-layer fully connected network (MLP) is used to concatenate the feature vectors of process parameters. Perform nonlinear mapping:
[0084]
[0085] in, The input is a structured process feature vector; and These represent the weight matrix and bias vector of the Lth layer, respectively. ); It is a non-linear activation function; the final output is a structured mode code vector. .
[0086] (14) Unstructured expert knowledge encoding: After extracting expert text into an entity relation graph, it is encoded using a graph attention network (GAT). In layer graph convolution operations, the aggregation method for node information is defined as follows:
[0087]
[0088] in, This represents the feature vector of node i in the l-th layer; Represents the set of neighboring nodes of node i; The calculated attention coefficient; The weight matrix is used to output the knowledge graph modality encoding vector. ;
[0089] Step 2: Construct a cross-modal cross-attention fusion mechanism (CMAF) to generate a unified representation, such as... Figure 3 As shown, the independent features output in step one are semantically aligned; first, the semantic association matrix between any two modalities i and j is calculated. :
[0090]
[0091] in, and These represent the encoded feature vectors of the i-th and j-th modes, respectively. ; and These are the learnable query projection matrix and the key projection matrix, respectively; The semantic association score matrix represents the relationship between modalities;
[0092] Subsequently, attention information is propagated, and a gating mechanism is used for adaptive feature integration:
[0093]
[0094] in, (Right now () represents the unified representation vector of the multi-source data generated by the final fusion; For modal index sets; This represents the feature vector updated after cross-modal attention propagation; This is the original feature vector; The gating coefficient matrix is generated by a learnable network;
[0095] Step 3: Construct a dual-branch prediction network of physical and data aspects to achieve mechanism-enhanced inference, such as... Figure 4 , Figure 5 As shown, a unified representation will be used. The input is a two-branch inference network, specifically:
[0096] (31) Physical mechanism branch, such as Figure 4 As shown, Hertz contact theory and Reynolds equations are used to generate theoretical pressure field and liquid film thickness physical prior features. Network prediction is then combined with the extended Preston equation to calculate the prediction tensor of the physical branch. :
[0097]
[0098] in, Represents wafer coordinates Physical baseline removal rate at the location; This is an empirical coefficient; , and These represent the local pressure distribution, relative sliding speed, and lubrication efficiency factor extracted by the neural network, respectively.
[0099] (32) Data-driven large model branches, such as Figure 5 As shown, based on unified representation Retrieve data from a knowledge graph, formatted as prompt text P, inputting a Large Language Model (LLM) and a Physical Information Transformer (PIT), and outputting a data-driven prediction tensor. With uncertainty and variance :
[0100]
[0101] in, This represents the prediction tensor generated by the data-driven branch; This represents the variance of the confidence interval for the corresponding prediction result; Features of the hidden state; Formatted prompt text for retrieval; This represents the eigenvector concatenation operation. Specifically, the PIT algorithm explicitly injects a physical affinity matrix characterizing wafer physical coupling during attention computation. ;
[0102] (33) Prediction results are fused.
[0103]
[0104] in, The final output is the full-field prediction tensor for the non-uniformity indices (WIWNU and WIDNU); The fusion weights for physical terms and data-driven terms learned automatically through a gating network.
[0105] Step 4: Calculate the joint loss function based on multiple physical conservation laws to optimize the network model. Specifically, during the network backpropagation training phase, calculate the overall optimization objective loss function including physical regularization. :
[0106]
[0107] in, This represents the total loss during end-to-end training; The mean squared error of the data fitting between the predicted and actual values; the value in parentheses is the total physical loss penalty term (where... This represents the loss due to the mass conservation constraint. This represents the energy conservation constraint loss. This represents the monotonicity constraint loss. , , (for corresponding weights) For knowledge retrieval-related losses; This is the regularization term for uncertainty estimation; by optimizing this joint loss function, iterative convergence of the model is achieved.
[0108] To verify the effectiveness and superiority of the mechanism-enhanced multimodal large model framework of this application in virtual measurement of non-uniformity in wafer CMP, the embodiments of this application carry out systematic verification from the dimensions of dataset construction, cross-modal alignment performance verification, inference prediction accuracy comparison, model generalization test and ablation experiment. The core objectives of the experiment include: 1) verifying the effectiveness of unified representation and cross-modal fusion of multi-source heterogeneous data; 2) evaluating the effect of physical mechanism embedding on improving prediction accuracy and physical consistency; 3) demonstrating the performance advantages of the framework compared with traditional data-driven models and pure large models; 4) verifying the generalization ability of the model under different process conditions.
[0109] The experimental hardware platform for this application is configured with an Intel Xeon Gold 6330 processor, equipped with four NVIDIA RTX 4060 GPUs with 80 GB of video memory each and 256 GB of DDR4 memory. The software environment is built on Python 3.9, the deep learning model training relies on the PyTorch 2.0 framework, the multimodal embedding and Transformer network implementation uses the Hugging FaceTransformers library, the knowledge graph construction and retrieval uses the Neo4j 5.0 database, and the data preprocessing and result visualization are completed using Pandas, NumPy and Matplotlib tools.
[0110] Considering the regression task characteristics of CMP non-uniform virtual measurements, mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R²) are selected as core accuracy evaluation indicators. Physical consistency error (PCE) is also introduced to assess the fit between the model output and the physical mechanism; a smaller PCE indicates stronger physical consistency. For uncertainty quantification performance, prediction interval coverage (PICP) and average prediction interval width (MPIW) are used for evaluation. The definitions of each indicator are as follows:
[0111]
[0112]
[0113]
[0114]
[0115]
[0116]
[0117] in, For the sample size, For virtual measurements of the model, These are actual measured values. This is the average of the actual values. These are theoretical values calculated based on the extended Preston equation. For indicator functions, and These are the lower and upper bounds of the virtual measurement interval for the i-th sample, respectively.
[0118] The experimental dataset for this embodiment originates from actual production data of a 12-inch wafer CMP production line of a semiconductor manufacturing company. It covers three typical CMP process scenarios: Shallow Trench Isolation (STI), Copper Wiring (Cu), and Through-Silicon Via (TSV). A total of 1000 complete wafer processing batches were collected, with each batch integrating four types of multimodal data samples. Specifically, the timing process signal data is collected via sensors, capturing real-time parameters such as polishing pressure, spindle speed, slurry flow rate, and polishing pad temperature at a sampling frequency of 10Hz. Each batch corresponds to 1200 data points, forming a 1200×6-dimensional timing matrix. Spatial graphic feature data is acquired using an atomic force microscope (AFM) and an optical profilometer, including 512×512 resolution wafer surface topography images and non-uniformity distribution thermal maps. Each batch is equipped with three image samples from different measurement areas. Structured process knowledge data encompasses process parameters (such as polishing time, pressure level, and slurry type), wafer properties (such as material composition and thickness), and equipment parameters (such as polishing pad type and wear level), constructing a total of 18 structured feature dimensions. Unstructured expert knowledge data originates from CMP process optimization manuals, fault diagnosis reports, and expert experience documents, yielding 5000 standardized knowledge entries after text extraction and structuring. To protect core process data, all raw data collected on-site underwent feature extraction and numerical encryption. This processing does not affect the statistical characteristics of the data distribution or the reliability of subsequent analysis, aiming to meet industrial confidentiality requirements while ensuring the authenticity and reproducibility of research conclusions.
[0119] The data preprocessing workflow mainly covers two parts: targeted processing of multimodal data and dataset partitioning, as follows: For time-series process signals, outliers are removed using the 3σ criterion, missing data is filled in using linear interpolation, and time-domain and frequency-domain features are extracted using short-time Fourier transform after Z-score standardization; For spatial graphic data, grayscale processing is performed first, and the size is normalized to 256×256. Gaussian filtering is used to eliminate noise interference, and then deep texture and contour features are extracted using a convolutional neural network; For structured process knowledge data, one-hot encoding is performed on categorical variables, continuous variables are normalized, and redundant features are removed based on the criterion that the variance inflation factor is less than 10; For unstructured expert knowledge data, the BERT model is used to complete text segmentation and entity recognition, and then a knowledge graph covering the triple of "process parameters-non-uniformity type-optimization strategy" is constructed. This graph contains 12 entity types and 8 relationship types. After all modal data preprocessing is completed, it is divided into 700 batches of training set, 100 batches of validation set and 200 batches of test set in a ratio of 7:1:2.
[0120] 1) Cross-modal alignment and embedding performance verification experiments
[0121] This embodiment aims to systematically verify the effectiveness of the proposed heterogeneous encoder and cross-modal attention fusion mechanism, specifically setting up three sets of comparison schemes. The first set is baseline scheme 1, which only inputs temporal signals for single-modal embedding construction without introducing any cross-modal fusion mechanism. The second set is baseline scheme 2, which uses a simple concatenation method to achieve multimodal feature fusion, but does not set up a cross-modal attention alignment module. The third set is the proposed scheme, which uses a heterogeneous encoder architecture combined with a cross-modal cross-attention mechanism to complete the fusion. The temporal signals are adapted to the Transformer encoder, image features to the CNN encoder, structured data to the MLP encoder, and text knowledge to the BERT encoder to achieve targeted encoding of each modality feature. The experiment calculates the cosine similarity of the feature similarity values between modalities and the preliminary accuracy MAE and R² index of the downstream prediction task to quantitatively evaluate the cross-modal alignment effect and embedding performance of different schemes, as shown in Table 1.
[0122] Table 1 Performance Comparison of Different Cross-Modal Alignment and Embedding Schemes
[0123]
[0124] 2) Comparison Experiment of Reasoning and Prediction Performance
[0125] To comprehensively verify the virtual measurement performance of the proposed framework, this application selected five mainstream models to construct a comparative experimental system, as shown in Table 2. These include three main categories: traditional machine learning models, pure data-driven deep learning models, and mechanism-enhanced models. The traditional machine learning models used Random Forest (RF) and Extreme Gradient Boosting (XGBoost); the pure data-driven deep learning models employed Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) and the basic Transformer model; and the mechanism-enhanced model used a Physics-Informed Neural Network (PINN) that embeds only the Preston equations. To ensure fairness in the comparative experiments, all comparative models and the proposed framework were trained using the same training and validation sets, and the hyperparameters of each model were optimized using a grid search method. The performance comparison of different models under various CMP process scenarios is shown in Table 3.
[0126] Table 2 Comparison of CMP non-uniformity virtual measurement performance of different models
[0127]
[0128] Table 3 Performance comparison of different models under various CMP process scenarios (taking R² and MAE as examples)
[0129]
[0130] 3) Ablation experiment
[0131] To verify the necessity of each core module of the proposed framework, this application designed an ablation experiment. By sequentially removing the multimodal fusion module, the physical mechanism module embedding the Preston equation and Hertz contact theory, and the knowledge graph retrieval enhancement module, the performance differences of each ablation model and the complete framework were compared. The experimental results are shown in Table 4.
[0132] Table 4 Comparison of Ablation Test Performance Results
[0133]
[0134] 4) Generalization verification experiment
[0135] To systematically verify the generalization ability of the proposed framework under unknown process conditions, this application constructs an independent generalization test set and selects the basic Transformer, a typical pure data-driven model, as a control. The performance comparison between the two is used to quantitatively evaluate the adaptability of the proposed framework to new conditions. The specific experimental design and result analysis are as follows.
[0136] The generalization test set uses novel process conditions that were not involved in model training. Specifically, it consists of 200 complete wafer processing batches corresponding to SiO2-based modified slurry with a 30% difference in composition from the conventional slurry in the training set and 600mm diameter polishing pads. The data composition and preprocessing procedures are consistent with the training set to ensure the fairness of the comparative experiments. In the experiments, the typical pure data-driven model base Transformer was selected as a control, and its performance was compared with the proposed framework. The evaluation metrics used were MAE, RMSE, R², and PCE. All experiments were repeated three times to ensure the reliability of the results. The quantitative comparison results of the generalization verification experiments are shown in Table 5.
[0137] Table 5 Performance comparison of different models on the generalization test set
[0138]
[0139] Example 2
[0140] This embodiment provides a wafer CMP virtual measurement system with enhanced mechanism-based multimodal large model. Based on the method of Embodiment 1 above, this invention also provides a corresponding method operation system, deployed in the form of software intelligent algorithm modules on semiconductor manufacturing workshop control equipment or cloud platforms. The system includes:
[0141] Multi-source data unified representation and fusion module: Used to connect to the CMP machine sensor network, responsible for independent feature encoding of four types of modal data and fusion with CMAF cross-modal attention to generate a unified representation of multi-source features. .
[0142] Mechanism-Data Dual-Branch Hybrid Modeling Module: This module integrates a physical computing engine and a physical information neural network to perform prior extraction based on Hertz / Reynolds theory and physical branch prediction calculations based on the extended Preston equation.
[0143] Large Model Retrieval Enhanced Prediction and Optimization Module: Used for built-in large language models and PIT models, performing contextual retrieval and global non-uniformity index inference, and calculating the joint loss function combined with mass / energy conservation constraints during model training to complete the system update closed loop.
[0144] Example 3
[0145] This embodiment provides a computer device, including a memory and a processor. The memory stores a computer program (such as a deep learning algorithm script written in Python), and when the processor executes the program, it can fully implement all the steps of heterogeneous feature extraction, CMAF fusion, bi-branch inference prediction, and joint loss optimization update described in Embodiment 1.
[0146] Example 4
[0147] This embodiment also provides a computer-readable storage medium (such as a solid-state drive, optical disk, etc.) that stores a corresponding computer instruction program. When the program is read and executed by a processor, it implements any of the method steps described in Embodiment 1.
[0148] Finally, it should be pointed out that the above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.
Claims
1. A virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model, characterized in that, Including the following steps: Step 1: Acquire multi-source heterogeneous data and perform feature extraction based on heterogeneous encoders. Collect multimodal data in the CMP process, including time-series process signals, spatial graphic features, structured process knowledge, and unstructured expert knowledge. For the multimodal data, use a hybrid architecture of 1D-CNN+Transformer, visual Transformer, multilayer perceptron, and graph attention network as heterogeneous encoders to extract the corresponding single-modal representation vectors in parallel. Step 2: Construct a cross-modal attention fusion mechanism to generate a unified representation. Using the single-modal representation vector described in Step 1 as input, calculate the semantic association matrix between the feature vectors of each modality and propagate attention information. Introduce a learnable gate control mechanism to adaptively weight and integrate the propagated modal features with the original features to generate a unified representation vector of multi-source data. Step 3: Construct a physical and data-driven dual-branch prediction network. The unified representation vector output from Step 2 is input into two parallel prediction branches. The prediction branches are: the physical mechanism branch, which calculates the theoretical contact pressure using Hertz contact theory and estimates the slurry film thickness (as a priori feature) using the Reynolds equation; combined with the extended Preston equation, it outputs a physical mechanism-driven prediction result; and the knowledge-enhanced data branch, which uses the unified representation to search in the expert knowledge graph, formats it into prompt words, and inputs it into the Large Language Model (LLM); at the same time, it extracts spatial sequence features by introducing the physical information Transformer (PIT) of the physical affinity matrix, and outputs the data-driven prediction result and confidence interval. The prediction results of the two branches are weighted and fused using learnable weights to generate a non-uniform prediction tensor. Step 4: Optimize the network model by calculating the joint loss function based on multiple physical conservation laws. During the backpropagation training phase of the model, construct and optimize the end-to-end joint loss function, which includes data fitting terms, multiple physical conservation constraints, knowledge retrieval terms, and uncertainty regularization terms. Minimize the joint loss function using the gradient descent algorithm to drive the iterative update of network parameters until the model converges.
2. The virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model according to claim 1, characterized in that, The timing process signal mentioned in step one is acquired through a timing signal sensor, an acoustic emission sensor, or a motor drive current sensor.
3. The virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model according to claim 1, characterized in that, Step three also outputs physical mechanism-driven prediction results by combining the Archard wear equation or the nonlinear removal rate equation based on the Tseng-Wang model; at the same time, spatial sequence features are extracted by introducing a physical affinity matrix with a physical attention bias Mamba or bidirectional LSTM network, and data-driven prediction results and confidence intervals are output.
4. The virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model according to claim 1, characterized in that, Step one describes the use of a hybrid architecture of 1D-CNN+Transformer, a visual Transformer, a multilayer perceptron, and a graph attention network as heterogeneous encoders for the multimodal data, respectively, to extract the corresponding single-modal representation vectors in parallel. The specific process is as follows: (11) Temporal process signal encoding: The acquired temporal feature matrix is input into a hybrid architecture consisting of one-dimensional convolution and Transformer; firstly, local features are extracted through a one-dimensional convolutional layer: The input is then fed into the Transformer coding layer for multi-head self-attention (MHSA) processing: in, The input is the time series feature matrix; and These represent the kernel weights and biases, respectively; * indicates the convolution operation. For activation functions; It is a local feature matrix; They represent respectively by The query matrix, key matrix, and value matrix obtained by linear transformation; Represents the dimension of the key vector; Normalized exponential function; outputs temporal mode coding vector. ; (12) Spatial graphic feature encoding: The acquired spatial feature maps are processed using a visual Transformer architecture. First, the image is flattened into a sequence using a Patch embedding layer, and positional encoding is added: in, The input is a spatial graphic feature map; Indicates the feature flattening operation; This represents the position encoding matrix to be added; The features are serialized; then, after multiple layers of self-attention processing, the output image mode coding vector is obtained. ; (13) Structured process knowledge encoding, using a multi-layer fully connected network to concatenate the feature vectors of process parameters. Perform nonlinear mapping: in, The input is a structured process feature vector; and These represent the weight matrix and bias vector of the Lth layer, respectively. ); It is a non-linear activation function; the final output is a structured mode code vector. ; (14) Unstructured expert knowledge encoding: After extracting expert text into an entity relationship graph, a graph attention network is used for encoding. In layer graph convolution operations, the aggregation method for node information is defined as follows: in, This represents the feature vector of node i in the l-th layer; Represents the set of neighboring nodes of node i; The calculated attention coefficient; The weight matrix is used to output the knowledge graph modality encoding vector. .
5. The virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model according to claim 1, characterized in that, Step two involves constructing a cross-modal cross-attention fusion mechanism to generate a unified representation, specifically by semantically aligning the independent features output in step one. First, calculate the semantic association matrix between any two modalities i and j. : in, and These represent the encoded feature vectors of the i-th and j-th modes, respectively. ; and These are the learnable query projection matrix and the key projection matrix, respectively; The semantic association score matrix represents the relationship between modalities. Subsequently, attention information is propagated, and a gating mechanism is used for adaptive feature integration: in, (Right now () represents the unified representation vector of the multi-source data generated by the final fusion; For modal index sets; This represents the feature vector updated after cross-modal attention propagation; This is the original feature vector; This is the gating coefficient matrix generated by a learnable network.
6. The virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model according to claim 1, characterized in that, Step three, which involves constructing a dual-branch prediction network for both physical and data aspects, specifically includes: (31) The theoretical pressure field and liquid film thickness physical prior features are generated using Hertz contact theory and Reynolds equations; the network prediction is combined with the extended Preston equation to calculate the prediction tensor of the physical branch. : in, Represents wafer coordinates Physical baseline removal rate at the location; This is an empirical coefficient; , and These represent the local pressure distribution, relative sliding speed, and lubrication efficiency factor extracted by the neural network, respectively. (32) Based on the unified representation described in step two Retrieve data from a knowledge graph, formatted as prompt text P, inputting a Large Language Model (LLM) and a Physical Information Transformer (PIT), and outputting a data-driven prediction tensor. With uncertainty and variance : in, This represents the prediction tensor generated by the data-driven branch; This represents the variance of the confidence interval for the corresponding prediction result; Features of the hidden state; Formatted prompt text for retrieval; This represents the eigenvector concatenation operation; where PIT explicitly injects the physical affinity matrix characterizing wafer physical coupling during attention computation. ; (33) Prediction result fusion: in, The final output is the full-field prediction tensor for the non-uniformity indices (WIWNU and WIDNU); The fusion weights for physical terms and data-driven terms learned automatically through a gating network.
7. The virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model according to claim 1, characterized in that, Step four, which involves calculating the joint loss function to optimize the network model based on multiple physical conservation laws, specifically includes: during the network backpropagation training phase, calculating the overall optimization objective loss function that incorporates physical regularization. : in, This represents the total loss during end-to-end training; The mean squared error of the data fitting between the predicted and actual values; the value in parentheses is the total physical loss penalty term (where... This represents the loss due to the mass conservation constraint. This represents the energy conservation constraint loss. This represents the monotonicity constraint loss. , , (for corresponding weights) For knowledge retrieval-related losses; This is a regularization term for uncertainty estimation; by optimizing the joint loss function, the iterative convergence of the network model is achieved.
8. A system for implementing the virtual measurement method for wafer CMP non-uniformity based on a large data mechanism fusion model as described in claim 1, characterized in that, The system, deployed in the form of software intelligent algorithm modules in semiconductor manufacturing workshop control equipment or cloud platforms, includes: A multi-source data unified representation and fusion module, connected to the CMP machine sensor network, is used to fuse independent feature encodings of multimodal data with CMAF cross-modal attention to generate a unified representation of multi-source features. ; The mechanism data dual-branch hybrid modeling module has a built-in physical calculation engine and physical information neural network, which are used to perform prior extraction based on Hertz / Reynolds theory and physical branch prediction calculation based on the extended Preston equation. The large model retrieval enhancement prediction and optimization module incorporates a large language model and a PIT model to perform contextual retrieval and global non-uniformity index inference. During model training, it calculates a joint loss function that incorporates mass / energy conservation constraints to complete the system update closed loop.
9. A computer device, characterized in that, It includes a memory and a processor; the memory stores a computer program, and the processor, when executing the program, is able to fully implement all the steps of heterogeneous feature extraction, CMAF fusion, bi-branch inference prediction, and joint loss optimization update as described in claims 1-7.
10. A computer-readable storage medium, characterized in that, It stores a corresponding computer instruction program, which, when read and executed by a processor, implements any of the method steps described in claims 1-7.