A visual detection method for chip wire bonding appearance quality
By constructing a unified detection framework that supports multiple forms of supervision, and combining WideResNet50 and a synthetic anomaly generation strategy, the problems of strong labeling dependence and low accuracy of micro-defect detection in wire bonding appearance inspection are solved, achieving high-precision, real-time micro-defect detection, which meets the needs of the integrated circuit packaging industry.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 明益信(江苏)智能设备有限公司
- Filing Date
- 2026-04-20
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies for wire bonding appearance inspection suffer from problems such as strong labeling dependence, single supervision paradigm, low accuracy in detecting minute defects, and insufficient generalization and adaptability, making it difficult to meet the integrated circuit packaging industry's demand for high precision and high efficiency.
A unified detection framework supporting unsupervised, weakly supervised, and fully supervised detection is constructed. The WideResNet50 network is used for feature extraction, and a multi-scale feature fusion and synthetic anomaly generation strategy are combined to achieve high-precision detection of minute defects through a dual-branch detection model.
It achieves high-precision detection that can be flexibly adapted to different labeling conditions, effectively detects micro-scale targets such as neck cracks, meets real-time detection requirements, and has strong generalization ability and anti-interference performance.
Smart Images

Figure CN122265261A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a visual inspection method for the appearance quality of chip wire bonding, belonging to the technical field of semiconductor packaging visual inspection. Background Technology
[0002] Wire bonding is a core interconnection process in chip packaging bonding technology. In actual production, gold, copper, or aluminum wires are used to electrically connect the chip pads to the lead frame or substrate pins. It is a crucial step in forming an electrical path between the internal circuitry of the chip and the external package. The appearance quality of the bonding points directly affects the reliability, electrical performance, and service life of the packaged device. Defects in the bonding points can easily lead to device failure.
[0003] In the integrated circuit packaging manufacturing process, visual inspection of the appearance quality of wire bonding is quite challenging, requiring simultaneous inspection of multiple quality indicators such as solder joint location, gold wire arc height, arc shape, solder ball size, and neck integrity. In actual production, common bonding defects include solder joint misalignment, abnormal solder ball size, gold wire collapse, arc crossing, neck cracks, root damage, gold wire breakage, and excess solder wire residue. These defects are minute in size, and if missed, they can directly cause open circuits, short circuits, or reduce device reliability. Therefore, high-precision, non-destructive automated inspection of bonding appearance is a necessary step in integrated circuit packaging production.
[0004] Traditional wire bonding appearance inspection mainly relies on manual microscopic visual inspection or rule-based automated optical inspection (AOI). Manual inspection is greatly affected by operator experience, visual fatigue, and subjective judgment, resulting in low inspection efficiency and difficulty in ensuring consistent results, making it unsuitable for mass production needs. Existing AOI equipment mostly uses template matching and morphological image processing methods to identify defects through preset thresholds, which has significant limitations in practical applications: First, it is sensitive to changes in lighting and the high reflectivity of metal leads, resulting in poor imaging stability and a high risk of false detections; second, adjusting equipment parameters is cumbersome, requiring reconfiguration of inspection rules when product models or package specifications change, impacting production efficiency; third, it lacks the ability to detect weak defects, such as micron-sized neck cracks and minor root damage, leading to missed detections.
[0005] In recent years, deep learning technology has been gradually applied to the field of industrial defect detection, and has also been attempted for wire bonding appearance inspection. However, this scenario presents unique challenges, resulting in unsatisfactory application results. First, the cost of acquiring labeled data is high. Labeling micron-level bonding defects at the pixel level requires professionally trained engineers, resulting in long labeling cycles and poor consistency among different engineers' labeling results. Second, defect samples are scarce. In actual production lines, the proportion of good products is extremely high, and the types of bonding defects are diverse and occur infrequently, making it difficult to construct a balanced and comprehensive supervised training set. In addition, frequent product changes on production lines, along with different packaging types and wire bonding processes, place high demands on the generalization ability of the detection model, making it difficult for existing models to adapt quickly.
[0006] Existing deep learning-based detection methods still have significant limitations. Convolutional neural network-based segmentation methods lack the ability to model the global context of tiny targets like micron-level neck cracks, resulting in a high false negative rate. While Transformer-based methods can capture long-range dependencies, they are large in model size, slow inference speed, and require a large number of labeled samples, failing to meet the real-time deployment needs of industrial production lines. More importantly, existing methods have fixed supervision paradigms, either supporting fully supervised learning based on large amounts of labeled data or unsupervised learning based on normal samples, lacking a unified detection framework that can flexibly adapt to different labeling conditions. In actual production, it is common for a batch of products to have only a few labeled anomalous samples, or only image-level labels without pixel-level masks. In such cases, unsupervised methods cannot utilize the limited supervision information, and fully supervised segmentation models cannot be effectively trained, leading to poor detection results.
[0007] Meanwhile, existing methods lack the ability to model abnormal distributions, especially lacking an effective simulation mechanism for structural defects and real defect morphologies generated during the bonding process. This makes it impossible to accurately distinguish between normal fluctuations and real defects, further affecting the detection accuracy and making it difficult to meet the high-precision and high-efficiency requirements of the integrated circuit packaging industry for wire bonding appearance inspection. Summary of the Invention
[0008] The purpose of this invention is to address the shortcomings of the prior art and to propose a visual inspection method for the appearance quality of chip wire bonding, which addresses the problems of strong labeling dependence, single supervision paradigm, low accuracy of small defect detection, and insufficient generalization and adaptability in traditional inspection.
[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A visual inspection method for the appearance quality of chip wire bonding includes the following steps: S1 constructs a training dataset, which includes unsupervised samples, weakly supervised samples, and fully supervised samples; S2 data preprocessing involves preprocessing the input image. The S3 feature extraction backbone network extracts multi-scale feature maps from the preprocessed image, which are then upsampled to generate the first feature map F. S4 Feature Adaptation: The first feature map F is processed by the feature adaptation module to generate a second feature map A for the segmentation task. S5 Synthetic Anomaly Generation: Based on the supervision method of the current sample, a synthetic anomaly generation strategy is configured to generate a synthetic anomaly mask and corresponding perturbation features, which are then injected into the first feature map F and the second feature map A, respectively, to form the third feature map MixF and the fourth feature map MixA. S6 dual-branch detection is divided into a training phase and an inference phase; The training phase includes: inputting the fourth feature map MixA into the segmentation head and outputting a pixel-level anomaly score map M0; concatenating the third feature map MixF with the pixel-level anomaly score map M0 and inputting the concatenation into the classification head and outputting an image-level anomaly score s; The inference stage includes: inputting the second feature map A into the segmentation head and outputting a pixel-level anomaly score map M0; concatenating the first feature map F with the pixel-level anomaly score map M0 and inputting the concatenation into the classification head, and outputting an image-level anomaly score s; S7 Model Training: Calculate the loss function based on the output of the training phase in step S6, and update the parameters of the feature extraction backbone network, the feature adaptation module, the segmentation head, and the classification head based on the gradient of the loss function. S8 inference detection sequentially executes the inference stages of steps S2, S3, S4 and S6 on the image of the lead bonding area of the chip to be inspected, and outputs the image-level anomaly score s of the inference stage as the defect judgment result.
[0010] Preferably, in step S3, the feature extraction backbone network adopts the WideResNet50 network, and the shallow convolution parameters of the feature extraction backbone network are kept frozen during training, with only the high-level feature layer parameters being updated.
[0011] Preferably, in step S4, the feature adaptation module includes a 1×1 convolutional layer, a 3×3 convolutional layer, and an attention module connected in sequence, wherein the 1×1 convolutional layer is used for channel compression and semantic information fusion, the 3×3 convolutional layer is used for extracting local structure and enhancing spatial information, and the attention module is used for focusing on key regions and suppressing background noise.
[0012] Preferably, in step S5, the synthetic anomaly generation strategy includes a dual-path random selection mechanism: during training, the diffusion model anomaly generation path based on region mask is selected with a 50% probability, and the noise-guided anomaly generation path based on region restriction is selected with a 50% probability.
[0013] Preferably, the anomaly generation path of the diffusion model based on region mask includes: randomly sampling normal images from the training dataset as input, using region masks to control the position and size of the generated anomaly images, injecting the embedding vector as a condition through a cross-attention module to generate the expected anomaly image, and in each step of the denoising process, the region within the mask box is preserved, while the region outside the box is replaced by a noise version.
[0014] Preferably, the noise-guided anomaly generation path based on region restriction includes: randomly generating anomalies of different shapes in the feature space using one of Gaussian noise, fractal noise, or simplex noise; obtaining an anomaly mask by thresholding the noise map; removing the actual anomaly region to generate a synthetic anomaly mask; using the synthetic anomaly mask to restrict Gaussian noise to a specific region; generating final noise and adding it to the features to create a synthetic anomaly.
[0015] Preferably, the abnormal generation path of the diffusion model based on region mask adopts a pre-trained diffusion model, which does not intervene in the first 30% of ep0ch during training, and then intervenes randomly with a 50% probability starting from the 31st ep0ch.
[0016] Preferably, in step S6, the segmentation head includes parallel 3×3 convolutional layers, 5×5 convolutional layers, dilated convolutional layers, and 1×1 convolutional layers, wherein the parallel convolutional layers are used to fuse multi-scale features and output a single-channel anomaly score map. The classification head includes a 5×5 convolutional block, a pooling layer, and a fully connected layer. The 5×5 convolutional block is used to capture global context information, the pooling layer is used for feature dimensionality reduction, and the fully connected layer is used to output image-level anomaly scores.
[0017] Preferably, in step S7, the loss function includes segmentation loss and classification loss, the segmentation loss includes truncation loss, edge loss and focus loss, and the classification loss adopts binary cross-entropy loss or focus loss.
[0018] Preferably, during the model training process in step S7, the loss function is calculated based on the supervision information of the current sample, the pixel-level anomaly score map M0 output during the training phase, and the image-level anomaly score s. For unsupervised samples, the synthetic anomaly mask generated in step S5 is used as supervision information; for weakly supervised samples, image-level anomaly labels are used as supervision information; and for fully supervised samples, pixel-level defect annotations are used as supervision information.
[0019] Preferably, in step S2, the data preprocessing includes: minimum target size statistics, reflection component suppression, and mixed data augmentation, wherein the reflection component suppression adopts a multi-scale adaptive gain Retinex algorithm combined with frequency domain high-pass filtering, and the mixed data augmentation includes random flipping, rotation, scaling, saturation adjustment, and Gaussian noise injection.
[0020] Preferably, in step S6, the inference stage further includes a post-processing step: thresholding the pixel-level anomaly score map to obtain a defect segmentation mask, performing morphological closing operations on the defect segmentation mask to fill the micro-holes, and removing isolated noise regions with an area less than one-quarter of the minimum target size.
[0021] The beneficial effects of this invention are mainly reflected in: 1. Unified Multi-Form Learning Capability. This invention introduces for the first time a unified detection framework supporting unsupervised, weakly supervised, and fully supervised training in the field of wire bonding appearance inspection. It allows for flexible switching of training methods based on the actual labeled data, solving the practical problem of scarce and diverse labeled data in industrial scenarios. This feature enables the method to work uniformly in various scenarios, from those with only normal samples (abundant on production lines) to those with a small number of labeled abnormal samples.
[0022] 2. High-precision detection of minute defects. For minute targets such as bond neck cracks and micrometer-level offsets of bonding points, a multi-scale feature fusion and edge-aware loss mechanism is designed. Combined with a synthetic anomaly generation strategy, diverse synthetic defects can be generated for training even in unsupervised mode, significantly improving the detection sensitivity for small targets. Experiments on typical bonding datasets show that this method achieves a pixel-level IoU of 0.86 for neck crack detection.
[0023] 3. Real-time inference performance. The backbone network adopts a lightweight design (frozen shallow layers), requiring only a single forward propagation during inference, with the segmentation-classification dual branches sharing the computational cost of feature extraction. Inference time on an NVIDIA RTX 4070 Ti is less than 45ms, meeting the real-time detection requirements of production lines.
[0024] 4. Strong generalization and anti-interference capabilities. Multi-scale Retinex reflection separation in the preprocessing stage effectively suppresses interference from high metallic reflectivity; the synthetic anomaly generation mechanism makes the model robust to changes in illumination and product type. It maintains stable performance in bonding appearance inspection tasks across various package types (QFN, SOP, BGA).
[0025] 5. Ease of use in engineering. Most hyperparameters can be automatically calculated from the dataset, reducing the burden of manual parameter tuning; the model supports incremental learning, and the production line can continuously optimize performance as data accumulates. Attached Figure Description
[0026] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating a visual inspection method for the appearance quality of chip wire bonding according to the present invention.
[0027] Figure 2 This is a model framework diagram of a visual inspection method for the appearance quality of chip wire bonding according to the present invention.
[0028] Figure 3 This is a schematic diagram of feature adaptation in a visual inspection method for the appearance quality of chip wire bonding according to the present invention.
[0029] Figure 4 This is a schematic diagram of diffusion model learning of abnormal embedding in a visual inspection method for chip wire bonding appearance quality according to the present invention.
[0030] Figure 5 This is a schematic diagram of diffusion model generation anomalies in a visual inspection method for chip wire bonding appearance quality according to the present invention.
[0031] Figure 6 This is a schematic diagram illustrating the generation of noise-guided anomalies in a visual inspection method for chip wire bonding appearance quality according to the present invention.
[0032] Figure 7 This is a schematic diagram of the segmentation head and the sorting head in a visual inspection method for the appearance quality of chip wire bonding according to the present invention. Detailed Implementation
[0033] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0034] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. It should be noted that, unless otherwise specified, the embodiments and features described in the present application can be combined with each other.
[0035] This invention provides a visual inspection method for the appearance quality of chip wire bonding, such as... Figures 1 to 7 As shown, it includes the following steps: Construct a training dataset, which includes unsupervised samples, weakly supervised samples, and fully supervised samples.
[0036] Collect t microscopic images of the wire bonding region (resolution not less than 5 million pixels), and construct a training set according to the annotation type: Unsupervised sample U: contains only normal sample images without any annotation; Weakly supervised sample W: contains image-level labels (normal / abnormal), annotates the abnormal category but does not contain pixel-level location information; Fully supervised sample S: contains pixel-level semantic segmentation annotations, accurately annotating the solder joint area, gold wire area and the boundaries of each defect type.
[0037] The proportions of the three subsets can be flexibly configured according to the actual application scenario to form a multi-form training sample set, train_multi. Based on this type of training data, the training modes are: unsupervised mode (using only normal images); weakly supervised mode, which includes anomalous images with image-level labels; and fully supervised mode, where all images are pixel-level labeled.
[0038] Data preprocessing involves preprocessing the input image.
[0039] Specifically, the data preprocessing includes: minimum target size statistics, reflection component suppression, and mixed data augmentation. The reflection component suppression uses a multi-scale adaptive gain Retinex algorithm combined with frequency domain high-pass filtering. The mixed data augmentation includes random flipping, rotation, scaling, saturation adjustment, and Gaussian noise injection.
[0040] More specifically, a dedicated preprocessing workflow was designed to address the characteristics of wire bonding images, such as high metallic reflectivity and complex backgrounds: Minimum target size statistics: When annotations exist, the width minw and height minh of the smallest defect target are statistically analyzed from the annotation samples and used as network design constraints.
[0041] Reflection component suppression: To address the issue of strong reflection in the gold wire / pad area, a multi-scale adaptive gain Retinex algorithm is used to separate the reflection component from the illumination component. This is combined with frequency domain high-pass filtering to remove interference from low-frequency illumination variations, thereby enhancing the visibility of weak defects on the metal surface.
[0042] Hybrid data augmentation: The training images are augmented by random flipping, rotation (±0~90°), scaling (0.8~1.2 times), saturation adjustment, Gaussian noise injection, etc. The corresponding labels are automatically generated for the labeled samples, and the augmented images are incorporated into the training set.
[0043] Reference Figure 2As shown, the network model design is explained. Features are first extracted and amplified, and adaptive adjustments are made in the segmentation branch. During training, anomalies are generated randomly at the image layer. The randomly generated anomaly region mask is passed to the noise-guided mask region generation module as a parameter to constrain the mask region generation. In other cases, synthetic anomalies are generated in the latent space and restricted to the region defined by the binarized noise region mask. The segmentation head predicts M0 based on MixA, and then the classification head combines this mask with MixF to generate anomaly scores s. Both anomaly scores s and M0 are learned under supervision using the anomaly mask M and the true anomaly score y, where y is set to 1 if the image contains anomalies (synthetic or real), and 0 otherwise. The inference phase directly generates M0 and s, skipping the anomaly generation phase.
[0044] Specifically, in combination Figures 2 to 7 Explanation: A feature extraction backbone network is used to extract multi-scale feature maps from the preprocessed image, which are then upsampled to generate the first feature map F. The feature extraction backbone network adopts the WideResNet50 network, and the shallow convolution parameters of the feature extraction backbone network are frozen during training, with only the parameters of the higher-level feature layers being updated.
[0045] WideResNet50, pre-trained on a dataset collected using our own bonding detection equipment, was used as the backbone feature extractor to extract multi-scale feature maps. To improve inference speed, the backbone network was optimized as follows: Extracting features from a partial layer L (where l∈L, L={2,3}), these features have relatively low spatial resolution. To enhance the model's ability to detect minor anomalies and improve localization accuracy, an upsampling layer is introduced before feature concatenation, i.e.: ,in , where represents the size of the maximum extracted feature map, and bilinear interpolation is used for l∈L. This method can effectively double the feature resolution. Specifically, the size of the third layer features is increased by a factor of 4, and the size of the second layer features is increased by a factor of 2. It ensures that all layers maintain a consistent spatial resolution.
[0046] And they can be connected to generate : The neighborhood context of each feature location is obtained using local average pooling with a 3×3 kernel, as implemented below: This ultimately generates an upsampled feature map.
[0047] The backbone network is pre-trained on its own dataset, and the shallow convolution parameters are frozen, with only the high-level feature layers being updated.
[0048] Feature adaptation involves performing feature adaptation processing on the first feature map F through a feature adaptation module to generate a second feature map A for the segmentation task. The feature adaptation module includes a 1×1 convolutional layer, a 3×3 convolutional layer, and an attention module connected in sequence. The 1×1 convolutional layer is used for channel compression and semantic information fusion, the 3×3 convolutional layer is used for extracting local structure and enhancing spatial information, and the attention module is used for focusing on key regions and suppressing background noise.
[0049] Reference Figure 3 As shown, 1x1 convolution is first used to compress channels to fuse semantic information and reduce computation, 3x3 convolution is used to extract local structure and enhance spatial information, and the attention module focuses on key regions and suppresses background noise and can improve abnormal contrast.
[0050] Synthetic anomaly generation involves configuring a synthetic anomaly generation strategy based on the supervision format of the current samples, generating a synthetic anomaly mask and corresponding perturbation features, and injecting them into the first feature map F and the second feature map A, respectively, to form the third feature map MixF and the fourth feature map MixA.
[0051] Synthetic anomalies mainly include diffusion models and noise introduction. In this case, the synthetic anomaly generation strategy includes a dual-path random selection mechanism: during training, the diffusion model anomaly generation path based on region mask is selected with a 50% probability, and the noise-guided anomaly generation path based on region restriction is selected with a 50% probability.
[0052] Specifically, the anomaly generation path of the diffusion model based on region masks includes: randomly sampling normal images from the training dataset as input, using region masks to control the position and size of the generated anomaly images, injecting the embedding vector as a condition through a cross-attention module to generate the expected anomaly image, and in each step of the inference image in the denoising process, the region within the mask box is preserved, while the region outside the box is replaced by a noise version.
[0053] The diffusion model is refined. The anomaly generation path of the diffusion model based on region masking adopts a pre-trained diffusion model. The pre-trained diffusion model does not intervene in the first 30% of epochs of training, and then intervenes randomly with a 50% probability starting from the 31st epoch.
[0054] The noise-guided anomaly generation path based on region constraints includes: randomly generating anomalies of various shapes in the feature space using one of Gaussian noise, fractal noise, or simplex noise; obtaining an anomaly mask by thresholding the noise map; removing the actual anomaly region to generate a synthetic anomaly mask; using the synthetic anomaly mask to restrict Gaussian noise to a specific region; generating the final noise and adding it to the features to create a synthetic anomaly.
[0055] Image generation based on the diffusion model: The diffusion model (DM) is a probabilistic model used to learn the distribution of data, enabling the reconstruction of diverse samples from noise. This model treats the process of adding and removing noise as a Markov chain of length T, learning the data distribution through continuous noise addition and removal. The optimization objective of the diffusion model is to predict noise from a noisy image.
[0056] Conditional Diffusion Model. Given random noise, Denoising Model (DM) can generate diverse images through iterative denoising, but the semantics of the generated images cannot be controlled. To address this issue, the Latent Diffusion Model (LDM) proposes using a condition y to control the denoising process. LDM first transforms the image into a latent space, then injects the condition y into the model through a cross-attention module, enabling the model to generate images corresponding to y. The optimization objective of LDM can be simplified as: , Noise sampled using a Gaussian distribution; It is an autoencoder used to transform x into the latent space; t is the time step (diffusion step). This is the encoding of condition y; For U-Net; using L2 loss (mean squared error); the condition y is text or defect category label. It is the corresponding encoder.
[0057] Due to the scarcity of abnormal image resources, training expert models... Encoding industrial scenario prompts (such as "bonded ball head damaged") using this method is difficult to apply in practice.
[0058] Similarly, for the same reasons, fine-tuning the LDM model is difficult to implement or requires significant resources and time, making it difficult to implement in real-world industrial projects. To overcome these challenges and reduce reliance on anomalous images, this method employs a scheme that directly utilizes embedding vectors as conditional variables.
[0059] In this way, only a small number of embedding vectors need to be learned. This method can effectively generate anomalous images while alleviating the limitations caused by the scarcity of anomalous data.
[0060] like Figure 4 As shown, based on a pre-trained Latent Diffusion Model (LDM) with all parameters fixed, a model containing a small number of support anomalies is learned. , Embedded vector Real-world anomaly images The quantity only needs to be 1, and This is the corresponding truth mask.
[0061] This method focuses on building embedded representations that can effectively capture the semantic features of real-world anomalous images, rather than relying on fine-tuning processes of complex models.
[0062] Specifically, by using the loss function in the noise prediction process to understand the distribution characteristics of true outliers, an embedding vector is first initialized. Replace conditional embeddings in LDM , represented as .
[0063] in It contains several (or even just one) real-world anomalous images. Similar to the conditional mechanism in LDM, the embedding vector in LDM employing cross-attention will... An intermediate layer is inserted into the U-Net network. Instead of training the entire network directly, embedding vectors are generated using a pre-trained LDM model with its parameters fixed. This allows only the embedding vectors to be updated during noise addition and denoising. In this way, the learned embedding vectors can capture the distribution characteristics of the provided real-world anomalous images, thus guiding subsequent image generation stages.
[0064] In many cases, anomalous regions in an image typically only occupy a small portion of the entire object. Training the model on the entire image might lead to a biased data distribution, favoring the background or the overall image rather than focusing on the anomalous points. To address this issue, a segmentation mask of the anomalous image is used to guide the loss function. As The segmentation mask. Then the loss of LDM is expressed as... .
[0065] This method learns a specified object by optimizing the embedding vector. However, it is not used for learning the entire object; instead, it uses a mask-guided loss function to capture the object's local, detailed features.
[0066] like Figure 5 As shown, given a normal image and mask Utilizing learned embeddings Guide LDM to generate abnormal images By achieving both semantic and spatial controllability, it is possible to generate images that closely match the expected specifications.
[0067] Specifically, a normal image is randomly selected from the training set. As input image, and using a mask To control the position and size of the generated abnormal images. Embedding vector The image will be frozen and conditionally injected through a cross-attention module to generate the expected anomalous image. In each step of the denoising process, the in-box region will be preserved, while the out-of-box region will be... The noise version is a replacement.
[0068] In this way, the generated abnormal regions can be controlled to be located within a specified area of the input image, while keeping other areas unaffected. This process can be represented as follows: ,in yes ( In the version added in step t, yes The noise-reduced version. (·)Will Mapped to the latent space. After denoising, the latent variables... Anomaly images are generated using a decoder. To further enhance diversity, a reference frame with arbitrary position and size was used. Traditional methods such as cut-and-paste and geometric misalignment are commonly used generation methods in this field and will not be elaborated further.
[0069] When generating noise-guided anomalies, traditional noise generation methods are used in the deep feature space to simulate anomalies of varying shapes. Each time, one type of noise—Gabb noise, fractal noise, or simplex noise—is randomly used.
[0070] Two-branch detection is divided into a training phase and an inference phase. The training phase includes: inputting the fourth feature map MixA into the segmentation head and outputting a pixel-level anomaly score map M0; concatenating the third feature map MixF with the pixel-level anomaly score map M0 and inputting the concatenation into the classification head and outputting an image-level anomaly score s. The inference phase includes: inputting the second feature map A into the segmentation head and outputting a pixel-level anomaly score map M0; concatenating the first feature map F with the pixel-level anomaly score map M0 and inputting the concatenation into the classification head and outputting an image-level anomaly score s.
[0071] like Figure 6 As shown, the anomaly generation process is synthesized. This is achieved by analyzing the anomaly mask... (Obtained by thresholding the noise map) Remove actual outlier regions (from the real mask) (Capture) Generate synthetic anomaly mask Then utilize The Gaussian noise is confined to a specific region to generate the final noise. This is then added to the features to create a synthetic anomaly. The final anomaly mask M is... Constructed This marks the areas of synthetic anomalies and real anomalies. In weakly supervised and unsupervised learning scenarios, because... Empty Directly become And the final mask M. Anomaly generation only takes effect during the training phase.
[0072] The segmentation head includes parallel 3×3 convolutional layers, 5×5 convolutional layers, dilated convolutional layers, and 1×1 convolutional layers. The parallel convolutional layers are used to fuse multi-scale features and output a single-channel anomaly score map. The classification head includes 5×5 convolutional blocks, pooling layers, and fully connected layers. The 5×5 convolutional blocks are used to capture global contextual information, the pooling layers are used for feature dimensionality reduction, and the fully connected layers are used to output image-level anomaly scores.
[0073] like Figure 7 As shown, a detection architecture is designed with the segmentation head and classification head connected in parallel: The segmentation head receives the adapted feature map A or the feature map MixA from the training period, and fuses them through three parallel convolutional layers (3×3, 5×5, dilated convolution) and one 1×1 convolutional layer, finally outputting a single-channel anomaly score map. .
[0074] Classification head receiving feature map It consists of a single 5×5 convolutional block, subsequent pooling layers, and a final fully connected layer. The model can capture global contextual information, thereby reducing the false alarm rate and improving the detection capability for small and distributed defects.
[0075] The segmentation head first generates an anomaly map M0. This map is then concatenated with a feature map F (or a noise-enhanced feature map MixF generated during training) and used as input to the classification head's convolutional blocks. The output of the convolutional blocks and the anomaly map are pooled, concatenated, and then passed through a final fully connected layer to generate an image-level anomaly score s.
[0076] Model training involves calculating the loss function based on the output of the training phase, and updating the parameters of the feature extraction backbone network, feature adaptation module, segmentation head, and classification head based on the gradient of the loss function.
[0077] The loss function includes segmentation loss and classification loss. Segmentation loss includes truncation loss, edge loss, and focus loss, while classification loss uses either binary cross-entropy loss or focus loss. During model training, corresponding supervisory information is used based on the differences in supervised samples: for unsupervised samples, the synthetic anomaly mask from the synthetic anomaly generation step is used as supervisory information; for weakly supervised samples, image-level anomaly labels are used as supervisory information; and for fully supervised samples, pixel-level defect annotations are used as supervisory information.
[0078] Specifically, a combined loss function is designed to adapt to the annotation characteristics of different supervision methods.
[0079] Segmentation loss The losses in each part are described below: ,in To prevent excessively high fitted values for the cutoff term (0.6 in this method), It is a splitter head. This is the value of the predicted anomaly mask at position (i, j). Total truncation loss. , recorded as Calculate all elements within the predicted anomaly mask. The average value is used to derive the loss. This loss prompts the model to learn the soft decision boundary between anomalous and non-anomalous regions. Due to the existence of the soft decision boundary, the model does not overfit the data, thus achieving better generalization ability.
[0080] Edge loss calculation method Sobel indicates that Sobel edge extraction is performed.
[0081] To account for the imbalance of samples, focal loss is added. Focal loss and Sobel are common algorithms in this field and will not be described in detail.
[0082] The weights of the edge loss in this method It is 0.3.
[0083] Classification loss The BCE binary classification cross-entropy loss is adopted. The anomaly probability is output by the classification head. Image-level labels are used (available in supervised mode). Focal loss is used in other supervised modes.
[0084] The final total loss is: hyperparameters Dynamically adjusts based on the form of supervision: All normal images, unsupervised mode: =1; Image anomaly and full annotation: =1; Image anomalies were only partially annotated at the image level (OK / NG). =0.
[0085] This allows the segmentation head to be trained on all images (excluding anomalous images without pixel-level labels), while the classification head remains in training mode. Thanks to the anomaly generation strategy, the model can still successfully complete segmentation training even in the complete absence of pixel-level labels.
[0086] This method also includes model training strategies: Multi-stage progressive training: Pre-training, self-supervised pre-training on a large-scale general defect dataset, learns general anomaly representation capabilities; includes defect features from the diffusion model. The feature extraction backbone network is then established. This step is performed only once in this method and used directly in subsequent steps. Domain adaptation involves using normal samples from the target production line for unsupervised domain adaptation to adjust the distribution. Fine-tuning is performed by selecting the appropriate form of supervision based on the available label types and quantities.
[0087] Optimizer configuration: The AdamW optimizer is used with an initial learning rate of 1e-4. Cosine annealing is used for learning rate scheduling, along with a learning rate warm-up strategy (linearly increasing to the target learning rate over the first 20 epochs).
[0088] Early stopping and model selection: Use the validation set for evaluation after every 5 iterations. Stop training when the mixed metric does not improve for 3 consecutive iterations and save the optimal model parameters.
[0089] Evaluation metrics: Considering the characteristics of extremely small defects in wire bonding testing, a weighted mixed metric is adopted: This makes the evaluation indicators more closely reflect actual testing needs. , , These are commonly used evaluation metrics in this field.
[0090] In the inference detection process, the image of the wire bonding area of the chip to be detected is sequentially processed through data preprocessing, feature extraction backbone network, feature adaptation, and inference stage. The image-level anomaly score s of the inference stage is output as the defect judgment result.
[0091] Inference Flow: Images are preprocessed and input into the model; the segmentation head outputs pixel-level anomaly score maps, and the classification head outputs image-level anomaly probabilities; the anomaly score maps are thresholded (default threshold 0.5) to obtain a defect segmentation mask. Post-processing Optimization: Morphological closing operation: fills in tiny holes within the defect region; area filtering: removes isolated noise regions smaller than minw×minh / 4; multi-threaded processing: different defect categories are extracted in parallel, accelerating the post-processing flow.
[0092] As described above, this invention, by unifying multi-form learning capabilities, introduces a unified detection framework supporting unsupervised, weakly supervised, and fully supervised methods for the first time in the field of wire bonding appearance inspection. It allows for flexible switching of training methods based on the actual labeled data, solving the practical problem of scarce and diverse labeled data in industrial scenarios. This characteristic enables the method to work uniformly in various scenarios, from those with only normal samples (abundant on production lines) to those with a small number of labeled abnormal samples.
[0093] High-precision detection of minute defects. For minute targets such as bond neck cracks and micrometer-level offsets of bonding points, a multi-scale feature fusion and edge-aware loss mechanism is designed. Combined with a synthetic anomaly generation strategy, diverse synthetic defects can be generated for training even in unsupervised mode, significantly improving the detection sensitivity for small targets. Experiments on typical bonding datasets show that this method achieves a pixel-level IoU of 0.86 for neck crack detection.
[0094] Real-time inference performance. The backbone network employs a lightweight design (frozen shallow layers), requiring only a single forward propagation during inference, with the segmentation-classification dual branches sharing the computational cost of feature extraction. Inference time on an NVIDIA RTX 4070 Ti is less than 45ms, meeting the real-time detection requirements of production lines.
[0095] Strong generalization and anti-interference capabilities. Multi-scale Retinex reflection separation in the preprocessing stage effectively suppresses interference from high metallic reflectivity; the synthetic anomaly generation mechanism makes the model robust to changes in illumination and product type. It maintains stable performance in bonding appearance inspection tasks across various package types (QFN, SOP, BGA).
[0096] Ease of use in engineering. Most hyperparameters can be automatically calculated from the dataset, reducing the burden of manual parameter tuning; the model supports incremental learning, and the production line can continuously optimize performance as data accumulates.
[0097] Experiments have shown that this invention not only achieves practical-level detection accuracy (defect detection AUROC of 0.917 in unsupervised mode) when using only normal samples, but also further improves performance after introducing a small amount of labeled data, demonstrating a good labeling efficiency ratio.
[0098] This invention is not only applicable to wire bonding appearance quality inspection, but its unified design concept can also be extended to other visual inspection scenarios in chip packaging (such as silver burn-off detection, substrate defect detection, etc.), and has broad application prospects and commercial value.
[0099] The term "comprising" or any other similar term is intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus / device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent in such process, method, article, or apparatus / device.
[0100] The technical solution of the present invention has been described above with reference to the preferred embodiments shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after such changes or substitutions will all fall within the scope of protection of the present invention.
Claims
1. A visual inspection method for the appearance quality of chip wire bonding, characterized in that... Includes the following steps: S1 constructs a training dataset, which includes unsupervised samples, weakly supervised samples, and fully supervised samples; S2 data preprocessing involves preprocessing the input image. The S3 feature extraction backbone network extracts multi-scale feature maps from the preprocessed image, which are then upsampled to generate the first feature map F. S4 Feature Adaptation: The first feature map F is processed by the feature adaptation module to generate a second feature map A for the segmentation task. S5 Synthetic Anomaly Generation: Based on the supervision method of the current sample, a synthetic anomaly generation strategy is configured to generate a synthetic anomaly mask and corresponding perturbation features, which are then injected into the first feature map F and the second feature map A, respectively, to form the third feature map MixF and the fourth feature map MixA. S6 dual-branch detection is divided into a training phase and an inference phase; The training phase includes: inputting the fourth feature map MixA into the segmentation head and outputting a pixel-level anomaly score map M0; concatenating the third feature map MixF with the pixel-level anomaly score map M0 and inputting the concatenation into the classification head and outputting an image-level anomaly score s; The inference stage includes: inputting the second feature map A into the segmentation head and outputting a pixel-level anomaly score map M0; concatenating the first feature map F with the pixel-level anomaly score map M0 and inputting the concatenation into the classification head, and outputting an image-level anomaly score s; S7 Model Training: Calculate the loss function based on the output of the training phase in step S6, and update the parameters of the feature extraction backbone network, the feature adaptation module, the segmentation head, and the classification head based on the gradient of the loss function. S8 inference detection sequentially executes the inference stages of steps S2, S3, S4 and S6 on the image of the lead bonding area of the chip to be inspected, and outputs the image-level anomaly score s of the inference stage as the defect judgment result.
2. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S3, the feature extraction backbone network adopts the WideResNet50 network. The shallow convolution parameters of the feature extraction backbone network are kept frozen during training, and only the parameters of the high-level feature layers are updated.
3. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S4, the feature adaptation module includes a 1×1 convolutional layer, a 3×3 convolutional layer, and an attention module connected in sequence. The 1×1 convolutional layer is used for channel compression and semantic information fusion, the 3×3 convolutional layer is used for extracting local structure and enhancing spatial information, and the attention module is used for focusing on key regions and suppressing background noise.
4. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S5, the synthetic anomaly generation strategy includes a dual-path random selection mechanism: during training, the diffusion model anomaly generation path based on region mask is selected with a 50% probability, and the noise-guided anomaly generation path based on region restriction is selected with a 50% probability.
5. The visual inspection method for the appearance quality of chip wire bonding according to claim 4, characterized in that: The anomaly generation path of the diffusion model based on region masking includes: randomly sampling normal images from the training dataset as input, using region masks to control the position and size of the generated anomaly images, injecting the embedding vector as a condition through a cross-attention module to generate the expected anomaly image, and in each step of the denoising process, the region within the mask box is preserved, while the region outside the box is replaced by a noise version.
6. The visual inspection method for the appearance quality of chip wire bonding according to claim 4, characterized in that: The noise-guided anomaly generation path based on region restriction includes: randomly generating anomalies of different shapes in the feature space using one of Gaussian noise, fractal noise, or simplex noise; obtaining an anomaly mask by thresholding the noise map; removing the actual anomaly region to generate a synthetic anomaly mask; using the synthetic anomaly mask to restrict Gaussian noise to a specific region; generating the final noise and adding it to the features to create a synthetic anomaly.
7. The visual inspection method for the appearance quality of chip wire bonding according to claim 5, characterized in that: The anomaly generation path of the diffusion model based on region masking adopts a pre-trained diffusion model. The pre-trained diffusion model does not intervene in the first 30% of ep0ch during training, and then intervenes randomly with a 50% probability starting from the 31st ep0ch.
8. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S6, the segmentation head includes parallel 3×3 convolutional layers, 5×5 convolutional layers, dilated convolutional layers, and 1×1 convolutional layers. The parallel convolutional layers are used to fuse multi-scale features and output a single-channel anomaly score map. The classification head includes a 5×5 convolutional block, a pooling layer, and a fully connected layer. The 5×5 convolutional block is used to capture global context information, the pooling layer is used for feature dimensionality reduction, and the fully connected layer is used to output image-level anomaly scores.
9. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S7, the loss function includes segmentation loss and classification loss. The segmentation loss includes truncation loss, edge loss, and focus loss. The classification loss uses binary cross-entropy loss or focus loss.
10. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: During the model training process in step S7, the loss function is calculated based on the supervision information of the current sample, the pixel-level anomaly score map M0 output during the training phase, and the image-level anomaly score s. For unsupervised samples, the synthetic anomaly mask generated in step S5 is used as supervision information; for weakly supervised samples, image-level anomaly labels are used as supervision information; and for fully supervised samples, pixel-level defect annotations are used as supervision information.
11. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S2, data preprocessing includes: minimum target size statistics, reflection component suppression, and mixed data augmentation. The reflection component suppression uses a multi-scale adaptive gain Retinex algorithm combined with frequency domain high-pass filtering. The mixed data augmentation includes random flipping, rotation, scaling, saturation adjustment, and Gaussian noise injection.
12. The visual inspection method for the appearance quality of chip wire bonding according to claim 1, characterized in that: In step S6, the inference stage further includes a post-processing step: thresholding the pixel-level anomaly score map to obtain a defect segmentation mask, performing morphological closing operations on the defect segmentation mask to fill the micro-holes, and removing isolated noise regions with an area less than one-quarter of the minimum target size.