Multi-focal image fusion method and system for semiconductor package bonding process

By combining a heterogeneous computing power collaborative architecture and an active light source adaptive closed loop with a semantic prior-driven multifocal image fusion system, the problem of generating panoramic depth images in semiconductor packaging manufacturing has been solved, achieving high-speed and high-precision micro-defect detection.

CN122289014APending Publication Date: 2026-06-26SEMIBRIDGE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SEMIBRIDGE TECH CO LTD
Filing Date
2026-03-13
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies in semiconductor packaging manufacturing struggle to generate high-fidelity, artifact-free panoramic depth images at high-speed production line cycles, especially when faced with high reflectivity interference and complex structures, leading to missed detections of micro-defects and distortion of dimensional measurement data.

Method used

A heterogeneous computing power collaborative architecture is adopted, integrating an active light source adaptive closed loop and a semantic prior-driven multi-focal image fusion system. The system generates a mask by running a semantic segmentation model on the GPU, adjusts the light source in real time on the FPGA, and performs image fusion by combining a triple weight map and a multi-scale pyramid strategy.

Benefits of technology

It achieves high-speed (>5 pieces/second), high-precision (measurement error <1%), and high-robustness (stable even with 25% reflectivity), fully automatic multi-focal image fusion inspection, significantly improving image clarity and detail fidelity, and solving the problem of micro-defect detection.

✦ Generated by Eureka AI based on patent content.
Patent Text Reader

Abstract

This invention relates to the fields of digital image processing and semiconductor automated inspection technology, specifically disclosing a multi-focus image fusion method and system for semiconductor packaging bonding processes. This invention utilizes a GPU to run a semantic segmentation model to generate a mask, driving non-uniform layered sampling with key area encryption and background area sparseness along the Z-axis. It uses an FPGA to adjust the light source in real time based on the reflection ratio, constructing an active light source adaptive closed loop, and intelligently scheduling FPGA hardware interpolation or GPU deep learning generation branches for reflection repair. Finally, it combines edge, semantic, and attention metrics to construct a triple weight map, using a multi-scale pyramid strategy to achieve high- and low-frequency separation and fusion, realizing high-quality panoramic depth image generation suitable for Wirebond process AOI. This invention solves the problem of missed detection of micro-defects in semiconductor gold wire bonding inspection caused by strong reflection interference, low sampling efficiency, timing mismatch, and fusion distortion.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of digital image processing and semiconductor automated inspection technology, specifically a multi-focal image fusion method and system for wirebond areas in semiconductor packaging processes. By fusing multiple images from different focal planes, a high-resolution, clear image with full depth of field is generated. This method is suitable for automated measurement and defect detection of key structures such as ball bonds (first solder joint), neck (the line segment after ball bonding), wire loop (the curved part of the lead wire), and stitch bond (second solder joint). Background Technology

[0002] In semiconductor packaging manufacturing, wirebonding is a crucial step in connecting wafer electrodes to pin structures. The surface features of this process, such as solder joint morphology, bond wire curvature, collapse amount, ball diameter, and neck cracks, are minute in size and distributed in a non-flat three-dimensional space. Due to the physical depth-of-field limitations of optical systems, a single image cannot simultaneously capture a complete and clear image of different height levels (typically 100-400 μm) including the gold ball, neck, wire curvature, and second solder joint.

[0003] To address the issue of insufficient depth of field, existing technologies typically employ multi-focus image fusion algorithms. However, these algorithms still face the following significant challenges in practical industrial applications:

[0004] First, there is a time conflict between high-speed detection and deep stacking: In order to cover the entire height range, the system needs to perform Z-axis layered scanning; traditional image acquisition and serial processing methods are too time-consuming and cannot meet the stringent cycle time requirements of modern semiconductor packaging and testing production lines for high throughput (such as >5 pieces / second), which can easily become a production bottleneck.

[0005] Second, the limitations of traditional fusion algorithms under complex conditions: the surfaces of metal bonding materials such as gold and copper have high specular reflection characteristics, and the bonding line areas have large areas of low texture features. Traditional gradient-based sharpness evaluation algorithms (such as Laplacian energy, SML, etc.) are prone to misjudging overexposed areas of highlights, resulting in artifacts, ghosting, or broken lines appearing at the edges of gold spheres and the curvature of fine lines in the fused image, severely masking subtle defects.

[0006] Third, poor adaptability to lighting conditions: The complex three-dimensional structure of the bonding region makes it difficult to simultaneously ensure imaging quality at different height levels with a fixed illumination angle. Localized strong reflections not only lead to pixel saturation but also interfere with image feature extraction, causing unstable positioning in image content-based automatic alignment and precision measurement algorithms, severely affecting measurement repeatability.

[0007] Fourth, the detection of critical micro-defects is highly dependent on image quality: Inspection items in the Wirebond process (such as ball diameter, neck width, heel crack, wire sweep offset, etc.) have extremely high requirements for image edge sharpness and contrast. If a full depth-of-field image without artifacts and reflection interference cannot be obtained, it is very easy to miss fatal defects such as neck cracks and heel cracks, or to cause distortion of key dimensional measurement data such as ball diameter and wire arc height.

[0008] In summary, existing technologies struggle to overcome the effects of high reflectivity and complex structures to generate high-fidelity, artifact-free full-depth images under high-speed production line conditions. Therefore, there is an urgent need for a multifocal image fusion system that integrates dynamic light source adaptive control, semantic prior guidance, and heterogeneous computing acceleration to support high-precision AOI (Automated Optical Inspection) and precision measurement applications. Summary of the Invention

[0009] This invention aims to overcome the shortcomings of existing technologies and provide a robust, efficient, and automated multifocal image fusion method and system. Based on a heterogeneous computing power collaborative architecture, it integrates an active light source adaptive closed loop and a semantic prior-driven adaptive strategy, effectively suppressing fusion artifacts and specular interference, significantly improving the overall image clarity and detail fidelity, and achieving high-quality panoramic depth image generation suitable for Wirebond process AOI.

[0010] To achieve the above objectives, the present invention adopts the following technical solution:

[0011] A multi-focus image fusion system for semiconductor packaging bonding processes includes a mechanical platform and motion control module, a Z-axis precision drive module, a dynamic light source adaptation system, a microscopic imaging module, and an image processing and control platform.

[0012] The image processing and control platform includes a central processing unit (CPU), a graphics processing unit (GPU), and a field-programmable gate array (FPGA) module. The FPGA module is connected to the CPU and GPU via a high-speed bus for real-time control and data interaction.

[0013] The image processing and control platform is configured to perform the following operations:

[0014] The GPU is used to run a semantic segmentation model to generate a semantic mask containing the gold ball, neck, and pad regions based on the acquired reference image.

[0015] Based on the semantic mask, a non-uniform hierarchical sampling strategy is generated, and the Z-axis precision drive module is controlled to perform dense sampling in the key area using a first step interval, and sparse sampling in the background area using a second step interval greater than the first step interval.

[0016] During the layered scanning process, the FPGA is used to calculate the proportion of reflective areas in the image in real time, and the dynamic light source adaptation system is controlled to adjust the brightness according to the calculation results in order to generate a multi-focal image sequence.

[0017] The multifocal image sequence is subjected to reflection restoration, image registration, and fusion processing.

[0018] The reflection repair includes intelligently scheduling FPGA hardware interpolation branches or GPU deep learning generation branches based on the proportion of reflective pixels in the image.

[0019] The fusion process includes constructing a triple weight map of edge guidance weights, semantic prior weights, and spatial attention weights, and combining a multi-scale pyramid strategy to separate and fuse high and low frequencies of the image to generate a panoramic depth-fusion image.

[0020] Preferably, the dynamic light source adaptation system is at least partially deployed in the hardware logic of the FPGA module, and is configured to parse the image stream, calculate the proportion of reflective areas and contrast index of the image, and generate lighting adjustment instructions based on the calculation results to control the brightness level of the light source.

[0021] The dynamic light source adaptation system is configured as follows:

[0022] Receive real-time image streams from the microscopic imaging module and calculate the pixel ratio of reflective areas and Michelson contrast in the image through hardware logic;

[0023] The pixel ratio and Michelson contrast of the reflective area are compared with the preset target reflective ratio and target contrast.

[0024] When the pixel ratio of the reflective area or the Michelson contrast does not meet the preset conditions, an adjustment command is generated to adjust the intensity of the blue light channel and / or the weight of the dark field light source in the multi-channel light source through a serial bus or parallel PWM signal until the pixel ratio of the reflective area and the Michelson contrast meet the preset conditions.

[0025] Preferably, the non-uniform hierarchical sampling strategy specifically includes:

[0026] For the gold ball and neck region where the semantic mask confidence is greater than or equal to the first preset threshold, the coverage height range is set to 40-60μm and the sampling step size is 1-3μm.

[0027] For pad regions with semantic mask confidence greater than or equal to the first preset threshold, the coverage height range is set to 80-120μm and the sampling step size is 15-25μm.

[0028] For background regions where the semantic mask confidence level is less than the first preset threshold, the sampling step size is set to 40-60μm;

[0029] The Z-axis precision drive module is controlled to perform unidirectional continuous scanning, and the total number of sampling frames is controlled within 32 frames.

[0030] The Z-axis precision drive module includes a piezoelectric actuator and a position sensing component, which is used to perform multi-layer stepping motion in the Z direction to acquire sample images at multiple focal planes, and output a Z-axis ready signal to the FPGA module after the position is stabilized.

[0031] Preferably, the step of intelligently scheduling the FPGA hardware interpolation branch or the GPU deep learning generation branch based on the proportion of reflective pixels in the image specifically includes:

[0032] The percentage R of reflective pixels in a single frame image is calculated.

[0033] When R is less than the preset threshold R0, the FPGA module uses a neighborhood weighted interpolation algorithm to fill pixels in the reflective area.

[0034] When R is greater than or equal to a preset threshold R0, the FPGA module triggers the GPU to call the generative adversarial network model to perform texture repair on the reflective area.

[0035] Preferably, the image processing and control platform is further configured to perform hybrid image registration, including:

[0036] The CPU is used to call the AVX-512 instruction set to accelerate ORB feature extraction and RANSAC coarse matching to obtain the initial transformation matrix;

[0037] The GPU is used to perform subpixel-level fine registration based on the initial transformation matrix using the pyramid Lucas-Kanade optical flow method.

[0038] Preferably, the construction method of the triple weight graph includes:

[0039] Edge information is extracted using the Canny operator, and edge guiding weights are generated based on the consistency of edge direction.

[0040] The semantic mask output by the semantic segmentation model is reused and combined with dynamic light source contrast data to generate semantic prior weights.

[0041] Image features are extracted using a convolutional attention module, and spatial attention weights are generated that focus on regions of salient features.

[0042] Preferably, the mechanical platform and motion control module include an XYθ motion mechanism, a temperature sensor, and a nanoscale position feedback component, which are used to perform sample positioning in the XY plane and rotation direction according to the control instructions of the FPGA module, and output a ready signal to the FPGA module after the position is stable.

[0043] The FPGA module is configured to read the position data of the nanoscale position feedback component and the temperature data of the temperature sensor, calculate the position compensation amount based on Abbe error and thermal expansion error, and superimpose the compensation amount into the target coordinate command to correct the positioning of the mechanical platform in real time.

[0044] Preferably, the microscopic imaging module includes a camera, an objective lens, and a multi-channel light source. The camera is configured to acquire images of multiple focal planes in response to a trigger signal from the FPGA module and transmit the image data to the FPGA module.

[0045] The camera is connected to the FPGA module via a Camera Link interface;

[0046] The FPGA module is configured to write the image data directly into the GPU's video memory via the PCIe bus using Direct Memory Access (DMA) to trigger the GPU to perform inference calculations.

[0047] A multi-focus image fusion method for semiconductor packaging bonding processes, implemented based on the aforementioned multi-focus image fusion system for semiconductor packaging bonding processes, includes the following steps:

[0048] Step 1: Based on the heterogeneous computing platform consisting of a central processing unit (CPU), a graphics processing unit (GPU), and a field-programmable gate array (FPGA), the mechanical platform is controlled to move the target area of ​​the sample to the center of the imaging field of view.

[0049] Step 2: Acquire a reference image and run a semantic segmentation model using the GPU to generate a semantic mask containing the gold ball, neck, and pad regions;

[0050] Step 3: Determine the non-uniform layered sampling strategy based on the semantic mask, control the Z-axis actuator to perform layered scanning according to the strategy, and control the dynamic light source through the FPGA to adjust it in real time to generate a multi-focus image sequence;

[0051] Step four: Perform reflection repair on the multifocal image sequence, including selecting hardware interpolation repair performed by FPGA or deep learning generation repair performed by GPU based on the proportion of reflective pixels in the image.

[0052] Step 5: Perform hybrid registration on the image sequence after reflection restoration, including CPU-accelerated coarse registration and GPU-accelerated subpixel-level fine registration.

[0053] Step 6: Construct a triple weighted graph based on the improved multi-scale Laplacian sharpness evaluation function and combining edge information, semantic mask, and spatial attention information;

[0054] Step 7: Decompose the image into low-frequency and high-frequency components using a multi-scale pyramid strategy, and fuse the low-frequency and high-frequency components based on the triple weight map to reconstruct a panoramic depth-fusion image.

[0055] Preferably, step three involves real-time adjustment of the dynamic light source via the FPGA, including:

[0056] The FPGA analyzes the image stream of the acquired image in real time and calculates the pixel ratio of the reflective area and the Michelson contrast ratio.

[0057] If the pixel percentage of the reflective area is not within the preset target percentage range, or the Michelson contrast ratio is lower than the preset contrast threshold, then via I 2 Adjust the light source parameters using the C / SPI interface or PWM signal until the requirements are met.

[0058] The beneficial effects of this invention are as follows:

[0059] This invention utilizes a GPU to run a semantic segmentation model to generate a mask, driving non-uniform layered sampling with key area encryption and background area sparseness along the Z-axis. It employs an FPGA to adjust the light source in real time based on the reflectance ratio, constructing an active adaptive closed loop for the light source, and intelligently scheduling FPGA hardware interpolation or GPU deep learning to generate branches for reflectance repair. Finally, it combines edge, semantic, and attention metrics to construct a triple weighted graph, using a multi-scale pyramid strategy to achieve high- and low-frequency separation and fusion. This effectively solves the problem of missed micro-defect detection caused by strong reflectance interference, low sampling efficiency, timing mismatch, and fusion distortion in semiconductor gold wire bonding detection, achieving high-speed (>5 pieces / second), high-precision (measurement error <1%), and high-robustness (stable even with a reflectance ratio of 25%) fully automatic multi-focal image fusion detection. Detailed Implementation

[0060] Example 1:

[0061] A multi-focus image fusion system for semiconductor packaging bonding processes comprises an image processing and control platform, a mechanical platform and motion control module, a Z-axis precision drive module, a dynamic light source adaptation system, and a microscopic imaging module. Each module contains dedicated functional components, and precise linkage between components and modules is achieved through standardized interfaces, forming a complete hardware and software collaborative architecture. The specific composition and functions are as follows:

[0062] (i) The physical carrier of the image processing and control platform is a customized industrial computer, which integrates an Intel i9-13900K motherboard, an NVIDIA RTX 4090 graphics card, a Camera Link image acquisition card, a Samsung 990 ProNVMe SSD and a Xilinx KCU105 FPGA development board. All modules are interconnected through a PCIe 4.0 bus and are packaged in a 4U industrial control chassis with efficient heat dissipation and electromagnetic shielding.

[0063] This platform provides industrial interfaces such as CameraLink, RS422, and Gigabit Ethernet through the front or rear I / O panel. As a complete electromechanical-computer integrated unit, it supports high-speed image acquisition, real-time processing, and precise motion coordination control.

[0064] This platform adopts a heterogeneous computing architecture, with an Intel i9-13900K CPU as the system scheduling core, an NVIDIA RTX4090 GPU as the parallel processing engine, and a Xilinx KCU105 FPGA as the real-time control and image acquisition hub; the Basler sprint spL2592-17c camera of the microscopic imaging module is connected to the FMC-Camera Link daughter card on the Xilinx KCU105 FPGA development board through the Camera Link Medium interface;

[0065] The FPGA's internal hard logic receives a 16-bit image stream in real time and performs reflective area detection and Michelson contrast calculation during transmission. At the same time, the FPGA writes the original image data directly to the GPU memory via the PCIe 4.0 ×8 interface in DMA mode, triggering the resident SegNet-Lite semantic segmentation model or MetalGAN reflective restoration model to perform inference.

[0066] The camera is directly triggered by the FPGA via a hardware TTL signal, and the entire acquisition and preprocessing process is completely decoupled from CPU scheduling, ensuring nanosecond-level timing determinism; the processing results are asynchronously cached on a Samsung 990 Pro NVMe SSD (read / write speed ≥7GB / s).

[0067] The FPGA utilizes a 128MB area allocated in its onboard DDR4 memory to store a lightweight structured prior dataset (including gold bonding wire geometry templates, typical ball bond size ranges, pad layout rules, and material optical parameter tables), supporting microsecond-level deterministic access and meeting the real-time requirements of hardware-accelerated matching algorithms.

[0068] (ii) The mechanical platform and motion control module is an integrated electromechanical device that integrates XYθ high-precision direct drive motion mechanism, nano-level grating ruler position feedback component, multi-axis servo drive and control PCB, and EtherCAT real-time communication interface.

[0069] This module receives instruction packets from the FPGA master station via the EtherCAT industrial bus, which contain the target coordinates (X,Y,θ), motion trajectory parameters, and synchronization enable flags.

[0070] Once the XYθ platform completes its positioning motion and the actual positions of each axis (sampled in real time by the built-in grating ruler) enter the preset tolerance window (e.g., ±0.2μm / ±0.001°) and remain stable, its built-in motion controller immediately outputs a platform ready signal through a dedicated TTL hardware pin. This signal is directly connected to the dedicated interrupt input pin of the FPGA via a shielded twisted pair cable. The end-to-end delay from the platform's ready signal output to the platform's ready signal is ≤200μs, ensuring strict timing alignment between the sample's spatial pose and the image acquisition time.

[0071] (III) The Z-axis precision drive module is an integrated electromechanical device that integrates multi-layer PZT piezoelectric actuators, Hall position sensing components, high-voltage drive PCBs and RS422 communication interfaces.

[0072] This module receives instruction packets containing parameters such as target height and step interval through the RS422 interface. When the Z-axis reaches the target position and the actual position error enters the preset tolerance window, its built-in drive controller immediately outputs a ready signal through a dedicated hardware pin. This signal is directly connected to the dedicated interrupt input pin of the FPGA via a shielded twisted pair cable. After the FPGA comprehensively judges the system status, it triggers the camera exposure in a unified manner. The end-to-end delay is ≤100ns, ensuring strict timing alignment between Z-axis positioning and image acquisition.

[0073] During multi-focal plane scanning, the Z-axis module, mechanical platform, and motion control module work together based on the same global coordinate system, and the motion timing is uniformly scheduled by the image processing and control platform to ensure that the target area of ​​the sample is always located in the center of the imaging field of view during the Z-axis layer-by-layer scanning, effectively avoiding three-dimensional image stitching errors caused by field of view offset.

[0074] (iv) The dynamic light source adaptation system is deployed inside the Xilinx KCU105 FPGA chip and is implemented by dedicated hardware logic circuits, including an FPGA light source control interface unit, an image feature extraction module, an illumination parameter optimization module and an effect verification module.

[0075] FPGA light source control interface unit: Receives adjustment commands from the lighting parameter optimization module, and, based on the hardware interface type of the light source driver board, controls the input via I / O. 2The C / SPI serial bus sends digital configuration commands or outputs multiple parallel PWM signals (frequency ≥ 1kHz) to precisely control the brightness levels (0-255) of the RGBW ring light source and 8 groups of dark field light sources.

[0076] Image feature extraction module: Implemented by internal FPGA hardware logic, it directly parses the 16-bit digital image stream input from Camera Link. During image transmission, it performs real-time pixel counting and Michelson contrast calculation for reflective areas (grayscale ≥ 62000), without the need for external analog acquisition or GPU participation, ensuring feature extraction latency ≤ 1ms and meeting the ≤ 30ms closed-loop control requirement. For scenarios requiring advanced features such as semantic segmentation, the system also supports the generation of a refined reflective area mask (i.e., a binary map aligned with the image, identifying the position of specular reflection pixels) by the GPU, which is fed back to the FPGA via the AXI4-Lite bus as a priori for illumination optimization in the next round of image acquisition, further improving image quality.

[0077] Lighting parameter optimization module: It adopts a strategy that combines rapid initial adjustment based on human experience rules with fine mapping of pre-calibrated lookup table (LUT). With the process target reflective area ratio ≈18% and contrast ratio ≥8% as optimization criteria, it dynamically adjusts the intensity of blue light channel and / or the relative brightness weight of dark field light source in RGBW light source, generates digital adjustment instructions and sends them to FPGA control interface.

[0078] The effect verification module is implemented based on dedicated hardware logic inside the FPGA, including frame state marking, double buffer feature storage and Boolean decision unit. The entire process is completed in the hardware pipeline to ensure microsecond-level response and millisecond-level closed loop, fully meeting the real-time requirement of ≤30ms.

[0079] The effect verification module is the last link in the FPGA control closed loop. Together with feature extraction and light source control, it forms a complete hardware autonomous unit of perception-decision-execution-verification. Through strictly time-aligned pre-adjustment and post-adjustment frame pairs, it extracts the reflectivity and Michelson contrast within the same ROI. Success is determined only when the adjusted indicators meet 18%±1% and the contrast is ≥8%. To avoid single-frame noise interference, the system supports two consecutive frames meeting the criteria before confirming success; otherwise, a secondary optimization is triggered.

[0080] (v) Microscopic imaging module: Its components include a Basler sprint spL2592-17c camera, an Olympus long working distance plan semi-apochromatic objective lens (LMPlanFLN50× / 0.80), a Smart Vision RGBW ring light source, 8 groups of 45° surround dark field light sources and a multi-interface light source driver board.

[0081] The Camera Link Medium signal output from the camera is directly connected to the FMC-Camera Link daughter card of the Xilinx KCU105FPGA in the image processing platform, enabling hardware-level image acquisition and real-time analysis. The objective lens provides an effective depth of field of ≤1μm at NA=0.80, enabling clear imaging of 15μm-level wirebond structures. Ring blue light (450nm) effectively suppresses diffuse reflection from metal surfaces, while dark-field oblique illumination significantly enhances edge contour contrast. The light source driver board receives FPGA commands, independently adjusts the brightness of each light source, and communicates with I... 2 The C / SPI interface enables bidirectional communication.

[0082] The microscopic imaging module outputs a TTL trigger signal through the FPGA GPIO. The TTL trigger signal is connected to the Line0 (TriggerIn) pin of the camera's Hirose I / O interface via a shielded twisted pair cable. The camera response delay is ≤1μs. The image is transmitted in real time via the Camera LinkMedium interface at a maximum effective rate of approximately 3.4Gbps. At the same time, an illumination-acquisition-feedback closed loop is constructed to dynamically optimize the imaging quality.

[0083] The overall workflow of the multifocal image fusion system for semiconductor packaging bonding processes is as follows: The host computer (Intel i9-13900K CPU) sends the detection coordinates and process parameters to the Xilinx KCU105 FPGA; the FPGA controls the mechanical platform and motion control module through EtherCAT to accurately move the target area of ​​the chip to the center of the field of view and rotate and align it; the Z-axis drive module performs autofocus or multi-layer stepping (Z-stack); the dynamic light source adaptation system completes millisecond-level adaptive illumination optimization before image acquisition; the microscopic imaging module synchronously exposes and acquires images under unified triggering of the FPGA; the nanoscale grating ruler built into the mechanical platform provides real-time position feedback, and the FPGA compensates for temperature drift and vibration online based on this data; the GPU executes the CUDA-accelerated core multifocal fusion algorithm to generate a fully focused image; the CPU synchronously calculates the local height map based on the sharpness evaluation results during the fusion process, and collaboratively completes global contrast enhancement, multi-dimensional data encapsulation, and result output; the system finally outputs a fully focused image and a 3D depth map to support subsequent 3D reconstruction, dimensional measurement, and defect detection.

[0084] Example 2:

[0085] A multi-focal image fusion method for semiconductor packaging bonding processes is proposed. Built upon a heterogeneous computing architecture combining hardware and software, this method leverages core system capabilities such as nanometer-level precise positioning, millisecond-level dynamic light source adaptation, and multi-stage pipelined parallel processing to achieve high-speed acquisition and high-precision fusion of Wirebond bonding region images. Furthermore, through deep customization and optimization of the CTCFuse multi-scale fusion architecture, the system's anti-reflection interference capability, defocus robustness, and micro-defect identification accuracy under complex operating conditions are significantly enhanced. The specific technical process and algorithm design are as follows:

[0086] Step 1, system initialization and parameter loading, as follows:

[0087] (1-1) Hardware heterogeneous parallel initialization

[0088] The central processing unit (Intel Core i9-13900K) utilizes the high bandwidth of the PCIe 4.0 ×16 bus to perform parallel enumeration, function self-test, and logical wake-up of core peripherals:

[0089] On the FPGA side (Xilinx KCU105): a lightweight structured prior dataset (including gold bond line geometry templates, typical ball bond size ranges, pad layout rules, and material optical parameter tables, with a total capacity ≤ 4MB) is loaded from the onboard 1GB DDR4 high-speed memory area, and the preprocessing logic units (such as Hough transform parameter constraints, initial reflection threshold values, and initial ROI positioning anchor points) are dynamically configured; where ROI (Region of Interest) in this system specifically refers to the core detection area containing the ball bond gold ball, neck, and pads;

[0090] On the GPU side (NVIDIA RTX 4090): CUDA 12.1 computing environment and TensorRT 8.6 inference engine are deployed simultaneously; the system pre-allocates an 8GB VRAM resource pool and loads two types of INT8 quantized deep learning models in parallel through a high-speed NVMe SSD channel.

[0091] SegNet-Lite semantic segmentation model: To adapt to the real-time segmentation requirements of 512×512 ROIs guided by FPGA, a lightweight encoder-decoder architecture is designed.

[0092] The encoder reuses the MobileNetV2 backbone (input resolution 512×512), removes the last two fully connected layers, retains 16 inverse residual blocks, and outputs a feature map with stride=16.

[0093] The decoder uses the U-NetLite lightweight upsampling path, retaining only 4 skip connections (corresponding to encoder stage2-stage5). Each stage compresses the number of channels to 32 through 1×1 convolution, and then refines it through bilinear interpolation and 3×3 convolution.

[0094] To address the issue of sparse foreground on the gold ball / neck / pad, a weighted Dice Loss is introduced into the loss function (with a weight ratio of background:pad:neck:gold ball = 1:3:5:5).

[0095] The entire network uses BatchNorm (training phase), and during deployment, it is fused into Scale / Bias to accelerate inference;

[0096] The output layer is a 1×1 convolution + Softmax layer, generating a 4-channel probability mask S. sem (x,y);

[0097] Model size: INT8 quantized size ≤ 3.8MB, number of parameters ≈ 920,000, single-frame inference time ≤ 4.7ms (RTX4090).

[0098] MetalGAN Reflection Restoration Model: To address the pixel saturation problem caused by strong reflections in metallic bonding scenes, a lightweight generative adversarial network is designed (INT8 quantized size ≤ 1.2MB, total parameters ≤ 780,000).

[0099] Generator (10-layer U-Net variant): The encoder uses 3×3 depthwise separable convolution downsampling (32→64→96→128); a lightweight CAM module is embedded in the bottleneck layer to enhance specular features; the decoder uses transposed depthwise separable convolution upsampling, and skip connections only transmit the gold ball edge and neck contour; the first layer uses 7×7 convolution to expand the receptive field, and two residual blocks are cascaded in the middle to improve edge sharpness; the output layer generates a [-1,1] repair map using Tanh.

[0100] Discriminator (3-layer PatchGAN): 4×4 depthwise separable convolution, outputting 1×64×64 local real / false image, with ≤190,000 parameters;

[0101] We employ a joint training method combining adversarial loss and L1 perceptual loss, and use instance normalization (IN) across the entire network to adapt to the characteristics of metal images.

[0102] After TensorRT INT8 quantization, the single-frame repair time is 7.2ms;

[0103] The repair results are smoothly fused by 5×5 guided filtering (λ=50) and have a built-in watchdog mechanism. In case of an anomaly, the system automatically falls back to the FPGA interpolation branch (≤1ms), and the end-to-end delay is ≤12ms.

[0104] After completing deserialization and kernel optimization, both models enter standby mode, waiting for FPGA trigger commands.

[0105] Storage side: NVMeSSD completes I / O performance calibration to ensure high-throughput data read and write channels are ready;

[0106] (1-2) Dynamic injection and adaptation of process parameters

[0107] The CPU calls the built-in optical parameter library for gold bonding wires and dynamically issues imaging and detection configuration commands based on the material type of the current work order:

[0108] Optical parameters: Based on the typical surface reflectivity of gold wire (≈50%), combined with system calibration data, the reflection suppression threshold is set to 62,000 (16-bit linear grayscale space, corresponding to the first 5% of the image saturation area, used to accurately capture overexposed pixels).

[0109] Motion control: The motion controller distributes high-precision positioning commands for the XY axes in real time via the EtherCAT bus; the Z-axis piezoelectric ceramic platform responds to the hardware trigger signal generated by the FPGA and starts synchronous closed-loop control via the RS-422 differential interface to ensure that the mechanical response delay of the whole machine is ≤1ms.

[0110] (1-3) Deterministic timing synchronization calibration

[0111] The FPGA logic layer and the mechanical motion platform establish a deterministic linkage mechanism through TTL level hardware interrupts. The end-to-end timing of positioning completion and camera triggering is strictly calibrated. After calibration, the system must meet the following hard constraints:

[0112] Link confirmation: After the XY platform arrives at the target position and outputs a ready signal, the FPGA must complete the internal state latching and confirmation within ≤2μs;

[0113] Trigger Response: After the Z-axis reaches a stable pose and outputs a ready signal, the FPGA must issue a camera hardware trigger command within ≤2μs to start the exposure timing.

[0114] This mechanism aims to ensure strict synchronization between image acquisition and the platform's dynamic state, fundamentally eliminating motion blur and sampling timing deviations;

[0115] (1-4) System Global Ready State Confirmation

[0116] After completing the above initialization, parameter injection and timing calibration, the system enters a stable and ready state. At this time, all computing modules, peripheral interfaces and motion axis parameters have completed global alignment, consistency verification and functional interlock verification. The system has all the prerequisites for carrying out high-precision sample detection and high-speed and reliable data acquisition.

[0117] Step 2: Sample positioning and high-precision alignment with the field of view. This step constructs a nanoscale positioning system that combines fully closed-loop motion control with visual servoing. Utilizing a high-precision position feedback loop built with a Renishaw grating ruler, and combined with a multi-source error compensation model, it achieves micron-level stable positioning at the mechanical level. Simultaneously, through the Camera Link high-speed image transmission link, a visual feedback mechanism based on Hough transform is introduced to drive the XY platform to perform sub-micron-level fine-tuning, precisely locking the bonded sample at the center of the field of view. This lays a solid spatial reference for subsequent Z-axis layered scanning and image fusion, as detailed below:

[0118] (2-1) Nanoscale position command issuance and closed-loop control

[0119] The FPGA serves as the logic control core, driving the XYθ three-axis precision motion platform to the target coordinates (X,Y,θ). The system uses a Renishaw high-precision grating ruler (measurement resolution 0.01μm) to construct a fully closed-loop position feedback system. By reading the grating ruler position data in real time and executing a dynamic error compensation algorithm, the absolute positioning accuracy of the system is calibrated to ±0.1μm, meeting the error tolerance requirements in industrial environments while maintaining high measurement resolution.

[0120] (2-2) Multi-dimensional closed-loop stability verification

[0121] The system implements a dual stability verification mechanism to ensure that the sample is in an absolutely static state before image acquisition, specifically:

[0122] Position closed-loop criterion: When the real-time positioning error fed back by the grating ruler is ≤ ±0.2μm and the stable holding time is ≥1ms, the system determines that the current pose has reached the stability standard;

[0123] Multi-source error comprehensive compensation: Simultaneously activate the comprehensive compensation model that integrates geometric, thermal and assembly deviations to correct three-dimensional spatial deviations caused by guide rail tilt, temperature drift and structural non-idealities in real time, ensuring sub-micron level stability of the bonding core area under repeated positioning.

[0124] The comprehensive compensation model is designed for typical error sources of the mechanical platform. Based on feedback data from a Renishaw grating ruler (0.01μm resolution), a high-precision temperature sensor, and pre-calibrated parameters, it calculates the total compensation in real time. The core compensation formula and parameter definitions are as follows:

[0125] The core compensation formula is:

[0126] ΔX = L × sinθ + α × L × ΔT + δ;

[0127] ΔY = W × sinθ + α × W × ΔT + δ;

[0128] The parameters are defined as follows:

[0129] ΔX and ΔY are the Abbe error compensation amounts for the X and Y axes, respectively (unit: μm);

[0130] L and W are the Abbe arm lengths of the X and Y axes, respectively (i.e., the vertical distance between the grating ruler reading head and the focal plane of the bonding target, with preset L=80mm and W=60mm).

[0131] θ is the guide rail tilt angle (sampled in real time by a level or dual grating differential, unit: rad, detection range ±0.0005rad).

[0132] α is the coefficient of linear expansion of the mechanical structure (11.8 × 10⁻⁶ for the material of this platform). −6 / ℃);

[0133] ΔT is the difference between the real-time temperature and the standard temperature (25℃) (sampled by a built-in high-precision sensor with an accuracy of ±0.1℃).

[0134] δ is the inherent deviation correction term (pre-calibrated fixed value of 0.02μm, used to compensate for assembly clearance and system zero deviation);

[0135] Explanation of compensation logic:

[0136] The first term, L×sinθ (or W×sinθ), is the classical Abbe error, which originates from the angular yaw coupling when there is an offset between the measurement reference and the target plane.

[0137] The second term, α×L×ΔT (or α×W×ΔT), represents the structural scale drift caused by thermal expansion, which is a thermally induced error.

[0138] The third term, δ, represents the systematic constant deviation, which is obtained through offline calibration.

[0139] The FPGA reads the position feedback of the grating ruler and the temperature sensor data in real time, calculates the sum of the above three items, obtains the total compensation amount ΔX and ΔY, and synchronously interpolates them into the target coordinate command of the motion controller to drive the actuator to perform real-time fine adjustment, ensuring that the repeatability of the bonding core area under complex working conditions is stable within ±0.1μm.

[0140] (2-3) Intelligent field calibration based on Hough transform

[0141] The FPGA outputs a TTL hard trigger signal to start the Basler Sprint SPL2592-17c industrial camera. The camera operates in full-resolution mode (2592×1944) and acquires images at a maximum frame rate of 17fps. The raw image is transmitted via the CameraLink Medium interface (effective bandwidth approximately 3.4Gbps) to the FMC-Camera Link daughter card on the Xilinx KCU105 FPGA development board. The FPGA's internal hard logic receives the 16-bit format image data in real time (based on 12-bit raw sampling) and performs preliminary processing (such as reflective area detection and Michelson contrast calculation). Simultaneously, it transmits the image data via PCIe 4.0. The ×8 interface directly writes the raw image data to the system memory via DMA; the CPU calls the optimized circular Hough transform algorithm to accurately locate the center coordinates of the BallBond, calculates the coordinate offset from the center of the field of view, and feeds it back to the FPGA; the FPGA then drives the XY platform to perform sub-micron level fine-tuning (step size 0.1μm) until the target area is centered (offset ≤ ±5 pixels); in the subsequent Z-axis multi-focal scanning stage, the system enables the PartialScan function to transmit only the key ROI area to improve efficiency;

[0142] (2-4) Establishment of sample locking benchmark

[0143] The sample is precisely locked at the center of the imaging field of view, and the system then establishes a high-confidence positioning benchmark, which provides a solid spatial coordinate foundation for subsequent semantic partitioning extraction, Z-axis layered non-uniform scanning and multi-focal image acquisition processes.

[0144] Step 3, semantic prior-driven hierarchical non-uniform sampling and multi-focal acquisition: This step constructs a semantically perceptual-guided adaptive non-uniform sampling system. Spatial priors are acquired through GPU+FPGA collaborative inference, dynamically generating differentiated hierarchical sampling strategies (intensified key areas, sparse background areas). While maintaining the coverage density of the core structure, the total number of frames is compressed to 32. Simultaneously, dynamic light source closed-loop feedback and ROI narrowband transmission technology are integrated, achieving end-to-end collaborative optimization from optical imaging and mechanical motion to data transmission. The time consumed for a single Z-axis scan is strictly controlled within 160ms, providing high-quality multi-focal image sequences for high-speed detection, as detailed below:

[0145] (3-1) GPU+FPGA Collaborative ROI Intelligent Inference

[0146] After the field of view calibration is completed, the FPGA triggers the acquisition of one frame of coarse focusing reference image (by default located at the middle layer height) to ensure that the key semantic features (gold ball, neck) have basic outline visibility. After noise reduction and grayscale normalization preprocessing, the key region of interest (ROI, which in this system specifically refers to the core detection area containing the Ball Bond gold ball, neck and pads, 512×512 pixels) is cropped and transmitted to the GPU.

[0147] To ensure the robustness of the SegNet-Lite model segmentation, the system verifies the ROI reflectance R before transmission. If R ≥ 18%, the FPGA performs lightweight interpolation pre-repair to ensure that the input image meets the segmentation quality requirements.

[0148] The GPU calls the SegNet-Lite model and outputs a four-channel semantic mask S. sem (x,y) provides a high-precision spatial prior for non-uniform sampling and multi-focus fusion along the Z-axis;

[0149] (3-2) Adaptive hierarchical non-uniform sampling strategy

[0150] The CPU calculates the coordinates of sampling points in each region based on material properties and semantic masks. After global deduplication and Z-axis sorting, the coordinates are fused to generate a unidirectional continuous scan trajectory containing 32 sampling points, which is then strictly executed by the FPGA.

[0151] Key areas (golden ball, neck): When S sem When the confidence level of (x,y) is ≥0.8, the coverage height is 50μm, and a 2μm encryption step (25 sampling points) is used.

[0152] Secondary area (pad): When S sem When the confidence level of (x,y) is ≥0.8, the coverage height is 100μm, and a standard step size of 20μm (5 sampling points) is used.

[0153] Background area: When S sem When the confidence level of (x,y) is less than 0.8, the coverage height is 150μm, and the remaining intervals use a sparse step of 50μm (2 sampling points).

[0154] Total number of frames: strictly controlled within 32 frames, eliminating invalid reciprocating motion on the Z-axis through trajectory merging, maximizing scanning efficiency while ensuring fusion accuracy;

[0155] (3-3) Dynamic light source adaptive closed loop and scanning execution

[0156] The FPGA has a built-in dynamic light source adaptation engine that uses a hardware pipeline to analyze the 16-bit image stream received by the Camera Link in real time and calculates Michelson contrast and highlight statistics online.

[0157] Pre-scan light source locking: Before Z-axis scanning starts, the system analyzes optical parameters based on the reference image and adopts a combined strategy of empirical initial adjustment + LUT fine adjustment, through I... 2 The C / SPI bus dynamically adjusts the blue light intensity of the RGBW light source and the weights of 8 dark fields; the dual buffer mechanism continuously monitors and compares the indicators, and only when the indicators are stable (reflection ratio ≈18%±1%, contrast ≥8%) is the light source confirmed to be ready, the parameter is locked and applied to the subsequent full sequence acquisition, and the closed-loop adjustment time is ≤30ms.

[0158] Scanning execution: The Z-axis piezoelectric platform moves according to the above continuous trajectory. After each layer is reached, closed-loop verification is performed in combination with the feedback of the grating ruler (position error ≤ ±0.2μm, stabilization time ≥ 0.8ms); the system implements differentiated exposure, 25μs for the core area (high signal-to-noise ratio) and 20μs for the background area. The camera simultaneously enables ROI mode to transmit only valid data.

[0159] (3-4) High-speed transmission and serialization storage

[0160] After enabling ROI mode, the data size per frame is reduced to 0.5MB. Transmitted via Camera Link Medium interface, the time per frame is ≤0.6ms, and the total transmission time for 32 frames is ≤20ms. Images are written to memory via DMA and asynchronously cached to NVMeSSD. The FPGA sorts the sequence by Z-axis coordinates and marks the region attributes to generate a standardized multifocal image sequence. The total time for a single Z-axis scan is ≤160ms.

[0161] Step 4, FPGA hardware-accelerated preprocessing and dual-branch reflection restoration: This step constructs an adaptive reflection restoration and high-concurrency pipeline system based on heterogeneous computing. It integrates real-time FPGA detection with GPU temporal tracking to generate dynamic and accurate masks, and combines material-adaptive filtering algorithms to optimize the basic image quality. The system intelligently schedules FPGA interpolation or GPU depth generation for dual-branch restoration paths based on the degree of reflection, ensuring microsecond-level response speed while maintaining high fidelity in texture reconstruction. Furthermore, it utilizes a fully pipelined parallel design to completely eliminate data transfer bottlenecks, ensuring the efficiency and robustness of the detection process, as detailed below:

[0162] (4-1) Dynamic mask generation for high-reflectivity areas

[0163] The FPGA utilizes hardware logic to parse and optimize the image stream in real time, generating an initial specular mask M0(x,y) based on a dynamic thresholding method. Simultaneously, it reads the specular region mask M1(x,y) obtained by binarizing the feature map of the intermediate layer of the MetalGAN generator, transmitted via the PCIe interface. The system fuses the two through pixel-level logic operations to construct an accurate binary mask M(x,y). This real-time detection + temporal tracking fusion strategy utilizes the correlation between adjacent frames, effectively solving the flickering problem that may occur in single-frame detection. Pure hardware logic ensures that the end-to-end latency of mask generation and update is ≤1ms.

[0164] (4-2) Material-adaptive brightness normalization and filtering

[0165] For the non-reflective area (M=0) marked by the mask, the FPGA performs adaptive gamma correction based on the material by looking up a table to compensate for the reflectivity difference of the gold material (gold wire γ=1.0) and achieve brightness normalization.

[0166] Noise reduction: The FPGA uses a LineBuffer pipeline architecture to implement 5×5 bilateral filtering;

[0167] Parameter configuration: Spatial standard deviation σ obtained by optimizing the texture characteristics of wirebond bonding lines for gold materials. s =1.5, grayscale standard deviation σ r =20, this configuration effectively filters out high-frequency sensor noise while preserving the edge texture information of the gold ball and neck to the greatest extent, with a single frame processing time of ≤3ms;

[0168] (4-3) FPGA-GPU dual-branch collaborative repair (ROI)

[0169] The FPGA calculates the percentage R of reflective pixels in real time and intelligently schedules repair paths based on a preset threshold R0=18%.

[0170] Lightweight branch (R<18%): For small-area highlights, the FPGA directly uses a 3×3 neighborhood bidirectional weighted interpolation algorithm to fill pixels without the need for GPU participation, and the time is ≤1ms.

[0171] Deep branch (R≥18%): For large-area reflections, the FPGA triggers the GPU-side MetalGAN model via PCIeDMA, combining the original ROI image containing the reflection with the semantic mask S. sem (x,y) are concatenated into a dual-channel input, which together serve as the conditional input for MetalGAN to generate a repaired image, effectively restoring the 15μm-level bonded texture that was masked by overexposure.

[0172] (4-4) Efficient data flow and pipeline parallelism

[0173] The repaired image is written to system memory at high speed via PCIe 4.0 DMA; the CPU uses asynchronous I / O to persist the image to the NVMe SSD, while the GPU asynchronously initiates an image loading command, directly triggering the subsequent registration process; through this pipeline design that overlaps computation and transmission, the waiting time caused by data transfer and storage is completely eliminated, maximizing system throughput;

[0174] Step 5, Hybrid Image Registration: This step constructs a hierarchical, progressive image registration mechanism based on CPU-GPU heterogeneous collaboration. First, the CPU's AVX-512 instruction set accelerates ORB feature extraction and RANSAC coarse matching, quickly achieving macroscopic alignment of image sequences. Then, based on the CUDA parallel computing framework, the pyramid Lucas-Kanade optical flow method is used to perform sub-pixel-level fine registration, accurately eliminating mechanical vibration and calibration errors. Zero-copy memory technology maintains high-speed data flow. Under the strict constraint of a total time ≤70ms, high-performance alignment with a registration accuracy better than 0.1 pixels is achieved, providing seamless input data for subsequent fusion. Details are as follows:

[0175] (5-1) Robust coarse registration accelerated by AVX-512SIMD

[0176] The CPU utilizes the Single Instruction Multiple Data (SIMD) parallel capabilities integrated into the AVX-512 instruction set to accelerate the execution of the ORB (Oriented Fast and Rotated BRIEF) feature extraction algorithm.

[0177] Feature extraction: Extract FAST corner points from the reference frame and 32 frames to be registered (detection threshold 20, maximum number of corner points per frame limited to 500), and generate 256-bit BRIEF descriptors;

[0178] Feature matching and model estimation: Fast matching of binary descriptors is performed using Hamming distance (matching threshold 50), and RANSAC (random sample consensus) algorithm is used to remove mismatched points (interior point ratio requirement ≥80%). The initial affine transformation matrix or homography matrix is ​​then solved.

[0179] Performance metrics: The overall coarse registration time for 32 frames of images is controlled within ≤40ms (equivalent to ≤1.2ms per frame), completing the macroscopic alignment of the image sequence;

[0180] (5-2) CUDA-accelerated subpixel fine registration

[0181] Based on coarse alignment, the GPU uses the Lucas-Kanade sparse optical flow method to perform batch fine registration, further eliminating tiny offsets caused by mechanical vibration or calibration errors.

[0182] Pyramid optical flow: Construct a 3-layer Gaussian pyramid and calculate the optical flow field within a 15×15 tracking window to avoid tracking failure caused by large displacements;

[0183] Subpixel correction: Subpixel-level displacement is solved by iterative gradient descent to ensure that the registration error is ≤0.1 pixels;

[0184] Performance metrics: Utilizing CUDA's massive thread parallel computing, the overall fine registration time for 32 frames is controlled within ≤30ms (equivalent to ≤0.9ms per frame); after adding coarse registration, the total time is ≤70ms, strictly adhering to the overall workflow time budget;

[0185] (5-3) Zero-copy data handover of video memory

[0186] After fine registration is completed, the aligned high-precision image sequence is directly stored in the GPU memory as the input source for the subsequent step 6 (sharpness evaluation). This process completely eliminates the overhead of large-scale image data backhaul (PCIe copy) between the GPU and the CPU, significantly reducing system latency and improving overall throughput.

[0187] Step 6, Improved MS-SML Sharpness Evaluation: This step constructs an improved MS-SML sharpness evaluation system that integrates semantic prior and gradient adaptive suppression mechanisms. Utilizing the parallel computing power of GPUs, it extracts texture detail features from images based on multi-scale Laplacian energy; it effectively suppresses specular artifacts through dynamic gradient weights W(x,y), and combines this with semantic segmentation masks S... sem The (x,y) method focuses the evaluation calculation on the bonding core region (gold sphere, neck). It completes the entire computation and zero-copy data transfer within GPU memory, achieving high-precision, interference-resistant real-time quantitative evaluation of image focus quality under complex lighting and material variations. This provides accurate weight maps for multi-focal image fusion, as detailed below:

[0188] (6-1) Parallel extraction of Laplace energy across multiple scales

[0189] The GPU, based on the CUDA 12.1 platform, constructs a multi-scale computing pipeline, defining a scale set S = {1, 2, 4, 8, 16}. A 3×3 Laplacian kernel is used to perform convolution operations on each preprocessed frame of the image, and the multi-scale response L is computed in parallel. i s (x,y); By leveraging the massively parallel computing capabilities of the RTX4090, the time required for multi-scale feature extraction in a single frame is strictly controlled to within ≤1.6ms, providing basic data for subsequent real-time evaluation;

[0190] (6-2) Gradient-adaptive reflection suppression weights

[0191] Based on the high-reflectivity region mask M(x,y) generated in step 4 and the image gradient information, a spatial adaptive suppression weight W(x,y) = 1 / [1+α×|∇I is dynamically generated. i ′′(x,y)|];

[0192] Among them, |∇I i ′′(x,y)| is the magnitude of the second-order gradient calculated by the Sobel operator (or the gradient intensity of the preprocessed image), which characterizes the degree of texture change of the pixel; α is the material adaptation coefficient (e.g., 0.5 for gold wire);

[0193] This mechanism effectively suppresses the artificially high sharpness response caused by highlight artifacts by reducing the weight of high gradient regions (usually overexposed reflections or noise).

[0194] (6-3) Semantic prior correction and fusion evaluation

[0195] The GPU invokes the SegNet-Lite semantic segmentation model to perform semantic parsing on the single-frame image restored by MetalGAN, outputting a high-confidence mask S. sem (x,y) provides regional weights for sharpness evaluation;

[0196] By incorporating the aforementioned weights into the MS-SML (Modified Sumof Modified Laplacians) framework, a pixel-level sharpness evaluation model is constructed:

[0197] C i (x,y)=∑ s∈S β s ×W(x,y)×|L i s (x,y)|×S sem (x,y);

[0198] Specifically, for the texture characteristics of wirebond bonding lines in gold materials, the multi-scale weighting coefficient is set to β. s =[0.4, 0.3, 0.15, 0.1, 0.05], with weights decreasing as the scale increases to highlight the ability to focus on details;

[0199] The model generates N resolution maps (corresponding to N frames) in the GPU memory and resides directly in the memory for subsequent fusion modules to use with zero copying. The model size is controlled to be ≤4MB, and the intersection-union ratio (IoU) of the validation set is ≥0.92, ensuring the high accuracy and real-time performance of the evaluation algorithm.

[0200] Step 7, Triple Weight Map Construction and High-Low Frequency Separation and Fusion: This step constructs an adaptive image fusion architecture based on triple weight decision-making and multi-scale frequency separation. The system first fuses edge guidance, semantic prior, and spatial attention information to construct a robust pixel-level decision weight map. Then, it employs the CTCFuse multi-scale pyramid strategy to decompose the image into low-frequency structural components and high-frequency detail components. Differential fusion rules are implemented for different frequencies: low-frequency components are weighted by triple weights to ensure background smoothness and structural fidelity; high-frequency components introduce gradient adaptive factors to dynamically sharpen bonded textures and suppress background noise. Finally, a high-definition fused image with no ghosting, full focus, and enhanced details is reconstructed, as detailed below:

[0201] (7-1) To ensure the structural integrity and semantic accuracy of the fused image, the system constructs a triple weighting system based on edge + semantics + attention, specifically as follows:

[0202] Edge-guided weight D i (x,y): The edges of each frame image are extracted using the Canny operator, and the cosine of the angle between the edge and the corresponding pixel edge of the reference frame is calculated. The threshold is set to 30°. When the angle is less than the threshold, a high weight is given, and the weight is reduced otherwise. This mechanism effectively eliminates geometrically inconsistent edges caused by parallax or micro-motion, ensuring the geometric consistency of the fused image.

[0203] Semantic prior weight S′ sem (x,y): Reuse the semantic mask generated in step 6 and combine it with dynamic light source contrast data for adaptive enhancement; this weight focuses on strengthening the contribution of high-confidence regions (high contrast, high confidence) such as the gold ball and neck, and suppressing the influence of blurred or low-confidence regions.

[0204] Spatial attention weight A(x,y): Deploy a lightweight CBAM (Convolutional Block Attention Module) to generate a spatial attention map, dynamically focusing on salient feature regions in the image and suppressing unstructured background noise;

[0205] Weight fusion and normalization: Combine the three types of weights mentioned above with the sharpness map C output from step 6. i (x,y) is weighted, fused, and normalized in the spatial dimension to generate the final decision weight graph; the entire process takes ≤25ms.

[0206] (7-2) High- and low-frequency pyramid fusion based on the CTCFuse architecture, that is, adopting a multi-scale pyramid decomposition strategy to implement differentiated fusion for different frequency components, specifically:

[0207] Pyramid decomposition: Construct a 5-layer Gaussian pyramid (Gaussian kernel standard deviation σ=1.0) to decompose the original image into low-frequency components (layers 1-2) representing large-scale structures and high-frequency components (layers 3-5) representing texture details.

[0208] Low-frequency fusion (structure fidelity): For low-frequency components, a weighted average fusion strategy based on triple weights is adopted to ensure that the background of the fused image is smooth and the lighting transition is natural, avoiding block artifacts;

[0209] High-frequency fusion (detail enhancement): For high-frequency components, a spatial attention weight A(x,y) and a gradient adaptive factor λ(x,y)=exp(−|∇L| / 5.0) are introduced; where the gradient factor |∇L| is the magnitude of the Laplacian gradient; in areas with rich texture (large gradient), the value of λ decreases to reduce smoothing operations and sharpen edges; in flat areas (small gradient), the value of λ increases to enhance smoothing and suppress noise;

[0210] This strategy dynamically enhances the detail representation of key microstructures such as bond necks and cracks.

[0211] Image reconstruction: Upsampling is performed layer by layer from the top of the pyramid, and the images are superimposed with the corresponding Laplacian residuals to reconstruct high-definition images with no artifacts and full focus; the time for a single fusion is ≤20ms, the structural similarity (SSIM) of the fused image is ≥0.98, and the SSIM of the core region (gold sphere, neck) can be optimized to above 0.99. The specific value is related to the imaging optical parameters (such as objective lens NA value and light source brightness parameters);

[0212] Step 8, Post-processing, Result Output, and System Closed Loop: This step completes the final closed loop and multi-dimensional output of intelligent detection. Through a CPU-GPU heterogeneous collaborative strategy, global CLAHE enhancement and attention map-based local precision sharpening are performed, significantly improving image readability and defect identification. The system constructs a multidimensional dataset in parallel, including a panoramic deep fusion map, a 3D height map, and an attention map, which is then exchanged at high speed via DMA and Ethernet to local storage and the MES system. Finally, based on the comprehensive judgment results, sorting execution and hardware reset are driven. Through an extremely parallel pipeline design, the entire process time for a single sample is ensured to be ≤190ms, meeting the high-speed production line cycle time requirement of >5.2 pieces / second, as detailed below:

[0213] (8-1) Local detail enhancement through heterogeneous collaboration, that is, using a CPU and GPU heterogeneous collaboration strategy to achieve targeted enhancement, specifically:

[0214] CPU Global Enhancement: The CPU calls the optimized Limit Contrast Adaptive Histogram Equalization (CLAHE) algorithm (tile=8×8 grid, contrast limit coefficient Clip=2.0) to perform global processing on the panoramic depth fusion image, significantly improving the local contrast of microstructures such as gold balls and pads;

[0215] GPU Precise Sharpening: The GPU uses the spatial attention map A(x,y) generated in the previous step as a priori guide to perform Laplacian sharpening on suspected defect areas (such as neck microcracks and deformation points) with A(x,y)≥0.7. This strategy effectively suppresses background noise in non-interested areas while enhancing the identification of key defect features, and the overall post-processing time is kept at an extremely low level.

[0216] (8-2) Multidimensional data storage and interaction

[0217] The system generates and outputs multi-dimensional detection data in parallel to meet measurement, archiving, and traceability requirements.

[0218] Panoramic deep fusion image: Outputs 8-bit (for display) and 16-bit (for analysis) grayscale images, showing the complete bonding morphology without blurring;

[0219] Local height map H(x,y): Based on the sharpness evaluation results, according to the formula H(x,y)=Z0+arg max{C i (x,y)}×ΔZ, calculate depth information pixel by pixel, and reconstruct the 3D morphology of the bonding points;

[0220] Spatial attention map: Outputs a weighted map containing regions of interest in defects, assisting in manual review or algorithm debugging;

[0221] The above data is written to the NVMe SSD at high speed via DMA for local persistent storage, and simultaneously uploaded to the Manufacturing Execution System (MES) in real time via a gigabit Ethernet interface, completing the data interaction loop;

[0222] (8-3) Intelligent sorting trigger and system quick reset

[0223] Sorting decision: The CPU integrates image features and 3D dimensional measurement data (such as ball diameter, neck thickness, and height difference) to generate a high-confidence defect judgment report;

[0224] Execution and Reset: The MES system drives the sorting mechanism to execute NG / OK actions in real time based on the report; at the same time, the FPGA controls the motion platform and Z-axis module to reset to the initial state at high speed.

[0225] Cycle time statistics: The entire process time of a single sample from positioning, scanning, fusion to output is strictly controlled within ≤190ms, corresponding to a throughput of >5.2 pieces / second, which fully meets the cycle time requirements of a high-speed production line of 5 pieces / second, and a 10ms time margin is reserved to cope with system scheduling fluctuations, ensuring the long-term stable operation of the production line.

[0226] Example 3:

[0227] This experiment constructs a multi-dimensional verification system based on full-scale working condition simulation. Relying on an industrial-grade heterogeneous hardware platform, it conducts systematic quantitative evaluation of gold materials and typical defect samples. The experiment is carried out from two dimensions: fusion accuracy and robustness under complex working conditions, verifying the effectiveness of semantic prior-driven hierarchical sampling and triple-weighted fusion mechanism.

[0228] I. Basic Experimental Setup

[0229] Experimental objective: To verify the accuracy, processing efficiency, and adaptability to complex working conditions of the multifocal image fusion method provided in Example 2 in the Wirebond packaging bonding process, to quantify the gain effect of semantic prior-driven hierarchical sampling and triple-weighted fusion mechanism on detection performance, and to provide reproducible and verifiable solid data support for the industrial application of the technical solution.

[0230] Experimental Hardware Platform: The experiment adopts the heterogeneous computing architecture described in Example 1. The core hardware configuration is as follows: the CPU is an Intel i9-13900K, the FPGA is a Xilinx KCU105, the GPU is an NVIDIA RTX 4090, and it is equipped with a Baslersprint spL2592-17c camera, a Renishaw grating ruler (resolution 0.01μm), and a high-precision Z-axis piezoelectric platform; the interfaces are deployed with Camera Link Medium (bandwidth 4.0Gbps), PCIe 4.0, and EtherCAT bus, which is fully benchmarked against the actual hardware configuration of the mass production line to ensure that the experimental results have direct engineering reference value.

[0231] Experimental Samples and Operating Conditions: The sample consists of 50 gold-plated Wirebond devices, covering normal bonding components (ball diameter 200-300μm, neck width 50-80μm) and typical defective components (neck shrinkage, microcracks, overexposure, defect size 5-20μm), comprehensively covering typical operating conditions of the production line.

[0232] Experimental conditions: Simulate the normal temperature environment (25℃±2℃) of the production line, normal lighting and strong light reflection interference scene (reflection ratio 10%-25%), and truly replicate the complex optical conditions of the production site.

[0233] II. Fusion Accuracy Testing Experiment and Technical Effect Verification

[0234] Test method: Using a standard part calibrated by a laser interferometer (calibrated accuracy ±0.1μm) as a reference, the material sample was subjected to multifocal image acquisition and fusion processing using the method provided in Example 2. The bonding sphere diameter and neck width of the fused image were accurately measured using Image-ProPlus software, and the relative error was calculated. The recognition rate of micro-defects ≥5μm was simultaneously statistically analyzed, and the results were compared with those of the traditional single-focal imaging method without semantic partitioning fusion method to quantitatively evaluate the accuracy improvement efficiency of the method provided in Example 2.

[0235] The experimental results are as follows:

[0236] 1) Dimensional measurement accuracy: The relative error of the ball diameter of the gold bonding parts is 0.82%, and the relative error of the neck width is 0.91%, both strictly controlled within 1.0%, which fully meets the high-precision measurement index requirement of ±2μm of the production line;

[0237] 2) Defect identification performance: The typical defect identification rate of gold bonding parts is ≥98% (F1-score), and can reach 98.6% in the test of 50 exemplary samples, which is a significant improvement over the average identification rate of traditional methods (89.3%).

[0238] 3) Registration accuracy: After hybrid registration of 32 frames of images, the pixel offset is ≤0.08 pixels, which is better than the preset design threshold (≤0.1 pixels), providing a solid foundation for high-precision fusion;

[0239] 4) Fusion quality: The overall structural similarity (SSIM) of the fused image is ≥0.98. For core areas such as the gold ball and neck, SSIM can be optimized to above 0.99. The specific value is controlled by the imaging optical parameters to ensure high-fidelity reproduction of the core structural texture.

[0240] Conclusion: The significant improvement in accuracy performance is mainly attributed to the synergistic effect of two core designs. First, the semantic prior-driven hierarchical sampling strategy effectively reduces the interference of non-critical area data redundancy on fusion accuracy by selectively acquiring core area images in 32 frames. Second, the deep integration of the triple-weighted fusion framework and the dual-branch reflective repair mechanism strongly suppresses strong light reflective interference and ensures the accurate capture of micro-defect features and key size parameters.

[0241] III. Robustness Testing and Technical Effect Verification under Complex Working Conditions

[0242] Test method: A reflection interference condition with a reflection ratio of 15%-25% (covering the GPU repair branch trigger threshold) was set to verify the system robustness. 50 samples were processed, and the system accuracy stability and algorithm fault tolerance were statistically analyzed.

[0243] The experimental results show that when the reflectivity reaches 25%, the relative error of size measurement is still ≤1.5%, the defect recognition rate is ≥95.3%, and the edge artifact elimination rate is over 92%. The dual-branch reflectivity repair mechanism effectively avoids the accuracy decay caused by reflectivity.

[0244] Conclusion: The synergistic effect of three major technologies—dynamic light source adaptation, high-reflectivity area mask generation, and hardware-accelerated preprocessing—builds a highly robust anti-interference system. This system can adapt to the complex operating conditions of the packaging mass production line and is compatible with the complex operating condition detection of gold Wirebond devices without manual intervention to adjust parameters, fully meeting the core requirements of continuous and stable industrial-grade operation.

Claims

1. A multi-focus image fusion system for semiconductor packaging bonding processes, characterized in that, Includes a mechanical platform and motion control module, a Z-axis precision drive module, a dynamic light source adaptation system, a microscopic imaging module, and an image processing and control platform: The image processing and control platform includes a central processing unit (CPU), a graphics processing unit (GPU), and a field-programmable gate array (FPGA) module. The FPGA module is connected to the CPU and GPU via a high-speed bus for real-time control and data interaction. The image processing and control platform is configured to perform the following operations: The GPU is used to run a semantic segmentation model to generate a semantic mask containing the gold ball, neck, and pad regions based on the acquired reference image. Based on the semantic mask, a non-uniform hierarchical sampling strategy is generated, and the Z-axis precision drive module is controlled to perform dense sampling in the key area using a first step interval, and sparse sampling in the background area using a second step interval greater than the first step interval. During the layered scanning process, the FPGA is used to calculate the proportion of reflective areas in the image in real time, and the dynamic light source adaptation system is controlled to adjust the brightness according to the calculation results in order to generate a multi-focal image sequence. The multifocal image sequence is subjected to reflection restoration, image registration, and fusion processing. The reflection repair includes intelligently scheduling FPGA hardware interpolation branches or GPU deep learning generation branches based on the proportion of reflective pixels in the image. The fusion process includes constructing a triple weight map of edge guidance weights, semantic prior weights, and spatial attention weights, and combining a multi-scale pyramid strategy to separate and fuse high and low frequencies of the image to generate a panoramic depth-fusion image.

2. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The dynamic light source adaptation system is at least partially deployed in the hardware logic of the FPGA module. It is configured to parse the image stream, calculate the proportion of reflective areas and contrast index of the image, and generate lighting adjustment instructions based on the calculation results to control the brightness level of the light source. The dynamic light source adaptation system is configured as follows: Receive real-time image streams from the microscopic imaging module and calculate the pixel ratio of reflective areas and Michelson contrast in the image through hardware logic; The pixel ratio and Michelson contrast of the reflective area are compared with the preset target reflective ratio and target contrast. When the pixel ratio of the reflective area or the Michelson contrast does not meet the preset conditions, an adjustment command is generated to adjust the intensity of the blue light channel and / or the weight of the dark field light source in the multi-channel light source through a serial bus or parallel PWM signal until the pixel ratio of the reflective area and the Michelson contrast meet the preset conditions.

3. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The non-uniform hierarchical sampling strategy specifically includes: For the gold ball and neck region where the semantic mask confidence is greater than or equal to the first preset threshold, the coverage height range is set to 40-60μm and the sampling step size is 1-3μm. For pad regions with semantic mask confidence greater than or equal to the first preset threshold, the coverage height range is set to 80-120μm and the sampling step size is 15-25μm. For background regions where the semantic mask confidence level is less than the first preset threshold, the sampling step size is set to 40-60μm; The Z-axis precision drive module is controlled to perform unidirectional continuous scanning, and the total number of sampling frames is controlled within 32 frames. The Z-axis precision drive module includes a piezoelectric actuator and a position sensing component, which is used to perform multi-layer stepping motion in the Z direction to acquire sample images at multiple focal planes, and output a Z-axis ready signal to the FPGA module after the position is stabilized.

4. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The intelligent scheduling of FPGA hardware interpolation branches or GPU deep learning generation branches based on the proportion of reflective pixels in the image specifically includes: The percentage R of reflective pixels in a single frame image is calculated. When R is less than the preset threshold R0, the FPGA module uses a neighborhood weighted interpolation algorithm to fill pixels in the reflective area. When R is greater than or equal to a preset threshold R0, the FPGA module triggers the GPU to call the generative adversarial network model to perform texture repair on the reflective area.

5. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The image processing and control platform is also configured to perform hybrid image registration, including: The CPU is used to call the AVX-512 instruction set to accelerate ORB feature extraction and RANSAC coarse matching to obtain the initial transformation matrix; The GPU is used to perform subpixel-level fine registration based on the initial transformation matrix using the pyramid Lucas-Kanade optical flow method.

6. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The construction methods of the triple weight graph include: Edge information is extracted using the Canny operator, and edge guiding weights are generated based on the consistency of edge direction. The semantic mask output by the semantic segmentation model is reused and combined with dynamic light source contrast data to generate semantic prior weights. Image features are extracted using a convolutional attention module, and spatial attention weights are generated that focus on regions of salient features.

7. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The mechanical platform and motion control module, including an XYθ motion mechanism, a temperature sensor, and a nanoscale position feedback component, are used to perform sample positioning in the XY plane and rotation direction according to the control instructions of the FPGA module, and output a ready signal to the FPGA module after the position is stable. The FPGA module is configured to read the position data of the nanoscale position feedback component and the temperature data of the temperature sensor, calculate the position compensation amount based on Abbe error and thermal expansion error, and superimpose the compensation amount into the target coordinate command to correct the positioning of the mechanical platform in real time.

8. The multi-focus image fusion system for semiconductor packaging bonding process according to claim 1, characterized in that, The microscopic imaging module includes a camera, an objective lens, and a multi-channel light source. The camera is configured to acquire images of multiple focal planes in response to a trigger signal from the FPGA module and transmit the image data to the FPGA module. The camera is connected to the FPGA module via a Camera Link interface; The FPGA module is configured to write the image data directly into the GPU's video memory via the PCIe bus using Direct Memory Access (DMA) to trigger the GPU to perform inference calculations.

9. A multi-focus image fusion method for semiconductor packaging bonding processes, implemented based on the multi-focus image fusion system for semiconductor packaging bonding processes as described in any one of claims 1-8, characterized in that, Includes the following steps: Step 1: Based on the heterogeneous computing platform consisting of a central processing unit (CPU), a graphics processing unit (GPU), and a field-programmable gate array (FPGA), the mechanical platform is controlled to move the target area of ​​the sample to the center of the imaging field of view. Step 2: Acquire a reference image and run a semantic segmentation model using the GPU to generate a semantic mask containing the gold ball, neck, and pad regions; Step 3: Determine the non-uniform layered sampling strategy based on the semantic mask, control the Z-axis actuator to perform layered scanning according to the strategy, and control the dynamic light source through the FPGA to adjust it in real time to generate a multi-focus image sequence; Step four: Perform reflection repair on the multifocal image sequence, including selecting hardware interpolation repair performed by FPGA or deep learning generation repair performed by GPU based on the proportion of reflective pixels in the image. Step 5: Perform hybrid registration on the image sequence after reflection restoration, including CPU-accelerated coarse registration and GPU-accelerated subpixel-level fine registration. Step 6: Construct a triple weighted graph based on the improved multi-scale Laplacian sharpness evaluation function and combining edge information, semantic mask, and spatial attention information; Step 7: Decompose the image into low-frequency and high-frequency components using a multi-scale pyramid strategy, and fuse the low-frequency and high-frequency components based on the triple weight map to reconstruct a panoramic depth-fusion image.

10. The multi-focus image fusion method for semiconductor packaging bonding process according to claim 9, characterized in that, Step three involves real-time adjustment of the dynamic light source via the FPGA, including: The FPGA analyzes the image stream of the acquired image in real time and calculates the pixel ratio of the reflective area and the Michelson contrast ratio. If the pixel percentage of the reflective area is not within the preset target percentage range, or the Michelson contrast ratio is lower than the preset contrast threshold, then via I 2 Adjust the light source parameters using the C / SPI interface or PWM signal until the requirements are met.