PCB surface defect detection method and device, storage medium and computer equipment

By improving the YOLOv5 architecture and combining the C3MobileViT module and the deformable convolutional receptive field module, the problems of missing small defects and interference from complex backgrounds in PCB surface defect detection have been solved, achieving high-precision and high-robust detection results.

CN122243872APending Publication Date: 2026-06-19SHENZHEN HUAFU INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN HUAFU INFORMATION TECH CO LTD
Filing Date
2026-02-06
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing PCB surface defect detection methods are insufficient in detecting minor defects, complex background interference, and adaptive detection of multi-scale defects, making it difficult to meet the industrial requirements of high precision, high efficiency, and high robustness.

Method used

We adopt an improved YOLOv5 architecture, combining the C3MobileViT module, the deformable convolutional receptive field module, and the joint attention block. We extract features through a dual-branch structure and dynamically adjust the sampling position of the convolutional kernel to enhance the adaptability and fusion capability of the feature map.

🎯Benefits of technology

It significantly improves the recall and localization accuracy of detecting minute defects, enhances the feature discrimination ability in complex backgrounds, and improves detection precision and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243872A_ABST
    Figure CN122243872A_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, storage medium, and computer equipment for detecting PCB surface defects, relating to the field of product quality inspection. The method includes: acquiring a PCB surface image; inputting the image into a pre-trained defect detection model, wherein the backbone network of the model integrates a C3MobileViT module, merging global dependent features and local features through a dual-branch structure; a feature pyramid network generating multi-scale feature maps; deformable convolutional receptive field modules sequentially arranged in the neck network dynamically adjusting the receptive field using multi-scale deformable convolution, and joint attention blocks applying channel attention to high-level feature maps and coordinate attention to low-level feature maps to optimize the feature maps; four detection heads processing the optimized multi-scale feature maps and outputting the defect bounding box and category information; and finally determining the defect detection result based on this information. This application can improve the detection accuracy of PCB surface defects, especially the detection accuracy of minute defects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of product quality inspection, and in particular to a method, apparatus, storage medium and computer equipment for detecting defects on PCB surfaces. Background Technology

[0002] As a core component of electronic information products, the surface quality of PCBs (Printed Circuit Boards) directly affects the reliability of the final product. Traditional inspection methods, such as manual visual inspection, flying probe testing, and automated optical inspection, have limitations in terms of inspection efficiency, accuracy, and overall cost, making it difficult to meet the production requirements of modern high-density, fine-pitch PCBs.

[0003] With the development of deep learning technology, data-driven detection methods have provided a new approach for the automatic identification of PCB defects. However, PCB surface defects themselves are characterized by large scale variations, irregular shapes, and low contrast with the background, posing a serious challenge to detection algorithms. Specifically, existing methods mainly face the following technical problems: Missed detection of minute defects: Minor defects such as fine cracks and pinholes account for a very small proportion in an image. During the downsampling process of a standard convolutional neural network, their feature information is easily lost, resulting in a high rate of missed detection.

[0004] Complex background interference: The PCB surface has dense regular textures (wires, pads), and some defects have slight differences in color and grayscale from normal areas, making it difficult to effectively distinguish defects from the background.

[0005] Rigid feature fusion mechanism: The feature pyramid structure adopted by existing mainstream detection networks (such as YOLOv5) has a relatively fixed multi-scale feature fusion method and lacks the ability to adaptively model the diverse geometric shapes of PCB defects.

[0006] Insufficient robustness in industrial environments: Imaging on actual production lines is susceptible to uneven lighting, metal reflections, and noise interference, which further increases the difficulty of stable feature extraction and accurate identification.

[0007] To address these issues, existing research has explored various improvement directions, but all have significant limitations: attention mechanisms offer limited improvement in the accuracy of locating extremely small targets; lightweight designs often come at the cost of sacrificing shallow, detailed features; and adaptive methods such as dynamic convolution still lack sufficient discriminative ability in complex backgrounds. These methods generally struggle to achieve a good balance between detection accuracy, inference speed, and model complexity, failing to fully meet the industry's practical needs for high-precision, high-efficiency, and highly robust detection. Summary of the Invention

[0008] This application provides a method, apparatus, storage medium, and computer equipment for detecting PCB surface defects, which can solve the technical problems of missed detection of minute PCB defects, interference from complex backgrounds, and adaptive detection of multi-scale defects in the prior art. The technical solution is as follows: In a first aspect, embodiments of this application provide a method for detecting defects on the surface of a PCB, the method comprising: Acquire an image of the PCB surface to be inspected; The PCB surface image is input into a pre-trained defect detection model, which is a model obtained based on an improvement of the YOLOv5 architecture. The defect detection model includes: A backbone network is used to extract features from an input PCB surface image. The backbone network integrates a C3MobileViT module, and at least one C3 module in the backbone network is replaced with the C3MobileViT module to output the extracted features through the dual-branch structure of the C3MobileViT module. A feature pyramid network, connected to the backbone network, is used to receive and fuse the extracted features output by the backbone network to generate multiple feature maps of different scales. A neck network, connected to the feature pyramid network, is provided with a deformable convolutional receptive field module and a joint attention block in sequence. The deformable convolutional receptive field module processes feature maps from multiple scales of the feature pyramid network, dynamically adjusting the sampling position of the convolution kernel using multi-scale deformable convolution to output an enhanced feature map. The joint attention block processes the enhanced feature map from the deformable convolutional receptive field module to obtain an optimized feature map. The detection head, connected to the neck network, is used to process the optimized feature map and output the bounding box and category information of defects in the PCB surface image; Based on the bounding box and category information output by the defect detection model, the defect detection result of the PCB surface image is determined.

[0009] Secondly, embodiments of this application provide a PCB surface defect detection device, the device comprising: The acquisition module is used to acquire images of the PCB surface to be inspected. The input module is used to input the PCB surface image into a pre-trained defect detection model. The defect detection model is a model obtained based on the YOLOv5 architecture improvement, and the defect detection model includes: A backbone network is used to extract features from an input PCB surface image. The backbone network integrates a C3MobileViT module, and at least one C3 module in the backbone network is replaced with the C3MobileViT module to output the extracted features through the dual-branch structure of the C3MobileViT module. A feature pyramid network, connected to the backbone network, is used to receive and fuse the extracted features output by the backbone network to generate multiple feature maps of different scales. A neck network, connected to the feature pyramid network, is provided with a deformable convolutional receptive field module and a joint attention block in sequence. The deformable convolutional receptive field module processes feature maps from multiple scales of the feature pyramid network, dynamically adjusting the sampling position of the convolution kernel using multi-scale deformable convolution to output an enhanced feature map. The joint attention block processes the enhanced feature map from the deformable convolutional receptive field module to obtain an optimized feature map. The detection head, connected to the neck network, is used to process the optimized feature map and output the bounding box and category information of defects in the PCB surface image; The determination module is used to determine the defect detection result of the PCB surface image based on the bounding box and the category information output by the defect detection model.

[0010] Thirdly, embodiments of this application provide a computer storage medium storing a plurality of instructions adapted for loading by a processor and executing the above-described method steps.

[0011] Fourthly, embodiments of this application provide a computer device, which may include: a processor and a memory; wherein the memory stores a computer program, the computer program being adapted to be loaded by the processor and to execute the above-described method steps.

[0012] The beneficial effects of the technical solutions provided in some embodiments of this application include at least the following: This application combines the global dependency modeling capability of Transformer with the local feature extraction advantage of convolutional neural network by designing the C3MobileViT module. This enables the model to capture both fine local details of defects and long-range contextual information, thereby significantly improving the model's feature discrimination ability to accurately distinguish between defective and non-defective regions in complex PCB backgrounds.

[0013] This application designs a deformable convolutional receptive field module, utilizing deformable convolution operations to enable the model to dynamically adjust the sampling position of the convolution kernel and the receptive field range according to the shape and size of the input defect. This adaptive mechanism allows the model to more accurately focus on the key feature regions of minute defects, effectively alleviating the problem of feature loss and missed detection of minute defects caused by fixed or insufficient receptive fields, thereby significantly improving the detection recall and localization accuracy of small-sized and irregularly shaped defects.

[0014] This application proposes a joint attention module that differentially applies channel attention and coordinate attention mechanisms to feature maps of different network depths. This design enhances the semantic information discrimination capability of high-level features while simultaneously improving the ability to locate spatial details in low-level features, thus achieving a more efficient and targeted fusion of high-level semantic information and low-level spatial information. This reduces information loss during cross-scale feature fusion, ultimately improving the model's classification confidence for defect categories and the accuracy of bounding box localization.

[0015] In summary, the technical solution provided in this application, through the synergistic improvement of the above modules, effectively improves the detection accuracy, robustness, and feature utilization efficiency of the final defect detection model in PCB surface defect detection tasks, especially when dealing with challenges such as small defects, complex backgrounds, and multi-scale changes. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a schematic diagram of the network architecture provided in the embodiments of this application; Figure 2 This is a schematic flowchart of the PCB surface defect detection method provided in the embodiments of this application; Figure 3 This is a schematic diagram of the feature extraction process provided in an embodiment of this application; Figure 4 This is a schematic diagram of the process for generating enhanced feature maps provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of a PCB surface defect detection device provided in this application; Figure 6 This is a schematic diagram of a computer storage medium provided in an embodiment of this application. Figure 7This is a schematic diagram of the structure of a computer device provided in this application. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0019] It should be noted that the PCB surface defect detection method provided in this application is generally executed by computer equipment, and correspondingly, the PCB surface defect detection device is generally installed in the computer equipment.

[0020] Figure 1 An exemplary network architecture is shown that can be applied to the PCB surface defect detection method or PCB surface defect detection device of this application.

[0021] like Figure 1 As shown, the network architecture may include computer device 101 and server 102. Computer device 101 and server 102 can communicate with each other via the network, which serves as the medium for providing communication links between the various units. The network may include various types of wired or wireless communication links, such as wired communication links including fiber optic cables, twisted-pair cables, or coaxial cables, and wireless communication links including Bluetooth communication links, Wi-Fi communication links, or microwave communication links.

[0022] It should be noted that computer device 101 and server 102 can be either hardware or software. When computer device 101 and server 102 are hardware, they can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When computer device 101 and server 102 are software, they can be implemented as multiple software programs or software modules (for example, used to provide distributed services), or as a single software program or software module; no specific limitations are made here.

[0023] The pre-trained defect detection model in this application can be deployed locally on computer device 101 or on server 102.

[0024] The computer device described in this application can be equipped with various communication client applications, such as video recording applications, video playback applications, voice interaction applications, search applications, instant messaging tools, email clients, social platform software, etc.

[0025] Computer devices can be either hardware or software. When a computer device is hardware, it can be any computer device with a display screen, including but not limited to smartphones, tablets, laptops, and desktop computers. When a computer device is software, it can be installed on the computer devices listed above. It can be implemented as multiple software programs or software modules (e.g., used to provide distributed services) or as a single software program or software module; no specific limitation is made here.

[0026] When a computer device is used as hardware, it can also be equipped with a display device and a camera. The display device can be any device capable of displaying information, and the camera is used to capture video streams. Examples of display devices include cathode ray tube displays (CR), light-emitting diode displays (LED), e-ink screens, liquid crystal displays (LCD), and plasma display panels (PDP). Users can utilize the display devices on their computer devices to view displayed text, images, videos, and other information.

[0027] It should be understood that Figure 1 The number of computer devices, networks, and servers shown is for illustrative purposes only. The number of computer devices, networks, and servers can be any number, depending on the implementation requirements.

[0028] The following will be combined with the appendix Figure 2 This application provides a detailed description of the PCB surface defect detection method provided in the embodiments of this application. The PCB surface defect detection device in the embodiments of this application can be... Figure 1 The computer equipment shown.

[0029] Please see Figure 2 This is a flowchart illustrating a method for detecting surface defects on a PCB, as provided in this application embodiment. Figure 2 As shown, the method described in this application embodiment may include the following steps: S1. Obtain an image of the PCB surface to be inspected.

[0030] In this system, computer equipment establishes a connection with the image data source and completes data exchange through its input / output system. Image data sources are mainly divided into two categories. The first category is image acquisition devices that are directly connected to the computer equipment, such as industrial cameras or line scan scanners. These devices convert the captured optical signals into digital signals and transmit them in real time. The second category is the storage system of the computer equipment, which stores the pre-acquired image files.

[0031] When acquiring data from an image acquisition device, the computer needs to call the corresponding device driver and control interface to send acquisition commands to the device, controlling it to complete focusing, exposure, and shooting operations. Then, it receives the raw image data stream in the form of a stream or packets via the data bus. When reading from the storage system, the computer uses a file management program to locate and load the image file in a specific directory, decoding the file's binary content into an image data array that can be directly accessed in memory.

[0032] Regardless of the source, the final image data must be formatted into a three-dimensional array with a defined height, width, and number of channels in the computer's memory. Each element in this array corresponds to the intensity information of a pixel in the image. For color images, this is typically achieved by superimposing data from three channels representing the intensities of red, green, and blue light, respectively. The sharpness and detail resolution of an image are primarily determined by its spatial resolution, i.e., the number of pixels in the width and height directions, and its quantization bit depth. The acquired image should meet basic requirements for sharpness and uniform illumination to ensure the reliability of subsequent automated analysis processes.

[0033] For example, in an inspection station integrated at the end of an assembly line, a computer is connected to a fixed-mount area-array CCD (Charge-Coupled Device) industrial camera via a standard Ethernet interface. The computer sends a soft-trigger command to the camera. After receiving a circuit board arrival signal from a photoelectric sensor, the camera performs a global shutter exposure, generating a raw image with a resolution of 5 megapixels. This image data is transmitted via Ethernet cable to a memory buffer designated by the computer, where it is formatted as an 8-bit unsigned integer three-dimensional array with a width of 2448 pixels and a height of 2048 pixels, containing red, green, and blue color channels. After verifying the integrity of this array data, the computer marks it as the PCB surface image to be processed in this round.

[0034] S2. Input the PCB surface image into the pre-trained defect detection model.

[0035] The defect detection model is an improved version of the YOLOv5 (You Only Look Once version 5, a real-time object detection algorithm based on deep learning) architecture. The defect detection model includes: The backbone network is used to extract features from the input PCB (Printed Circuit Board) surface image. The backbone network integrates a C3MobileViT module (a module that integrates a lightweight visual transformer mechanism into the C3 module structure), and at least one C3 module in the backbone network is replaced with a C3MobileViT module to output the extracted features through the dual-branch structure of the C3MobileViT module. The first branch extracts global dependency features through convolutional layers and the MobileViT module, and the second branch extracts local features through convolutional layers. The outputs of the first branch and the second branch are merged into the extracted features.

[0036] The C3MobileViT module is a core component designed in this application to improve the feature extraction capabilities of the YOLOv5 backbone network. It is implemented by replacing at least one standard C3 module in the original network. This module adopts a dual-branch parallel structure: in the first branch, the input features are processed sequentially through convolutional layers, bottleneck blocks, and the MobileViT module. The MobileViT module captures the global long-range dependencies of features by combining the locality advantage of convolution with the self-attention mechanism of the Transformer. In the second branch, the input features are processed only through convolutional layers to efficiently preserve and transmit local detail information. Finally, the feature maps output from the two branches are concatenated and fused along the channel dimension to generate an enhanced feature representation that is rich in both local detail and global context. Through this design, the C3MobileViT module effectively improves the model's ability to represent and distinguish defect features in PCB images, especially small and low-contrast defects. In a preferred embodiment, the C3 modules in the third and fourth stages of the YOLOv5 backbone network are replaced with the C3MobileViT module described in this invention.

[0037] The C3 module is one of the core building blocks of its backbone network. It is a feature extraction module that adopts a cross-stage partial network design. Specifically, a standard C3 module typically contains two parallel paths: one path consists of multiple stacked bottleneck blocks for deep nonlinear feature transformation and extraction; the other path is a simple shortcut connection. The outputs of the two paths are eventually fused (e.g., concatenated or added). This dual-path design not only promotes the effective propagation of gradients in deeper networks and alleviates the gradient vanishing problem through residual connections, but also reduces computational complexity to some extent. As a key stage submodule in the network, the C3 module is responsible for effectively abstracting local features and integrating information from the input feature map, and its output provides more representative features for subsequent network layers. In this invention, the C3MobileViT module is an improvement and functional enhancement based on this standard C3 module structure.

[0038] The MobileViT module is a lightweight visual neural network component. Its core design effectively integrates the advantages of convolutional operations and the Transformer architecture to achieve collaborative modeling of local details and global contextual information in image features. This module typically performs the following operations sequentially: First, it extracts local spatial information of the input features through standard convolutional layers; next, it reduces the feature channel dimension using point convolutions to decrease computational cost; then, it unfolds the feature map spatially into a sequence of image patches and inputs them into a lightweight Transformer encoder, establishing long-range dependencies between these patches through a self-attention mechanism to capture global contextual information; subsequently, it folds the encoded sequence back into the spatial dimension to restore the feature map structure; finally, it adjusts the number of channels using point convolutions. This design allows MobileViT to achieve both the locality advantage of convolution and the global modeling capability of the Transformer with lower computational overhead. In this invention, the MobileViT module, as a core branch of the C3MobileViT component, is specifically designed to enhance the backbone network's understanding of the global semantics and contextual relationships of PCB image defects.

[0039] The feature pyramid network, connected to the backbone network, is used to receive and fuse the extracted features from the backbone network output to generate feature maps of multiple different scales. The neck network is connected to the feature pyramid network. Deformable convolutional receptive field modules and joint attention blocks are set sequentially in the neck network. The deformable convolutional receptive field module is used to process feature maps from multiple scales of the feature pyramid network. It uses multi-scale deformable convolution to dynamically adjust the sampling position of the convolution kernel to output enhanced feature maps. The joint attention block is used to process the enhanced feature maps from the deformable convolutional receptive field module, where a channel attention mechanism is applied to the high-level feature maps to enhance semantic information, and a coordinate attention mechanism is applied to the low-level feature maps to enhance spatial location information, thereby outputting an optimized feature map. The detection head, connected to the neck network, is used to process the optimized feature map and output the bounding box and category information of defects in the PCB surface image. There are four detection heads, each corresponding to at least four of multiple different scales.

[0040] The purpose of this step is to input the acquired digital image data into a pre-trained neural network model based on the improved YOLOv5 framework, performing forward propagation computation to extract defect information. The computer device first performs preprocessing on the image tensor in memory, typically including scaling the size to the model's preset input specifications and normalizing pixel values. Subsequently, this normalized tensor is loaded into the model's computation graph and begins to flow sequentially through the model's various components.

[0041] First, image data enters the backbone network of the model. This network, composed of multiple stacked convolutional modules, is responsible for deep feature extraction. The core improvement lies in replacing at least one of the original standard C3 (CrossStage Partial Network 3) modules in the backbone with a C3MobileViT module. This module employs a parallel two-branch structure to process the input features. In the first branch, the input features undergo preliminary transformation and channel adjustment via a convolutional block, followed by a bottleneck structure composed of stacked residual units for stable training, and finally fed into the MobileViT (Mobile Vision Transformer) module. The MobileViT module combines the locality advantage of convolutional operations with the global modeling capability of the Transformer architecture, enabling it to capture long-range contextual dependencies within the feature map. In the second branch, the input features are processed by only one convolutional block, aiming to efficiently preserve and transmit local detail features. Finally, the feature maps output from the two branches are concatenated along the channel dimension to form an enhanced feature map that integrates local details and global context, which serves as the output of the backbone network at this stage.

[0042] The multi-level features output from the backbone network are then fed into the feature pyramid network. This network amplifies deep, low-resolution but semantically rich features through a top-down upsampling path, and then adds or merges them element-wise with the corresponding shallow, high-resolution but semantically weak features from the bottom-up path. This process generates feature map pyramids of multiple scales, such as large, medium, and small spatial sizes, with each layer possessing both good semantic information and spatial detail.

[0043] Next, the multi-scale feature maps are refined in the neck network. The neck network contains two custom modules. The first is a deformable convolutional receptive field module. This module applies several sets of convolutional branches with different traditional convolutional kernel sizes in parallel to each scale of the input feature map. Crucially, the standard convolutions in these branches are replaced with deformable convolutions. The sampling locations of the deformable convolutions are no longer regular rectangular grids, but rather dynamically adjusted spatially for each sampling point by a set of additional two-dimensional offsets learned by the network. This allows the convolutional kernels to adaptively fit the irregular geometry of the defect target, thus focusing more accurately on the relevant feature regions. The outputs of each branch are concatenated and added to the original input features passed through shortcut connections, resulting in a feature map with an enhanced receptive field.

[0044] The second module is the Joint Attention Block module. This module applies differential attention weighting to the feature maps enhanced by the receptive field. It first distinguishes between high-level and low-level feature maps. For high-level feature maps, a channel attention mechanism is primarily applied. This mechanism globally compresses the spatial dimension of the feature maps to generate a one-dimensional channel weight vector. This vector is used to recalibrate the importance of each feature channel, thereby enhancing the model's discriminative power regarding defect category semantics. For low-level feature maps, a coordinate attention mechanism is primarily applied. This mechanism encodes features along both the horizontal and vertical coordinate directions, generating a pair of orientation-aware feature maps that can perceive the target location. These are then synthesized into a two-dimensional spatial attention weight map, which highlights key regions in the feature maps related to the spatial location of the defect. The feature maps modulated by channel attention and coordinate attention are then concatenated to output a fused feature map with optimized semantic and spatial information.

[0045] Finally, the optimized multi-scale fused features are fed into the detection head. The model has four detection heads that work in parallel, each responsible for processing feature maps of different scales from the neck network. Each detection head is essentially a small convolutional network that performs convolution operations on the input feature map, ultimately outputting a three-dimensional prediction tensor. This tensor corresponds to several predefined-shaped anchor boxes at each location in space, and predicts a set of data for each anchor box, including the center coordinates, width, and height of the bounding box, a confidence score indicating the presence of an object within the box, and a probability distribution vector covering all defect categories.

[0046] For example, a computer device scales and normalizes an acquired PCB image into a tensor of size 640 pixels by 640 pixels. This tensor is input into the model. In the backbone network, the original C3 module in the third stage is replaced by the C3MobileViT module. Assuming the feature map input to this module is 80 pixels by 80 pixels with 256 channels, this feature map is processed in two branches. The first branch, after convolution, a bottleneck structure, and the MobileViT module, outputs an 80 pixels by 80 pixels feature map with 128 channels, which encodes global context information. The second branch, after another set of convolutions, outputs an 80 pixels by 80 pixels feature map with 128 channels, which preserves local texture details. The two are concatenated to form a fused feature map of 80 pixels by 80 pixels with 256 channels.

[0047] The feature pyramid network may fuse this feature map with features from other layers, ultimately generating a four-layer feature map pyramid with scales of 80*80, 40*40, 20*20, and 10*10 pixels. For the 80*80 feature map, a 3x3 convolutional branch in the deformable convolutional receptive field module may shift its sampling points from a fixed grid to a location closer to the edge contour of a small "mouse bite" defect, depending on the image content. The joint attention block applies channel attention to the 20x20 high-level features, potentially increasing the weights of feature channels associated with "short circuit" defects; simultaneously, it applies coordinate attention to the 80x80 low-level features, potentially creating a high-response region around a "hole" location.

[0048] Four detection heads receive optimized features at four different scales. For example, a detection head responsible for 80*80 scale features outputs a tensor of 80*80. Assuming three anchor boxes are preset at each location, and each anchor box predicts four coordinate values, one confidence score, and a category score for each defect type, the detection head outputs a total of 80*80*3 multiplied by (4 plus 1 plus 6), which is 80*80*33 predicted data. This application indicates that the defect types mainly include: missing hole (Mh), mouse bite (Mb), open circuit (Oc), short circuit (Sh), spur (Sp), and spurious copper (Sc). Of course, other defect types can also be included, and can be labeled in the training data according to actual needs.

[0049] The training process of the defect detection model in this application is based on a given dataset of PCB surface defect images. The aim is to optimize the network parameters so that the model can learn to identify and locate various types of defects from the images. The training process is described in detail below.

[0050] Training Data Preparation and Preprocessing. Training begins with dataset acquisition and partitioning. The computer acquires a dataset containing images of PCB surface defects and their corresponding annotations, such as the publicly available PKU-PCB (Peking University Printed Circuit Board Surface Defect Detection Dataset), which contains images and annotations for various defect types. Subsequently, the computer randomly partitions all data into three mutually exclusive subsets: a training set, a validation set, and a test set, with a common partition ratio of 8:1:1. During the training phase, the training set is primarily used to learn the model parameters.

[0051] To improve the model's generalization ability and alleviate class imbalance, two data augmentation operations were applied online during training. The first was Mosaic augmentation, where the computer randomly selected four images from the training set, randomly cropped each image, and then stitched the cropped regions onto a background image of the same size as the model input, forming a new synthetic image and correspondingly synthesizing its annotation information. The second was Mixup augmentation, where the computer randomly selected two images and their corresponding annotations from the training set, and linearly interpolated the pixel values ​​and annotation information of the two images using a mixing coefficient λ randomly generated in the range of 0 to 1, generating a new mixed sample. The augmented image data was then used as training samples input into the model.

[0052] Model Construction and Initialization. The defect detection model is an improvement on the standard YOLOv5 network architecture, named MDUA-YOLO (MobileViT, Deformable-convolution, and Union-Attention based Improved YOLO). Its core improvements include: Backbone network improvement: In the backbone network, at least one of the original C3 modules was replaced with a C3MobileViT module. This module adopts a two-branch structure. The first branch extracts global context features sequentially through convolutional blocks, bottleneck blocks, and the MobileViT module; the second branch extracts local detail features through convolutional blocks; finally, the outputs of the two branches are concatenated and fused through convolutional blocks to form an enhanced feature representation.

[0053] Neck network improvement: Two new modules are introduced sequentially after the feature pyramid network.

[0054] Deformable Convolutional Receptive Field Module: This module replaces the original dilated convolution with multi-scale deformable convolution. Its core operation is defined by the following formula: For the output feature value y(p) at position p on the output feature map, it is calculated using the following formula: ; Where K represents the total number of sampling points in the convolution kernel. The weights of the convolution kernel corresponding to the k-th sampling point are: This is the predefined offset of the sampling point within the regular convolutional grid. This indicates the adjusted position on the input feature map. Input feature value at, This is a learnable offset obtained through network learning, used to dynamically adjust the position of the k-th sampling point. This allows the convolutional kernel to dynamically adapt to the defect shape.

[0055] Joint Attention Block: This module performs differential processing on features. A channel attention mechanism is applied to high-level feature maps, using global average pooling, fully connected layers, and a sigmoid function to generate channel weights for weighting. A coordinate attention mechanism is applied to low-level feature maps, using encoding along the horizontal and vertical directions to generate spatial attention weights for weighting. Finally, the two weighted feature sets are concatenated.

[0056] Detection head adjustment: The model is equipped with four detection heads, each responsible for predicting feature maps at different scales, in order to improve the detection capability of multi-scale defects.

[0057] Once the model is built, its weight parameters are typically initialized using pre-trained weights or a specific initialization method.

[0058] Training process and optimization configuration. The training process is iterative, and the computer equipment performs the following key steps: Forward propagation: A batch of preprocessed training images are input into the model. The data flows sequentially through the backbone network, feature pyramid network, improved neck network and detection head, and finally outputs the predicted bounding box, category and confidence score.

[0059] Loss Calculation: The model's predicted output is compared with the ground truth annotations of the batch of images to calculate the loss value. To address the class imbalance problem in PCB defect data, the FocalLoss function is used as the classification loss function during training; for bounding box regression, the CIoU loss function is used. The total loss is the weighted sum of the individual losses.

[0060] Backpropagation and parameter update: Using an optimization algorithm, the gradient is backpropagated along the network based on the calculated total loss, and all weight parameters of the model, including the weights w in deformable convolutions, are updated. kAnd the learnable offset Δpk.

[0061] Training cycle: Repeat the above steps until the preset training cycle is reached or the early stop condition is met.

[0062] The specific training hyperparameter configuration is as follows: Optimizer: The Adam (Adaptive Moment Estimation) optimizer is used, with its momentum parameter set to 0.937 and weight decay coefficient set to 0.0005 to prevent overfitting.

[0063] Learning rate strategy: A dynamic learning rate strategy is adopted. The initial learning rate is set to 0.0001 and decayed according to the cosine annealing rule. For example, the learning rate is decayed to 10% of the original value every 50 training cycles to balance convergence speed and stability.

[0064] Training period and batch size: The total training period is set to 200, and the number of images processed per batch (batch size) is set to 16.

[0065] Early stopping mechanism: During training, the mean accuracy metric on the validation set is continuously monitored. If this metric does not improve for 20 consecutive training epochs, training is terminated, and the currently best-performing model weights are saved.

[0066] Model Evaluation and Saving. Throughout the training process, model performance is periodically evaluated on an independent validation set. After training, a final evaluation is performed on an unseen test set to verify the model's generalization ability. The trained model, i.e., its final weight parameters, is saved as a "pre-trained defect detection model" for subsequent inference detection of PCB surface defects.

[0067] Through the training process described above, which includes specific data augmentation, improved model structure, customized loss function, and detailed optimization configuration, the MDUA-YOLO model can effectively learn from the data and ultimately obtain high-precision PCB surface defect detection capabilities.

[0068] S3. Based on the bounding box and category information output by the defect detection model, determine the defect detection results of the PCB surface image.

[0069] The purpose of this step is to parse, filter, and integrate the raw prediction data output by the defect detection model, transforming it into a structured final detection report that can be used by subsequent systems. The computer equipment first obtains the set of raw prediction tensors output by all detection heads from the model's computational graph. Each prediction tensor corresponds to a specific feature scale, containing a large amount of dense, overlapping, and redundant preliminary prediction information.

[0070] The first step in processing this data by the computer equipment is to perform confidence threshold filtering. It iterates through every candidate bounding box in all prediction tensors, reading its associated confidence score. This score characterizes the degree of certainty that the model believes the box contains a defective object. The computer equipment then discards all candidate bounding boxes with confidence scores below a pre-set threshold, such as 0.5 or 0.6. This operation aims to filter out a large number of low-quality predictions triggered by background or noise, significantly reducing the amount of data required for subsequent computations.

[0071] After initial filtering, the computer device groups the remaining predicted bounding boxes at all scales according to their predicted defect categories. For each defect category group, the computer device needs to address the issue of the same defect being repeatedly detected by multiple overlapping boxes. To this end, it applies a non-maximum suppression algorithm. This algorithm first sorts all predicted bounding boxes within a group in descending order based on their confidence scores. Then, it selects the box with the highest confidence score as the baseline and calculates the intersection-union ratio (IU) of this box with all other boxes in the group. IU measures the area of ​​overlap between two rectangular boxes. The algorithm identifies other predicted bounding boxes with an IU exceeding another preset threshold (e.g., 0.5) as repeated detections of the same target and removes them from the group. Next, the algorithm selects the next highest-confidence box from the remaining boxes as the new baseline and repeats the above calculation and suppression process until all boxes in the category group have been processed. This process ensures that for a potential defect instance in an image, only one predicted bounding box with the highest confidence and the most accurate localization is ultimately retained.

[0072] The computer then integrates these final predicted bounding boxes, which have undergone non-maximum suppression and belong to different categories. Each predicted bounding box typically includes: a category label representing the defect type, a corrected confidence score, and bounding box coordinates in the coordinate system of the model's input image. The bounding box coordinates are typically defined by four parameters: the x and y coordinates of the box's center point, and the box's width and height. The computer can transform these coordinates back to the pixel coordinate system of the original input PCB image as needed.

[0073] Finally, the computer system encapsulates all the integrated defect information into a structured list or report. This result is the final defect detection result of the PCB surface image, clearly listing the type, location, and confidence level of all identified defects. This result can be directly output to a user interface for visual annotation, archived in a database, or passed to an implementing agency to trigger the corresponding quality control process.

[0074] For example, the computer device receives approximately 8500 initial predicted bounding boxes from four detection heads. After filtering with a confidence threshold of 0.5, approximately 120 predicted bounding boxes remain. These boxes are divided into six groups according to their categories. When processing the "mouse bite" defect group, this group initially has 15 predicted bounding boxes with confidence scores between 0.52 and 0.94. The non-maximum suppression algorithm first selects box A with a confidence score of 0.94 as the baseline, and calculates the intersection-union ratios (IU) of boxes B, C, and D with A, finding them to be 0.85, 0.78, and 0.10, respectively. Since the IU ratios of the first two exceed the suppression threshold of 0.5, boxes B and C are suppressed and removed, while box D is retained. Subsequently, the algorithm selects box E (0.82) with the second highest confidence score from the remaining boxes (including D) as the new baseline, continuing to suppress other boxes that highly overlap with it. Ultimately, this "mouse bite" group may retain only boxes A and D, which correspond to two different "mouse bite" defect instances in the image.

[0075] After performing the above operations on all category groups, the computer device may obtain a result list containing eight final predicted bounding boxes. For example, the first result entry might be: category "short circuit", confidence score 0.96, bounding box center coordinates, width, and height. The computer device serializes this list into JSON (JavaScript Object Notation) format or stores it in a specific data structure, marking the completion of the automated defect detection task for the current PCB image. This structured data is the final detection result.

[0076] In one possible embodiment, see Figure 3 , Figure 3 This is a schematic diagram of the feature extraction process provided in an embodiment of this application, including the following steps: The C3MobileViT module outputs the extracted features through the following steps: A1. Input the input feature map into the first branch and the second branch respectively; wherein, in the first branch, the input feature map is processed by the first convolutional block, the bottleneck block and the MobileViT module in sequence to obtain the first feature map; in the second branch, the input feature map is processed by the second convolutional block to obtain the second feature map.

[0077] A2. Concatenate the first feature map with the second feature map to obtain the merged feature map.

[0078] A3. Input the merged feature map into the third convolutional block for processing to output the extracted features.

[0079] Step A1 marks the beginning of the data flow within the C3MobileViT module. The computer device simultaneously copies the received input feature map into two data replicas, feeding them into two structurally independent processing branches for parallel computation. The input feature map is a three-dimensional tensor, with dimensions including spatial height, spatial width, and the number of channels.

[0080] In the first branch, the input feature map first enters the first convolutional block. This first convolutional block contains at least one convolutional layer, one normalization layer, and one non-linear activation function layer. Its main function is to perform preliminary spatial feature extraction and channel number mapping transformation on the input features. Next, the feature map is fed into a bottleneck block. This bottleneck block employs a design with residual connections, maintaining the stability of gradient flow while extracting deep features by first reducing and then restoring the number of channels. Finally, the feature map flows into the MobileViT module. This module first extracts local features using convolutional operations, then reassembles the feature map into a series of image patches and inputs them into a lightweight Transformer encoder structure. This establishes long-range dependencies across the entire feature map, thereby capturing and outputting a feature representation rich in global contextual information. The tensor finally output by this branch is called the first feature map.

[0081] In the second branch, another copy of the input feature map is fed into the second convolutional block for processing. This second convolutional block typically also consists of convolutional layers, normalization layers, and activation function layers. Its design focuses on efficiently transforming and adjusting the channel count of the input features, while aiming to preserve the original, fine-grained local spatial details in the feature map to the greatest extent possible. The tensor output after this branch's processing is called the second feature map.

[0082] For example, the feature map input to this module is a 3D tensor with a spatial size of 80 pixels * 80 pixels and 256 channels. The computer device copies this tensor, and one copy is fed into the first branch. This feature map first passes through the first convolutional block, converting its channel count to 128. Subsequently, it undergoes further processing through a bottleneck block. Finally, via the MobileViT module, which may divide the feature map into multiple image blocks and compute using a self-attention mechanism, it outputs a first feature map with a spatial size of 80 * 80 pixels and 128 channels, encoding global semantic relationships. Simultaneously, another copy of the feature map is fed into the second branch, processed by the second convolutional block, and outputs a second feature map with a spatial size of 80 * 80 pixels and 128 channels, which retains key local details such as line edges and corners.

[0083] The purpose of step A2 is to integrate the complementary features extracted from the two parallel branches. The computer device retrieves the first feature map output from the first branch and the second feature map output from the second branch from memory. These two feature maps have the same spatial height and width dimensions. The computer device performs a concatenation operation along the channel dimension on these two three-dimensional tensors, that is, it sequentially arranges all channel data of the second feature map after all channel data of the first feature map, combining them into a new three-dimensional tensor. This newly generated tensor is called the merged feature map, and its spatial dimensions are consistent with the input feature maps, while its number of channels is equal to the sum of the number of channels of the first and second feature maps.

[0084] Following the previous example, the computer device stitches together the first feature map (80*80 pixels, 128 channels) and the second feature map (80*80 pixels, 128 channels). The stitching operation is performed along the channel dimension, generating a new merged feature map. The spatial size of this merged feature map remains 80 pixels * 80 pixels, but its number of channels becomes 256.

[0085] Step A3 is the final processing stage of the C3MobileViT module. The computer device takes the merged feature map obtained in step A2 as input and feeds it into the third convolutional block for processing. This third convolutional block contains at least one convolutional layer, one normalization layer, and one non-linear activation function layer. Its core function is to perform cross-channel information integration and non-linear transformation on the concatenated and fused features to learn and output a more representative fused feature. This convolutional block may also adjust the number of channels of the feature map to the dimension desired by the downstream network. The final three-dimensional tensor output after processing by this convolutional block is the extracted feature output by the C3MobileViT module.

[0086] Continuing the previous example, the computer device inputs the merged feature map (80*80 pixels, 256 channels) into the third convolutional block. This third convolutional block may contain a 3*3 convolutional layer, which consolidates the number of channels to 256 and outputs it. Finally, the computer device outputs a feature map with a spatial size of 80*80 pixels and 256 channels, as the final result processed by the C3MobileViT module. This result integrates the global contextual information extracted via the first branch and the local detail information preserved via the second branch.

[0087] In one possible embodiment, see Figure 4 , Figure 4 This is a schematic flowchart of the process for generating enhanced feature maps provided in an embodiment of this application, including the following steps: The deformable convolutional receptive field module outputs enhanced feature maps through the following steps: B1. Perform 1×1 convolution processing on the input feature map in sequence, and then perform standard convolution processing with kernel sizes of 1×1, 3×3 and 5×5 respectively to obtain multiple intermediate feature maps.

[0088] B2. Each intermediate feature map is input into a deformable convolutional layer for processing to obtain multiple deformable feature maps. The sampling position of the convolution kernel of the deformable convolutional layer is dynamically adjusted according to the input features.

[0089] B3. Combine multiple deformable feature maps to obtain a combined feature map.

[0090] B4. The concatenated feature map is added to and fused with the input feature map to output an enhanced feature map.

[0091] Step B1 is the initial stage of the deformable convolutional receptive field module's processing flow. The computer acquires the input feature map to be processed, which is a three-dimensional tensor. The module first processes this input feature map using a 1x1 convolutional layer. The main function of this 1x1 convolution is to perform cross-channel information integration and dimensionality reduction or expansion transformations to adjust the number of channels in the feature map, preparing a suitable feature representation foundation for subsequent parallel computing branches.

[0092] Subsequently, the computer device copies this feature map, processed by 1x1 convolution, multiple times and feeds them into three independent parallel computation branches. Each branch uses a standard convolutional layer with a fixed, regular sampling grid, but the spatial dimensions of the standard convolutional kernels used in the three branches are different: 1x1, 3x3, and 5x5, respectively. These different sized convolutional kernels are designed to capture feature patterns within different spatial scales of the input feature map. Each branch performs convolution computation independently and outputs a feature map with the same spatial dimensions as the input, but the number of channels may be adjusted by design. The feature maps output by these three branches, together with the initial feature map processed by 1x1 convolution, constitute multiple intermediate feature maps.

[0093] For example, the input feature map of this module has a spatial size of 40 pixels by 40 pixels and 256 channels. The computer device first processes it using a 1x1 convolutional layer, which may reduce the number of channels from 256 to 64, outputting a 40x40 pixel, 64-channel feature map. Next, the computer device copies this feature map three times. The first copy is input to a standard convolutional layer with a 1x1 kernel, outputting a 40x40 pixel, 64-channel feature map. The second copy is input to a standard convolutional layer with a 3x3 kernel, outputting a 40x40 pixel, 64-channel feature map. The third copy is input to a standard convolutional layer with a 5x5 kernel, outputting a 40x40 pixel, 64-channel feature map. Together with the original 40x40 pixel, 64-channel feature map, the computer device now has four intermediate feature maps.

[0094] Step B2 is the core step in enabling dynamic receptive field adjustment. The computer device processes each intermediate feature map obtained in step B1 by inputting it into an independent deformable convolutional layer. Unlike standard convolutional layers that use a fixed rectangular sampling grid, deformable convolutional layers, when performing convolution operations, attach a two-dimensional spatial offset learned by the network to each sampling point position of their convolution kernel on the feature map. This offset is typically predicted and generated in real-time based on the content of the current input feature map through an additional lightweight quantum network (such as a separate convolutional layer). Therefore, for each intermediate feature map, the deformable convolutional layer dynamically and adaptively warps the sampling position grid of the convolution kernel according to its feature content, allowing the sampling points to more accurately focus on key regions in the feature map or adapt to the non-rigid geometric deformation of the target. After each deformable convolutional layer completes processing, it outputs a new feature map corresponding to the spatial size and number of channels of the input intermediate feature map; these new feature maps are called deformable feature maps.

[0095] Continuing the previous example, the computer feeds four intermediate feature maps (each 40x40 pixels, 64 channels) into four independent deformable convolutional layers. For one of the intermediate feature maps, the corresponding deformable convolutional layer (e.g., with a kernel size of 3x3) no longer uses a regular 3x3 grid for sampling at each location on the output feature map. For example, when processing a region containing an arc-shaped defect edge, the offset predicted by the network might cause these sampling points to bend along the arc, thus capturing the features of the edge more accurately. The four deformable convolutional layers ultimately output four deformable feature maps with a spatial size of 40x40 pixels and 64 channels.

[0096] Step B3 aims to aggregate the multi-scale feature information captured from different receptive field branches after dynamic adjustments. The computer device acquires all deformable feature maps output from step B2. These feature maps have the same spatial height and width dimensions, as well as the same number of channels. The computer device performs a concatenation operation along the channel dimension, arranging all channel data of the second deformable feature map after the first, then arranging all channel data of the third after the second, and so on. This operation produces a new three-dimensional tensor with a significantly increased number of channels, called the concatenated feature map. Its spatial dimensions remain unchanged, while the number of channels equals the sum of the number of channels of all input deformable feature maps.

[0097] Continuing the previous example, the computer device concatenates four deformable feature maps (each 40x40 pixels, 64 channels) along the channel dimension. The concatenated feature map generates a new feature map with the same spatial dimensions of 40x40 pixels, but with 256 channels (4x64). This concatenated feature map incorporates dynamically adjusted multi-scale features extracted from deformable convolutions with different initial receptive fields (1x1, 3x3, 5x5, etc.).

[0098] Step B4 is the final processing stage of the deformable convolutional receptive field module, designed to fuse the new features learned by the module with the original input features to preserve necessary original information and stabilize network training. The computer device takes the concatenated feature map obtained in step B3 as one input and the original input feature map initially received by the module as the other. To achieve additive fusion, the computer device needs to ensure that the two feature maps have exactly the same spatial dimensions and number of channels. This is typically achieved by adding an additional 1x1 convolutional layer to adjust the number of channels in the concatenated feature map, making it consistent with the number of channels in the original input feature map. Subsequently, the computer device performs an element-wise addition operation on the two aligned feature maps. This addition operation forms a residual connection, enabling the module to learn an incrementally enhanced representation of the input feature map. The final three-dimensional tensor obtained after addition is the enhanced feature map output by the module.

[0099] Continuing the previous example, the concatenated feature map has a size of 40*40 pixels and 256 channels. The original input feature map also has a size of 40*40 pixels and 256 channels. Since the number of channels is already matched, the computer directly performs an element-wise addition operation on these two feature maps. The final output is an enhanced feature map with a spatial size of 40*40 pixels and 256 channels. This feature map contains both the original input information and the enhanced features dynamically extracted by multi-scale deformable convolution.

[0100] In one possible embodiment, the joint attention block processes the enhanced feature map through the following steps: C1. Based on the size of the spatial resolution, the enhanced feature map is divided into high-level feature map and low-level feature map; C2. Apply a channel attention mechanism to the high-level feature map, including: after upsampling the high-level feature map, perform global average pooling, the first fully connected layer, the ReLU (Rectified Linear Unit) activation function, the second fully connected layer and the Sigmoid function in sequence to generate channel attention weights, and then weight the high-level feature map according to the channel attention weights to obtain the channel-enhanced feature map. C3. Apply a coordinate attention mechanism to the low-level feature map, including: encoding the low-level feature map along the horizontal and vertical directions respectively, generating spatial attention weights, and weighting the low-level feature map according to the spatial attention weights to obtain a spatially enhanced feature map; C4. Concatenate the channel-enhanced feature map with the spatial-enhanced feature map to output the optimized feature map.

[0101] Step C1 is the initial stage of the joint attention block processing flow. The computer device acquires a set of enhanced feature maps to be processed, which contains multiple feature maps with different spatial resolutions from the outputs of the aforementioned network layers. Based on the spatial resolution of each feature map—that is, the pixel values ​​of its height and width—the computer device divides the entire feature map set into two logical groups. One group is classified as high-level feature maps, typically referring to those with lower spatial resolution, undergoing more downsampling operations by the network, and carrying richer high-level semantic information. The other group is classified as low-level feature maps, typically referring to those with higher spatial resolution, undergoing fewer downsampling operations by the network, and retaining more original spatial details and location information. The division is based on preset rules, such as classifying all feature maps with spatial dimensions below a certain threshold into the high-level group and the rest into the low-level group.

[0102] For example, the enhanced feature map set input to this module contains four feature maps with spatial dimensions of 80 pixels * 80 pixels, 40 pixels * 40 pixels, 20 pixels * 20 pixels, and 10 pixels * 10 pixels, respectively. The computer device, according to preset rules, divides the lower-resolution 20*20 pixel and 10*10 pixel feature maps into a high-level feature map group, and the higher-resolution 80*80 pixel and 40*40 pixel feature maps into a low-level feature map group.

[0103] Step C2 applies a channel attention mechanism to the high-level feature map group to enhance its semantic discriminative ability. The computer device first upsamples the high-level feature maps, increasing their spatial resolution to a scale compatible with subsequent processing, for example, uniformly upsampling them to a size similar to the higher resolution in the lower-level feature maps. Next, the computer device performs global average pooling on each upsampled high-level feature map, averaging all pixel values ​​in the spatial dimension to compress the two-dimensional feature map of each channel into a single scalar value, resulting in a one-dimensional channel description vector. This vector captures the global response of each feature channel.

[0104] Subsequently, the channel description vector is fed into a subnetwork consisting of two fully connected layers. The first fully connected layer performs a dimensionality reduction transformation on the vector and introduces non-linearity, typically by connecting a ReLU activation function. The second fully connected layer restores the dimensionality to the original number of channels. This two-layer network learns the non-linear interaction relationships between channels. Its output is then processed by a Sigmoid activation function, mapping the weight values ​​of each channel to between 0 and 1, generating the final channel attention weight vector.

[0105] Finally, the computer device combines the generated channel attention weight vectors with the original high-level feature maps. Specifically, each scalar value in the weight vector is multiplied by the entire feature map of the corresponding channel, achieving channel dimension recalibration. The new feature map obtained after this weighting operation is called the channel-enhanced feature map, in which important semantic feature channels are enhanced, while secondary channels are suppressed.

[0106] Taking the aforementioned 20*20 pixel high-level feature map as an example, assuming it has 512 channels, the computer device first upsamples it to 80*80 pixels. Next, it calculates the average value of all pixels for each channel's 80*80 feature map, resulting in a 512-dimensional vector. This vector is compressed to 128 dimensions by a fully connected layer, activated by ReLU, and then restored to 512 dimensions by another fully connected layer. Finally, a sigmoid function is used to generate 512 weight values ​​between 0 and 1. The computer device multiplies these 512 weights by the corresponding channels of the original 20*20 pixel feature map, outputting a channel-enhanced 20*20 pixel feature map.

[0107] Step C3 applies a coordinate attention mechanism to the low-level feature map set to enhance its spatial localization capability. The computer device first processes the low-level feature maps, performing global pooling operations along both the horizontal and vertical coordinate axes. Specifically, for a feature map with height H and width W, pooling (usually average pooling) is performed on each column of W pixels along the horizontal direction, generating a feature vector with height H and width 1, encoding global information in the vertical direction. Similarly, pooling is performed on each row of H pixels along the vertical direction, generating a feature vector with height 1 and width W, encoding global information in the horizontal direction.

[0108] Subsequently, the two direction-aware feature vectors are concatenated and fused through a shared 1x1 convolutional layer, followed by non-linear activation. Next, this fused feature is decomposed again into two independent feature vectors corresponding to the horizontal and vertical directions, respectively. These two vectors are processed by another 1x1 convolutional layer and a sigmoid function, transforming them into two independent spatial attention weight maps: a horizontal weight map with height H and width 1, and a vertical weight map with height 1 and width W.

[0109] Finally, the computer device applies these two weight maps to the original low-level feature map. Specifically, the horizontal weight map is broadcast-multiplied with the feature map in the width direction, and the vertical weight map is broadcast-multiplied with the feature map in the height direction, thereby achieving feature recalibration based on coordinate location. The new feature map obtained after this weighting operation is called the spatially enhanced feature map, in which the features of key spatial regions are significantly enhanced.

[0110] Taking the aforementioned 80*80 pixel low-level feature map as an example, assuming its number of channels is 256, the computer first performs global average pooling in the horizontal direction to generate an 80*1*256 tensor; simultaneously, it performs global average pooling in the vertical direction to generate a 1*80*256 tensor. These two tensors are concatenated, then passed through a 1x1 convolution and activation function, and then split into two tensors of 80*1 and 1*80 pixels, respectively. Weights are generated for each tensor using the Sigmoid function. Finally, these two weighted tensors are broadcast and multiplied back onto the original 80*80*256 feature map, outputting a spatially enhanced 80*80 pixel feature map.

[0111] Step C4 is the final processing stage of the joint attention block, designed to fuse features optimized by two different attention mechanisms. The computer device acquires the channel-enhanced feature map output from step C2 and the spatial-enhanced feature map output from step C3. Since they may have different spatial resolutions, the computer device typically needs to upsample the channel-enhanced feature map to the same resolution as the spatial-enhanced feature map first. Subsequently, the computer device concatenates these two spatially identical feature maps along the channel dimension, generating a new three-dimensional tensor with the number of channels equal to the sum of the two. This final output tensor is the optimized feature map, which simultaneously fuses rich semantic information from high-level features and precise spatial localization information from low-level features.

[0112] Continuing with the previous example, the computer device upsamples the channel-enhanced 20x20 pixel feature map to 80x80 pixels. Assume its channel count is 512, and the spatially enhanced 80x80 pixel feature map has 256 channels. The two are then concatenated along the channel dimension to generate an optimized 80x80 pixel feature map with 768 channels, which serves as the output of this module.

[0113] This joint attention block, by applying channel attention and coordinate attention mechanisms in a differentiated manner, precisely enhances and fuses the semantic information of high-level features and the spatial information of low-level features, respectively. This effectively improves the model's accuracy in classifying PCB defects and its location accuracy, and is particularly beneficial for identifying and locating small or low-contrast defect targets in complex backgrounds.

[0114] In one possible embodiment, the four detection heads detect feature maps at four different scales. When the input size of the PCB surface image is 640×640 pixels, the feature map sizes corresponding to the four scales are 160×160, 80×80, 40×40 and 20×20, respectively.

[0115] This step, executed by the processing unit of the computer device, is the final prediction stage of the defect detection model. The computer device assigns the optimized multi-scale feature maps output by the neck network to four independent detection heads for processing according to a specific correspondence. This correspondence is based on the spatial resolution of the feature maps, with each detection head specifically responsible for processing feature maps at a particular spatial scale. Typically, larger spatially sized feature maps (e.g., 160×160) contain richer, finer-grained spatial details and are assigned to detect smaller-scale defects in the image; while smaller spatially sized feature maps (e.g., 20×20) contain stronger semantic information and are assigned to detect larger-scale defects. The four detection heads are structurally identical, each containing several convolutional layers to convert the input feature maps into dense prediction tensors containing bounding box coordinates, confidence scores, and class probabilities. The computer device runs these four detection heads in parallel, each performing forward computation on its assigned feature map.

[0116] For example, when the input image size is 640 pixels × 640 pixels, the neck network outputs four optimized feature maps with different spatial scales: 160 pixels × 160 pixels, 80 pixels × 80 pixels, 40 pixels × 40 pixels, and 20 pixels × 20 pixels. The computer assigns the 160 × 160 feature map to the first detection head, the 80 × 80 feature map to the second, the 40 × 40 feature map to the third, and the 20 × 20 feature map to the fourth. Each detection head consists of several convolutional layers. Taking the first detection head as an example, it performs convolution operations on the input 160 × 160 feature map, ultimately outputting a three-dimensional prediction tensor with a spatial dimension of 160 × 160. Each spatial location corresponds to several predefined anchor boxes, and the tensor predicts the bounding box coordinate offset, a confidence score, and the probability distribution of all defect categories for each anchor box. Other detection heads work in a similar way, but because their input feature map scales are different, the preset anchor frame size is also adjusted accordingly to adapt to targets of different sizes.

[0117] By setting up four detection heads corresponding to feature maps at different scales, the model achieves multi-scale collaborative detection of defects on PCB surfaces. This design enables the model to generate predictions simultaneously at multiple resolution levels. It can capture the fine features of small defects using high-resolution feature maps, and grasp the global context of large defects using low-resolution feature maps. This significantly improves the detection recall and localization accuracy for defects of different sizes, and in particular, it improves the detection capability of traditional single-scale or fewer-scale detection methods for small target defects.

[0118] In one possible embodiment, the method further includes, before inputting the PCB surface image into the defect detection model: D1. Preprocess the PCB surface image. Preprocessing includes scaling the image to the fixed input size used during defect detection model training and normalizing it.

[0119] This step, performed by a computer, standardizes and formats the input raw PCB surface image before the defect detection model performs core analysis. The primary preprocessing operation is image scaling. The computer reads the pixel matrix data of the original image from memory, which has the original height and width dimensions. Based on the uniform input dimensions preset and fixed during the defect detection model's training phase, the computer transforms the pixel matrix of the original image to the target size using a specific image resampling algorithm. Commonly used algorithms include bilinear interpolation, which calculates the color value of each pixel in the new-sized image based on the weighted average of neighboring pixels in the original image, preserving image content as smoothly as possible. This scaling operation ensures that all input images are fully compatible with the model's weight parameters in terms of spatial dimensions.

[0120] After scaling, the computer performs normalization. Normalization is applied to the pixel intensity values ​​of each color channel in the scaled image. The computer first converts the pixel values ​​from their original integer representation, such as an 8-bit unsigned integer ranging from 0 to 255, to a floating-point representation. Then, a linear transformation is performed on each pixel value using a set of preset constants obtained from the training dataset before model training, typically including the mean and standard deviation of each color channel. Specifically, the corresponding channel mean is subtracted from the channel value of each pixel, and then divided by the corresponding channel standard deviation. This operation adjusts the distribution of pixel intensity values ​​for each channel to a range centered at zero with a standard deviation of one, thereby eliminating intensity distribution deviations caused by differences in lighting and camera response between different images, ensuring that the input data distribution is consistent with the distribution seen during model training.

[0121] For example, a computer acquires a color image of a PCB with a raw resolution of 2448 pixels × 2048 pixels from a camera. The model training uses a fixed input size of 640 × 640 pixels. The computer uses bilinear interpolation to scale the long and short sides of the original image proportionally, and fills or centers any excess background, ultimately generating an image matrix of size 640 × 640 pixels. Then, for the red, green, and blue channels of this image, the computer converts the pixel values ​​of each channel from 0-255 to floating-point numbers between 0.0 and 1.0. Assume the pre-stored statistical constants of the training set are: red channel mean 0.485, standard deviation 0.229; green channel mean 0.456, standard deviation 0.224; blue channel mean 0.406, standard deviation 0.225. The computer subtracts 0.485 from the red channel value of each pixel in the image and then divides it by 0.229, performing similar operations for the green and blue channels. Ultimately, the image data is converted into a floating-point tensor with a size of 640×640×3 and a numerical range of approximately -2 to +2, which can be directly input into the model.

[0122] This preprocessing step ensures that the input data received by the model has a constant format and stable distribution by standardizing the size and numerical distribution of the input images. This directly eliminates the interference caused by differences in input image resolution and color distribution on model performance, improves the repeatability and stability of the model's inference process, and is an important foundation for ensuring high accuracy and robustness in subsequent defect detection.

[0123] In one possible embodiment, the deformable convolution operation in the deformable convolutional layer is performed as follows: The output feature value y(p) at position p on the output feature map is calculated using the following formula: ; Where K represents the total number of sampling points in the convolution kernel. The weights of the convolution kernel corresponding to the k-th sampling point are: This is the predefined offset of the sampling point within the regular convolutional grid. This indicates the adjusted position on the input feature map. Input feature value at, This is the learnable offset obtained through network learning, used to dynamically adjust the position of the kth sampling point.

[0124] This step, executed by the processing unit of the computer device, is the core numerical calculation process for the deformable convolutional layer to complete feature transformation. When the computer device processes each position on the output feature map, the calculation of its corresponding output feature value follows a weighted summation process that includes dynamic position adjustments.

[0125] First, the computer determines the total number of sampling points based on the size of the convolution kernel. Each sampling point has a predefined fixed two-dimensional offset relative to the center point of the convolution in a standard convolution operation, and these offsets form a regular sampling grid.

[0126] The core improvement of deformable convolution operations is the introduction of an additional set of learnable dynamic offsets. These dynamic offsets are calculated in real-time by the preceding convolutional layers in the network based on the content of the current input feature map. Specifically, an auxiliary convolutional layer, running parallel to the main convolutional layer, processes the input feature map. Its output channel number is designed to predict a two-dimensional dynamic offset for each sampling point at each spatial location on the output feature map. Therefore, for different locations on the output feature map, the sampling grid shape of its convolutional kernel can adaptively and irregularly deform spatially according to the local content of the image.

[0127] When calculating the feature value at a specific location on the output feature map, the computer device performs the following steps: First, it obtains a set of dynamic offsets predicted for that location. Next, it adds the predefined fixed offset of each sampling point to the corresponding dynamic offset, obtaining the adjusted actual sampling offset for that sampling point. Then, based on the current location coordinates on the output feature map and the actual sampling offset, it determines the corresponding sampling coordinates on the input feature map. Since these coordinates are usually non-integer, the computer device uses a bilinear interpolation algorithm to calculate the feature value at that location as the sampling input, based on the feature values ​​of the four nearest integer coordinates around that coordinate. Finally, it multiplies this sampling input value with the weight coefficients of the corresponding sampling point in the convolution kernel, and sums the products of all sampling points to obtain the final output feature value at that location.

[0128] The following are embodiments of the apparatus described in this application, which can be used to execute the embodiments of the method described in this application. For details not disclosed in the apparatus embodiments of this application, please refer to the embodiments of the method described in this application.

[0129] Please see Figure 5 This illustration shows a schematic diagram of a PCB surface defect detection device provided in an exemplary embodiment of this application, hereinafter referred to as device 5. Device 5 can be implemented as all or part of a computer device through software, hardware, or a combination of both. Device 5 includes: The acquisition module 501 is used to acquire an image of the PCB surface to be inspected; Input module 502 is used to input the PCB surface image into a pre-trained defect detection model, the defect detection model being a model obtained based on an improvement of the YOLOv5 architecture, the defect detection model comprising: A backbone network is used to extract features from an input PCB surface image. The backbone network integrates a C3MobileViT module, and at least one C3 module in the backbone network is replaced with the C3MobileViT module to output the extracted features through the dual-branch structure of the C3MobileViT module. A feature pyramid network, connected to the backbone network, is used to receive and fuse the extracted features output by the backbone network to generate multiple feature maps of different scales. A neck network, connected to the feature pyramid network, is provided with a deformable convolutional receptive field module and a joint attention block in sequence. The deformable convolutional receptive field module processes feature maps from multiple scales of the feature pyramid network, dynamically adjusting the sampling position of the convolution kernel using multi-scale deformable convolution to output an enhanced feature map. The joint attention block processes the enhanced feature map from the deformable convolutional receptive field module to obtain an optimized feature map. The detection head, connected to the neck network, is used to process the optimized feature map and output the bounding box and category information of defects in the PCB surface image; The determination module 503 is used to determine the defect detection result of the PCB surface image based on the bounding box and the category information output by the defect detection model.

[0130] For other details regarding the implementation of the above technical solution by each module in the above PCB surface defect detection device, please refer to the description of the PCB surface defect detection method provided in the above invention embodiments, which will not be repeated here.

[0131] It should be noted that the device 5 provided in the above embodiments is only illustrated by the division of the above functional modules when performing the PCB surface defect detection method. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the above functions. In addition, the PCB surface defect detection device and the PCB surface defect detection method embodiments provided in the above embodiments belong to the same concept, and the implementation process is detailed in the method embodiments, which will not be repeated here.

[0132] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0133] See Figure 6 The diagram shown is a schematic of a computer storage medium provided in an embodiment of this application. The computer storage medium can store multiple instructions (i.e., ... Figure 6 The computer program shown above), the instructions are adapted to be loaded and executed by a processor as described above. Figure 2 The method steps of the illustrated embodiment can be found in the following documentation for detailed execution. Figure 2 The specific details of the illustrated embodiments will not be elaborated here.

[0134] This application also provides a computer program product that stores at least one instruction, which is loaded and executed by the processor to implement the PCB surface defect detection method as described in the above embodiments.

[0135] Please see Figure 7 This document provides a schematic diagram of the structure of a computer device according to an embodiment of this application. Figure 7 As shown, the computer device 700 may include: at least one processor 701, at least one network interface 704, a user interface 703, a memory 705, and at least one communication bus 702.

[0136] The communication bus 702 is used to enable communication between these components.

[0137] The user interface 703 may include input units such as a mouse and keyboard.

[0138] The network interface 704 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).

[0139] The processor 701 may include one or more processing cores. The processor 701 connects to various parts of the computer device 700 using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in memory 705, and by calling data stored in memory 705. Optionally, the processor 701 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The processor 701 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content to be displayed on the screen; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the processor 701 and may be implemented as a separate chip.

[0140] The memory 705 may include random access memory (RAM) or read-only memory. Optionally, the memory 705 may include a non-transitory computer-readable storage medium. The memory 705 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 705 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch function, sound playback function, image playback function, etc.), instructions for implementing the above-described method embodiments, etc.; the data storage area may store data involved in the above-described method embodiments, etc. Optionally, the memory 705 may also be at least one storage device located remotely from the aforementioned processor 701. Figure 7 As shown, the memory 705, which serves as a computer storage medium, may include an operating system, a network communication module, a user interface module, and application programs.

[0141] exist Figure 7 In the computer device 700 shown, the user interface 703 is mainly used to provide an input interface for the user and to obtain user input data; while the processor 701 can be used to call the application program stored in the memory 705 and specifically execute, such as... Figure 2 The method shown can be referred to for details. Figure 2 As shown, it will not be elaborated further here.

[0142] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory, or random access memory, etc.

[0143] The above-disclosed embodiments are merely preferred embodiments of this application and should not be construed as limiting the scope of this application. Therefore, any equivalent variations made in accordance with the claims of this application shall still fall within the scope of this application.

Claims

1. A method for detecting defects on the surface of a PCB, characterized in that, The method includes: Acquire an image of the PCB surface to be inspected; The PCB surface image is input into a pre-trained defect detection model, which is a model obtained based on an improvement of the YOLOv5 architecture. The defect detection model includes: A backbone network is used to extract features from an input PCB surface image. The backbone network integrates a C3MobileViT module, and at least one C3 module in the backbone network is replaced with the C3MobileViT module to output the extracted features through the dual-branch structure of the C3MobileViT module. A feature pyramid network, connected to the backbone network, is used to receive and fuse the extracted features output by the backbone network to generate multiple feature maps of different scales. A neck network, connected to the feature pyramid network, is provided with a deformable convolutional receptive field module and a joint attention block in sequence. The deformable convolutional receptive field module processes feature maps from multiple scales of the feature pyramid network, dynamically adjusting the sampling position of the convolution kernel using multi-scale deformable convolution to output an enhanced feature map. The joint attention block processes the enhanced feature map from the deformable convolutional receptive field module to obtain an optimized feature map. The detection head, connected to the neck network, is used to process the optimized feature map and output the bounding box and category information of defects in the PCB surface image; Based on the bounding box and category information output by the defect detection model, the defect detection result of the PCB surface image is determined.

2. The PCB surface defect detection method according to claim 1, characterized in that, The C3MobileViT module outputs the extracted features through the following steps: The input feature maps are fed into the first branch and the second branch respectively; In the first branch, the input feature map is processed sequentially through the first convolutional block, the bottleneck block, and the MobileViT module to obtain the first feature map; In the second branch, the input feature map is processed by the second convolutional block to obtain the second feature map; The first feature map and the second feature map are concatenated to obtain a merged feature map; The merged feature map is input into the third convolutional block for processing to output the extracted features.

3. The PCB surface defect detection method according to claim 1 or 2, characterized in that, The deformable convolutional receptive field module outputs the enhanced feature map through the following steps: The input feature maps are sequentially processed by 1×1 convolution and standard convolution with kernel sizes of 1×1, 3×3 and 5×5 to obtain multiple intermediate feature maps. Each of the intermediate feature maps is input into a deformable convolutional layer for processing to obtain multiple deformable feature maps. The sampling position of the convolution kernel of the deformable convolutional layer is dynamically adjusted according to the input features. The multiple deformable feature maps are stitched together to obtain a stitched feature map. The concatenated feature map is added to and fused with the input feature map to output the enhanced feature map.

4. The PCB surface defect detection method according to claim 3, characterized in that, The joint attention block processes the enhanced feature map through the following steps: Based on the spatial resolution, the enhanced feature map is divided into high-level feature maps and low-level feature maps; Applying a channel attention mechanism to the high-level feature map includes: upsampling the high-level feature map, then sequentially performing global average pooling, a first fully connected layer, a ReLU activation function, a second fully connected layer, and a Sigmoid function to generate channel attention weights, and then weighting the high-level feature map according to the channel attention weights to obtain a channel-enhanced feature map. Applying a coordinate attention mechanism to the low-level feature map includes: encoding the low-level feature map along the horizontal and vertical directions respectively to generate spatial attention weights, and weighting the low-level feature map according to the spatial attention weights to obtain a spatially enhanced feature map; The channel-enhanced feature map is concatenated with the spatial-enhanced feature map to output the optimized feature map.

5. The PCB surface defect detection method according to claim 1, characterized in that, The four detection heads detect feature maps at four different scales. When the input size of the PCB surface image is 640×640 pixels, the feature map sizes corresponding to the four scales are 160×160, 80×80, 40×40 and 20×20, respectively.

6. The PCB surface defect detection method according to claim 5, characterized in that, Before inputting the PCB surface image into the defect detection model, the method further includes: The PCB surface image is preprocessed, which includes scaling the image to the fixed input size used during the training of the defect detection model and normalizing it.

7. The PCB surface defect detection method according to claim 1, characterized in that, The deformable convolution operation in the deformable convolutional layer is performed in the following manner: The output feature value y(p) at position p on the output feature map is calculated using the following formula: ; Where K represents the total number of sampling points in the convolution kernel. The weights of the convolution kernel corresponding to the k-th sampling point are: This is the predefined offset of the sampling point within the regular convolutional grid. This indicates the adjusted position on the input feature map. Input feature value at, This is the learnable offset obtained through network learning, used to dynamically adjust the position of the kth sampling point.

8. A PCB surface defect detection device, characterized in that, The device includes: The acquisition module is used to acquire images of the PCB surface to be inspected. The input module is used to input the PCB surface image into a pre-trained defect detection model. The defect detection model is a model obtained based on the YOLOv5 architecture improvement, and the defect detection model includes: A backbone network is used to extract features from an input PCB surface image. The backbone network integrates a C3MobileViT module, and at least one C3 module in the backbone network is replaced with the C3MobileViT module to output the extracted features through the dual-branch structure of the C3MobileViT module. A feature pyramid network, connected to the backbone network, is used to receive and fuse the extracted features output by the backbone network to generate multiple feature maps of different scales. A neck network, connected to the feature pyramid network, is provided with a deformable convolutional receptive field module and a joint attention block in sequence. The deformable convolutional receptive field module processes feature maps from multiple scales of the feature pyramid network, dynamically adjusting the sampling position of the convolution kernel using multi-scale deformable convolution to output an enhanced feature map. The joint attention block processes the enhanced feature map from the deformable convolutional receptive field module to obtain an optimized feature map. The detection head, connected to the neck network, is used to process the optimized feature map and output the bounding box and category information of defects in the PCB surface image; The determination module is used to determine the defect detection result of the PCB surface image based on the bounding box and the category information output by the defect detection model.

9. A computer storage medium, characterized in that, The computer storage medium stores multiple instructions, which are adapted to be loaded by a processor and executed by the steps of the PCB surface defect detection method as claimed in any one of claims 1 to 7.

10. A computer device, characterized in that, include: A processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to execute the steps of the PCB surface defect detection method as claimed in any one of claims 1 to 7.