A high-precision lightweight PCB fault detection method for improving product reliability

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a heterogeneous knowledge distillation architecture and multi-scale feature extraction, and combining the spatial attention mask and dynamic weight matrix generated by the teacher network, the problem of insufficient accuracy of deep learning models in detecting minute defects in PCB inspection is solved, and a real-time detection effect with high accuracy and low computational cost is achieved.

CN122199480APending Publication Date: 2026-06-12BEIHANG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIHANG UNIV
Filing Date: 2026-03-14
Publication Date: 2026-06-12

Application Information

Patent Timeline

14 Mar 2026

Application

12 Jun 2026

Publication

CN122199480A

IPC: G06T7/00; G06N3/0495; G06N3/045; G06N3/096; G06T7/73; G06V10/82; G06V10/764

AI Tagging

Application Domain

Image analysis Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing deep learning models struggle to accurately capture subtle geometric features in complex backgrounds during PCB inspection, resulting in insufficient accuracy in detecting minute defects. Furthermore, the computational load is high, making it difficult to meet the real-time requirements of industrial production lines.

⚗Method used

By constructing a heterogeneous knowledge distillation architecture, the teacher network generates spatial attention masks and dynamic weight matrices. Combined with multi-scale feature extraction and lightweight student network training, the supervision information is enhanced, ensuring the preservation of high-resolution feature maps and suppression of background noise, thus achieving high-precision detection of minute defects.

🎯Benefits of technology

While reducing computational load, it significantly improves the detection accuracy and robustness of minute defects, meets the real-time response requirements of industrial quality inspection, and reduces the risk of missed detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122199480A_ABST

Patent Text Reader

Abstract

The application discloses a high-precision light-weight PCB fault detection method for improving product reliability, relates to the technical field of computer vision and industrial automatic detection, and comprises the following steps: acquiring a PCB defect image and a label, and extracting a high-resolution first feature map and a low-resolution second feature map set; inputting the first feature map into a teacher network to generate an attention mask and a probability distribution, calculating spatial prediction uncertainty according to the attention mask and the probability distribution, and mapping the spatial prediction uncertainty into a dynamic weight matrix; constructing a light-weight student network, using the second feature map and the label as basic supervision, using the first feature map, the mask, the probability distribution and the dynamic weight matrix as enhanced supervision to perform heterogeneous knowledge distillation training; and optimizing the network through a loss function containing dynamic weight-based weighted calculation, and finally realizing high-precision detection of PCB faults by using the trained student network.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and industrial automation inspection technology, specifically to a high-precision, lightweight PCB fault detection method to improve product reliability. Background Technology

[0002] With the rapid development of electronic information technology, printed circuit boards (PCBs) are evolving towards high integration and miniaturization. The reliability of their circuits and solder joints directly determines the quality foundation of the final product. In the field of industrial quality inspection, automated optical inspection technology based on deep learning has become the mainstream solution for surface defect detection. It usually relies on convolutional neural networks to extract image features through layer-by-layer downsampling, aiming to obtain high-level semantic information by expanding the receptive field, thereby achieving the classification and localization of targets.

[0003] However, because deep neural networks must undergo multiple pooling or stride convolutions to compress data dimensionality, the spatial resolution of feature maps decreases exponentially. This mechanism is effective for detecting large-scale conventional targets, but when applied to PCB inspection, target defects (such as hairline scratches and tiny solder joints) often have extremely small physical dimensions and highly irregular geometric shapes. As the network depth increases, the pixel information of these tiny defects is over-compressed or even completely lost in the deep feature map, causing the high-frequency edge features of the defects to be submerged in complex background noise such as substrate ink and silkscreen text. Attempting to forcibly recover the information loss by simply retaining high-resolution shallow features or deploying complex models with a large number of parameters will inevitably lead to a surge in computation, making it difficult to meet the stringent requirements of industrial production lines for millisecond-level real-time inference.

[0004] Therefore, how to solve the problem of attenuation and annihilation of small target signals during feature downsampling on edge devices with limited computing power, and break through the bottleneck that lightweight models cannot accurately capture subtle geometric features in complex backgrounds, is a core technical challenge that urgently needs to be solved in this field. Summary of the Invention

[0005] To address the shortcomings of existing technologies, this invention provides a high-precision, lightweight PCB fault detection method that improves product reliability.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows:

[0007] In a first aspect, the present invention discloses a high-precision, lightweight PCB fault detection method for improving product reliability, comprising the following steps:

[0008] Obtain PCB defect sample images and their corresponding fault annotations;

[0009] Multi-scale feature extraction is performed on PCB defect sample images to obtain a first feature map and a second feature map set; the spatial resolution of the first feature map is higher than the spatial resolution of any feature map in the second feature map set.

[0010] The first feature map is input into the pre-constructed teacher network, and then the teacher spatial attention mask and teacher probability distribution map are generated by sequentially performing geometric deformation alignment, channel semantic filtering, and spatial feature focusing.

[0011] Spatial prediction uncertainty is calculated based on the teacher probability distribution map, and the spatial prediction uncertainty is mapped to a dynamic weight matrix with the same size as the first feature map; the region with higher spatial prediction uncertainty has a larger corresponding weight value in the dynamic weight matrix.

[0012] A lightweight student network is constructed, using the second feature map set and fault real annotations as basic supervision information, and the first feature map, teacher spatial attention mask, teacher probability distribution map and dynamic weight matrix are used to construct enhanced supervision information, and heterogeneous knowledge distillation training is performed on the student network.

[0013] The training loss function is configured as follows: it includes a first loss term for the first feature map and a second loss term for the second feature map set; the first loss term is based on a dynamic weight matrix and performs a weighted calculation on the differences in probability distribution between the student network and the teacher network.

[0014] Acquire the target PCB image to be detected, perform fault detection on the target PCB image based on the trained student network, and output the detection results including fault category and location coordinates.

[0015] Secondly, this invention discloses a high-precision, lightweight PCB fault detection system for improving product reliability, comprising:

[0016] The data acquisition module is used to acquire PCB defect sample images and their corresponding actual fault annotations;

[0017] The multi-scale feature extraction module is used to perform multi-scale feature extraction on PCB defect sample images to obtain a first feature map and a second feature map set; the spatial resolution of the first feature map is higher than the spatial resolution of any feature map in the second feature map set.

[0018] The teacher network processing module is used to input the first feature map into the pre-constructed teacher network, and generate the teacher spatial attention mask and teacher probability distribution map by sequentially performing geometric deformation alignment, channel semantic filtering and spatial feature focusing;

[0019] The dynamic weight generation module is used to calculate the spatial prediction uncertainty based on the teacher probability distribution map and map the spatial prediction uncertainty into a dynamic weight matrix with the same size as the first feature map; the region with higher spatial prediction uncertainty has a larger corresponding weight value in the dynamic weight matrix.

[0020] The heterogeneous knowledge distillation training module is used to construct a lightweight student network. It uses the second feature map set and fault reality annotations as basic supervision information, and simultaneously uses the first feature map, teacher spatial attention mask, teacher probability distribution map and dynamic weight matrix to construct enhanced supervision information to train the student network. The training loss function is configured to include a first loss term for the first feature map and a second loss term for the second feature map set. The first loss term is based on the dynamic weight matrix and performs a weighted calculation on the difference in probability distribution between the student network and the teacher network.

[0021] The fault detection module is used to acquire the target PCB image to be detected, and to perform fault detection on the target PCB image based on the trained student network, and output the detection results including the fault category and location coordinates.

[0022] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0023] 1. In the feature extraction stage, the first feature map with high spatial resolution (P2 layer) is preserved, and combined with the geometric deformation alignment in the teacher network, the spatial offset predicted by the side-channel convolution is used to drive the convolution kernel to adaptively fit irregular small defects (such as hairline scratches and small cold solder joints). This mechanism effectively solves the problem of weak pixel information loss caused by continuous pooling in traditional deep networks, and ensures that complete geometric texture features can still be extracted on faulty targets that only account for a very small proportion of the image, thereby greatly reducing the risk of missed detection in industrial quality inspection.

[0024] 2. By using channel semantic filtering and spatial feature focusing in the teacher network, the feature map is cleaned and enhanced in multiple dimensions. The system can automatically identify and suppress the channel response corresponding to high-frequency background noise such as silkscreen text and solder resist ink on the PCB board. At the same time, it uses spatial attention masking to force the model to focus on potential defect areas. This denoising and focusing mechanism significantly improves the signal-to-noise ratio of the features, enabling the detection algorithm to maintain a high level of confidence in fault discrimination when facing complex textured circuit board backgrounds.

[0025] 3. By calculating the information entropy of the probability distribution of the teacher network, the high uncertainty of the prediction (usually corresponding to defect edges or fuzzy difficult examples) is mapped to a high-weight dynamic matrix. During distillation training, this matrix forces the student network to focus on and imitate the teacher's judgment logic on these difficult points, rather than simply fitting a large number of simple background samples. This mechanism effectively prevents gradient vanishing or being dominated by simple samples, and significantly improves the recognition accuracy of the student network for difficult defects such as blurred edges and strange shapes.

[0026] 4. A heterogeneous knowledge distillation architecture was constructed, which uses a complex teacher network to extract high-order implicit knowledge (such as attention topology and entropy weight features) and transfers it losslessly to a simplified and lightweight student network by enhancing supervision information. In the inference stage, the student network removes additional overhead such as auxiliary mapping layers and retains only the backbone and detection head, thereby replicating the high-precision detection capability at the teacher level while maintaining extremely low computational and memory usage, meeting the real-time response requirements of industrial production lines. Attached Figure Description

[0027] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0028] Figure 1 This is an overall block diagram of the method in Embodiment 1 of the present invention;

[0029] Figure 2 This is an overall execution flowchart of the method in Embodiment 1 of the present invention;

[0030] Figure 3 This is an overall block diagram of the system in Embodiment 2 of the present invention. Detailed Implementation

[0031] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0032] Application Overview: In the field of precision quality inspection in modern electronic manufacturing, especially in the surface defect inspection of highly integrated printed circuit boards (PCBs), the completeness and discriminative power of multi-scale features are regarded as the core indicators for measuring the reliability of the inspection system. The generation of such high-quality inspection results is essentially a process of effective feature extraction and non-destructive mapping at the deep learning level. That is, through the hierarchical structure of convolutional neural networks, while filtering out complex background noise such as substrate ink and silkscreen text, the geometric texture features of small defects (such as cold solder joints and scratches) are accurately transmitted to the decision layer, thereby retaining activation responses with specific semantic orientation in the feature map.

[0033] However, existing technologies lack a sophisticated mechanism to reconcile the contradiction between the computational constraints of lightweight models and the preservation of features of small targets. This results in an inability to effectively address the feature annihilation and attention drift issues that occur in student networks during large-scale channel pruning and downsampling. Feature annihilation manifests as the lightweight network being able to mimic the large-scale structure recognition of the teacher network, but losing high-frequency edge information when processing pixel-level defects due to the conflict between the receptive field and resolution. Attention drift manifests as the student network, lacking dynamic weight guidance, focusing excessively on the distribution fitting of simple background samples, leading to the dilution of gradient contributions for difficult samples. Consequently, a strict topological correspondence cannot be established between the high-order implicit knowledge of the teacher network and the feature extraction capabilities of the student network, causing the system to miss weak fault signals or have insufficient confidence, thus affecting the final detection rate of edge detection equipment.

[0034] For example, in real-time inspection scenarios on industrial production lines, when a lightweight model deployed on an edge device attempts to learn to identify hairline scratches through knowledge distillation, traditional teaching systems, through conventional soft-label supervision, can only constrain the student network to fit the overall probability distribution of the teacher network, but cannot distinguish whether this fit stems from a true understanding of the defect features or from mechanical memorization of large-area background pixels. Furthermore, when faced with the complex texture of PCB solder mask boundaries, the system only records the numerical approximation of the student network in terms of classification probability, failing to detect the high spatial prediction uncertainty (entropy value) caused by feature ambiguity in the teacher network at this point, nor can it transform this uncertainty into a stronger supervisory signal. Specifically, the system misjudges the student network's successful fitting of the background as mastery of detection skills, or incorrectly classifies the loss of features of minor defects as normal prediction fluctuations. This leads to the student network continuously solidifying the erroneous attention pattern of emphasizing the background and neglecting defects, failing to form a dynamic attention mechanism that meets the needs of small target detection.

[0035] If the above problems are not addressed, the lightweight detection system will continue to lose its ability to detect minute faults. Specifically, the failure to suppress feature annihilation will cause the model to completely lose the geometric index of minute targets in the deep network, causing the detection logic to deviate from the principle of high precision, thereby weakening the ability to intercept early reliability risks. At the same time, the failure to correct the attention drift problem will result in the ineffective allocation of computing resources, making it impossible for limited computing power to focus on difficult areas with high entropy values, ultimately causing the detection results to lose their due robustness.

[0036] Example 1:

[0037] like Figures 1-2 As shown, a high-precision, lightweight PCB fault detection method for improving product reliability includes the following steps:

[0038] Step S1: Obtain PCB defect sample images and their corresponding actual fault annotations;

[0039] Specifically, the system first uses an industrial-grade high-resolution line scan camera to scan and acquire data on PCB boards on the production line in real time. This is because PCB defects (such as tiny solder joints or hairline scratches) often occupy only a very small proportion of the image (e.g., within a few square meters). Only a fraction of the pixels in the image (pixels), this embodiment constructs a dedicated defect sample database. The data structure of this database is not simple image storage, but rather adopts an index structure based on multidimensional metadata, containing original image data, ambient lighting parameters, and corresponding real-world fault annotations. The real-world fault annotations are calibrated using a human expert system and stored in XML or TXT format, precisely defining the bounding box coordinates of each defect. and fault category ,in Normalized coordinates of the center point To normalize the width and height, and to address the overfitting problem caused by the scarcity of samples with minor defects, the system pre-implements Mosaic data augmentation logic before inputting the network. This involves stitching together four different PCB local images into a single image through random scaling, cropping, and arrangement. Standard input tensor This enriches the semantic complexity of the image background and forces the model to learn more robust features.

[0040] It is worth noting that before performing the subsequent feature extraction and distillation steps, this embodiment pre-constructs heterogeneous teacher and student network architectures, which is a core prerequisite for achieving a balance between high accuracy and lightweight design. For the teacher network, this embodiment does not use a general model, but instead performs a deeply customized topology reconstruction based on the YOLOv11 architecture. Specifically, the backbone of the teacher network is configured as an improved version based on CSPDarknet, with its core component replaced by the C3k2 module (Cross Stage Partial with Kernel 3). Compared to the traditional C2f module, the C3k2 module optimizes the propagation efficiency of gradient flow in deep networks by introducing convolutional branches with different kernel sizes and residual connections, enabling the model to have stronger feature representation capabilities when extracting complex PCB circuit textures. In addition, at the end of the backbone network (i.e., the P5 layer with 32x downsampling), the system embeds the C2PSA (Cross Stage Partial with SpatialAttention) module, which uses a multi-head attention mechanism to capture the global spatial context information of large-scale defects such as PCB board fractures and missing components.

[0041] More importantly, to address the pain point of feature loss during downsampling due to minor defects in existing technologies, this embodiment breaks with the conventional YOLO series' approach of using only P3 to P5 layers (i.e., 8x to 32x downsampling) for detection in the teacher network architecture definition. After the backbone network is processed by the Stem layer and the shallow C3k2 module, the system performs a 4x downsampling at the image resolution (the size becomes...). At the nodes of the original image, a dedicated shallow feature branch is created, defined as layer P2. This P2 layer feature map preserves the richest edge textures and minute geometric contours of the original image.

[0042] Meanwhile, for the student network, this embodiment configures its architecture as a simplified mapping of the teacher network. To adapt to the computing power limitations of edge computing devices, the number of channels in the backbone layer of the student network is pruned (e.g., the channel scaling factor is set to 0.5), but the P2 layer detector head, consistent with the topology of the teacher network, is strictly retained. This heterogeneous design with aligned structures and asymmetric parameters ensures that the student network has the physical ability to receive higher-order knowledge from the teacher network (such as attention masks and entropy weight features). Through the above data preparation and initial construction of the network topology, the system completes the mapping preparation from the original signals of the physical world to the high-dimensional tensor space of deep learning, laying an analytical foundation for subsequent multi-scale feature extraction and heterogeneous distillation.

[0043] Step S2: Perform multi-scale feature extraction on the PCB defect sample image to obtain a first feature map and a second feature map set; the spatial resolution of the first feature map is higher than the spatial resolution of any feature map in the second feature map set;

[0044] In the specific implementation of this embodiment, step S2 maps the high-dimensional image pixel matrix into a set of feature tensors containing rich semantic and geometric information. This step is not a single convolution operation, but consists of two cascaded sub-processes: backbone network feature encoding and neck network feature fusion. It aims to resolve the contradiction in spatial resolution between small defects (such as objects of interest in the P2 layer) and large-scale structural defects (such as objects of interest in the P3-P5 layers) in PCB inspection.

[0045] Specifically, this embodiment first constructs a backbone network based on an improved CSPDarknet architecture. This backbone network is configured to consist of several C3k2 modules (Cross Stage Partial with Kernel3) stacked in series.

[0046] To ensure efficient feature extraction from the C3k2 module, this embodiment configures its internal parameters. The C3k2 module contains several cascaded Bottleneck units (preferably 3 or 6 depending on the layer depth), and the internal expansion ratio is set to 0.5. Both branch paths within the module employ... Compared to YOLOv8's C2f module, C3k2 uses a standard convolutional kernel, but introduces a variable kernel size in part of the Bottleneck (supporting...). and (Switching) enhances the flexibility of the receptive field.

[0047] For the C2PSA module embedded at the end of the P5 layer, to balance computational complexity and global awareness capability, this embodiment sets the number of heads in its multi-head attention mechanism to 4, with each head having an embedding dimension of 64, for a total embedding dimension of 256. Simultaneously, to reduce the computational complexity of full-image self-attention, a windowed attention mechanism is employed, with the window size set to... .

[0048] Unlike traditional residual modules, the C3k2 module introduces a cross-stage local connectivity structure, which integrates the input feature map... The network is segmented along the channel dimension. One part undergoes a non-linear transformation through dense convolutional layers, while the other part is directly connected via residual edges. Finally, the segments are concatenated at the output. This design ensures lossless propagation of gradient flow in deep networks. Assume the input feature map is... The core convolution operation within the C3k2 module follows the formula below:

[0049] ;

[0050] in, This represents the convolution operation, where W and b are the kernel weights and biases, respectively, and BN is the batch normalization process. This is the SiLU activation function.

[0051] Based on the aforementioned backbone network, this embodiment performs the following hierarchical extraction process:

[0052] Step S201: Specific extraction of the first feature map (P2 layer):

[0053] The system will preprocess the input image (size is) Input to the backbone network. First, it passes through the Stem layer (consisting of two strides of 2). The network performs initial downsampling (using convolutional layers) and then proceeds to the first C3k2 feature extraction stage. When the network reaches the 4x downsampling node, the feature map size becomes... At this point, instead of continuing downsampling without outputting as in the traditional YOLO model, the system forcibly extracts a feature branch, directly defining the tensor output by that node as the first feature map. .

[0054] During this process, the spatial resolution of the feature map remains at a high level. ), making the physical size only Tiny solder joints or burrs in the pixel are still retained in the feature map. This ensures an effective response region, preventing information from being annihilated in subsequent downsampling. For example, if there is a scratch with a width of 4 pixels in the input image, in In the P3 layer (80x80), its width is about 1 pixel, which can still be captured by the convolution kernel; however, in the P3 layer (80x80), its width will be less than 0.5 pixels, causing the feature to disappear.

[0055] Step S202: Extraction and fusion of the second feature map set (P3-P5 layers):

[0056] Introducing Subsequently, the backbone network continues to perform downsampling operations, sequentially passing through convolutional layers with a stride of 2, extracting preliminary features at the 8x, 16x, and 32x downsampling nodes respectively. It is worth noting that at the end of the P5 layer with 32x downsampling, this embodiment embeds a C2PSA (Cross Stage Partial with Spatial Attention) module. This module introduces a multi-head self-attention mechanism to calculate the correlation strength matrix between global spatial locations. The formula is expressed as: ;

[0057] in For querying key vectors, This is the scaling factor. The C2PSA module enhances the model's ability to detect global defects such as overall breakage of PCB materials or large-area missing components.

[0058] Subsequently, the aforementioned 8x (P3), 16x (P4), and 32x (P5) feature maps were input into the feature fusion network (Neck). This network adopts the PANet (Path Aggregation Network) structure, which includes a top-down upsampling path and a bottom-up downsampling path.

[0059] Top-down approach: Upsample the features of layer P5 and concatenate them with the features of layer P4, then upsample them again and concatenate them with the features of layer P3, thereby transmitting deep semantic information to the shallow layer.

[0060] Bottom-up path: The fused features of layer P3 are downsampled and fused with layer P4, then downsampled and fused with layer P5, thereby transmitting shallow positioning information to deeper layers.

[0061] After the above fusion process, the network outputs feature maps at three scales. ( ), ( ), ( These three feature maps together constitute the second feature map set.

[0062] In summary, through the processing in step S2, the system outputs two sets of feature data: one set is the first feature map that has not been fused by PANet and maintains extremely high spatial resolution. The other set is a set of second feature maps that have undergone deep feature fusion. (Basic supervision for routine inspection tasks), thereby achieving comprehensive coverage of PCB micro-defects and routine defects at the data level.

[0063] Step S3: Input the first feature map into the pre-constructed teacher network, and generate the teacher spatial attention mask and teacher probability distribution map by sequentially performing geometric deformation alignment, channel semantic filtering, and spatial feature focusing;

[0064] Specifically, the system first sets the size to be First feature map Input geometric deformation alignment unit, where, This represents the number of channels in the first feature map (Channel Depth). In the preferred YOLOv11 teacher network configuration of this embodiment, considering the need to preserve rich micro-texture features, The value is set to 128 (corresponding to the number of channels in the P2 layer under the standard width factor). Considering that scratches on the PCB surface often exhibit irregular curves or oblique distributions, traditional fixed-grid convolutional kernels are insufficient to effectively extract their complete features. Therefore, this embodiment first utilizes a bypass convolutional layer (Offset Conv) to... Perform calculations to learn and output a spatial offset field corresponding to the input size. To ensure that the offset field can accurately drive subsequent deformable convolutions, the bypass convolutional layer is physically configured as a standard convolutional layer with a kernel size set to [value missing]. The stride is 1 and the padding is 1 to maintain the input and output dimensions. Most importantly, the number of output channels for this layer is strictly defined as... Where K is the size of the deformable convolutional kernel. In this embodiment, K=3, so the number of output channels of the bypass convolutional layer is 18. These 18 channels physically correspond to... Independent offset scalars of the 9 sampling points within the sensing field along the x and y axes. During the model initialization phase, the weights of this layer are preferably initialized to 0, so that the initial offset is 0. This ensures that the model degenerates into a standard convolution in the early stages of training, and gradually learns nonlinear offsets for defect morphology as training progresses.

[0065] This spatial offset field It includes displacement parameters for each pixel along the x and y axes. The system then bases this on a spatial offset field. The computational logic for performing deformable convolution follows the formula: ;

[0066] Where R is the sampling grid of a regular convolution kernel (e.g., ... to ), This is the currently calculated center position of the pixel. The relative position within the grid. That is, from the spatial offset field The non-integer offset values are extracted. Pixel values at the offset positions are obtained through bilinear interpolation, allowing the effective receptive field of the convolutional kernel to adaptively deform and closely fit the geometric edges of irregular defects, thereby generating a geometrically aligned feature map. For example, for a hair scratch at a 45-degree angle, the sampling points will automatically shift diagonally to ensure that the convolution kernel captures continuous scratch features rather than background noise.

[0067] After geometric alignment is completed, the data stream enters the channel semantic filtering unit. Addressing the issue of severe high-frequency background noise interference from silkscreen text, solder resist ink, etc., in PCB images, the system... Global max pooling (GMP) and global average pooling (GAP) are performed in parallel to extract the extreme responses and background mean of the feature maps along the channel dimension, respectively. These two one-dimensional vectors are concatenated and input into a multilayer perceptron (MLP) with shared weights. This MLP contains dimensionality reduction and dimensionality expansion layers. Specifically, it is configured to first compress the channel dimension to its original value through a fully connected layer. (Scaling ratio in this embodiment) The preferred number of channels is 16. After ReLU activation, the number of channels is restored to the original number through a second fully connected layer. By learning the importance of each channel for defect representation, a normalized channel weight vector is output. .

[0068] The system then performs a channel-by-channel weighted calculation. By suppressing channels that strongly respond to background noise (such as the green ink channel) and amplifying channels that strongly respond to metallic luster or cracks, semantically enhanced feature maps are generated. .

[0069] Furthermore, in order to accurately locate the core region of the defect spatially, the system enhances the semantic feature map. Perform spatial feature focusing processing. Specifically, this involves enhancing the semantic feature map. Perform global max pooling and global average pooling based on the channel dimension respectively, generating two data points of size 1. Two-dimensional feature descriptors are then generated. These two descriptors are then concatenated along the channel dimension to obtain a single feature descriptor. The intermediate feature maps. Finally, through a convolution kernel of size... A standard convolutional layer with a stride of 1 and padding of 3 is used to compress the number of channels from 2 to 1, and then processed by a sigmoid activation function to generate a teacher spatial attention mask. In this mask, the closer a pixel value is to 1, the higher the probability that a real defect exists at that location; conversely, a value closer to 1 represents the background.

[0070] Finally, the system performs feature fusion and probability mapping. The generated mask... With semantic enhancement features Element-wise product is performed to obtain the final enhanced feature tensor. This tensor is then fed into the detection head of the teacher network, which is... The system consists of convolutional layers, responsible for mapping the number of feature channels to the total number of fault categories N. The output is activated by either a sigmoid or softmax function to generate a teacher probability distribution map. Each voxel in this distribution map Quantified in coordinates The confidence level of the existence of the k-th type of fault provides direct data support for subsequent calculation of spatial prediction uncertainty.

[0071] Step S4: Calculate the spatial prediction uncertainty based on the teacher probability distribution map, and map the spatial prediction uncertainty to a dynamic weight matrix with the same size as the first feature map; the region with higher spatial prediction uncertainty has a larger corresponding weight value in the dynamic weight matrix;

[0072] In this specific implementation, step S4 aims to transform the probability distribution output by the teacher network at layer P2 into quantifiable supervisory signal weights. Essentially, this step establishes a perception mechanism that dynamically adjusts the gradient contribution weights of different spatial regions during subsequent distillation by quantifying the spatial prediction uncertainty of the teacher model for the current pixel classification. This process includes two steps: pixel-level information entropy calculation and nonlinear weight mapping.

[0073] Specifically, the system first receives the teacher probability distribution map generated in step S3. Its data dimensions are Where N represents the total number of predefined fault categories (including the background class). The system performs a pixel-level traversal operation, for each spatial coordinate point on the feature map. Extract its corresponding class probability vector To quantify the spatial prediction uncertainty at this location, this embodiment uses Shannon entropy as a metric. Compared to statistics such as variance or range, information entropy more sensitively captures the flatness of the probability distribution. The system calculates the information entropy value of this coordinate point according to the following formula. In this calculation, when a pixel is located in a clear background area, the teacher network typically gives a high-confidence prediction (e.g., ...). At this point, the calculated entropy value approaches 0; however, when the pixel is located at the blurred edge of a minor defect or at a complex texture boundary, the teacher network's prediction often exhibits a multi-peak distribution (e.g., At this point, the entropy value increases significantly. The system recombines the calculation results of all coordinate points to generate a single-channel spatial entropy map. .

[0074] Subsequently, to transform the aforementioned physical uncertainty into the focus of the loss function during training, the system constructs a dynamic weight generation module. This module does not directly use the original entropy value, but instead converts the spatial entropy map into a dynamic weight matrix through a non-linear mapping function. This mapping logic aims to assign excess weights to regions of high uncertainty, forcing the student network to invest more fitting ability in these areas. This embodiment uses the following formula for mapping: ;

[0075] The constant 1 ensures basic gradient backpropagation and prevents gradient vanishing in low-entropy regions (simple samples). The method utilizes squaring operations to further widen the difference between high-entropy and low-entropy regions, thereby suppressing low-amplitude noise fluctuations. In the preferred configuration of this embodiment, the preset hyperparameter adjustment coefficients are used. Set between 2.0 and 5.0 to control the magnification of difficult areas.

[0076] For example, suppose in coordinates The region is the background area, and the calculated entropy value is 0.1. Let... Then the weight of that point That is, maintaining standard weights; while in coordinates The point is a fuzzy edge of a cold solder joint defect, with an entropy value of 1.2. What is the weight of this point? This means that in subsequent loss calculations, the error in the defect edge region will be amplified by more than 6 times, thus guiding the student network to prioritize optimizing the parameters in this region during backpropagation, significantly improving the model's ability to capture small, blurry defects. The final output dynamic weight matrix... It will serve as key supervisory information and directly participate in the heterogeneous knowledge distillation in step S5.

[0077] Step S5: Construct a lightweight student network, using the second feature map set and fault reality annotations as basic supervision information, and construct enhanced supervision information using the first feature map, teacher spatial attention mask, teacher probability distribution map, and dynamic weight matrix. Perform heterogeneous knowledge distillation training on the student network; the training loss function is configured as follows: it includes a first loss term for the first feature map and a second loss term for the second feature map set; the first loss term is based on the dynamic weight matrix, and performs weighted calculation on the differences in probability distribution between the student network and the teacher network;

[0078] In this specific implementation, step S5 performs heterogeneous knowledge distillation training. This aims to seamlessly transfer the implicit higher-order knowledge (i.e., attention patterns to minor defects and uncertainty cognition) extracted by the teacher network at layer P2 to the lightweight student network by constructing a composite loss function system. Simultaneously, conventional label supervision ensures the model's generalization ability to multi-scale targets. This process involves lightweight configuration of the student network architecture, physical alignment of teacher and student feature dimensions, and the calculation and backpropagation of the weighted composite loss function.

[0079] Before formally performing heterogeneous knowledge distillation training and loss calculation, this embodiment performs rigorous feature alignment and dimensionality unification to overcome the physical differences between the teacher network and the student network (lightweight pruning) in terms of feature channel number and spatial representation. This process ensures that the two establish an accurate point-to-point mapping relationship in the tensor space.

[0080] Specifically, the system first obtains the original student feature maps of the student network at the level corresponding to the first feature map (P2 layer). Assuming the student network undergoes channel pruning, the tensor dimension of its P2 layer output is... (For example ), and the teacher spatial attention mask generated by the teacher network For single-channel tensors ( Since the mismatch in channel dimensions directly hinders the calculation of spatial attention topology loss, the system will... Input the preset auxiliary mapping layer.

[0081] This auxiliary mapping layer is configured in terms of physical structure as a The convolutional layer has a kernel size that is forced to be 1, and it does not include any padding or stride variation. Through this convolutional operation, the system increases the channel dimension of the student feature map from... Linearly compressed to 1, then processed by the Sigmoid activation function to generate a virtual attention map of the student. Its core calculation logic is as follows:

[0082] ;

[0083] in, The learnable weight matrix for the auxiliary mapping layer, This is a bias term.

[0084] Furthermore, the system generates Dimensional consistency processing is performed. Although the student network P2 layer design size is the same as the teacher network in this embodiment, in practical applications, if the student network adopts a different downsampling strategy, it may lead to... Spatial resolution ( )and Spatial resolution ( If there is a slight deviation, the system will automatically trigger the bilinear interpolation algorithm to... Dynamic resampling to Completely consistent dimensions. Through the aforementioned channel compression and spatial resampling, the system ensures... and In tensor dimension ( The two are fully aligned on the tensor space, thus establishing a point-to-point mapping relationship between them in the tensor space, providing a mathematical premise for the subsequent calculation of attention topology loss.

[0085] After establishing feature alignment, the system computes the training loss in parallel using enhanced and basic supervision information. First, a first loss term is calculated for the first feature map (P2 layer), which is further subdivided into entropy-weighted soft-label loss and attention topology loss. For the entropy-weighted soft-label loss, the system utilizes the dynamic weight matrix generated in step S4. As a moderating factor, calculate the student probability distribution. Probability distribution of teachers The relative entropy (KL divergence) between them is calculated according to the following formula:

[0086] W and H represent the width and height of the feature map, respectively. Represents the relative entropy function. and Represent the student network and teacher network in coordinates respectively. The probability distribution at that location.

[0087] In this formula, Spatial weighting was utilized: for areas where the teacher network was identified as having high uncertainty (i.e., (For small defect edges with larger values), if the student network's prediction distribution differs significantly from the teacher's, the system will apply a gradient penalty that is multiplied; conversely, in background regions, the gradient penalty remains at the baseline level. This mechanism forces the student network to primarily mimic the teacher's judgment logic on difficult examples. Simultaneously, the system calculates the attention topology loss. By minimizing the L2 norm distance between the two, the focus of the student network is constrained to be consistent with that of the teacher in geometric space, that is, the student network is forced to "look" at the defective area that the teacher is focusing on.

[0088] Meanwhile, for the second feature map set (layers P3, P4, and P5), the system calculates the second loss term using basic supervised information. This process does not require the teacher network's participation; instead, it directly matches the student network's predicted outputs at these scales with the ground truth fault labels. The system uses CIoU (Complete IoU) loss to calculate the bounding box regression error and DFL (Distribution Focal Loss) to calculate the classification error, defining the sum of the two as the second loss term. Finally, the system performs a weighted sum of the first and second loss terms to obtain the total loss and updates the student network's weight parameters using the backpropagation algorithm. The formula for the total loss function constructed in this embodiment is as follows:

[0089] ;

[0090] in, and To balance the hyperparameters of the gradients of each loss term, and considering the difference in magnitude between the entropy-weighted soft-label loss and the attention topology loss in the early stages of training, this embodiment preferably sets the hyperparameters to ensure that the model learns minor defect features without compromising its basic detection capabilities. .

[0091] To ensure the convergence and robustness of model training, this embodiment employs a specific hyperparameter configuration strategy. Specifically, the training process is based on a stochastic gradient descent (SGD) optimizer, with a momentum factor set to 0.937 and a weight decay factor set to 0.0005. The total number of training epochs is set to 300. The learning rate is dynamically adjusted using a cosine annealing strategy, with an initial learning rate set to 0.01 and a final learning rate decaying to 0.001. Furthermore, to fully utilize the parallel computing power of the GPU, the batch size is set to 32. In the first three epochs of training, a warm-up strategy is also implemented, allowing the learning rate to slowly and linearly increase from 0 to the initial value, thus preventing gradient oscillations in the early stages of training from disrupting the distribution characteristics of the pre-trained weights.

[0092] It is worth noting that after training, the aforementioned auxiliary mapping layer will be removed, leaving only the main structure of the student network for inference, thus achieving the ultimate lightweighting of high-dimensional knowledge guidance during the training phase and inference phase.

[0093] Step S6: Obtain the target PCB image to be detected, perform fault detection on the target PCB image based on the trained student network, and output the detection results including fault category and location coordinates;

[0094] Specifically, before performing detection, the system first restructures the trained student network. The auxiliary mapping layer used for feature alignment during training is removed, and the system retains only the backbone layer, neck fusion layer, and multi-scale detection heads from P2 to P5 of the student network. This structural simplification ensures that the model does not incur any additional parameter computation overhead during the inference phase, thereby maximizing the operating efficiency of edge devices.

[0095] Subsequently, the target PCB image to be inspected is normalized to a standard size (e.g., The adjusted student network is then input. The network performs forward propagation computation and outputs detection tensors corresponding to four scales: P2, P3, P4, and P5. It is worth noting that this embodiment employs an anchor-free detection mechanism; therefore, the output tensor at each scale does not depend on preset anchor boxes, but directly predicts the distance distribution vectors from the current grid center point to the top, bottom, left, and right edges of the target bounding box. These distance distribution vectors are subsequently converted into specific bounding box center coordinates using an integral regression decoding algorithm. and width and height dimensions Ultimately, each candidate detection box is represented by a vector containing the coordinates of the bounding box center. Width and height dimensions And a confidence score indicating the presence of a fault within the box. and specific fault categories .

[0096] After obtaining the initial candidate detection box set, the system enters the post-processing filtering stage. First, confidence threshold filtering is performed, with the system setting a pre-defined confidence threshold. (For example, set to 0.25). Iterate through all candidate boxes, and for each box with a confidence score... The detection boxes will be considered invalid background noise and directly discarded.

[0097] The remaining candidate boxes after initial screening often contain duplicate predictions of the same defect (i.e., redundant overlapping boxes). To remove duplicates and lock in the optimal result, the system executes the Non-Maximum Suppression (NMS) algorithm. The core of this algorithm is to calculate the Intersection over Union (IoU) ratio between candidate detection boxes as an overlap index. Assume there are two candidate boxes... and The formula for calculating the intersection-union ratio is: ;

[0098] in, This represents the area of the overlapping region between the two frames. This represents the area of the region where the two frames are joined.

[0099] The specific execution logic of NMS is as follows: The system first sorts all remaining candidate boxes according to their confidence scores. Sort the boxes from highest to lowest confidence level. Select the boxes with the highest confidence level. As a retained result, subsequent calculations are performed sequentially. With all other boxes in the list The IoU value. If Greater than the preset overlap threshold (For example, if set to 0.45), then determine Redundant boxes are suppressed (deleted); if they are less than a threshold, they are retained. Proceed to the next round of filtering. This process is repeated until the candidate list is empty.

[0100] Finally, the set of detection boxes retained after NMS processing is the final detection result. The system maps the normalized coordinates of these detection boxes back to the pixel coordinate system of the original image, and outputs structured data containing fault category (such as cold solder joint, scratch, etc.), location coordinates and confidence level, thus completing the detection.

[0101] In summary, this embodiment, by constructing a heterogeneous distillation architecture including a high-resolution branch at the P2 layer, essentially establishes a lossless transmission channel from the high-order cognition of the teacher network to the lightweight features of the student network. Through the teacher network's geometric calibration of irregular defects and semantic suppression of background noise, as well as the gradient amplification of high-uncertainty regions by a dynamic weight matrix based on information entropy, the system successfully forces the parameter-pruned student network to overcome the physical bottleneck of low-computing-power models in extracting features from minute targets. This deep supervision mechanism ensures that after removing auxiliary structures during the inference phase, the student network can reproduce the teacher network's keen ability to capture hair-level scratches and minute solder joints with extremely low computational overhead, thus achieving a performance balance between real-time detection and high-precision recognition in industrial edge computing scenarios.

[0102] Example 2:

[0103] like Figure 3 As shown, a high-precision, lightweight PCB fault detection system for improving product reliability includes:

[0104] The data acquisition module is used to acquire PCB defect sample images and their corresponding actual fault annotations;

[0105] The multi-scale feature extraction module is used to perform multi-scale feature extraction on PCB defect sample images to obtain a first feature map and a second feature map set; the spatial resolution of the first feature map is higher than the spatial resolution of any feature map in the second feature map set.

[0106] The teacher network processing module is used to input the first feature map into the pre-constructed teacher network, and generate the teacher spatial attention mask and teacher probability distribution map by sequentially performing geometric deformation alignment, channel semantic filtering and spatial feature focusing;

[0107] The dynamic weight generation module is used to calculate the spatial prediction uncertainty based on the teacher probability distribution map and map the spatial prediction uncertainty into a dynamic weight matrix with the same size as the first feature map; the region with higher spatial prediction uncertainty has a larger corresponding weight value in the dynamic weight matrix.

[0108] The heterogeneous knowledge distillation training module is used to construct a lightweight student network. It uses the second feature map set and fault reality annotations as basic supervision information, and simultaneously uses the first feature map, teacher spatial attention mask, teacher probability distribution map and dynamic weight matrix to construct enhanced supervision information to train the student network. The training loss function is configured to include a first loss term for the first feature map and a second loss term for the second feature map set. The first loss term is based on the dynamic weight matrix and performs a weighted calculation on the difference in probability distribution between the student network and the teacher network.

[0109] The fault detection module is used to acquire the target PCB image to be detected, and to perform fault detection on the target PCB image based on the trained student network, and output the detection results including the fault category and location coordinates.

[0110] The above description is merely an example and illustration of the structure of the present invention. Those skilled in the art can make various modifications or additions to the specific embodiments described, or use similar methods to replace them, as long as they do not deviate from the structure of the invention or exceed the scope defined in the claims, all of which should fall within the protection scope of the present invention.

[0111] In the description of this specification, references to terms such as "an embodiment," "example," "specific example," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0112] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to any specific implementation. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A high-precision, lightweight PCB fault detection method to improve product reliability, characterized in that, Includes the following steps: Obtain PCB defect sample images and their corresponding fault annotations; Multi-scale feature extraction is performed on the PCB defect sample image to obtain a first feature map and a second feature map set; the spatial resolution of the first feature map is higher than the spatial resolution of any feature map in the second feature map set; The first feature map is input into the pre-constructed teacher network, and then the teacher spatial attention mask and teacher probability distribution map are generated by sequentially performing geometric deformation alignment, channel semantic filtering, and spatial feature focusing. The spatial prediction uncertainty is calculated based on the teacher probability distribution map, and the spatial prediction uncertainty is mapped to a dynamic weight matrix with the same size as the first feature map; the region with higher spatial prediction uncertainty has a larger weight value in the dynamic weight matrix. A lightweight student network is constructed, using the second feature map set and the real fault annotations as basic supervision information, and using the first feature map, the teacher spatial attention mask, the teacher probability distribution map and the dynamic weight matrix to construct enhanced supervision information, and heterogeneous knowledge distillation training is performed on the student network. The training loss function is configured to include a first loss term for the first feature map and a second loss term for the second feature map set; the first loss term is based on the dynamic weight matrix, and performs a weighted calculation on the difference in probability distribution between the student network and the teacher network. The target PCB image to be detected is obtained, and the fault detection of the target PCB image is performed based on the trained student network, and the detection result containing the fault category and location coordinates is output.

2. The high-precision lightweight PCB fault detection method for improving product reliability according to claim 1, characterized in that: By using geometric deformation alignment, channel semantic filtering, and spatial feature focusing, a teacher spatial attention mask and a teacher probability distribution map are generated, including the following steps: The spatial offset field is calculated based on the first feature map using the bypass convolutional layer in the teacher network; a deformable convolution operation is performed on the first feature map based on the spatial offset field to generate a geometrically aligned feature map; Global max pooling and global average pooling are performed on the geometrically aligned feature map to generate channel descriptors; the channel descriptors are input into a multilayer perceptron to calculate channel weight vectors; the channel weight vectors are multiplied channel-by-channel with the geometrically aligned feature map to generate semantically enhanced feature maps. The semantically enhanced feature map is compressed along the channel dimension and then processed through a convolutional layer and activation function to generate the teacher spatial attention mask; The teacher spatial attention mask is multiplied element-wise with the semantic enhancement feature map, and the product result is input into the detection head of the teacher network to generate the teacher probability distribution map.

3. The high-precision lightweight PCB fault detection method for improving product reliability according to claim 1, characterized in that: Calculating spatial prediction uncertainty based on the aforementioned teacher probability distribution map includes: Traverse each pixel coordinate point in the teacher probability distribution map and extract the category probability vector corresponding to that coordinate point; The information entropy value of the category probability vector is calculated using the Shannon entropy algorithm, and the information entropy values of all coordinate points are combined to generate a spatial entropy map; the information entropy value in the spatial entropy map is the quantitative representation of the spatial prediction uncertainty; the formula for calculating the information entropy value is as follows: ； in, Representing coordinates The information entropy value at the location, where N represents the total number of fault categories. Representing coordinates The probability value that the fault belongs to the k-th type.

4. The high-precision lightweight PCB fault detection method for improving product reliability according to claim 3, characterized in that: The spatial prediction uncertainty is mapped to a dynamic weight matrix with the same size as the first feature map, satisfying the following nonlinear mapping: ； in, Represents the dynamic weight matrix in coordinates The weight value at that location, Representing coordinates Information entropy value at that location, This is a preset adjustment coefficient used to amplify the gradient contribution of high-uncertainty regions during model training.

5. The high-precision lightweight PCB fault detection method for improving product reliability according to claim 1, characterized in that: Before performing heterogeneous knowledge distillation training on the student network, the method further includes a step of aligning the features of the student network with those of the teacher network. Obtain the original student feature map of the student network corresponding to the level of the first feature map; The original student feature map is input into a preset auxiliary mapping layer, and the number of channels of the original student feature map is compressed to a single channel through a convolution operation to generate a virtual attention map of the student. The student virtual attention map is subjected to dimension uniformity processing to make it consistent with the teacher's spatial attention mask in terms of spatial resolution and channel dimension, thereby establishing a point-to-point mapping relationship between the two in tensor space.

6. The high-precision lightweight PCB fault detection method for improving product reliability according to claim 5, characterized in that: The first loss term includes an entropy-weighted soft label loss sub-term and an attention topology loss sub-term; The entropy-weighted soft-label loss sub-item is configured as follows: using the dynamic weight matrix, a spatial weighted calculation is performed on the relative entropy between the student probability distribution map and the teacher probability distribution map output by the student network. The calculation formula is as follows: ； in, For the entropy-weighted soft label loss term, W and H are the width and height of the feature map, respectively. Represents the relative entropy function. and Represent the student network and teacher network in coordinates respectively. The probability distribution at that location.

7. A high-precision, lightweight PCB fault detection method for improving product reliability according to claim 6, characterized in that: The attention topology loss sub-item is configured to calculate the geometric distribution difference between the student's virtual attention map and the teacher's spatial attention mask, and the calculation formula is as follows: ； in, For attention topology loss, Create a virtual attention map for students. This is the teacher's spatial attention mask. This represents the square of the L2 norm.

8. The high-precision lightweight PCB fault detection method for improving product reliability according to claim 1, characterized in that: The second loss item is as follows: Obtain the multi-scale prediction results output by the student network based on the second feature map set; Based on the actual fault annotations, calculate the bounding box regression loss value and classification loss value for each scale in the multi-scale prediction results; The second loss term is obtained by summing the bounding box regression loss value and the classification loss value for all scales.

9. A high-precision, lightweight PCB fault detection method for improving product reliability according to claim 1, characterized in that: The trained student network is used to perform fault detection on the target PCB image, including: The target PCB image is input into the student network, and a multi-scale detection tensor is output through parallel computing. The multi-scale detection tensor contains several candidate detection boxes, and each candidate detection box includes the corresponding fault category, location coordinates, and confidence score. Based on the confidence score, all candidate detection boxes are filtered out, and candidate detection boxes with confidence scores lower than a preset confidence threshold are removed. The intersection-union ratio (IUU) among the remaining candidate detection boxes is then calculated as an overlap index. For multiple candidate detection boxes whose overlap index is higher than the preset overlap threshold, the one with the highest confidence score is retained as the final detection result.

10. A high-precision, lightweight PCB fault detection system for improving product reliability, characterized in that: Using a high-precision, lightweight PCB fault detection method to improve product reliability as described in any one of claims 1-9, comprising: The data acquisition module is used to acquire PCB defect sample images and their corresponding actual fault annotations; A multi-scale feature extraction module is used to perform multi-scale feature extraction on the PCB defect sample image to obtain a first feature map and a second feature map set; the spatial resolution of the first feature map is higher than the spatial resolution of any feature map in the second feature map set. The teacher network processing module is used to input the first feature map into the pre-constructed teacher network, and generate the teacher spatial attention mask and teacher probability distribution map by sequentially performing geometric deformation alignment, channel semantic filtering and spatial feature focusing; The dynamic weight generation module is used to calculate the spatial prediction uncertainty based on the teacher probability distribution map and map the spatial prediction uncertainty into a dynamic weight matrix with the same size as the first feature map; the region with higher spatial prediction uncertainty has a larger weight value in the dynamic weight matrix. A heterogeneous knowledge distillation training module is used to construct a lightweight student network and utilize the second feature map set and the fault real annotations as basic supervision information. Simultaneously, it utilizes the first feature map, the teacher spatial attention mask, the teacher probability distribution map, and the dynamic weight matrix to construct enhanced supervision information for training the student network. The training loss function is configured to include a first loss term for the first feature map and a second loss term for the second feature map set. The first loss term, based on the dynamic weight matrix, performs a weighted calculation on the differences in probability distribution between the student network and the teacher network. The fault detection module is used to acquire the target PCB image to be detected, and to perform fault detection on the target PCB image based on the trained student network, and output the detection result including the fault category and location coordinates.