A method, device, and medium for detecting defects in mobile phone screens based on machine vision.
By combining adaptive nonlocal mean filtering and frequency domain visual highlighting techniques with a dynamic multi-task recognition network, the problems of difficulty in separating background and defects and insufficient contrast in existing methods are solved, achieving high-precision defect detection, reducing false alarm rate and improving detection efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 厦门工学院
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
Existing machine vision-based mobile phone screen defect detection methods struggle to accurately separate the background from the defects, resulting in false defects in areas with drastic background changes and the masking of weak defect signals in areas with smooth backgrounds. Furthermore, they lack effective utilization of image frequency domain features, leading to insufficient contrast between defects and the background in low signal-to-noise ratio environments, making it difficult to extract effective features for analysis.
Adaptive nonlocal mean filtering is used to accurately separate background and potential defect signals. Combined with frequency domain visual highlighting and adaptive contrast enhancement techniques, defect analysis is performed using a dynamic multi-task recognition network with dual-channel input by dynamically adjusting enhancement parameters at multiple scales.
It improves the signal-to-noise ratio of defect detection, reduces the false alarm rate, and can accurately output defect type, shape, and size data to meet the high-precision and high-efficiency quality inspection needs of modern industrial production.
Smart Images

Figure CN122306822A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of defect detection technology, and in particular to a method, device and medium for detecting defects in mobile phone screens based on machine vision. Background Technology
[0002] With the rapid development of mobile terminal technology, consumers have increasingly stringent requirements for the display quality of smartphone screens. As the core medium for human-computer interaction, mobile phone screens are highly susceptible to scratches, dark spots, bright spots, foreign objects, or Mura defects due to process defects or environmental factors during the manufacturing process. In the past, mobile phone screen defect detection mainly relied on manual visual inspection. However, due to limitations such as human eye fatigue, subjective judgment differences, and the invisibility of minor defects (such as low-contrast Mura defects), it is no longer able to meet the needs of modern high-precision and high-efficiency industrial production. Therefore, automated optical inspection (AOI) technology based on machine vision has gradually become mainstream. Existing AOI technology usually uses high-resolution industrial cameras to acquire screen images under specific light sources, extracts features through image processing algorithms, and performs defect identification. In recent years, with the breakthrough of deep learning technology, convolutional neural networks (CNNs) have been widely used in the field of defect detection. By training classifiers or segmentation networks to identify screen anomalies, this has improved the accuracy and generalization ability of detection to a certain extent.
[0003] Nevertheless, existing defect detection methods still have room for improvement. First, previous background modeling methods often fail to accurately separate the background from the defect, resulting in a large number of false defects in areas with drastic background changes, while weak defect signals are easily masked in areas with smooth backgrounds. Second, existing algorithms based on spatial domain processing usually lack effective utilization of image frequency domain features, and cannot effectively enhance the visual salience of small defects globally. This results in insufficient contrast between defects and the background in low signal-to-noise ratio environments, making it difficult to extract effective features for subsequent analysis. Summary of the Invention
[0004] In view of the aforementioned existing problems, the present invention is proposed.
[0005] Therefore, this invention provides a machine vision-based method for detecting defects in mobile phone screens to solve the problem of difficulty in accurately separating the background from the defects.
[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution: In a first aspect, the present invention provides a method for detecting defects in mobile phone screens based on machine vision, comprising, Acquire the original image of the phone screen under test and perform noise reduction to obtain a smooth image; The smoothed image is input into a deep background generation network for reconstruction to obtain a preliminary background reconstruction map; Nonlocal mean filtering is applied to the preliminary background reconstruction image to obtain the final background image. The gray-level difference between the corresponding pixels of the smoothed image and the final background image is calculated, and the absolute value of the gray-level difference is taken to generate a preliminary residual image. Perform spectral residual significance calculation on the preliminary residual map to obtain a frequency domain visual saliency map; Based on the frequency domain visual saliency map, the contrast limiting parameter is calculated, and the preliminary residual map is subjected to contrast-limited histogram equalization processing according to the contrast limiting parameter to generate an enhanced residual map. After dot-multiplying the frequency domain visual saliency map and the enhancement residual map and normalizing them, a weighted defect response map is obtained. Threshold segmentation and morphological processing are then performed on the weighted defect response map to extract candidate defect regions. For each defect candidate region, dual-channel image patches are extracted from the smoothed image and the final background image, and then input into a dynamic multi-task recognition network for defect analysis to obtain defect detection results.
[0007] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, the smooth image is obtained by defining a two-dimensional Gaussian kernel function and performing a convolution operation with the original image of the mobile phone screen to be tested.
[0008] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, the preliminary background reconstruction map is obtained by inputting a smooth image into a deep background generation network. The encoder part of the deep background generation network downsamples the smooth image to extract multi-scale features, and the decoder part upsamples the multi-scale features to reconstruct the image, thereby generating a preliminary background reconstruction map.
[0009] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, the generation of the preliminary residual map specifically includes: Calculate the grayscale variance in the neighborhood of each pixel in the preliminary background reconstruction image; The adaptive filtering parameters for each pixel position are calculated based on the gray-level variance and the preset adjustment value. Based on the adaptive filtering parameters corresponding to each pixel position, nonlocal mean filtering is performed on the preliminary background reconstruction image to obtain the final background image. For corresponding pixel positions with the same coordinates in the smooth image and the final background image, the grayscale values of the corresponding pixel positions are read and the difference is calculated. After taking the absolute value of the difference calculation results of all pixel positions, the absolute value results of all pixel positions are arranged according to the original spatial position to form a new grayscale image. The new grayscale image is used as the preliminary residual image.
[0010] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, wherein obtaining the frequency domain visual salience map specifically involves: A two-dimensional discrete Fourier transform is performed on the preliminary residual plot to obtain the logarithmic amplitude spectrum and phase spectrum of the preliminary residual plot; Calculate the average gradient of the preliminary residual plot, and determine the cutoff frequency of the low-pass filter based on the average gradient, the preset base cutoff frequency, and the adjustment factor. Based on the cutoff frequency of the low-pass filter, the logarithmic amplitude spectrum of the preliminary residual plot is filtered to obtain the smoothed logarithmic amplitude spectrum. The difference between the logarithmic amplitude spectrum of the initial residual plot and the smoothed logarithmic amplitude spectrum is calculated to obtain the frequency domain residual spectrum. The complex spectrum is reconstructed based on the residual spectrum and phase spectrum in the frequency domain. A two-dimensional discrete Fourier inverse transform is performed on the reconstructed complex spectrum to obtain a frequency domain visual salience map.
[0011] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, wherein: the generation of the enhanced residual map specifically includes: To construct a Gaussian pyramid for the initial residual map, the frequency domain visual saliency map is downsampled to the same size as the image at each scale in the Gaussian pyramid. Based on the value of each pixel position in the downsampled frequency domain visual saliency image, calculate the contrast limiting parameter of the pixel position in the corresponding scale image. Using the calculated contrast limiting parameters, contrast-limited histogram equalization is performed independently on the image at each scale in the Gaussian pyramid to obtain a contrast-enhanced image at each scale. Upsample all contrast-enhanced images at all scales back to the original size of the initial residual map, and then perform a linear weighted summation of the pixel grayscale values at the same pixel coordinate position in all upsampled contrast-enhanced images to generate the enhanced residual map.
[0012] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, wherein: the extraction of defect candidate regions specifically includes: Perform pixel-by-pixel multiplication on the pixel grayscale values of the frequency domain visual saliency map and the enhanced residual map at the same spatial coordinates to generate a dot product result map. Normalize the grayscale values of all pixels in the dot product result map to obtain a weighted defect response map. Based on the distribution of gray values of each pixel in the weighted defect response map, the Otsu method is used to calculate the global optimal binarization threshold. Based on the globally optimal binarization threshold, the grayscale value of each pixel in the weighted defect response map is converted into a binary image pixel value to form a binary image; The binary image is subjected to morphological closing and morphological opening operations in sequence to obtain the morphologically processed binary image. Connectivity labeling analysis is performed on the morphologically processed binary image to identify all independent connected regions, and the minimum bounding rectangle of each connected region is calculated. Each minimum bounding rectangle constitutes a defect candidate region.
[0013] As a preferred embodiment of the machine vision-based mobile phone screen defect detection method of the present invention, the step of obtaining the defect detection result specifically includes: For each defect candidate region, based on the position and size of the corresponding minimum bounding rectangle, image blocks are cropped from the corresponding coordinate positions of the smooth image and the final background image, and these two image blocks are stitched together in the channel dimension to form a dual-channel image block. After scaling the dual-channel image patch to a preset size and normalizing it, it is input into the dynamic multi-task recognition network to generate convolution kernel parameters in real time. The backbone of the dynamic multi-task recognition network uses convolution kernel parameters to convolve and extract features from dual-channel image blocks, and then passes them to the multi-task output layer to obtain the category probability distribution of defects, pixel-level segmentation mask and quantization parameters. Based on preset judgment rules, the defect detection results are generated by fusing the defect category probability distribution, pixel-level segmentation mask, and quantization parameters.
[0014] In a second aspect, the present invention provides a computer device, including a memory and a processor, wherein the memory stores a computer program, wherein when the computer program is executed by the processor, it implements any step of the machine vision-based mobile phone screen defect detection method described in the first aspect of the present invention.
[0015] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein: when the computer program is executed by a processor, it implements any step of the machine vision-based mobile phone screen defect detection method described in the first aspect of the present invention.
[0016] The beneficial effects of this invention are as follows: By utilizing adaptive nonlocal mean filtering to accurately separate background and potential defect signals, the problem of false defects easily generated under complex textures in previous methods is solved. Subsequently, frequency domain visual highlighting and adaptive contrast enhancement technology are introduced. By dynamically adjusting the enhancement parameters at multiple scales, the problem of small and low-contrast defects being difficult to detect is effectively solved, and the signal-to-noise ratio is improved. Finally, a dynamic multi-task recognition network with dual-channel input is adopted. By fusing classification results, segmentation results, defect area, and quantization parameters, a refined description of defect attributes is achieved. This not only significantly reduces the false alarm rate caused by background interference, but also breaks through the limitations of a single detection task. It can accurately output the type, shape, and size data of defects, meeting the stringent requirements of modern industrial production for high-precision and high-efficiency quality inspection. Attached Figure Description
[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a flowchart of a machine vision-based method for detecting defects in mobile phone screens.
[0019] Figure 2 A flowchart for obtaining an enhanced residual map.
[0020] Figure 3 A flowchart for obtaining a preliminary residual plot.
[0021] Figure 4 A flowchart for obtaining a frequency domain visual salience map. Detailed Implementation
[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
[0023] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.
[0024] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.
[0025] Reference Figures 1-4 This is one embodiment of the present invention, which provides a machine vision-based method for detecting defects in mobile phone screens, comprising the following steps: S1. Acquire the original image of the screen of the mobile phone under test and perform noise reduction to obtain a smooth image.
[0026] It should be noted that, in a controlled darkroom environment, a high-resolution area array industrial camera is optically aligned with the surface of the mobile phone screen under test. A standard test image, consisting of a series of grayscale images covering a range from pure black to pure white, is displayed on the screen via a control circuit. While the screen stably displays the current grayscale image from the standard test image, a uniform backlight and a low-angle ring light source are simultaneously activated. The uniform backlight is used for transmitted illumination to highlight internal light-emitting defects, while the low-angle ring light source is used for lateral illumination to highlight surface morphology defects. The industrial camera then captures an image of the mobile phone screen under the combined illumination of these two light sources, obtaining the original image of the screen under test.
[0027] Define a two-dimensional Gaussian kernel function with preset fixed values for its size and standard deviation. Convolve the defined two-dimensional Gaussian kernel function with the original image in the spatial domain. During the convolution operation, the two-dimensional Gaussian kernel function acts as a weight template, sliding pixel-by-pixel across the original image. For each pixel location in the original image, calculate the weighted average of the gray values of all pixels within the area covered by the two-dimensional Gaussian kernel function, and assign this weighted average to the corresponding pixel location in the original image. After completing the convolution calculation for all pixel locations in the original image, a smoothed image is obtained.
[0028] Furthermore, the size of the Gaussian kernel function is set to limit the neighborhood range in which each pixel participates in the weighted average calculation during smoothing, and the standard deviation is set to limit the rate at which the Gaussian weights decay with increasing spatial distance, thereby controlling the smoothing intensity. In one embodiment, the size of the two-dimensional Gaussian kernel function can be set to 5×5, and the standard deviation can be set to 1.0. This can effectively suppress acquisition noise and local brightness fluctuations while preserving the gray-scale abrupt change characteristics of small defect areas, avoiding the reduction in the accuracy of subsequent background reconstruction, residual extraction, and defect localization due to excessive smoothing.
[0029] S2. Input the smoothed image into the deep background generation network for reconstruction to obtain a preliminary background reconstruction image.
[0030] S2.1, it should be noted that multiple sets of mobile phone screen training sample images and multiple sets of defect-free mobile phone screen reference images are pre-collected. The mobile phone screen training sample images are mobile phone screen images used as input to the deep background generation network for background reconstruction training, containing background information of the mobile phone screen and allowing for defects and perturbations. Both the mobile phone screen training sample images and the defect-free mobile phone screen reference images are acquired under the same display conditions, shooting parameters, and lighting conditions. Denoising processing is performed on the multiple sets of mobile phone screen training sample images and the multiple sets of defect-free mobile phone screen reference images according to the denoising method in S1, respectively, to obtain corresponding smoothed sample images and smoothed reference images. The smoothed sample images serve as training input, and the smoothed reference images serve as supervision labels, together constituting the training set used to train the deep background generation network.
[0031] The deep background generation network consists of an input layer, an encoder, a bottleneck feature processing layer, a decoder, and an output layer. The input layer receives a single-channel smoothed image of uniform size; in one embodiment, the input size of the single-channel smoothed image can be set to 512×512. The encoder has four downsampling levels arranged from shallow to deep, with each level performing two convolution operations and one downsampling operation sequentially. The number of feature channels at each level is set to 32, 64, 128, and 256, respectively. The bottleneck feature processing layer can have 512 feature channels. The decoder has four upsampling levels, where each upsampling level first upsamples the features from the previous layer, then concatenates them with the corresponding layer's output features from the encoder in the channel dimension, and then performs convolution calculations to gradually restore the image's spatial resolution. The output layer performs a 1×1 convolution operation on the final decoding result, outputting a single-channel reconstruction result with the same size as the input. The input size, downsampling level, number of feature channels, and bottleneck feature dimension are set based on the characteristics of mobile phone screen background images, which have both global brightness smooth changes and local subtle texture changes. This ensures that the deep background generation network can obtain a sufficiently large receptive field while retaining the local grayscale structure information required for subsequent residual extraction.
[0032] The training set is input into the deep background generation network for forward propagation calculation to obtain the output image. The output image is then compared pixel by pixel with the supervision label (i.e., the reference smooth image) in the training set to calculate the reconstruction loss value (the reconstruction loss value is composed of a weighted average absolute error loss and structural similarity loss). Based on the reconstruction loss value, the gradient of the learnable parameters of each layer in the deep background generation network is calculated using the backpropagation algorithm, and the learnable parameters of each layer are updated using the gradient descent method. The forward propagation, reconstruction loss value calculation, and parameter update process are repeated until the maximum number of iterations is reached to obtain the trained deep background generation network.
[0033] Furthermore, in one embodiment, the maximum number of iterations can be set to 100, which can ensure that the deep background generation network fully learns the reconstruction rules of the mobile phone screen background while avoiding excessive training process leading to increased training costs and reducing the risk of overfitting caused by excessive iteration.
[0034] S2.2. As should be noted, after inputting the smoothed image of the mobile phone screen to be tested in step S1 into the input layer of the deep background generation network, the encoder part of the deep background generation network performs convolution calculations on the smoothed image in a layer-by-layer order from shallow to deep. In each level of convolution calculation, the corresponding convolution kernel is used to perform a local weighted summation operation on the smoothed image and a bias term is added to obtain the convolution result. Then, a non-linear activation operation is performed on the convolution result to obtain the feature map of the current layer. Downsampling calculation is performed on the feature map of the current layer to compress the spatial size and increase the number of feature channels to obtain the input feature map of the next layer. As the encoder part calculates downwards step by step, the deep background generation network extracts local grayscale texture information, mesoscale brightness change information, and overall background distribution information from the smoothed image, and retains the encoded feature maps of each level. When the encoder part completes the last level of downsampling, the deepest input feature map is sent to the bottleneck feature processing layer. The bottleneck feature processing layer continues to perform convolution calculations and non-linear activation operations on the deepest input feature map to obtain high-level features containing global background contours and large-scale brightness distribution information. The process involves inputting high-level feature representations into the decoder. The decoder then performs upsampling calculations layer by layer, from deep to shallow. At each upsampling level, the spatial size of the previous layer's feature map is enlarged. The enlarged feature map is then concatenated with the encoded feature map retained at the corresponding level in the encoder, forming a fused feature map. Convolution and non-linear activation operations are performed on the fused feature map to obtain the current level's decoded feature map. Through progressive upsampling and concatenation, the deep background generation network restores the image's spatial resolution while combining high-level background semantic information with shallow edge structure information. This ensures that the feature map output by the decoder retains both the overall brightness distribution of the normal background area of the mobile phone screen and the edge transition relationships of that area. After the decoder completes the final level of feature restoration, the final level's decoded feature map is input into the output layer. The output layer performs convolution mapping operations on the final level's decoded feature map to generate a single-channel reconstruction result corresponding to the smoothed image's spatial size. This single-channel reconstruction result is then used as the preliminary background reconstruction map.
[0035] Furthermore, the layer order from shallow to deep refers to processing the downsampling structures in the encoder section according to their arrangement. The first downsampling structure corresponds to the shallow layer, and the last downsampling structure corresponds to the deep layer. The shallow layer is used to extract local grayscale texture information and edge transition information in the smoothed image, while the deep layer is used to extract mesoscale brightness variation information and overall background distribution information in the smoothed image.
[0036] S3. Perform nonlocal mean filtering on the preliminary background reconstruction image to obtain the final background image. Calculate the grayscale difference between the corresponding pixels of the smoothed image and the final background image, and take the absolute value of the grayscale difference to generate a preliminary residual image.
[0037] S3.1 It should be noted that a fixed-size square neighborhood window centered on the current pixel is set, for example, a window with a side length of 7 pixels. For each pixel in the preliminary background reconstruction image, extract the gray values of all pixels within the neighborhood defined by the aforementioned neighborhood window, centered on the coordinates; calculate the arithmetic mean of the gray values of all pixels within the neighborhood; calculate the difference between the gray value of each pixel within the neighborhood and the arithmetic mean, square each difference to obtain the squared difference of all pixels within the neighborhood; sum the squared differences of all pixels within the neighborhood, and quotient the summation result with the total number of pixels within the neighborhood to obtain the gray variance corresponding to the pixel position.
[0038] S3.2 It should be noted that after setting a basic smoothing intensity coefficient and an adjustment value, the adaptive filtering parameters corresponding to each pixel position are calculated based on the gray-level variance corresponding to the pixel position. The formula is as follows: ; in, This represents the adaptive filtering parameters corresponding to the pixel position. Indicates the basic smoothness strength coefficient. Indicates the adjustment value. This represents the grayscale variance corresponding to the pixel location.
[0039] Furthermore, the base smoothing intensity coefficient is set to control the overall smoothing intensity of the adaptive filter. In one embodiment, the base smoothing intensity coefficient can be set to 1.5 to ensure that the smoothness and continuity of the screen background are maintained to the greatest extent while effectively suppressing background noise and residual artifacts.
[0040] The adjustment value is used to adjust the variation of the filtering intensity according to the gray-level variance corresponding to the pixel position. In one embodiment, the adjustment value can be set to 0.5, which can achieve gentle noise reduction in flat areas of the image and automatically enhance the smoothing force in textured or edge areas.
[0041] S3.3. It should be noted that a fixed-size search window and a fixed-size similarity block window are set for each target pixel position in the preliminary background reconstruction image (for example, in one embodiment, the size of the search window is fixed at 21 pixels × 21 pixels, and the size of the similarity block window is fixed at 7 pixels × 7 pixels). For each target pixel position, traverse each candidate pixel position in the corresponding search window (each pixel position in the search window is a candidate pixel position), extract the gray values of all pixels in the similarity block window centered on the target pixel position to form a target pixel block, and extract the gray values of all pixels in the similarity block window centered on the candidate pixel position to form a candidate pixel block. Calculate the squared difference between the gray values of corresponding pixels in the target pixel block and the candidate pixel block, and sum all the squared differences to obtain the sum of squared inter-block distances. Calculate the quotient of the sum of squared inter-block distances with the square of the adaptive filtering parameter corresponding to the target pixel position to obtain the normalized distance. Perform a negative exponential operation on the normalized distance, that is, an exponential operation with the natural constant e as the base and the negative normalized distance as the exponent, to obtain the weight value of the candidate pixel position relative to the target pixel position. After traversing all candidate pixel positions within the search window and calculating their corresponding weight values, the original grayscale value of each candidate pixel position in the preliminary background reconstruction image is multiplied by its corresponding weight value. All products are summed to obtain a weighted grayscale sum, and all weight values are summed to obtain the total weight. The weighted grayscale sum is then divided by the total weight to obtain the grayscale value of the target pixel position in the final background image. The above weight calculation, weighted summation, and normalization operations based on the corresponding adaptive filtering parameters are sequentially performed on all target pixel positions in the preliminary background reconstruction image to generate the complete final background image.
[0042] For corresponding pixel positions with the same coordinates in the smooth image and the final background image, the grayscale values of the corresponding pixel positions are read and the difference is calculated. After taking the absolute value of the difference calculation results of all pixel positions, the absolute value results of all pixel positions are arranged according to their original spatial positions to form a new grayscale image. The new grayscale image is used as the preliminary residual image.
[0043] S4. Perform spectral residual significance calculation on the preliminary residual map to obtain a frequency domain visual salience map.
[0044] S4.1 It should be noted that the pixel array of the preliminary residual image is used as the input two-dimensional grayscale matrix. For each row of the two-dimensional grayscale matrix, all grayscale values of the corresponding row are extracted to form a one-dimensional real number sequence. A one-dimensional fast Fourier transform algorithm is applied to the one-dimensional real number sequence. The one-dimensional fast Fourier transform algorithm, through a butterfly operation structure, converts the real number sequence into a one-dimensional complex number sequence representing the frequency domain components. After completing the independent one-dimensional fast Fourier transform of all rows, all the resulting one-dimensional complex number sequences are arranged in the original row order to form an intermediate complex number matrix after row transformation. For each column of the intermediate complex number matrix, all complex elements of the corresponding column are extracted to form a one-dimensional complex number sequence. The one-dimensional fast Fourier transform algorithm is applied again to the one-dimensional complex number sequence to complete the frequency domain transformation in the column direction. After performing independent one-dimensional fast Fourier transforms on all columns, arrange all the resulting one-dimensional complex sequences in the original column order to form the final complex matrix after column transformation; use the final complex matrix as the complex spectrum of the preliminary residual map; each complex element in the complex spectrum contains a real part and an imaginary part.
[0045] Calculate the squares of the real and imaginary parts of each complex element; sum the squares of the real and imaginary parts of each complex element, and perform a square root operation on the summation to obtain the amplitude value of each complex element; arrange all the amplitude values of the complex elements according to their corresponding positions in the complex spectrum to form the amplitude spectrum of the preliminary residual plot. Perform a natural logarithm operation on each amplitude value in the amplitude spectrum, i.e., calculate the logarithm value with the natural constant e as the base, and arrange all the calculated logarithmic values in their original positions to form the logarithmic amplitude spectrum of the preliminary residual plot. Calculate the phase angle of each complex element from the complex spectrum (the phase angle is calculated using the four-quadrant arctangent function of the ratio of the imaginary to the real part of the corresponding complex element). Arrange all the phase angles of the complex elements according to their corresponding positions in the complex spectrum to form the phase spectrum of the preliminary residual plot.
[0046] S4.2 It should be noted that the formula for calculating the gray-level gradient components in the horizontal and vertical directions of each pixel position in the preliminary residual image is as follows: ; ; in, Indicates the pixel position The horizontal gray-level gradient components calculated at that location, Indicates the pixel position The vertical gray-level gradient component at that location. This represents the column coordinates of the pixel in the initial residual map. This represents the row coordinates of a pixel in the initial residual map within the image. ) indicates the initial residual map at pixel location The grayscale value at that location.
[0047] For each pixel location, the absolute values of the corresponding horizontal and vertical gradient components are summed to obtain the gradient magnitude at that pixel location. The arithmetic mean of the gradient magnitudes at all pixel locations in the preliminary residual map is calculated and used as the average gradient of the preliminary residual map. A preset base cutoff frequency and adjustment factor are established. The average gradient is multiplied by the adjustment factor to obtain the adjustment amount. The adjustment amount is summed with the value 1 to obtain the adjustment coefficient. The base cutoff frequency is multiplied by the adjustment coefficient to obtain the cutoff frequency of the low-pass filter.
[0048] Furthermore, the base cutoff frequency is set to determine the default strength benchmark for the low-pass filter to suppress high-frequency background noise; in one embodiment, the base cutoff frequency may be set to 0.05 to provide an initial smoothing scale for the filter.
[0049] The adjustment factor is set to dynamically adjust the scaling ratio of the aforementioned reference cutoff frequency based on the average gradient of the initial residual plot. In one embodiment, the adjustment coefficient can be set to 0.1, so that the cutoff frequency can make a moderate and stable linear response to the average gradient, avoiding over-adjustment that could lead to unstable filtering behavior.
[0050] S4.3. It should be noted that, based on the determined cutoff frequency of the low-pass filter, a frequency response function of a two-dimensional ideal low-pass filter or a Gaussian low-pass filter is constructed. The center position of the frequency response function in the frequency domain coordinate system is defined as the origin. For each frequency coordinate point in the logarithmic amplitude spectrum of the preliminary residual plot, the normalized radial frequency of the frequency coordinate point from the frequency domain origin is calculated using the following formula: ; in, Represented at frequency coordinate points The normalized radial frequency calculated at that point, This represents the horizontal coordinate (column index) of a frequency point in the frequency domain (Fourier spectrum). The horizontal coordinates representing the origin of the frequency domain. This represents the vertical coordinates (row index) of a frequency point in the frequency domain. Represents the vertical coordinates of the origin in the frequency domain. This indicates the height of the logarithmic amplitude spectrum of the preliminary residual plot. This indicates the width of the logarithmic amplitude spectrum of the preliminary residual plot.
[0051] The normalized radial frequency is compared with the low-pass filter cutoff frequency: if the normalized radial frequency is less than or equal to the low-pass filter cutoff frequency, the value of the frequency response function at the corresponding coordinate point is set to 1; if the normalized radial frequency is greater than the low-pass filter cutoff frequency, the value of the frequency response function at the corresponding coordinate point is set to 0. The logarithmic amplitude spectrum of the preliminary residual plot is multiplied element-wise with the value of the frequency response function at the same frequency coordinate point. After completing the multiplication operation at all frequency coordinate points, a new spectrum matrix is obtained, i.e., the smoothed logarithmic amplitude spectrum. The difference between the logarithmic amplitude spectrum of the preliminary residual plot and the smoothed logarithmic amplitude spectrum is calculated to obtain the frequency domain residual spectrum.
[0052] Based on the amplitude (real number) of the residual spectrum at each frequency coordinate point and the phase angle (real number) of the phase spectrum at the corresponding frequency coordinate point, the corresponding complex numbers are calculated using Euler's formula. After completing the complex number calculations for all frequency coordinate points, a reconstructed complex spectrum is obtained. A two-dimensional inverse fast Fourier transform is applied to the reconstructed complex spectrum. The two-dimensional inverse fast Fourier transform performs a one-dimensional inverse transform on each row of the reconstructed complex spectrum, and then performs a one-dimensional inverse transform on each column of the intermediate results, converting the signal from the frequency domain back to the spatial domain, resulting in an inversely transformed complex matrix. The magnitude of each complex element in the inversely transformed complex matrix is calculated (i.e., the square root is calculated after summing the squares of the real and imaginary parts of each complex element). All the calculated magnitudes are arranged to form a real value matrix. The real value matrix is then normalized (min-max normalization) to obtain a frequency domain visual saliency map.
[0053] S5. Based on the frequency domain visual saliency map, calculate the contrast limiting parameter, and perform contrast-limited histogram equalization on the preliminary residual map according to the contrast limiting parameter to generate an enhanced residual map.
[0054] S5.1 It should be noted that the preliminary residual image is used as the 0th layer image of the Gaussian pyramid. Gaussian filtering is applied to the 0th layer image to obtain a smoothed image. The smoothed image is then downsampled, typically using interleaved row and column sampling, reducing the image size to half its original width and height. This resulting image is used as the 1st layer image of the Gaussian pyramid. The above process is repeated, applying Gaussian filtering and downsampling to the kth layer image to generate the (k+1)th layer image, until the preset number of pyramid layers (e.g., 3 layers) is reached. After construction, a set of images from the original scale to multiple coarse scales is obtained, which is the Gaussian pyramid of the preliminary residual image. The same downsampling operation is performed on the frequency domain visual salience image to generate a downsampled frequency domain visual salience image that precisely corresponds to the scale of each layer of the Gaussian pyramid. Specifically, for the kth layer image of the Gaussian pyramid, the frequency domain visual salience image is continuously downsampled k times (Gaussian filtering can be performed before each downsampling to prevent aliasing), so that the downsampled frequency domain visual salience image has the same dimensions in width and height as the kth layer image of the Gaussian pyramid.
[0055] Read the value of the current pixel position in the downsampled frequency domain visual saliency image (a real number in the range [0,1]). Simultaneously, preset a base contrast limit value and an adjustment gain coefficient. Calculate the contrast limit parameter to be used when performing contrast-limited histogram equalization on the corresponding scale image at the current pixel position. The formula is: ; in, This represents the contrast limiting parameter. Indicates the base contrast limit value. This indicates the adjustment gain coefficient. This indicates the current pixel value in the downsampled frequency domain visual highlighting image.
[0056] Furthermore, the base contrast limit is set to provide a global, default upper limit for contrast stretching for contrast-limited histogram equalization; in one embodiment, the base contrast limit may be set to 2.0, providing a moderate global contrast stretching baseline and avoiding excessive amplification of noise in flat areas.
[0057] The adjustment gain factor is set to dynamically adjust the scaling ratio of the base contrast limit value based on the local information provided by the frequency domain visual salience map. In one embodiment, the adjustment gain factor can be set to 1.5 so that the contrast limit parameter can make a sufficiently sensitive linear response to salience information, achieving stronger contrast enhancement in defective areas without compromising the smoothness of non-salience areas.
[0058] S5.2 It should be noted that for each level of the Gaussian pyramid image, the image is divided into several non-overlapping rectangular blocks. For each pixel position in the image, the corresponding contrast limiting parameter is read. Taking the rectangular block where the current pixel position is located and its adjacent rectangular blocks as local regions, the gray-level histogram of all pixels in the local region is calculated. Specifically, the gray-level values of each pixel in the local region are statistically analyzed according to a preset gray-level interval. In one embodiment, the total number of gray-levels is set to 256, and 256 gray-level statistical intervals are established according to gray-level values from 0 to 255. The gray-level values of each pixel in the local region are read one by one, and it is determined which gray-level statistical interval the current pixel's gray-level value falls into. The pixel count of the corresponding gray-level statistical interval is incremented by 1. After completing the gray-level value statistics of all pixels in the local region, the pixel count results corresponding to each gray-level statistical interval are used to construct the gray-level histogram of all pixels in the local region. Based on the calculated contrast limiting parameter, the corresponding grayscale histogram is cropped so that the number of pixels at any gray level does not exceed the corresponding contrast limiting parameter. The number of pixels exceeding the limit in the cropped grayscale histogram is then evenly redistributed to all gray levels. Based on the redistributed grayscale histogram, the pixel counts for each gray level are accumulated sequentially in ascending order of gray levels to obtain the cumulative pixel count for each gray level. The cumulative pixel count for each gray level is then divided by the total number of pixels in the local region to obtain the cumulative distribution value for each gray level. This cumulative distribution value forms the cumulative distribution function. The cumulative distribution function is then used to map the original grayscale value at the current pixel location to obtain a new grayscale value. This process of histogram cropping, redistribution, and mapping transformation based on the corresponding contrast limiting parameter is repeated for all pixel locations in the current scale image, ultimately completing the contrast-limited histogram equalization processing for this scale image, resulting in a contrast-enhanced image. This process is performed sequentially on images at all scales within the Gaussian pyramid to obtain contrast-enhanced images at each scale.
[0059] Upsampling is performed on the contrast-enhanced images at each scale of the Gaussian pyramid. Upsampling uses bilinear interpolation, interpolating and enlarging each scale image in both width and height until the image size is exactly the same as the original size of the initial residual image, resulting in a set of uniformly sized upsampled contrast-enhanced images. A preset fixed fusion weight is assigned to each upsampled contrast-enhanced image, with the sum of the weights across all layers being 1. The grayscale value of each pixel in each upsampled contrast-enhanced image is multiplied by its corresponding fusion weight to obtain a weighted pixel value. The weighted pixel values at the same pixel coordinate position in all upsampled contrast-enhanced images are summed; the sum is the grayscale value of the corresponding pixel position in the enhanced residual image. This weighted summation operation is repeated for all pixel positions to finally generate the complete enhanced residual image.
[0060] Furthermore, the fixed fusion weights are pre-set based on the emphasis on representing local defect details and overall smooth background information in the images at each scale in the Gaussian pyramid. In one embodiment with a preset pyramid of 3 layers, the fixed fusion weights of the upsampled contrast-enhanced image corresponding to the original scale, the upsampled contrast-enhanced image corresponding to the first scale, and the upsampled contrast-enhanced image corresponding to the second scale can be set to 0.50, 0.30, and 0.20, respectively.
[0061] S6. After dot-multiplying the frequency domain visual saliency map and the enhanced residual map, normalize them to obtain a weighted defect response map. Perform threshold segmentation and morphological processing on the weighted defect response map to extract candidate defect regions.
[0062] S6.1 It should be noted that the grayscale value of the frequency domain visual saliency image at a certain pixel location is read, and the grayscale value of the enhancement residual image at the same pixel location is read simultaneously. The grayscale value of the frequency domain visual saliency image and the grayscale value of the enhancement residual image are multiplied to obtain the product result for the corresponding pixel location. The above multiplication operation is repeated for all pixel locations in the frequency domain visual saliency image and the enhancement residual image. The product results of all pixel locations are arranged according to their original spatial positions to form a dot product result image. The minimum and maximum values of all pixel grayscale values are found in the dot product result image, and a minimum-maximum normalization calculation is performed. The normalized values of all pixel locations are arranged according to their original positions, and the resulting image is the weighted defect response map.
[0063] Statistically analyze the histogram distribution of all pixel grayscale values in the weighted defect response map to obtain the frequency of each grayscale level between 0 and 1 (or the normalized integer range). Initialize a threshold variable and calculate the overall pixel grayscale mean of the weighted defect response map. Starting from the lowest grayscale level, sequentially use the current grayscale level as a candidate threshold to divide the image pixels into two categories: foreground (grayscale value greater than the candidate threshold) and background (grayscale value less than or equal to the candidate threshold). Calculate the proportion of foreground pixels to the total number of pixels in the image, the average grayscale value of foreground pixels, the proportion of background pixels to the total number of pixels in the image, and the average grayscale value of background pixels. Calculate the inter-class variance corresponding to the current candidate threshold using the inter-class variance formula: ; in, This indicates that when the candidate threshold is The inter-class variance calculated at that time This indicates that when the candidate threshold is At that time, the proportion of foreground pixels to the total number of pixels, This indicates that when the candidate threshold is At that time, the proportion of background pixels to the total number of pixels, This indicates that when the candidate threshold is At that time, the average gray value of the foreground pixel, This indicates that when the candidate threshold is At that time, the average grayscale value of the background pixels.
[0064] Iterate through all possible candidate thresholds and repeatedly calculate the inter-class variance. Compare the inter-class variances corresponding to all candidate thresholds, and determine the candidate threshold that maximizes the inter-class variance as the globally optimal binarization threshold for the weighted defect response map.
[0065] S6.2 It should be noted that, when reading the grayscale value of a pixel in the weighted defect response image, the corresponding grayscale value is compared with the global optimal binarization threshold. If the grayscale value of the corresponding pixel is greater than or equal to the global optimal binarization threshold, the output value of the corresponding pixel is set to 1 (representing the foreground); if the grayscale value of the corresponding pixel is less than the global optimal binarization threshold, the output value of the corresponding pixel is set to 0 (representing the background). After completing the above comparison and assignment operations for all pixel positions in the weighted defect response image, the output values of all pixel positions are arranged according to their original spatial positions to form a new image, i.e., a binary image.
[0066] S6.3. It should be noted that morphological closing is performed on the binary image. Morphological closing is composed of a sequential combination of morphological dilation and morphological erosion operations. First, a morphological dilation operation is performed on the binary image, followed by a morphological erosion operation on the dilated result. The morphological dilation operation is as follows: Define a structuring element (e.g., a 3×3 square structuring element), slide the structuring element on the binary image, and if at least one pixel in the binary image region covered by the structuring element has a value of 1, then set the pixel position in the output image corresponding to the center of the structuring element to 1. The morphological erosion operation is as follows: Slide the same structuring element on the binary image, and only set the pixel position in the output image corresponding to the center of the structuring element to 1 when all pixel values in the binary image region covered by the structuring element are 1. After completing the closing operation, an intermediate binary image is obtained. A morphological opening operation is then performed on the intermediate binary image. Morphological opening is a combination of morphological erosion and morphological dilation operations. First, a morphological erosion operation is performed on the intermediate binary image, followed by a morphological dilation operation on the eroded result. The structuring element used is the same as that used in closing. After the opening operation, the final output image is the morphologically processed binary image.
[0067] Scan the morphologically processed binary image to find the first unvisited foreground pixel as a seed. Using region growing or a two-pass scanning method, starting from the corresponding seed, search and label all foreground pixels connected to it according to the 4-connectivity or 8-connectivity rule, assigning each pixel a unique label to form a connected region. Continue scanning the morphologically processed binary image to find the next unlabeled foreground pixel as a new seed, repeating the connected region labeling process until all foreground pixels in the image have been visited and labeled. At this point, the set of pixels with the same label in the morphologically processed binary image constitutes an independent connected region. For each identified connected region, read the row and column coordinates of all pixels within the region, and find the minimum and maximum values of these row coordinates, as well as the minimum and maximum values of the column coordinates. The rectangle defined by these four values (minimum row coordinate, maximum row coordinate, minimum column coordinate, maximum column coordinate) is the minimum bounding rectangle of the corresponding connected region. The minimum bounding rectangle is completely determined by the row and column coordinates of its top-left vertex, as well as the height and width of the rectangle. Each of the smallest bounding rectangles calculated in this way represents an independent location of a suspected defect, thus constituting a defect candidate region.
[0068] S7. For each defect candidate region, extract dual-channel image patches from the smoothed image and the final background image, and input them into the dynamic multi-task recognition network for defect analysis to obtain defect detection results.
[0069] S7.1 It should be noted that, the defect candidate region is read, and the corresponding pixel region in the smoothed image and the final background image is determined based on the coordinates of the upper left corner, height, and width of the corresponding minimum bounding rectangle. From the smoothed image, an image block with the corresponding pixel region as its boundary is cropped; this image block is a single-channel grayscale image block. From the final background image, an image block with the same position and size is cropped; this is also a single-channel grayscale image block. The image blocks cropped from the smoothed image and the image blocks cropped from the final background image are stacked in the channel dimension. In the stacking operation, the image block cropped from the smoothed image is used as the first channel, and the image block cropped from the final background image is used as the second channel. The new image block generated after stacking contains two channels, where each pixel position contains two grayscale values, respectively from the smoothed image and the final background image; the new image block is a dual-channel image block. The above cropping and stacking operations are repeated for all defect candidate regions to obtain multiple dual-channel image blocks with the same number as the defect candidate regions.
[0070] The dual-channel image patch is scaled to a preset fixed size, such as 64 pixels by 64 pixels, using bilinear interpolation. The two channels of the scaled dual-channel image patch are then normalized (Z-score normalization).
[0071] S7.2 It should be noted that a large number of historically accurate dual-channel image patches are collected, and the annotation information of the historical dual-channel image patches is used as supervision labels to form a training set. The annotation information includes defect categories, pixel-level segmentation masks, and quantization parameters (such as contrast and aspect ratio).
[0072] The structure of the dynamic multi-task recognition network consists of a weight generation subnetwork, a dynamic convolutional backbone, and a multi-task output layer. The weight generation subnetwork comprises two convolutional layers and one fully connected layer, used to generate the kernel parameters required for the dynamic convolutional layers based on the input image patches. The first layer of the dynamic convolutional backbone is a dynamic convolutional layer, whose kernel parameters are provided in real-time by the weight generation subnetwork. Subsequent layers are three static convolutional layers, each containing convolution, batch normalization, and ReLU activation. The multi-task output layer includes three parallel branches: a classification branch (global average pooling followed by a fully connected layer, outputting class probabilities), a segmentation branch (composed of three transposed convolutional layers, outputting pixel-level segmentation masks), and a regression branch (global average pooling followed by a fully connected layer, outputting quantization parameters). The input to the dynamic multi-task recognition network is a 64×64 dual-channel image patch with an input feature dimension of 2. The dynamic convolutional layer uses a 3×3 convolutional kernel with 2 input channels and 32 output channels. The output channels of the subsequent static convolutional layers are 64, 128, and 256 respectively. The classification branch outputs C categories (including the "normal" class), the segmentation branch outputs a single-channel mask, and the regression branch outputs a dimension of 2 (contrast and aspect ratio).
[0073] The training set is input into the dynamic multi-task recognition network for forward propagation calculation to obtain the prediction result. The prediction result is compared with the corresponding supervised label, and the classification cross-entropy loss, segmentation Dice loss, and regression smoothing L1 loss are calculated. The three losses are weighted and summed to obtain the total loss value. The backpropagation algorithm is used to calculate the gradient of all learnable parameters in the dynamic multi-task recognition network, and the learnable parameters of the dynamic multi-task recognition network are updated based on the adaptive moment estimation algorithm to gradually reduce the total loss value, completing one parameter iteration update. The forward propagation, loss calculation, and parameter update process is repeated until the maximum number of iterations is reached to obtain the trained dynamic multi-task recognition network.
[0074] Furthermore, in one embodiment, the weight of the classification cross-entropy loss can be set to 1.0, the weight of the segmentation Dice loss can be set to 1.5, and the weight of the regression smoothing L1 loss can be set to 0.2. This allows the total loss to decrease smoothly during the multi-task optimization process, avoiding the single-task loss from dominating the training direction. This ensures that the dynamic multi-task recognition network can synchronously and stably improve classification accuracy, segmentation precision, and regression consistency, and finally converge to a balanced state that performs well on all three tasks.
[0075] In one embodiment, the maximum number of iterations can be set to 1000 to avoid underfitting of the model due to overly sensitive early stopping conditions or insufficient training.
[0076] S7.3 Input the normalized dual-channel image patch into the trained dynamic multi-task recognition network. The first convolutional layer of the weight generation sub-network uses a 3×3 convolutional kernel to extract features from the input, generating a 16-channel feature map. The second convolutional layer of the weight generation sub-network uses a 3×3 convolutional kernel to downsample the feature map, generating a 32-channel feature map. The fully connected layer of the weight generation sub-network flattens the feature map into a vector and performs a linear transformation, outputting a one-dimensional vector of length 576. The one-dimensional vector is reshaped into a weight tensor of size [3,3,2,32]. The weight tensor is the convolutional kernel parameter required by the dynamic convolutional layer, where the first two dimensions 3 and 3 represent the height and width of the convolutional kernel, the third dimension 2 represents the number of input channels, and the fourth dimension 32 represents the number of output channels.
[0077] The first layer of the backbone network is a dynamic convolutional layer. The dynamic convolutional layer uses a convolutional kernel parameter of [3,3,2,32] to perform convolution operations on the input two-channel image patch. The stride is set to 1 and the padding is set to 1 to generate an initial feature map with 32 channels. The second layer of the backbone network is a static convolutional layer. This second static convolutional layer uses kernel parameters [3,3,32,64] to convolve the initial feature map with a stride of 2 and padding of 1, followed by batch normalization and ReLU activation to generate a 64-channel feature map. The third layer of the backbone network is also a static convolutional layer. This third static convolutional layer uses kernel parameters [3,3,64,128] to convolve the feature map with a stride of 2 and padding of 1, followed by batch normalization and ReLU activation to generate a 128-channel feature map. The fourth layer of the backbone network is a static convolutional layer. This fourth static convolutional layer uses kernel parameters [3,3,128,256] to convolve the feature map with a stride of 2 and padding of 1, followed by batch normalization and ReLU activation to generate a 256-channel feature map. At this point, the backbone network completes feature extraction and outputs a high-level feature tensor of size [8,8,256].
[0078] The classification branch performs global average pooling on the high-level feature tensor, converting it into a 256-dimensional feature vector. The 256-dimensional feature vector is then linearly transformed through a fully connected layer (with weight dimensions of [256, C], where C is the total number of defect categories) to generate a C-dimensional raw score. The Softmax function is applied to the raw score to convert it into a category probability distribution, where each value represents the probability of belonging to the corresponding defect category.
[0079] The high-level feature tensor output from the backbone network is input into the first transposed convolutional layer of the segmentation branch. The first transposed convolutional layer uses kernel parameters of [4, 4, 256, 128], with a stride of 2 and padding of 1, to perform transposed convolution operations, generating feature maps of size [16, 16, 128], and applying the ReLU activation function. The output of the first transposed convolutional layer is then input into the second transposed convolutional layer, which uses kernel parameters of [4, 4, 128, 64], with a stride of 2 and padding of 1, to perform transposed convolution operations. The algorithm generates a feature map of size [32, 32, 64] and applies the ReLU activation function. The output of the second transposed convolutional layer is input into the third transposed convolutional layer, which uses a kernel parameter of [4, 4, 64, 1], with a stride of 2 and padding of 1 to perform transposed convolution operations, generating an output image of size [64, 64, 1]. The Sigmoid function is applied to each pixel value in the output image to obtain a pixel-level segmentation mask, where each pixel value represents the probability that the corresponding position belongs to the defect region.
[0080] A global average pooling operation is performed on the high-level feature tensor output by the backbone network. This means that the average value of the feature values at all positions in the spatial dimensions (height and width) of the high-level feature tensor is calculated to obtain a 256-dimensional feature vector. The 256-dimensional feature vector is then input into the fully connected layer of the regression branch. The fully connected layer performs matrix multiplication between the 256-dimensional feature vector and the weight matrix, and then superimposes the bias vector to output quantization parameters, including contrast and aspect ratio.
[0081] The category index with the highest probability value is extracted from the category probability distribution output by the classification branch. The corresponding category is recorded as the predicted defect type, and the corresponding probability value is recorded as the confidence score. Simultaneously, the pixel-level segmentation probability mask output by the segmentation branch is binarized using a threshold of 0.5. Pixels with a probability value greater than or equal to 0.5 are set to 1, and the rest are set to 0, resulting in an accurate binarized defect region mask. Based on the binarized defect region mask, the actual pixel area of the defect and the precise coordinates and dimensions of its minimum bounding rectangle are calculated. A comprehensive judgment is made according to preset judgment rules: if the predicted defect type is "normal" or "no defect," and the corresponding confidence score is higher than the threshold (the threshold is set to distinguish between non-defects and real defects, for example, 0.9), the current defect candidate region is determined to be non-defect; otherwise, it is determined to be a real defect. The predicted defect category, the precise shape and position defined by the binarized defect region mask, the calculated defect area, and the contrast and aspect ratio in the quantization parameters are combined to form a structured data record. This data record represents the complete defect detection result for the current defect candidate region. Repeat the above steps of fusion, judgment and result generation for all defect candidate regions, summarize the structured data of all real defects, and form a defect detection result list containing all defect information.
[0082] This embodiment also provides a computer device applicable to the machine vision-based mobile phone screen defect detection method, including: a memory and a processor; the memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions to implement the machine vision-based mobile phone screen defect detection method proposed in the above embodiment. The computer device can be a terminal, comprising a processor, memory, communication interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse.
[0083] This embodiment also provides a storage medium storing a computer program, which, when executed by a processor, implements the machine vision-based mobile phone screen defect detection method proposed in the above embodiments. The storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Red-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0084] In summary, this invention solves the problem of false defects easily generated by previous methods under complex textures by using adaptive nonlocal mean filtering to accurately separate background and potential defect signals. Furthermore, it introduces frequency domain visual highlighting and adaptive contrast enhancement techniques, effectively addressing the difficulty in detecting small and low-contrast defects by dynamically adjusting enhancement parameters at multiple scales, thus improving the signal-to-noise ratio. Finally, it employs a dual-channel input dynamic multi-task recognition network, fusing classification results, segmentation results, defect area, and quantization parameters to achieve a refined description of defect attributes. This not only significantly reduces the false alarm rate caused by background interference but also overcomes the limitations of a single detection task, accurately outputting defect type, shape, and size data, meeting the stringent requirements of modern industrial production for high-precision and high-efficiency quality inspection.
[0085] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.
Claims
1. A method for detecting defects in mobile phone screens based on machine vision, characterized in that: include, Acquire the original image of the phone screen under test and perform noise reduction to obtain a smooth image; The smoothed image is input into a deep background generation network for reconstruction to obtain a preliminary background reconstruction map; Nonlocal mean filtering is applied to the preliminary background reconstruction image to obtain the final background image. The gray-level difference between the corresponding pixels of the smoothed image and the final background image is calculated, and the absolute value of the gray-level difference is taken to generate a preliminary residual image. Perform spectral residual significance calculation on the preliminary residual map to obtain a frequency domain visual saliency map; Based on the frequency domain visual saliency map, the contrast limiting parameter is calculated, and the preliminary residual map is subjected to contrast-limited histogram equalization processing according to the contrast limiting parameter to generate an enhanced residual map. After dot-multiplying the frequency domain visual saliency map and the enhancement residual map and normalizing them, a weighted defect response map is obtained. Threshold segmentation and morphological processing are then performed on the weighted defect response map to extract candidate defect regions. For each defect candidate region, dual-channel image patches are extracted from the smoothed image and the final background image, and then input into a dynamic multi-task recognition network for defect analysis to obtain defect detection results.
2. The machine vision-based mobile phone screen defect detection method as described in claim 1, characterized in that: The smoothed image is obtained by defining a two-dimensional Gaussian kernel function and performing a convolution operation with the original image of the mobile phone screen under test.
3. The machine vision-based mobile phone screen defect detection method as described in claim 2, characterized in that: The process of obtaining the preliminary background reconstruction map involves inputting a smooth image into a deep background generation network. The encoder part of the deep background generation network downsamples the smooth image to extract multi-scale features, and the decoder part upsamples the multi-scale features to reconstruct the image, thereby generating the preliminary background reconstruction map.
4. The machine vision-based mobile phone screen defect detection method as described in claim 3, characterized in that: The generation of the preliminary residual map specifically involves: Calculate the grayscale variance in the neighborhood of each pixel in the preliminary background reconstruction image; The adaptive filtering parameters for each pixel position are calculated based on the gray-level variance and the preset adjustment value. Based on the adaptive filtering parameters corresponding to each pixel position, nonlocal mean filtering is performed on the preliminary background reconstruction image to obtain the final background image. For corresponding pixel positions with the same coordinates in the smooth image and the final background image, the grayscale values of the corresponding pixel positions are read and the difference is calculated. After taking the absolute value of the difference calculation results of all pixel positions, the absolute value results of all pixel positions are arranged according to the original spatial position to form a new grayscale image. The new grayscale image is used as the preliminary residual image.
5. The machine vision-based mobile phone screen defect detection method as described in claim 4, characterized in that: The acquisition of the frequency domain visual salience map specifically refers to: A two-dimensional discrete Fourier transform is performed on the preliminary residual plot to obtain the logarithmic amplitude spectrum and phase spectrum of the preliminary residual plot; The average gradient of the preliminary residual map is calculated, and the cutoff frequency of the low-pass filter is determined based on the average gradient, the preset base cutoff frequency, and the adjustment factor. The average gradient is calculated as follows: for each pixel in the preliminary residual map, the absolute value of the horizontal gradient component and the absolute value of the vertical gradient component are calculated to obtain the gradient magnitude of each pixel, and the arithmetic mean of the gradient magnitudes of all pixels is used as the average gradient. Based on the cutoff frequency of the low-pass filter, the logarithmic amplitude spectrum of the preliminary residual plot is filtered to obtain the smoothed logarithmic amplitude spectrum. The difference between the logarithmic amplitude spectrum of the initial residual plot and the smoothed logarithmic amplitude spectrum is calculated to obtain the frequency domain residual spectrum. The complex spectrum is reconstructed based on the residual spectrum and phase spectrum in the frequency domain. A two-dimensional discrete Fourier inverse transform is performed on the reconstructed complex spectrum to obtain a frequency domain visual salience map.
6. The machine vision-based mobile phone screen defect detection method as described in claim 5, characterized in that: The generation of the enhanced residual map specifically involves: To construct a Gaussian pyramid for the initial residual map, the frequency domain visual saliency map is downsampled to the same size as the image at each scale in the Gaussian pyramid. Based on the value of each pixel position in the downsampled frequency domain visual saliency image, calculate the contrast limiting parameter of the pixel position in the corresponding scale image. Using the calculated contrast limiting parameters, contrast-limited histogram equalization is performed independently on the image at each scale in the Gaussian pyramid to obtain a contrast-enhanced image at each scale. Upsample all contrast-enhanced images at all scales back to the original size of the initial residual map, and then perform a linear weighted summation of the pixel grayscale values at the same pixel coordinate position in all upsampled contrast-enhanced images to generate the enhanced residual map.
7. The machine vision-based mobile phone screen defect detection method as described in claim 6, characterized in that: The extraction of the defect candidate region specifically involves: Perform pixel-by-pixel multiplication on the pixel grayscale values of the frequency domain visual saliency map and the enhanced residual map at the same spatial coordinates to generate a dot product result map. Normalize the grayscale values of all pixels in the dot product result map to obtain a weighted defect response map. Based on the distribution of gray values of each pixel in the weighted defect response map, the Otsu method is used to calculate the global optimal binarization threshold. Based on the globally optimal binarization threshold, the grayscale value of each pixel in the weighted defect response map is converted into a binary image pixel value to form a binary image; The binary image is subjected to morphological closing and morphological opening operations in sequence to obtain the morphologically processed binary image. Connectivity labeling analysis is performed on the morphologically processed binary image to identify all independent connected regions, and the minimum bounding rectangle of each connected region is calculated. Each minimum bounding rectangle constitutes a defect candidate region.
8. The machine vision-based mobile phone screen defect detection method as described in claim 7, characterized in that: The specific steps for obtaining the defect detection results are as follows: For each defect candidate region, based on the position and size of the corresponding minimum bounding rectangle, image blocks are cropped from the corresponding coordinate positions of the smooth image and the final background image, and these two image blocks are stitched together in the channel dimension to form a dual-channel image block. After scaling the dual-channel image patch to a preset size and normalizing it, it is input into the dynamic multi-task recognition network to generate convolution kernel parameters in real time. The backbone of the dynamic multi-task recognition network uses convolution kernel parameters to convolve and extract features from dual-channel image blocks, and then passes them to the multi-task output layer to obtain the category probability distribution of defects, pixel-level segmentation mask and quantization parameters. Based on preset judgment rules, the defect detection results are generated by fusing the defect category probability distribution, pixel-level segmentation mask, and quantization parameters.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it implements the steps of the machine vision-based mobile phone screen defect detection method according to any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that: When the computer program is executed by the processor, it implements the steps of the machine vision-based mobile phone screen defect detection method according to any one of claims 1 to 8.