A cylinder hole flaw detection system and method based on DINO improvement

CN122244055APending Publication Date: 2026-06-19HANGZHOU XINGWANG INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU XINGWANG INTELLIGENT TECH CO LTD
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for cylinder porosity detection suffer from problems such as low recognition accuracy, weak anti-interference ability, insufficient generalization ability in small sample scenarios, and poor performance in detecting tiny pores in high-resolution images.

Method used

A cylinder porosity defect detection system based on DINO is adopted. Through a dual-branch feature extraction module combined with a DINOv3 model with frozen parameters and a lightweight backbone network, multi-scale feature extraction and fusion are performed. Prior attention masks for porosity regions are introduced, and combined with a difference feature decoding module and a detection result verification module, high-precision and robust porosity defect detection is achieved.

Benefits of technology

It significantly improves the detection accuracy and robustness of micropores, solves the problem of insufficient model generalization ability in small sample scenarios, and improves the reliability and credibility of the detection system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244055A_ABST
    Figure CN122244055A_ABST
Patent Text Reader

Abstract

This invention discloses a cylinder porosity defect detection system and method based on DINO improvement, including an image registration module that achieves pixel-level alignment of dual-temporal images through SIFT and RANSAC algorithms; a dual-branch feature extraction module that includes a local detail branch and a DINOv3 global semantic branch with frozen parameters; a feature fusion module that achieves deep feature fusion through a lightweight adaptation unit and a dense feature fusion unit, and introduces a prior attention mask for porosity region weighting; a differential feature decoding module that generates a multi-scale auxiliary prediction map through differential attention mechanism and cross-scale gating fusion; and a porosity defect type output module that optimizes the prediction results through the opening and closing operations of a learnable structure kernel and identifies five defect types: extra, occlusion, positional deviation, semi-submerged hole, and blockage. This invention solves the problems of low accuracy and weak anti-interference ability of traditional methods for porosity defect identification, and has the advantages of high detection accuracy, strong robustness, and good small sample adaptability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial visual inspection technology, and more specifically to a cylinder bore defect detection system and method based on DINO improvement. Background Technology

[0002] As a core component of power machinery and hydraulic equipment, the quality of the surface pores of cylinders directly affects the sealing performance, pressure resistance, and service life of the product. Missing pores (omitted pores, forgotten drilling) are a common and serious defect in cylinder manufacturing. Failure to detect them in time can lead to pressure imbalances, leaks, and other malfunctions during equipment operation. Therefore, strict quality inspection of cylinder pores is necessary. Existing inspection methods mainly rely on manual observation or traditional machine vision technology, which suffers from low efficiency, poor robustness, and insensitivity to subtle changes. In recent years, while deep learning has been applied in industrial defect detection, its limitations in small sample size, insufficient ability to recognize fine-grained targets, and sensitivity to complex interference (such as reflections, stains, and deformation) make it difficult to meet the high-precision, high-reliability requirements for cylinder pore defect detection. While the DINO (Self-supervised Visual Feature Learning with Deep Neural Networks) model possesses powerful self-supervised semantic feature extraction capabilities, its original architecture only supports 256×256 resolution input. Directly applying it to high-resolution industrial images will result in a significant loss of detail information, especially in the poor detection of "variable" defects such as missing micropores.

[0003] A search revealed a system and method for detecting surface defects in an engine block, published under publication number CN118566238A. This method uses a multi-axis moving camera in conjunction with a light source to acquire images of different height surfaces of the cylinder block. After segmenting the images into sub-images, image recognition algorithms are used to highlight defects such as porosity and pinholes. However, this technical solution is still based on traditional image processing or shallow AI algorithms, lacking a deep understanding of semantic levels. It is difficult to effectively distinguish between missing porosity and artifacts caused by surface texture, stains, or changes in lighting. Furthermore, it relies on precise mechanical positioning and multi-angle shooting, resulting in high system complexity. It also fails to address the problem of weak model generalization ability in small sample scenarios, limiting the detection rate and robustness for missing minute porosity. Summary of the Invention

[0004] To address the shortcomings of existing technologies, the present invention aims to provide a cylinder porosity defect detection system and method based on DINO improvement, in order to solve the technical problems of low accuracy in porosity defect identification, weak anti-interference ability, insufficient generalization ability in small sample scenarios, and poor detection effect of small porosity in high-resolution images by traditional detection methods.

[0005] To achieve the above objectives, the present invention provides the following technical solution:

[0006] A cylinder porosity defect detection system based on DINO improvement, comprising: The image registration module extracts and matches the feature points of the target image of the cylinder surface area to be inspected and the preset standard image through the scale-invariant feature transformation algorithm. After screening by the random sampling consistency algorithm, the homography matrix is ​​estimated, and the target image is subjected to rigid transformation to output a pixel-level aligned dual-temporal image. The dual-branch feature extraction module includes a parallel local detail branch and a global semantic branch. The local detail branch generates multi-scale local feature maps by combining a lightweight backbone network with a feature pyramid. The global semantic branch extracts global semantic feature maps by using a DINOv3 model with frozen parameters. The feature fusion module is used to perform channel alignment on the global semantic feature map through a lightweight adaptation unit, and then to deeply fuse it with the local feature map through a dense feature fusion unit. It also introduces a prior attention mask for the pore region based on the design drawing for region weighting, and outputs a multi-scale feature pyramid based on the semantic and context enrichment of the standard image and the target image. The differential feature decoding module is used to calculate the absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale change prior map, and to process the change prior map through a Transformer decoder that includes a differential attention mechanism and cross-scale gating fusion to generate a multi-scale auxiliary prediction map. The pore defect type output module performs morphological operations on the auxiliary prediction map and outputs a binary defect detection mask by weighted fusion with the original prediction through learnable weights. Based on the geometric and positional features of the binary defect detection mask, the module combines the preset pore layout template to determine the defect type.

[0007] Furthermore, it also includes a detection result verification module. This module is used to obtain the detected defect type, size, location, and angle information from the porosity defect type output module. Based on the defect information, it retrieves one or more defect feature samples matching the detected defect type from a preset defect feature library. Then, it stitches the retrieved defect feature samples to the corresponding area of ​​the standard image according to the size, location, and angle information to generate a simulated defect image. It calculates the similarity index between the simulated defect image and the target image to be detected. Based on the comparison result of the similarity index and a preset verification threshold, it determines whether the detection result of the porosity defect type output module is accurate. When the similarity index is lower than the preset verification threshold, it triggers a re-detection process or marks the current detection result as low confidence.

[0008] Furthermore, the local detail branch in the dual-branch feature extraction module uses MobileNetV2 as the backbone network. MobileNetV2 extracts image features through inverted residual blocks and linear bottleneck structures, outputting four-level feature maps with strides of 1, 2, 4, and 8 respectively. The feature pyramid structure fuses the semantic information of high-level features with the edge details of low-level features layer by layer through top-down upsampling operations and lateral connections, generating a local detail feature pyramid containing four levels.

[0009] Furthermore, the dual-branch feature extraction module also includes an image block preprocessing strategy. The image block preprocessing strategy includes dividing the original input image into multiple image blocks that conform to the model input specifications according to the fixed input size requirements of the DINOv3 model. A preset ratio of overlapping regions is set between adjacent image blocks to ensure feature continuity. After each image block is independently input into the DINOv3 model to extract global semantic features, the features of adjacent image blocks are smoothly transitioned through the overlapping region weighted fusion algorithm to reconstruct a global semantic feature map with full resolution.

[0010] Furthermore, the feature fusion module includes a feature channel alignment strategy, which includes compressing or expanding the channel dimension of the global semantic feature map through a 1×1 convolutional layer to match the number of channels of the local feature map at the corresponding level, then performing distribution calibration on the aligned features through a batch normalization layer, and finally introducing a nonlinear transformation through the SiLU activation function to output the adapted semantic feature map.

[0011] Furthermore, the feature fusion module also includes a feature deep fusion and region enhancement strategy. The feature deep fusion and region enhancement strategy includes channel concatenation of the adapted semantic feature map output by the lightweight adaptation unit and the local feature map of the corresponding level, and fusion and dimensionality reduction of the concatenated features through depthwise separable convolution; then, based on the a prior attention mask for the pore region generated by the cylinder design drawings, the mask assigns a first weight value to the theoretical distribution area of ​​the pores and a second weight value to the non-pore regions, and the first weight value is higher than the second weight value. The fused feature map is then multiplied element-wise with the a prior attention mask to achieve feature response enhancement of the pore region and feature suppression of the non-pore region, and outputs a multi-scale feature pyramid with semantic and context enrichment.

[0012] Furthermore, the differential feature decoding module includes a differential feature decoding strategy, which includes calculating the element-wise absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale change prior map. The multi-scale change prior map is input into the Transformer decoder, which includes a differential attention mechanism. The differential attention mechanism enhances the feature response of the real change region through differential calculation of two sets of self-attention matrices. A cross-scale gated fusion operator is used to adaptively interact with the features of adjacent layers. The contribution ratio of the current scale feature and the upper-layer fusion feature is dynamically adjusted through learnable gate weights. A fully convolutional head is set at the output of each scale to map the fusion feature of the corresponding scale to a single-channel auxiliary prediction map.

[0013] Furthermore, the porosity defect type output module includes a prediction fusion strategy. The prediction fusion strategy includes performing a differentiable morphological operation sequence on the rate-assisted prediction map. The operation sequence includes opening and closing operations implemented through a learnable structural kernel. The morphologically optimized prediction result is transformed to logarithmic space through an inverse sigmoid function. The optimization result in logarithmic space is then weighted and fused with the original auxiliary prediction map through learnable weights. The fusion result is mapped back to probability space through a sigmoid function, and a binary defect detection mask is output using a preset binary threshold.

[0014] Furthermore, the image registration module includes a mismatch removal submodule. The mismatch removal submodule iteratively filters the feature point pairs that are initially matched using a random sampling consensus algorithm. In each iteration, several matching pairs are randomly selected to estimate the homography matrix, and the projection error of the remaining matching points under the matrix is ​​calculated. Matching pairs with projection errors less than a preset inlier threshold are determined to be inliers. After a preset number of iterations, the homography matrix containing the most inliers is selected as the final estimation result, and the corresponding outlier matching pairs are removed.

[0015] A method for detecting cylinder porosity defects based on DINO improvement includes the following steps: The image registration step involves extracting and matching feature points of the target image of the cylinder surface area to be inspected and the preset standard image through the scale-invariant feature transformation algorithm. After screening by the random sampling consensus algorithm, the homography matrix is ​​estimated, and the target image is subjected to rigid transformation to output a pixel-level aligned dual-temporal image. The dual-branch feature extraction step includes a parallel local detail branch and a global semantic branch. The local detail branch generates multi-scale local feature maps by combining a lightweight backbone network with a feature pyramid. The global semantic branch extracts global semantic feature maps by using a DINOv3 model with frozen parameters. The feature fusion step is used to perform channel alignment on the global semantic feature map through a lightweight adaptation unit, and then to deeply fuse it with the local feature map through a dense feature fusion unit. A prior attention mask for the pore region based on the design drawing is introduced for region weighting, and a multi-scale feature pyramid based on the semantic and context enrichment of the standard image and the target image is output. The differential feature decoding step is used to calculate the absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale change prior map, and to process the change prior map through a Transformer decoder that includes a differential attention mechanism and cross-scale gating fusion to generate a multi-scale auxiliary prediction map. The output step for pore defect type involves performing morphological operations on the auxiliary prediction image and then weighting and fusing it with the original prediction using learnable weights to output a binary defect detection mask. Based on the geometric and positional features of the binary defect detection mask, and combined with a preset pore layout template, the defect type is determined.

[0016] The beneficial effects of this invention are as follows: 1. By constructing a dual-branch feature extraction module, the global semantic features of DINOv3 with frozen parameters are deeply fused with the local detail features of the lightweight backbone network. This not only preserves the strong semantic representation ability and domain adaptability of DINOv3 obtained in large-scale pre-training, but also ensures that the edge details of small pores are not lost through the fine-grained features extracted by MobileNetV2. This effectively solves the problem of difficult detection of small pores in high-resolution images. At the same time, the strategy of freezing DINOv3 parameters avoids the overfitting problem caused by the scarcity of defect samples in industrial scenarios, and significantly improves the model's generalization ability in small sample scenarios. 2. By introducing a prior attention mask for the pore region based on design drawings into the feature fusion module, high weights are assigned to the theoretical distribution area of ​​pores and low weights to non-pore regions, achieving region-selective enhancement of feature responses. This effectively suppresses the influence of interference factors such as cylinder surface reflection, oil stains, and processing textures on the detection results, significantly improving the robustness of the system in complex industrial environments. In addition, through the differential attention mechanism and cross-scale gating fusion strategy in the differential feature decoding module, the subtle differences in the presence or absence of pores between two temporal images are accurately captured. Combined with multi-scale deep supervision design, the model can learn effective defect features at each scale, significantly improving the localization accuracy of defects such as missing pores.

[0017] 3. By introducing a defect feature library comparison mechanism through the detection result verification module, the detected defect features are stitched back into the standard image to generate a simulation image, and then compared with the target image for similarity. This realizes secondary verification of the model output results. When the similarity is lower than the threshold, a re-detection is triggered or a low confidence level is marked, which further improves the reliability and credibility of the detection system. Attached Figure Description

[0018] Figure 1 This is the overall system framework diagram of the present invention; Figure 2 This is a system algorithm framework diagram of the present invention; Figure 3 This is a block diagram of the dual-branch feature extraction and fusion module in this invention; Figure 4 This is a flowchart of the difference decoding process in this invention. Detailed Implementation

[0019] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Identical components are denoted by the same reference numerals. It should be noted that the terms "front," "rear," "left," "right," "upper," and "lower" used in the following description refer to directions in the accompanying drawings, and the terms "bottom surface," "top surface," "inner," and "outer" refer to directions toward or away from the geometric center of a specific component, respectively.

[0020] Example 1: like Figure 1 As shown, the system architecture includes an image registration module, a dual-branch feature extraction module, a feature fusion module, a differential feature decoding module, and a porosity defect type output module. The modules work together through a collaborative and asynchronous processing mechanism to achieve accurate and automated detection of cylinder porosity defects.

[0021] The image registration module is used to precisely align the target image of the surface area of ​​the cylinder to be inspected with a preset standard image at the pixel level. Specifically, the standard image is a qualified cylinder sample image without missing pores, and the target image is the image of the cylinder to be inspected. Both types of images are acquired using an industrial camera. In practice, the image registration module first uses a scale-invariant feature transform algorithm to extract feature points and corresponding feature descriptors from the standard image and the target image respectively. The feature points contain three attributes: position, scale, and orientation, which can adapt to changes in image scale and rotation. Then, a fast nearest neighbor search library is used to perform preliminary matching of feature points and select candidate matching pairs.

[0022] To address the mismatch issue caused by image noise and surface texture interference, the image registration module includes a mismatch removal submodule. This submodule uses a random sampling consensus algorithm to iteratively filter the initially matched feature point pairs. Specifically, in each iteration, several matching pairs are randomly selected to estimate the homography matrix, and the projection error of the remaining matching points under this matrix is ​​calculated. Matching pairs with projection errors less than a preset inlier threshold are identified as inliers. After a preset number of iterations, the homography matrix containing the most inliers is selected as the final estimation result, and the corresponding outlier matching pairs are removed. Finally, based on the estimated homography matrix, a three-dimensional rigid transformation of translation, rotation, and scaling is performed on the target image to achieve accurate alignment with the standard image, ensuring that the spatial positional error of the corresponding pore regions in the two images after alignment does not exceed 1 pixel.

[0023] The dual-branch feature extraction module employs a symmetrical structure, processing the registered standard images in parallel. and the target image to be detected This includes local detail branches and global semantic branches, specifically, such as Figure 3 As shown, the local detail branch uses MobileNetV2 as the backbone network. This network retains fine-grained features while reducing computational cost through inverted residual blocks and a linear bottleneck structure. In this embodiment, MobileNetV2 extracts features from the registered image and outputs four local feature maps at different scales with strides of 1, 2, 4, and 8, corresponding to sizes of 1024×1024, 512×512, 256×256, and 128×128, respectively. Subsequently, these four feature maps are input into the feature pyramid network. Through top-down upsampling and lateral connections, the semantic information of high-level features is fused with the edge and shape details of low-level features layer by layer to generate a multi-scale local detail feature pyramid. The key is to preserve the fine structural information such as the edges and contours of the pores.

[0024] The global semantic branch introduces a pre-trained DINOv3 base model as the global semantic feature extractor. All parameters of the DINOv3 model are frozen during training to avoid compromising its strong semantic representation capabilities obtained through large-scale pre-training, while effectively mitigating overfitting caused by the scarcity of pore and defect samples in industrial scenarios. Considering that the original DINOv3 model only supports a fixed input size of 256×256, directly applying it to a 1024×1024 high-resolution image would result in significant detail loss. Therefore, the global semantic branch employs an image block preprocessing strategy. This strategy divides the original input image into multiple image blocks (e.g., 224×224) that conform to the model's input specifications, based on the fixed input size requirements of the DINOv3 model. A preset overlap ratio is set between adjacent image blocks to ensure feature continuity. After each image block is independently input into the DINOv3 model to extract global semantic features, a weighted fusion algorithm for overlapping regions is used to smoothly transition the features of adjacent image blocks, reconstructing a full-resolution global semantic feature map. .

[0025] Among them, the feature fusion module is used to deeply fuse the local detail features extracted by the dual branches with the global semantic features, and introduce prior knowledge to enhance the stomatal region response. This module includes a lightweight adaptation unit and a dense feature fusion unit. Specifically, because the channel dimension of the output features of the DINOv3 model (768 dimensions) does not match the channel dimension of the output features of MobileNetV2 (which varies with the layer, e.g., 320 dimensions for a 1024×1024 layer), a lightweight adaptation module needs to be designed to achieve channel alignment. The lightweight adaptation unit executes a feature channel alignment strategy, compressing or expanding the channel dimension of the global semantic feature map through a 1×1 convolutional layer to match the number of channels of the local feature map at the corresponding layer; then, a batch normalization layer is used to perform distribution calibration on the aligned features; finally, a nonlinear transformation is introduced through the SiLU activation function to output the adapted semantic feature map. .

[0026] The dense feature fusion unit performs deep feature fusion and region enhancement strategies. First, it adapts the semantic feature map output by the lightweight adaptation unit. Local feature maps of the corresponding levels Channel splicing is performed, and the spliced ​​features are fused and dimensionality reduced using depthwise separable convolution to reduce computational complexity. Secondly, a prior attention mask for the vent region generated from cylinder design drawings is introduced. The mask explicitly marks the theoretical distribution range of pores, assigning a first weight value to the theoretical pore distribution area and a second weight value to non-pore areas (such as shells, edges, etc.), with the first weight value being higher than the second weight value. Finally, the fused feature map is compared with the prior attention mask. Element-wise multiplication is performed to enhance the feature response of the stomatal region and suppress the feature of the non-stomatal region, outputting a multi-scale feature pyramid with semantic and contextual enrichment, corresponding to the standard image. and target image Fusion Feature Pyramid and .

[0027] Among them, the difference feature decoding module is used to accurately capture the difference in porosity between two temporal features through difference modeling, so as to achieve the initial location of defect areas, such as Figure 4 As shown, firstly, a change prior is generated, targeting the fusion feature pyramid. and The element-wise absolute difference is calculated at the corresponding scale level to obtain the multi-scale change prior map. Its expression is: The change in the prior diagram can directly highlight and Differences in the stomatal region, if The presence of pores will result in a significantly higher feature difference at the corresponding location compared to the normal area, providing a direct clue for subsequent defect localization.

[0028] Secondly, the multi-scale change prior map The input is a customized S²DT decoder, which focuses on stomatal region differences through an optimized attention mechanism. Specifically, the decoder includes a differential attention mechanism, adjusting the number of Transformer attention heads (h=8) and the single-head feature dimension (d=64) to ensure that computational load is controlled while maintaining feature representation capabilities. The optimized differential attention calculation formula is as follows: ,in , These are two sets of self-attention matrices. It is a learnable coefficient matrix (initially an identity matrix). For value vectors, through and Element-level multiplication forces the attention mechanism to focus on the feature positions corresponding to the pores, enhancing the response to cross-temporal differences and significantly suppressing interference from irrelevant areas such as the shell and edges. In addition, the decoder adopts a cascaded structure, processing features step by step from the top-level features (l=4, 128×128) to the bottom-level features (l=1, 1024×1024). A gated fusion operator is introduced to achieve adaptive interaction between features at adjacent scales. Dynamic weights are generated through the sigmoid function, and the contribution ratio between the current scale features and the upper-level fused features is automatically adjusted through learnable gated weights to ensure the effective combination of semantic information and spatial details, thereby improving the localization accuracy of missing micropores.

[0029] Finally, a fully convolutional head is set at the output of the decoder at each scale. The fully convolutional head consists of two 3×3 convolutional layers (128 and 1 channels respectively) and a batch normalization layer, which maps the fused features of the corresponding scale to a single-channel auxiliary prediction map. By scaling the auxiliary prediction maps at each scale to the original image scale of 1024×1024 through bilinear interpolation, and calculating the loss for each auxiliary prediction map during training, deep supervision is achieved. This design can guide the model to learn effective defect features at each scale, stabilize the training process, and improve the model's convergence speed and final detection accuracy.

[0030] The porosity defect type output module includes a prediction fusion strategy. This strategy eliminates prediction noise and repairs defect boundaries through morphological optimization, and improves the reliability of detection results through a weighted fusion strategy. Specifically, for example... Figure 4 As shown, the l=1 level auxiliary prediction map output by the difference feature decoding module is selected. (1024×1024 scale, containing the richest spatial details) performs a differentiable morphological operation sequence, which includes opening and closing operations implemented through a learnable structural kernel. A fixed window size combination of "3×3 opening operation + 5×5 closing operation" is used to adapt to the circular contour features of the cylinder bore: the 3×3 opening operation (erosion followed by expansion) is used to eliminate minor noise in the bore area (such as false features caused by surface reflection and slight stains); the 5×5 closing operation (expansion followed by erosion) is used to repair fragmented defect boundaries and avoid defect area breakage caused by incomplete feature extraction. The structural element weights of the opening and closing operations are adaptively updated through end-to-end training, which can accurately match the morphological features of the cylinder bore.

[0031] The prediction results are then optimized using a logit inverse transformation and weighted fusion strategy, specifically expressed as follows: , in The sigmoid function (converts the predicted values ​​into a probability plot in the [0,1] interval). It is the inverse sigmoid function (which transforms the probability map back to the logit space, preserving more gradient information). , They are 3×3 and 5×5 learnable structure kernels, respectively. To create learnable weights, a dynamic balance between the accuracy of the original prediction and the smoothness of the optimized prediction is achieved through training. Finally, a binarization threshold of 0.5 is used. The filter is then processed, and a binary defect detection image is output. White areas (pixel value = 1) mark the locations of defects such as missed pores or forgotten holes, while black areas (pixel value = 0) are normal areas.

[0032] The defect type output module is also used to determine the defect type based on the geometric and positional features of the binarized defect detection mask, combined with the preset pore layout template. This embodiment supports the automatic output of five defect types. The network directly outputs the defect type. Defect 1: Excessive pores. The network judgment is based on the reference coordinate area without pre-reserved pores in the design. Obvious pore segmentation masks and bounding boxes are detected, and there are obviously many pores. Interference pixel features are excluded (confidence > 0.95). It is a moderate defect. The extra pores are likely to cause local stress concentration and damage the integrity of the cylinder structure. Defect 2: Pore occlusion. The network judgment is based on the fact that the segmentation mask is partially covered by high grayscale occlusion pixels (occlusion area > 30%), the bounding box is incomplete, and the overall outline of the pore can be restored by the texture completion algorithm. It is a mild defect. Occlusion is likely to cover the true size and depth of the pores and requires manual verification. Defect 3: Pore position deviation. The network judgment criteria are: the actual detection coordinates of the designed reserved pore area deviate from the reference coordinates by more than 0.5mm; the segmentation mask is complete and there are no other defect characteristics; this is a moderate defect. Position deviation can lead to cylinder assembly misalignment and poor sealing. Defect 4: Semi-sunken pore. The network judgment criteria are: the segmentation mask has a semi-recessed feature, the depth value is 1 / 4 to 1 / 2 of the cylinder wall thickness (not penetrated), and the edge has a stepped texture; this is a moderate defect. Semi-sunken pores easily accumulate impurities and moisture, inducing local corrosion and reducing surface flatness. Defect 5: Pore blockage. The network judgment criteria are: the segmentation mask is filled with high-density impurity pixels, there are no cavity features, and the local grayscale value is uniform and low; this is a mild defect, affecting the cylinder's heat dissipation or ventilation function. In addition to outputting the binary defect detection mask and defect type, the system simultaneously outputs quantitative defect information, including the number of defects, the center coordinates of each defect, and the bounding box size, providing standardized data support for subsequent quality assessment and production traceability.

[0033] Example 2 Building upon Example 1, the system further includes a detection result verification module for secondary verification of the detection results from the porosity defect type output module, thereby enhancing the system's reliability and credibility. This module executes a detection result verification strategy. Specifically, firstly, it obtains the detected defect type, size, location, and angle information from the porosity defect type output module; secondly, based on the defect information, it retrieves one or more defect feature samples matching the detected defect type from a preset defect feature library, which stores feature data of various typical defect samples collected in historical detections; then, it stitches the retrieved defect feature samples to the corresponding areas of a standard image according to the detected size, location, and angle information to generate a simulated defect image.

[0034] Next, the similarity index between the simulated defect image and the target image to be detected is calculated. The similarity index can include a weighted combination of various measurement methods such as structural similarity index and feature space cosine similarity to comprehensively evaluate the consistency of the two images in terms of texture, edge structure and semantic content. Finally, based on the comparison result of the similarity index and the preset verification threshold, it is determined whether the detection result of the porosity defect type output module is accurate. When the similarity index is lower than the preset verification threshold, it indicates that the credibility of the detection result is low. At this time, the re-detection process is triggered or the current detection result is marked as a low confidence state pending review.

[0035] Example 3 Based on Example 1 and / or Example 2, loss function optimization was implemented to address the severe class imbalance problem between porosity defect samples (minority class) and background samples (majority class) in industrial scenarios. A weighted combined loss function was designed, combined with deep supervision, to improve the model's learning ability for minority defect samples. Specifically, the main loss function uses a weighted combination of Focal loss and Dice loss as the main loss, expressed as: ,in, This is a true label mask with a scale of 1024×1024. Focal loss (set the focusing parameter γ=2, and focus on the difficult-to-classify defective samples by reducing the weight of easily classified background samples). The Dice loss is used (weighting coefficient β=0.5, focusing on optimizing the boundary detection accuracy of defective regions). The auxiliary loss function introduces an auxiliary loss to provide deep supervision of the multi-scale auxiliary prediction map. The expression is as follows: ; The total loss function is a weighted sum of the main loss and the auxiliary loss, balancing the training weights of the main prediction and the multi-scale auxiliary prediction. Its expression is: This total loss function can effectively alleviate the class imbalance problem, guide the model to accurately learn defect features, and improve detection accuracy and generalization ability.

[0036] Example 4 Regarding the detection method, based on Example 1 and / or Example 2, this example further provides a specific implementation process for a cylinder porosity defect detection method based on DINO improvement, such as... Figure 2 As shown, it includes the following steps: Step S1, image registration step: The target image of the surface area of ​​the cylinder to be inspected and the preset standard image are extracted and matched by the scale-invariant feature transformation algorithm. The feature points of the two images are then extracted and matched by the random sampling consistency algorithm. After screening, the homography matrix is ​​estimated, and the target image is subjected to rigid transformation to output a pixel-level aligned dual-temporal image.

[0037] Step S2, the dual-branch feature extraction step, includes a parallel local detail branch and a global semantic branch. The local detail branch generates multi-scale local feature maps by combining the MobileNetV2 backbone network with the feature pyramid. The global semantic branch extracts global semantic feature maps by using the DINOv3 model with frozen parameters. According to the fixed input size requirements of the DINOv3 model, an image block preprocessing strategy is executed to divide the original input image into multiple image blocks that meet the model input specifications. A preset ratio of overlapping regions is set between adjacent image blocks. After each image block extracts features independently, the global semantic feature map with full resolution is reconstructed by weighted fusion of the overlapping regions.

[0038] Step S3, the feature fusion step, aligns the global semantic feature map through a lightweight adaptation unit. Specifically, it includes: compressing or expanding the channel dimension through a 1×1 convolutional layer to match the number of channels in the local feature map of the corresponding level; then performing distribution calibration through a batch normalization layer; and finally introducing a nonlinear transformation through the SiLU activation function. Then, the adapted semantic feature map and local feature map are deeply fused through a dense feature fusion unit. Specifically, this includes: channel concatenation, dimensionality reduction through depthwise separable convolution fusion; introducing a prior attention mask for pore regions based on design drawings, assigning a first weight value to the theoretical distribution region of pores and a second weight value to non-pore regions, with the first weight value being higher than the second weight value; and performing element-wise multiplication of the fused feature map with the prior attention mask to achieve region weighting, outputting a multi-scale feature pyramid enriched with semantics and context.

[0039] Step S4, the differential feature decoding step, calculates the element-wise absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale transformation prior map. This multi-scale transformation prior map is then input into the Transformer decoder, which includes a differential attention mechanism. The differential calculation is performed using two sets of self-attention matrices. To enhance the feature response of truly changing regions, a cross-scale gated fusion operator is employed to adaptively interact features from adjacent layers. Learnable gate weights dynamically adjust the contribution ratio between current-scale features and upper-layer fused features. A fully convolutional head is set at the output of each scale to map the fused features of the corresponding scale to a single-channel auxiliary prediction map. This enables multi-scale, in-depth supervision.

[0040] Step S5, Pore Defect Type Output Step: Select the highest resolution auxiliary prediction image. It performs a sequence of differentiable morphological operations, including 3×3 opening operations and 5×5 closing operations implemented through a learnable structure kernel. The opening operation is used to eliminate small noise regions, and the closing operation is used to repair fragmented defective boundaries. The prediction results are optimized using a logit inverse transformation and weighted fusion strategy: , in, For learnable weights, , As a learnable structure kernel, 0.5 is finally used as the binarization threshold to output a binarized defect detection mask; Finally, based on the geometric and positional features of the binarized defect detection mask, and combined with the preset pore layout template, the defect type is determined, and at least one defect type information and its corresponding confidence level are output, including multiple pores, pore obstruction, pore hole position deviation, pore semi-submerged hole, and pore blockage. At the same time, quantitative information such as the number of defects, the center coordinates of each defect, and the bounding box size are output.

[0041] Step S6, the detection result verification step, obtains the detected defect type, size, location, and angle information from the porosity defect type output step. Based on the defect information, it retrieves one or more defect feature samples matching the detected defect type from a preset defect feature library. The retrieved defect feature samples are then stitched to the corresponding areas of a standard image according to their size, location, and angle information to generate a simulated defect image. The similarity index between the simulated defect image and the target image to be detected is calculated. Based on the comparison between the similarity index and a preset verification threshold, the accuracy of the detection result is determined. When the similarity index is lower than the preset verification threshold, a re-detection process is triggered or the current detection result is marked as low confidence.

[0042] The above are merely preferred embodiments of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principle of the present invention should also be considered within the scope of protection of the present invention.

Claims

1. A DINO-improved based cylinder porosity flaw detection system, characterized in that: include: The image registration module matches feature points between the target image and the preset standard image of the surface area of ​​the cylinder to be inspected, and then transforms the target image based on the matched feature points to output a pixel-level aligned dual-temporal image. The dual-branch feature extraction module includes a parallel local detail branch and a global semantic branch. The local detail branch is used to generate multi-scale local feature maps, and the global semantic branch is used to extract global semantic feature maps. The feature fusion module aligns the global semantic feature map by channel and then deeply fuses it with the local feature map. It also introduces a prior attention mask for the pore region based on the design drawings for region weighting and outputs a multi-scale feature pyramid based on the semantic and context enrichment of the standard image and the target image. The differential feature decoding module calculates the absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale change prior map. The change prior map is then processed by a decoder that includes a differential attention mechanism and cross-scale gating fusion to generate a multi-scale auxiliary prediction map. The pore defect type output module performs morphological operations on the auxiliary prediction map and outputs a binary defect detection mask by weighted fusion with the original prediction through learnable weights. Based on the geometric and positional features of the binary defect detection mask, the module combines the preset pore layout template to determine the defect type.

2. The DINO based improved gas cylinder bore flaw detection system as claimed in claim 1 wherein: It also includes a detection result verification module, which is used to obtain the detected defect type, size, location, and angle information from the porosity defect type output module, retrieve one or more defect feature samples matching the detected defect type from a preset defect feature library based on the defect information, and then stitch the retrieved defect feature samples to the corresponding area of ​​the standard image according to the size, location, and angle information to generate a simulated defect image; calculate the similarity index between the simulated defect image and the target image to be detected, and determine whether the detection result of the porosity defect type output module is accurate based on the comparison result of the similarity index and a preset verification threshold; when the similarity index is lower than the preset verification threshold, trigger a re-detection process or mark the current detection result as low confidence.

3. The DINO based improved cylinder porosity flaw detection system as claimed in claim 1 or 2, wherein: The local detail branch in the dual-branch feature extraction module uses MobileNetV2 as the backbone network. MobileNetV2 extracts image features through inverted residual blocks and linear bottleneck structures, and outputs four-level feature maps with strides of 1, 2, 4 and 8 respectively. The feature pyramid structure integrates the semantic information of high-level features with the edge details of low-level features through top-down upsampling and lateral connections, generating a local detail feature pyramid with four levels.

4. The DINO based improved gas cylinder bore flaw detection system as claimed in claim 3, wherein: The dual-branch feature extraction module also includes an image block preprocessing strategy. The image block preprocessing strategy includes dividing the original input image into multiple image blocks that conform to the model input specifications according to the fixed input size requirements of the DINOv3 model. A preset ratio of overlapping regions is set between adjacent image blocks to ensure feature continuity. After each image block is independently input into the DINOv3 model to extract global semantic features, the features of adjacent image blocks are smoothly transitioned through the overlapping region weighted fusion algorithm to reconstruct a global semantic feature map with full resolution.

5. The DINO based improved gas cylinder bore flaw detection system as claimed in claim 1 or 2, wherein: The feature fusion module includes a feature channel alignment strategy, which involves compressing or expanding the channel dimension of the global semantic feature map through a 1×1 convolutional layer to match the number of channels of the local feature map at the corresponding level, then performing distribution calibration on the aligned features through a batch normalization layer, and finally introducing a nonlinear transformation through the SiLU activation function to output the adapted semantic feature map.

6. The cylinder porosity defect detection system based on DINO improvement according to claim 5, characterized in that: The feature fusion module also includes a feature deep fusion and region enhancement strategy. The feature deep fusion and region enhancement strategy includes concatenating the adapted semantic feature map output by the lightweight adaptation unit with the local feature map of the corresponding level through channel concatenation, and fusing and reducing the dimensionality of the concatenated features through depthwise separable convolution. Then, based on the a prior attention mask for the pore region generated by the cylinder design drawings, the mask assigns a first weight value to the theoretical distribution area of ​​the pores and a second weight value to the non-pore regions, with the first weight value being higher than the second weight value. The fused feature map is then multiplied element-wise with the a prior attention mask to enhance the feature response of the pore region and suppress the features of the non-pore region, outputting a multi-scale feature pyramid with semantic and context enrichment.

7. The cylinder porosity defect detection system based on DINO improvement according to claim 6, characterized in that: The differential feature decoding module includes a differential feature decoding strategy, which involves calculating the element-wise absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale change prior map. The multi-scale change prior map is then input into the Transformer decoder, which includes a differential attention mechanism. This mechanism enhances the feature response of the real change region through differential calculation of two sets of self-attention matrices. A cross-scale gated fusion operator is used to adaptively interact with features at adjacent levels, dynamically adjusting the contribution ratio of the current scale feature and the upper-level fusion feature through learnable gate weights. A fully convolutional head is set at the output of each scale to map the fusion feature of the corresponding scale to a single-channel auxiliary prediction map.

8. The cylinder porosity defect detection system based on DINO improvement according to claim 7, characterized in that: The porosity defect type output module includes a prediction fusion strategy. The prediction fusion strategy includes performing a differentiable morphological operation sequence on the rate-assisted prediction map. The operation sequence includes opening and closing operations implemented through a learnable structural kernel. The morphologically optimized prediction result is transformed to logarithmic space through an inverse sigmoid function. The optimization result in logarithmic space is then weighted and fused with the original auxiliary prediction map through learnable weights. The fusion result is mapped back to probability space through a sigmoid function, and a binary defect detection mask is output using a preset binary threshold.

9. The cylinder porosity defect detection system based on DINO improvement according to claim 8, characterized in that: The image registration module includes a mismatch removal submodule. The mismatch removal submodule uses a random sampling consensus algorithm to iteratively filter the feature point pairs that are initially matched. In each iteration, several matching pairs are randomly selected to estimate the homography matrix, and the projection error of the remaining matching points under the matrix is ​​calculated. Matching pairs with projection errors less than a preset inlier threshold are determined to be inliers. After a preset number of iterations, the homography matrix containing the most inliers is selected as the final estimation result, and the corresponding outlier matching pairs are removed.

10. A method for detecting cylinder porosity defects based on DINO improvement, characterized in that: Includes the following steps: The image registration module matches feature points between the target image and the preset standard image of the surface area of ​​the cylinder to be inspected, and then transforms the target image based on the matched feature points to output a pixel-level aligned dual-temporal image. The dual-branch feature extraction module includes a parallel local detail branch and a global semantic branch. The local detail branch is used to generate multi-scale local feature maps, and the global semantic branch is used to extract global semantic feature maps. The feature fusion module aligns the global semantic feature map by channel and then deeply fuses it with the local feature map. It also introduces a prior attention mask for the pore region based on the design drawings for region weighting and outputs a multi-scale feature pyramid based on the semantic and context enrichment of the standard image and the target image. The differential feature decoding module calculates the absolute difference between the feature pyramids of the standard image and the target image layer by layer to generate a multi-scale change prior map. The change prior map is then processed by a decoder that includes a differential attention mechanism and cross-scale gating fusion to generate a multi-scale auxiliary prediction map. The pore defect type output module performs morphological operations on the auxiliary prediction map and outputs a binary defect detection mask by weighted fusion with the original prediction through learnable weights. Based on the geometric and positional features of the binary defect detection mask, the module combines the preset pore layout template to determine the defect type.