Bearing surface defect detection method and electronic device

By using the MSF-YOLO network model, the problems of cross-scale feature extraction, protection of small defects, and interference from complex backgrounds in bearing surface defect detection were solved, achieving high-precision bearing surface defect detection.

CN122243933APending Publication Date: 2026-06-19NANCHANG HANGKONG UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANCHANG HANGKONG UNIVERSITY
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing bearing surface defect detection technologies have shortcomings in cross-scale feature extraction, protection of minute defect features, feature fusion adaptability, and resistance to complex background interference, resulting in poor detection accuracy and robustness.

Method used

The MSF-YOLO network model is adopted, which enhances feature extraction through multi-scale partial convolutional modules and small defect sensitive branches. Combined with feature fusion modules and collaborative attention mechanisms, it achieves full-scale defect feature capture, small defect feature enhancement and background noise suppression.

Benefits of technology

It significantly improves the accuracy and robustness of bearing surface defect detection, effectively solves the problems of cross-scale feature extraction, missed detection of small defects and interference from complex backgrounds, and achieves high-precision defect detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243933A_ABST
    Figure CN122243933A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of deep learning and machine vision technology, and provides a method and electronic device for detecting surface defects in bearings. The method includes: acquiring a bearing surface image and standardizing it to obtain an initial feature map, which is then input into an MSF-YOLO network model; the backbone network uses a multi-scale partial convolution module for dual-path processing, with multi-size parallel convolution branches capturing full-scale defect features and a micro-defect sensitive branch enhancing micro-defect features; the two are then concatenated and fused, and residual connections are used to obtain backbone-level features; the neck network first uses a feature fusion module to adaptively fuse deep and shallow features to obtain a fused feature map, and then uses a collaborative attention module to lock the defect location, filter noise, and output an enhanced feature map; finally, a multi-scale decoupled detection head is used for classification and regression to output the category, confidence level, and location coordinates of the bearing surface defect. This invention solves the core technical bottleneck of existing bearing surface defect detection, significantly improving detection accuracy, robustness, and effectiveness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of deep learning and machine vision technology, and provides a method and electronic device for detecting surface defects in bearings. Background Technology

[0002] Rolling bearings, as core components of modern machinery, are crucial for ensuring equipment operating accuracy, efficiency, and lifespan, and are widely used in industrial fields such as intelligent manufacturing, rail transportation, and aerospace. During production, processing, and assembly, bearing surfaces are prone to defects such as scratches, indentations, micro-pits, and electrolytic erosion holes. These defects range in size from sub-millimeter micro-cracks to macroscopic scratches visible to the naked eye, and exhibit diverse forms. If not accurately detected and removed, they can cause vibration and noise during equipment operation, and in severe cases, lead to catastrophic shutdowns. Therefore, achieving high-precision, high-efficiency, automated detection of bearing surface defects has significant engineering application value and practical significance for ensuring safe industrial production and promoting predictive maintenance of equipment.

[0003] For a long time, bearing surface defect detection has evolved from manual visual inspection to automated inspection based on machine vision. Manual visual inspection is the mainstream method in traditional industrial settings, relying on inspectors' naked eye to determine the presence of defects on the bearing surface. However, this method has inherent drawbacks such as strong subjectivity, susceptibility to visual fatigue, high false negative and false positive rates, and low inspection efficiency, making it unsuitable for the full inspection requirements of modern high-speed production lines. With the rapid development of machine vision and deep learning technologies, deep learning-based object detection algorithms, with their end-to-end inference capabilities and high detection speed, have become the mainstream technical solution for industrial surface defect detection. Among them, single-stage object detection algorithms, represented by the YOLO series, SSD, and RT-DETR, are widely used in surface quality inspection in industries such as steel, electronic components, and textiles. YOLOv8, as an advanced version of the YOLO series, demonstrates excellent detection performance on general object detection datasets, becoming one of the preferred benchmark models for defect detection deployment in industry.

[0004] Although deep learning-based automated inspection technology has matured and general object detection algorithms have been successfully applied in multiple fields, directly applying it to the specific industrial micro-scenario of bearing surface defect detection still faces many technical bottlenecks due to the structural characteristics of the bearing itself, the complex industrial environment, and the inherent architecture of the algorithm. Existing detection technologies have significant shortcomings in both hardware imaging and algorithm processing, making it difficult to meet the requirements of high-precision and high-robustness detection. Specific problems are as follows: First, single-scale convolution limits cross-scale feature extraction. The backbone networks of existing detection algorithms generally use fixed-size convolution kernels, resulting in a single receptive field. When faced with huge-scale defects on the bearing surface, ranging from micron-level dot-like pits to millimeter-level long scratches, they cannot simultaneously extract local subtle features and global semantic information at the same level, and have poor adaptability to multi-scale defects.

[0005] Second, deep networks are prone to losing features of minute defects. The downsampling process of deep learning models will gradually reduce the spatial resolution of the image. The pixel ratio of minute defects on the bearing surface in the original image is extremely low. After multiple layers of convolution and downsampling, their weak feature signals are easily submerged by background noise or disappear in feature mapping. Existing models lack feature protection mechanisms for small targets, resulting in a high rate of missed detection of minute defects.

[0006] Third, feature fusion lacks adaptability. Existing feature pyramid networks often use simple splicing operations when fusing deep and shallow features, assuming that all channel features are of equal importance. They cannot automatically adjust the contribution weight of features at different levels according to the defect scale, and key detailed features are easily diluted by redundant information.

[0007] Fourth, it has weak resistance to interference from complex backgrounds. Oil stains and water stains are easily left on the surface of bearings in industrial sites, and the texture of the metal itself and changes in lighting conditions can form pseudo-features that resemble defects. Existing algorithms lack an efficient spatial-channel collaborative attention mechanism and cannot automatically suppress background noise and highlight real defect areas during the feature extraction stage, resulting in a high false alarm rate and poor detection robustness in practical applications. Summary of the Invention

[0008] To address the aforementioned technical problems, this invention provides a bearing surface defect detection method and electronic device, which can overcome the core technical bottlenecks of existing bearing surface defect detection and significantly improve detection accuracy, robustness, and effectiveness.

[0009] The technical solution of the present invention includes: The bearing surface image is acquired and standardized to obtain an initial feature map.

[0010] The initial feature map is input into the MSF-YOLO network model. The backbone network performs dual-path processing through a multi-scale partial convolution module. One path is a multi-scale parallel convolution branch that captures full-scale defect features to obtain a multi-scale feature map. The other path generates a spatial attention weight map by entering a micro-defect sensitive branch to enhance micro-defect features and obtain a sensitive feature map. The multi-scale feature map and the sensitive feature map are concatenated and fused, and then fused through residual connections to obtain the backbone layer features. The neck network first introduces learnable channel weights into the backbone layer features through a feature fusion module, adaptively fusing shallow P2 high-resolution detail features and deep semantic features in the backbone network to obtain a fused feature map. Then, the collaborative attention module generates spatial attention masks along the height and width dimensions of the fused feature map to lock the defect location. After filtering noise through a channel self-attention mechanism, the enhanced feature map is output. After classification and regression by a multi-scale decoupled detection head, the category, confidence, and location coordinates of the bearing surface defect are output.

[0011] Furthermore, the bearing surface image acquisition device includes an area array industrial camera, a fixed-focus lens, a coaxial light source assembly, and a precision stage.

[0012] The fixed-focus lens is aligned with the optical axis of the area array industrial camera. The coaxial light source assembly is mounted on the end of the fixed-focus lens facing the precision stage. The precision stage is used to support the bearing under test, and its supporting surface is perpendicular to the optical axis of the area array industrial camera. The coaxial light source assembly has a built-in 45° beam splitter, which is used to convert the horizontal incident light into vertical incident light parallel to the camera's optical axis, which vertically illuminates the curved surface of the outer metal ring of the bearing under test. The area array industrial camera is used to acquire an image of the bearing surface.

[0013] Furthermore, the preprocessing involves: filtering and removing blurry images from the acquired bearing surface image, and unifying the image size and normalizing the pixel values ​​through bilinear interpolation.

[0014] Furthermore, the processing method for the backbone network is as follows: Input initial feature map The code is copied into two paths, one for inputting a multi-scale parallel convolution branch and the other for inputting a small defect sensitive branch.

[0015] The multi-scale parallel convolution branch will input the initial feature map. After segmentation along the channel dimension, fine-grained features, local contextual features, and wide-domain semantic features are extracted using 1×1, 3×3, and 5×5 convolutional kernels, respectively. These features are then concatenated along the channel dimension to generate a multi-scale feature map. .

[0016] The micro-defect sensitive branch will input the initial feature map. After sequentially performing 1×1 convolution dimensionality reduction, BN layer, ReLU activation, 3×3 convolution feature extraction, BN layer, and ReLU activation, a spatial attention weight map with values ​​ranging from [0,1] is generated by 1×1 convolution and Sigmoid activation. The spatial attention weight map Compared with the original input initial feature map Element-wise multiplication yields the sensitive feature map. .

[0017] Multi-scale feature maps With sensitive feature map After concatenation along the channel dimension and fusion via 1×1 convolution, the feature maps are compared with the original input feature map. Perform residual connections to output backbone-level features. .

[0018] Furthermore, the processing method for the neck network is as follows: The feature fusion module L_FFM receives P2 high-resolution feature maps from the shallow layers of the backbone network. semantic feature maps of deep networks after upsampling .

[0019] Initialize learnable weight vector The length of the weight vector is equal to and The total number of channels, for the weight vector Softmax normalization is performed to obtain the normalized weight coefficients for each channel. .

[0020] Normalized weight coefficients Broadcast to according to channel correspondence. and The channel-level weighted operation is completed to obtain the weighted feature map. and .

[0021] Will and The features are concatenated along the channel dimension to output a fused feature map. .

[0022] Furthermore, the method for obtaining the enhanced feature map is as follows: Input initial feature map Global average pooling is performed along the height and width directions respectively to generate projection vectors in the height and width directions.

[0023] The two projection vectors are divided into four groups on an average scale along the channel dimension. After being processed by one-dimensional convolutional layers with kernel sizes of 3, 5, 7, and 9, they are concatenated and then used to generate a height-oriented spatial attention mask through Sigmoid activation. Spatial attention mask in width direction ,Will , Compared with the initial input feature map respectively Element-wise multiplication to output spatially calibrated feature map .

[0024] Spatial calibration feature map Downsampling is performed, and then the query matrix Q, key matrix K, and value matrix V are generated through three parallel 1×1 convolutional layers; Q and... The dot product of V is then processed by Softmax to generate a channel correlation matrix. The channel correlation matrix is ​​then used to perform weighted aggregation on V to obtain the channel context features.

[0025] The channel context features are upsampled and restored to match the spatial calibration feature map. Channel weights of the same size are generated by Sigmoid activation. ,Will Spatial calibration feature map Element-wise multiplication outputs the enhanced feature map after feature calibration. .

[0026] Furthermore, the training and optimization method for the MSF-YOLO network model is as follows: The initial feature map is randomly rotated, horizontally flipped, its brightness is randomly perturbed, and Gaussian noise is added to construct a defect dataset. The defect dataset is then divided into training, validation, and test sets according to the proportions.

[0027] Images from the training set are input into the MSF-YOLO network model, and a joint loss function is calculated. The joint loss function includes Varifocal Loss for classification accuracy evaluation, CIoU Loss for bounding box regression accuracy evaluation, and Distribution Focal Loss for bounding box distribution optimization.

[0028] The Stochastic Gradient Descent (SGD) optimizer is used to update the convolutional kernel weights, channel fusion weights of the feature fusion module L_FFM, and attention parameters of the collaborative attention module C2f_SCSA through backpropagation based on the joint loss value.

[0029] The training rounds are set to 200 rounds to validate the model performance on the validation set until the loss converges, thus obtaining the trained MSF-YOLO network model.

[0030] Furthermore, the detection method for the multi-scale decoupling detection head is as follows: The coaxial light image of the bearing to be detected is input into the trained MSF-YOLO network model, and processed step by step by the four MSPConv modules of the backbone network to extract multi-scale feature maps.

[0031] The multi-scale decoupled detection head performs independent classification and regression on feature maps at four scales: P2, P3, P4, and P5, predicting the category confidence and bounding box coordinates of defects.

[0032] The prediction results for all scales are summarized, and the non-maximum suppression (NMS) algorithm is used to remove redundant detection boxes with excessive overlap. Finally, the category, confidence level and precise location coordinates of the bearing surface defects are output.

[0033] Furthermore, the bearing surface images include at least one type of defect such as abrasions, scratches, grooves, elongated cracks, and electrolytic erosion holes.

[0034] The present invention also provides an electronic device, comprising: Memory is used to store computer programs.

[0035] When the processor executes the computer program stored in the memory, it implements the above-described method for detecting surface defects in bearings.

[0036] The technical solution provided by this invention has the following advantages compared with the prior art: By constructing an MSF-YOLO network model and utilizing the dual-path processing of the multi-scale partial convolutional modules in the backbone network, simultaneous capture of full-scale defect features is achieved through multi-size parallel convolutional branches, solving the problem of insufficient extraction of cross-scale defect features in bearings by single-scale convolution in existing technologies. Furthermore, the spatial attention weight map of the micro-defect sensitive branch is used to directionally enhance micro-defect features and combine them with residual connections, preventing the loss of micro-defect features in deep networks and solving the problem of high false negative rates for micro-defects. Simultaneously, the neck network achieves the fusion of shallow P2 high-resolution detail features and deep speech through learnable channel weights in the feature fusion module. The adaptive fusion of semantic features solves the problems of lack of adaptability and dilution of key details by redundant information in existing feature fusion technologies. Then, the spatial attention mask of the collaborative attention module locks the defect location and the channel self-attention mechanism filters noise, effectively suppressing the interference of false features in the background such as oil stains and water stains in industrial sites. This solves the problems of weak anti-interference ability and high false alarm rate in complex backgrounds. Finally, through multi-scale decoupled detection head classification and regression, the accurate output of bearing surface defect category, confidence level and location coordinates is achieved. This comprehensively overcomes the core technical bottlenecks of existing bearing surface defect detection and greatly improves detection accuracy, robustness and effectiveness.

[0037] Other advantages, objectives and features of the present invention will become apparent in part from the following description, and in part from those skilled in the art through study and practice of the invention. Attached Figure Description

[0038] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0039] Figure 1 The diagram shows the structure of the light source illumination, where (a) is oblique illumination and (b) is coaxial illumination.

[0040] Figure 2 This is a schematic diagram of the overall device structure.

[0041] Figure 3 The distribution of defect samples in the dataset is shown in (a), where (a) is a bar chart of the number of defect samples; and (b) is a two-dimensional histogram of the joint distribution of sample width and height in the bearing defect dataset.

[0042] Figure 4 Here are bearing defect samples from the dataset, where (a) represents wear; (b) represents scratches; (c) represents grooves; (d) represents long, thin cracks; and (e) represents electrolytic erosion holes.

[0043] Figure 5 This is a diagram of the MSF-YOLO architecture.

[0044] Figure 6 This is a diagram of the MSPConv architecture.

[0045] Figure 7 This is a diagram of the C2f_SCSA structure.

[0046] Figure 8 This is a diagram of the FFM_concat structure.

[0047] Figure 9 The evolution of evaluation metrics for model training on self-built datasets.

[0048] Figure 10 This is a comparison chart of the mAP0.5 curve algorithm.

[0049] Figure 11 For visualization of the inspection results, (a) represents abrasions; (b) represents scratches; (c) represents grooves; (d) represents elongations; (e) represents holes; (I) represents the original state; (II) represents the YOLOv8n inspection result; and (III) represents the MSF-YOLO inspection result. Figure 12 For visualization of the detection results, (Ⅰ) is the original; (Ⅱ) is the YOLOv8n detection result; (Ⅲ) is the MSF-YOLO detection result. The area indicated by the red circle represents the location of the missing defect.

[0050] Figure 13 A comparison of the thermal effects of MSF-YOLO and YOLOv8n is presented, where (b) represents abrasion; (c) represents scratch; (d) represents groove; (e) represents elongation; (i) represents hole; (ii) represents original; and (iii) represents YOLOv8n detection result. Detailed Implementation

[0051] The following detailed description of a specific embodiment of the present invention is provided in conjunction with the accompanying drawings. However, it should be understood that the scope of protection of the present invention is not limited to the specific embodiment.

[0052] In the description of this invention, it should be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "axial," "radial," and "circumferential" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing the technical solution of this invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this invention.

[0053] In the description of the embodiments of the present invention, unless otherwise stated, "a plurality of" means two or more.

[0054] like Figures 1 to 13 As shown, the present invention provides a method and electronic device for detecting surface defects in bearings, comprising: The bearing surface image is acquired and standardized to obtain an initial feature map.

[0055] The initial feature map is input into the MSF-YOLO network model. The backbone network performs dual-path processing through a multi-scale partial convolution module. One path is a multi-scale parallel convolution branch that captures full-scale defect features to obtain a multi-scale feature map. The other path generates a spatial attention weight map by entering a micro-defect sensitive branch to enhance micro-defect features and obtain a sensitive feature map. The multi-scale feature map and the sensitive feature map are concatenated and fused, and then fused through residual connections to obtain the backbone layer features. The neck network first introduces learnable channel weights into the backbone layer features through a feature fusion module, adaptively fusing shallow P2 high-resolution detail features and deep semantic features in the backbone network to obtain a fused feature map. Then, the collaborative attention module generates spatial attention masks along the height and width dimensions of the fused feature map to lock the defect location. After filtering noise through a channel self-attention mechanism, the enhanced feature map is output. After classification and regression by a multi-scale decoupled detection head, the category, confidence, and location coordinates of the bearing surface defect are output.

[0056] This invention systematically improves the YOLOv8 model, specifically as follows: In the backbone network used for feature enhancement, the MSPConv module replaces the standard Conv module in four downsampling layers (layers 1, 3, 5, and 7), enhancing feature extraction capabilities through multi-scale convolutional branches and introducing a small target-sensitive branch to improve the detection of subtle defects. Regarding feature fusion network optimization, the SCSA attention mechanism is embedded into all C2f modules (layers 18, 21, 24, and 27) within the neck region, constructing a C2f_SCSA module to achieve channel-space collaborative feature calibration and enhance the model's focus on key defect regions. In terms of the feature fusion mechanism, an FFM module is introduced at the P2 high-level feature fusion point (layer 17), combining to form an L_FFM module. Adaptive feature weighted fusion is achieved through learnable weights, significantly improving the efficiency of multi-scale feature utilization. These three improvements form a complete feature enhancement path, significantly improving the detection accuracy of bearing surface defects while maintaining real-time performance. Figure 5 As shown.

[0057] In the embodiments provided by the present invention, the bearing surface image acquisition device includes an area array industrial camera, a fixed-focus lens, a coaxial light source assembly, and a precision stage.

[0058] The fixed-focus lens is aligned with the optical axis of the area array industrial camera. The coaxial light source assembly is mounted on the end of the fixed-focus lens facing the precision stage. The precision stage is used to support the bearing under test, and its supporting surface is perpendicular to the optical axis of the area array industrial camera. The coaxial light source assembly has a built-in 45° beam splitter, which is used to convert the horizontal incident light into vertical incident light parallel to the camera's optical axis, which vertically illuminates the curved surface of the outer metal ring of the bearing under test. The area array industrial camera is used to acquire an image of the bearing surface.

[0059] The coaxial light source assembly is a high-brightness white LED coaxial light source with a built-in diffuse reflector; the area array industrial camera has a resolution of 2448×2048, and the fixed-focus lens has a focal length of 8mm.

[0060] Specifically, by utilizing the characteristics of coaxial optical paths, an imaging model based on the relationship between surface tilt angle and lens aperture was established, and the gray-scale attenuation mechanism of defect areas was quantified.

[0061] Reflection Angle Multiplication Model: In a coaxial optical path, the incident ray L is perpendicular to the ideal bearing tangent plane. Let R be the tilt angle of the defect surface (such as the inner wall of a tiny pit) relative to the ideal plane. According to the geometrical law of reflection, the reflected ray α deviates from the optical axis by an angle... Satisfying the doubling relationship:

[0062] Physical meaning: This formula shows that coaxial optical paths are extremely sensitive to changes in surface slope. Even a tiny tilt α on a defective surface can cause the reflected light path to deflect by twice the angle 2α, which is the physical basis for the amplification of defect features.

[0063] Light intensity threshold discrimination model: Let the numerical aperture angle of the industrial camera lens (i.e., the maximum half-angle of light that the lens can receive) be... The pixel grayscale values ​​received by the imaging sensor It depends on whether the reflected light overflows the lens aperture. The imaging criteria are as follows:

[0064] When the inclination angle of the defect surface satisfies At that time, the reflected light will completely exit outside the lens entrance pupil.

[0065] Technical effect: Due to the aperture angle of a fixed-focus lens Typically small (approximately 5° to 10°), this means that even a tiny change in slope (>2.5°) at the defect edge can cause the light intensity to drop sharply to zero, resulting in a "bright background" in the image. ) to highlight hidden defects ( The high contrast feature of "( )" effectively eliminates reflection interference under traditional oblique light.

[0066] like Figure 1As shown, a multi-scale image acquisition scheme was adopted to construct a bearing defect dataset suitable for industrial scenarios. By adjusting the relative pose of the monocular camera and the loading platform, a small number of high-resolution azimuth close-up images were acquired at close range, while complex multi-azimuth scene images were acquired at a distance. This scheme effectively ensured a balance between the clarity of defect features and the complexity of the multi-object distribution in the data, providing a comprehensive and reliable foundation for model training. The bearing defect dataset contains images captured under different environmental conditions. One subset was acquired by a bearing surface defect visual inspection system to obtain clear defect descriptions; the other subset was captured under complex indoor multi-source lighting conditions to simulate challenging industrial environments. To construct a visual inspection system suitable for bearing surface defects, selecting an appropriate lighting scheme is crucial to ensuring image quality. Since the tested bearing sample is a highly reflective curved metal side, traditional tilted lighting easily produces strong, uneven highlight bands on the arc surface, severely interfering with defect features. Therefore, this study conducted a comparative experiment on two lighting schemes suitable for highly reflective curved surfaces. The first is tilted lighting, such as... Figure 1 As shown in (a), two light sources are placed symmetrically to illuminate the bearing from an upward angle, with the camera directly above the bearing. In this setup, the high-light reflection from the curved side of the bearing directly enters the camera lens, forming a concentrated arc-shaped bright spot that easily conceals the true shape of small defects such as holes. The second type is coaxial illumination

[51] , the working principle of which is as follows: Figure 1 As shown in (b): Light emitted from the light source is reflected by a beam splitter at a specific angle on the camera's optical axis, redirecting it to illuminate the side of the bearing perpendicularly (or approximately perpendicularly) to the optical axis. Because the illumination direction is highly aligned with the viewing direction, the light produces a uniform and soft reflection in the relatively flat curved area, which is clearly captured by the camera. Conversely, the geometric changes at the defect disrupt this regular reflection, creating a contrast between light and shadow, thus making the defect stand out against a uniform background.

[0067] Based on the above analysis, the coaxial illumination scheme was ultimately chosen because it can effectively suppress irregular strong glare on the bent side of the bearing, thus obtaining a uniformly illuminated image. The image acquisition system in this study mainly consists of a BASLER-acA2440-75um area scanning camera, an f=8mm fixed-focus lens, a coaxial illumination unit, and its controller. Figure 2 As shown. This configuration ensures that defects on the outer ring of the bearing can be clearly and consistently captured.

[0068] like Figure 3 and Figure 4As shown, a bearing acquisition system was used to collect defect images to construct a bearing surface defect dataset of 2194 images containing five defect categories. This dataset covers five common bearing surface defect types, including wear, scratches, grooves, elongations, and holes. It aims to simulate a real industrial acquisition environment, with images ranging from close-ups of single bearings to densely distributed multiple bearings. The dataset was randomly split into training and validation sets in an 8:2 ratio, with 1755 images in the training set and 438 images in the validation set. These datasets present challenges such as different defect morphologies, significant size variations, and positive / negative sample imbalance. This setup highly replicates the real-world complexity of industrial environments, providing a crucial platform for validating the robustness and generalization ability of the algorithm under realistic conditions.

[0069] In the embodiments provided by the present invention, the preprocessing is as follows: the acquired bearing surface image is filtered to remove blurry images, and the image size is unified and the pixel value is normalized by bilinear interpolation.

[0070] Specifically, the process involves: filtering the acquired raw bearing surface images to remove blurry images; and using a bilinear interpolation algorithm to uniformly adjust the image size to the specifications required for the model input (e.g., ...). (pixels); normalizes the pixel values ​​of the image.

[0071] In the embodiments provided by this invention, the processing method of the backbone network is as follows: Input initial feature map The code is copied into two paths, one for inputting a multi-scale parallel convolution branch and the other for inputting a small defect sensitive branch.

[0072] The multi-scale parallel convolution branch will input the initial feature map. After segmentation along the channel dimension, fine-grained features, local contextual features, and wide-domain semantic features are extracted using 1×1, 3×3, and 5×5 convolutional kernels, respectively. These features are then concatenated along the channel dimension to generate a multi-scale feature map. .

[0073] The micro-defect sensitive branch will input the initial feature map. After sequentially performing 1×1 convolution dimensionality reduction, BN layer, ReLU activation, 3×3 convolution feature extraction, BN layer, and ReLU activation, a spatial attention weight map with values ​​ranging from [0,1] is generated by 1×1 convolution and Sigmoid activation. The spatial attention weight map Compared with the original input initial feature map Element-wise multiplication yields the sensitive feature map. .

[0074] Multi-scale feature maps With sensitive feature map After concatenation along the channel dimension and fusion via 1×1 convolution, the feature maps are compared with the original input feature map. Perform residual connections to output backbone-level features. .

[0075] Specifically, during the downsampling stage of the backbone network (layers 1, 3, 5, and 7), the feature maps are no longer processed through standard convolutions but are instead input into the MSPConv module for processing, in order to simultaneously acquire multi-scale and minute defect features. The signal processing flow of this module is as follows:

[0076] Feature Input and Splitting: Suppose the initial feature map received by the module is... First, The signal is copied into two streams: one stream enters the multi-scale parallel convolution branch, and the other stream enters the tiny defect sensitive branch (SOM).

[0077] Data processing in a multi-scale parallel branch: In this branch, the input features It is further segmented along the channel dimension and fed into three parallel convolutional layers for computation: First route Convolution to extract pixel-level fine-grained features ; The second route Convolution, extracting local contextual features ; The third route passed by Convolution, extracting wide-area semantic features .

[0078] Subsequently, the three output features are concatenated along the channel dimension to generate a multi-scale feature map.

[0079] Data processing for the minute defect sensitive branch: In this branch, input features The aim is to enhance the contrast of small targets through attention mechanisms.

[0080] First, the data goes through... Convolutional dimensionality reduction, batch normalization (BN), ReLU activation, Convolutional feature extraction, BN, and ReLU activation are used to form intermediate features.

[0081] Then, through Convolution and the Sigmoid activation function map intermediate features into a spatial attention weight map with values ​​ranging from [0,1]. .

[0082] Finally, a weighted operation is performed to convert the weighted graph. With the original input Perform element-wise multiplication to output an enhanced feature map.

[0083] Module output generation: Aggregating features across multiple scales With enhanced features To splice together, after Convolutional fusion of channel information and its input The residuals are summed to output the final backbone-level features. Transmitted to the next layer of the network: While the fused feature map retains details, it may contain noise from residual oil or water stains from coaxial light imaging. Therefore, the data stream enters the C2f_SCSA module, where it is "cleaned" and calibrated using a collaborative attention mechanism.

[0084] Multi-scale calibration of spatial dimensions: Input: Features that receive the output from the previous stage Figure X .

[0085] Processing: First, X is projected using average pooling along the height H and width W directions respectively.

[0086] Then, the projection vectors are grouped along the channel dimension and fed into a kernel of size . One-dimensional convolutional layers are used to capture spatial dependencies across different spans.

[0087] Mask generation: The convolution output is activated by a Sigmoid function to generate a high-level attention map. and width attention map .

[0088] Spatial weighting: and Multiplying with the input feature X, the output spatial augmentation feature is obtained. This step amplifies the characteristic response of the defect area (dark spot) and suppresses background noise.

[0089] Channel-dimensional self-attention calibration (Stage 2): Input: Reception spatial enhancement features .

[0090] Mapping: After downsampling the features, a query matrix Q, a key matrix K, and a value matrix V are generated through linear mapping.

[0091] Correlation calculation: The covariance matrix between channels is calculated using dot product operations, and channel attention maps are generated using Softmax. .

[0092] Channel Reconstruction: Utilizing The value matrix V is weighted and aggregated to generate a channel weight vector. .

[0093] Collaborative output: Channel weights Acting on spatial enhancement features Output the final calibration features The feature map is transmitted to the detection head for final bounding box regression and class determination.

[0094] In the embodiments provided by the present invention, the processing method of the neck network is as follows: The feature fusion module L_FFM receives P2 high-resolution feature maps from the shallow layers of the backbone network. semantic feature maps of deep networks after upsampling .

[0095] Initialize learnable weight vector The length of the weight vector is equal to and The total number of channels, for the weight vector Softmax normalization is performed to obtain the normalized weight coefficients for each channel. .

[0096] Normalized weight coefficients Broadcast to according to channel correspondence. and The channel-level weighted operation is completed to obtain the weighted feature map. and .

[0097] Will and The features are concatenated along the channel dimension to output a fused feature map. .

[0098] Specifically, when the network enters the neck stage, to recover the subtle defect information lost during downsampling, the data stream is introduced into a high-resolution P2 layer, and learnable weighted fusion is performed through the L_FFM module. The signal processing flow of this module is as follows:

[0099] Feature Input: The module receives two features to be fused: one is a P2 high-resolution feature from the shallow layer of the backbone network. (Contains rich texture details); another approach is semantic features derived from deep upsampling. (Includes category information).

[0100] Weight generation and normalization: The network internally maintains a learnable weight vector parameter. Its length is equal to the total number of channels. During each forward propagation, the weight vector is first adjusted using the Softmax mechanism. Perform normalization processing to calculate the normalized weight coefficient of each channel at the current time. in To prevent the use of tiny constants with a denominator of zero (such as 0.0001)

[0101] Channel-level weighting operation: normalizing the weight coefficients According to the channel correspondence, the data is broadcast to the input features respectively.

[0102] shallow features Each channel is multiplied by its corresponding weight to obtain the weighted feature. ; For deep features Each channel is multiplied by its corresponding weight to obtain the weighted feature. .

[0103] This process enables the network to automatically suppress the response of the background channel and amplify the response of the defect channel based on the defect scale.

[0104] Fusion output: The weighted features and Perform the stitching operation and output the fused feature map. As input for subsequent processing: .

[0105] In the embodiments provided by the present invention, the method for obtaining the enhanced feature map is as follows: Input initial feature map Global average pooling is performed along the height and width directions respectively to generate projection vectors in the height and width directions.

[0106] The two projection vectors are divided into four groups on an average scale along the channel dimension. After being processed by one-dimensional convolutional layers with kernel sizes of 3, 5, 7, and 9, they are concatenated and then used to generate a height-oriented spatial attention mask through Sigmoid activation. Spatial attention mask in width direction ,Will , Compared with the initial input feature map respectively Element-wise multiplication to output spatially calibrated feature map .

[0107] Spatial calibration feature map Downsampling is performed, and then the query matrix Q, key matrix K, and value matrix V are generated through three parallel 1×1 convolutional layers; Q and... The dot product of V is then processed by Softmax to generate a channel correlation matrix. The channel correlation matrix is ​​then used to perform weighted aggregation on V to obtain the channel context features.

[0108] The channel context features are upsampled and restored to match the spatial calibration feature map. Channel weights of the same size are generated by Sigmoid activation. ,Will Spatial calibration feature map Element-wise multiplication outputs the enhanced feature map after feature calibration. .

[0109] In the embodiments provided by this invention, the method for training and optimizing the MSF-YOLO network model is as follows: To expand sample diversity, the initial feature map is randomly rotated, horizontally flipped, and subjected to random brightness perturbation, and Gaussian noise is added to construct a defect dataset. The defect dataset is then divided into training, validation, and test sets according to the proportions.

[0110] Images from the training set are input into the MSF-YOLO network model, and a joint loss function is calculated. The joint loss function includes Varifocal Loss for classification accuracy evaluation, CIoU Loss for bounding box regression accuracy evaluation, and Distribution Focal Loss for bounding box distribution optimization.

[0111] The Stochastic Gradient Descent (SGD) optimizer is used to update the convolutional kernel weights, channel fusion weights of the feature fusion module L_FFM, and attention parameters of the collaborative attention module C2f_SCSA through backpropagation based on the joint loss value.

[0112] The training rounds are set to 200 rounds to validate the model performance on the validation set until the loss converges, thus obtaining the trained MSF-YOLO network model.

[0113] In the embodiments provided by this invention, the detection method of the multi-scale decoupling detection head is as follows: The coaxial light image of the bearing to be detected is input into the trained MSF-YOLO network model, and processed step by step by the four MSPConv modules of the backbone network to extract multi-scale feature maps.

[0114] The multi-scale decoupled detection head performs independent classification and regression on feature maps at four scales: P2, P3, P4, and P5, predicting the category confidence and bounding box coordinates of defects.

[0115] The prediction results for all scales are summarized, and the non-maximum suppression (NMS) algorithm is used to remove redundant detection boxes with excessive overlap. Finally, the category, confidence level and precise location coordinates of the bearing surface defects are output.

[0116] In the embodiments provided by the present invention, the bearing surface image includes at least one defect type among scratches, grooves, elongated cracks, and electrolytic erosion holes.

[0117] Example 1: Hardware and Software Co-detection System 1. System Setup: A testing platform was set up, using a Basler acA2440-75um area scan camera with an 8mm fixed-focus lens.

[0118] Install a coaxial light source in front of the lens and adjust the angle of the light source so that the light shines perpendicularly onto the side of the outer ring of the bearing.

[0119] The computing unit uses an industrial computer equipped with an NVIDIA RTX 4060 graphics card to deploy the MSF-YOLO algorithm of this invention.

[0120] 2. Data collection process: Scenario description: There is a tiny impact dent about 0.2 mm in diameter on the side of the bearing under test, and there is slight oil stain on the bearing surface.

[0121] Traditional solution comparison: If ordinary ring light is used, the pit will be covered by the arc-shaped highlight band on the metal surface, and the image will be white, which the algorithm cannot recognize.

[0122] The advantages of this invention are as follows: Under coaxial light illumination, the smooth surface of the bearing reflects light back to the lens, appearing as a uniform, bright white; while light is scattered at tiny pits, appearing as clear black spots in the image. Oil stains, due to their different reflectivity, appear as light gray interference.

[0123] 3. Algorithm processing: The image is input into the MSF-YOLO network.

[0124] The 1×1 branch and SOM branch of the MSPConv module keenly capture the high-frequency edge information of the black spots (pits).

[0125] The C2f_SCSA module uses channel attention analysis to identify light gray areas as oil stains (not defects), thereby suppressing the weight of these areas; at the same time, spatial attention locks onto black spot areas.

[0126] The L_FFM module ensures that blob features at high resolution are not lost during downsampling.

[0127] Final result: The system accurately outlined the 0.2mm pit with a confidence level of 0.92, and did not falsely report the oil stain as a defect.

[0128] Example 2: Visualization of Training Results Experimental analysis was conducted under standardized training conditions. To ensure comparability among experiments, all experiments used the same parameter settings. Based on the summary and analysis of the experimental results, 200 epochs were determined to be suitable for training. Figure 9 As shown.

[0129] Example 3: Algorithm Comparison Experiment of Different Models This study constructs a quantitative evaluation system from two dimensions: detection accuracy and model complexity. For detection accuracy, precision, recall, F1 score, and mean average precision (mAP) are selected as the core evaluation indicators. For computational efficiency, the number of model parameters is used as the main indicator to measure model complexity. Precision measures the accuracy of the model in defect prediction, representing the proportion of correct predictions among all predicted results identified as defects. Its definition is as follows:

[0130] Wherein, TP (True Positives) represents the number of true positives, which is the number of detections that correctly predict the class and have an intersection-over-union (IoU) greater than or equal to a set threshold. FP (False Positives) represents false positives, which are the number of detections that are predicted as defects but are classified incorrectly or have an IoU less than the threshold; the denominator TP+FP represents the total number of positive sample predictions given by the model.

[0131] Recall measures a model's ability to detect real-world defects, representing the proportion of true defects that are successfully detected. The formula for this is: Where FN represents false negatives, which are the number of samples that actually have defects but the model failed to detect; the denominator TP+FN represents the total number of all real defects in the dataset.

[0132] Average precision is used to comprehensively evaluate the detection performance of a single class at different confidence thresholds. It is defined as the area under the precision-recall (P-R) curve, and its expression is: Where P(R) represents the precision function as a function of recall; dR is the differential term of recall; and the integral symbol represents the calculation of the area under the P-R curve over the entire recall interval.

[0133] In the experiment, mAP50_{50}50 (IoU threshold of 0.5) and mAP50:95_{50:95}50:95 (IoU threshold from 0.5 to 0.95, with a step size of 0.05) were used as evaluation indicators for the overall detection performance of the model. The calculation formula is as follows: Where N represents the total number of defect categories in the dataset; APi represents the average precision for the i-th defect category; ∑ represents the summation of AP values ​​for all categories. The F1 score is the harmonic mean of precision and recall, used to comprehensively measure the detection performance of the model, and is defined as follows:

[0134] Here, P represents precision and R represents recall. This metric strikes a balance between precision and recall, and can more comprehensively reflect the detection performance of the model.

[0135] To verify the superiority of the MSF-YOLO model, this study conducted comparative experiments using Faster R-CNN, SSD, RT-DETR, and nine different versions of the YOLO model, including YOLOv6n, the lightweight versions YOLOv9-Tinyh and YOLOv3-Tiny, the mainstream versions YOLOv5n and YOLOv8n, and the latest versions YOLOv11n and YOLOv12n released after YOLOv10n. All experiments maintained consistency in configuration, environment, and other hyperparameters. The comparative experimental results of model evaluation metrics are detailed in Table 1 below.

[0136] Table 1: Results of the ablation study Based on the comparative analysis of experimental results, as shown in Table 2 below, the proposed MSF-YOLO model demonstrates comprehensive and significant advantages in all core metrics of bearing defect detection: its accuracy (92.2%), recall (83.1%), mAP50 (88.1%), and mAP50:95 (53.1%) are significantly better than all the comparison models. Due to architectural limitations, classic detectors such as Faster R-CNN and SSD struggle to meet the requirements of real-time and accurate detection, both in terms of accuracy and speed. Although a series of lightweight YOLO models (such as YOLOv5n and YOLOv8n) have advantages in speed and number of parameters, their detection accuracy (mAP50 is below 82%) is still insufficient when facing subtle defects. It is worth noting that although YOLOv10n achieves the highest inference speed (333 FPS), its accuracy (mAP50:73.8%) is excessively compromised; new paradigm models like RT-DETR-L also did not show the expected advantages in this task. By introducing innovative modules such as multi-scale fusion and attention calibration, the MSF-YOLO model has 7.9M parameters and a computational cost of 22.1 GFLOPS, maintaining a high real-time performance of 101 FPS. Simultaneously, it achieves an improvement of over 6 percentage points on a stronger baseline model (YOLOv8n) in terms of mAP50. This successfully achieves a superior balance between detection accuracy, robustness, and inference efficiency, demonstrating its high practicality and advanced nature in real-time bearing defect detection in industrial scenarios. The mAP50 variation curve is shown below. Figure 10 As shown.

[0137] Table 2: Comparative Experiment Results Example 4: Comparison of the same model before and after improvement: like Figure 11 The figure shows the detection results of five types of defects using the YOLOv8 and MSF-YOLO models. As shown, under the same conditions, both models exhibited satisfactory detection performance. Compared to YOLOv8n at the same confidence threshold, both models successfully avoided most of the missed detections caused by strong light reflection and high-contrast shadow defects, while YOLOv8n occasionally missed small target defects. Experimental results indicate that the improved MSF-YOLO model achieves higher detection accuracy and demonstrates superior performance in reducing the missed detection rate.

[0138] like Figure 12 As shown, the detection performance of YOLOv8n and MSF-YOLO under relatively complex environments and various defects is presented based on the same confidence threshold.

[0139] As can be seen from the image, in a multi-reflection scenario on the side of the bearing sample ( Figure 12 (a), (e)) and scenes where shadows cause partial loss of the defective field of view ( Figure 12 In (b), (c), and (d), Figure 12 In (a)-(e), the false negatives of YOLOv8n are 2, 1, 1, 3, and 2, respectively, while the false negatives of MSF-YOLO are 2, 0, 0, 2, and 0, respectively. Experimental results show that the improved model exhibits relatively good robustness in complex environments, captures the detailed features of small targets more accurately, and effectively reduces the false positive rate.

[0140] Example 5: Comparison of heatmaps before and after improvement for the same model In target detection, heatmaps visually show the high response areas of the model. By comparing the heatmaps of the original YOLOv8 model and the improved MSF-YOLO model, their attention quality can be evaluated: the more focused the attention (the darker the color and the focus on the actual defect area), the more accurate the feature extraction and localization, and the lower the background false detection probability, thus verifying the effectiveness of the introduced multi-scale and attention mechanism. This study uses Grad-CAM++

[60] technology to generate heatmaps for comparison. The results show that the MSF-YOLO heatmaps exhibit a more focused and accurate attention distribution. In five types of defect scenarios, its thermal response significantly focuses on the actual boundary and key morphological features of the defect, while effectively suppressing activation from the background and irrelevant structures. The MSF-YOLO heatmaps have a higher overlap with the real contour of the defect, and the diffuse activation of the defect-free area is significantly reduced. From the perspective of the attention mechanism, these visual evidences confirm the effectiveness of the MSPConv and C2f_SCSA modules: these improvements enable the model to focus more accurately on the features most relevant to defect identification and enhance the localization ability of small and vague defects through a collaborative understanding of multi-scale context. The results intuitively explain the model's decision-making process, further supporting the superior feature perception and localization accuracy of MSF-YOLO in bearing defect detection tasks. Figure 13 As shown, the YOLOv8n thermal map has a low degree of focus on defects, and there is obvious thermal interference around the defect area.

[0141] The present invention also provides an electronic device, comprising: Memory is used to store computer programs.

[0142] When the processor executes the computer program stored in the memory, it implements the above-described method for detecting surface defects in bearings.

[0143] It should be noted that any parts not disclosed or specifically described in this invention are existing technology or conventional configurations, and their specific structures and working principles will not be elaborated further. In this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0144] Although embodiments of the present invention have been disclosed above, they are not limited to the applications listed in the specification and embodiments. It can be applied to various fields suitable for the present invention. Other modifications can be readily implemented by those skilled in the art. Therefore, without departing from the general concept defined by the claims and their equivalents, the present invention is not limited to the specific details and examples shown and described herein.

Claims

1. A method for detecting surface defects in bearings, characterized in that, include: Acquire an image of the bearing surface and perform standardization processing to obtain an initial feature map; The initial feature map is input into the MSF-YOLO network model. The backbone network performs dual-path processing through a multi-scale partial convolution module. One path is a multi-scale parallel convolution branch that captures full-scale defect features to obtain a multi-scale feature map. The other path generates a spatial attention weight map by entering a small defect sensitive branch to enhance the small defect features and obtain a sensitive feature map. The multi-scale feature map and the sensitive feature map are concatenated and fused, and then fused through residual connections to obtain the backbone layer features. The neck network first introduces learnable channel weights into the backbone layer features through a feature fusion module, and adaptively fuses the shallow P2 high-resolution detail features and deep semantic features in the backbone network to obtain a fused feature map. The collaborative attention module generates spatial attention masks along the height and width dimensions of the fused feature map to lock the defect location. After filtering noise through the channel self-attention mechanism, the enhanced feature map is output. After classification and regression by the multi-scale decoupled detection head, the category, confidence level and location coordinates of the bearing surface defect are output.

2. The bearing surface defect detection method according to claim 1, characterized in that, The device for acquiring images of the bearing surface includes an area array industrial camera, a fixed-focus lens, a coaxial light source assembly, and a precision stage. The fixed-focus lens is aligned with the optical axis of the area array industrial camera. The coaxial light source assembly is mounted on the end of the fixed-focus lens facing the precision stage. The precision stage is used to support the bearing under test, and its supporting surface is perpendicular to the optical axis of the area array industrial camera. The coaxial light source assembly has a built-in 45° beam splitter, which is used to convert the horizontal incident light into vertical incident light parallel to the camera's optical axis, which vertically illuminates the curved surface of the outer metal ring of the bearing under test. The area array industrial camera is used to acquire an image of the bearing surface.

3. The bearing surface defect detection method according to claim 1, characterized in that, The preprocessing includes: The acquired bearing surface images are filtered to remove blurry images, and bilinear interpolation is used to unify the image size and normalize the pixel values.

4. The bearing surface defect detection method according to claim 1, characterized in that, The processing method for the backbone network is as follows: Input initial feature map The code is copied into two paths, one for inputting a multi-scale parallel convolutional branch and the other for inputting a small defect-sensitive branch. The multi-scale parallel convolution branch will input the initial feature map. After segmentation along the channel dimension, fine-grained features, local contextual features, and wide-domain semantic features are extracted using 1×1, 3×3, and 5×5 convolutional kernels, respectively. These features are then concatenated along the channel dimension to generate a multi-scale feature map. ; The micro-defect sensitive branch will input the initial feature map. After sequentially performing 1×1 convolution dimensionality reduction, BN layer, ReLU activation, 3×3 convolution feature extraction, BN layer, and ReLU activation, a spatial attention weight map with values ​​ranging from [0,1] is generated by 1×1 convolution and Sigmoid activation. The spatial attention weight map Compared with the original input initial feature map Element-wise multiplication yields the sensitive feature map. ; Multi-scale feature maps With sensitive feature map After concatenation along the channel dimension and fusion via 1×1 convolution, the feature maps are compared with the original input feature map. Perform residual connections to output backbone-level features. .

5. The bearing surface defect detection method according to claim 4, characterized in that, The processing method for the neck network is as follows: The feature fusion module L_FFM receives high-resolution P2 feature maps from the shallow layers of the backbone network. semantic feature maps of deep networks after upsampling ; Initialize learnable weight vector The length of the weight vector is equal to and The total number of channels, for the weight vector Softmax normalization is performed to obtain the normalized weight coefficients for each channel. ; Normalized weight coefficients Broadcast to according to channel correspondence. and The channel-level weighted operation is completed to obtain the weighted feature map. and ; Will and The features are concatenated along the channel dimension to output a fused feature map. .

6. The bearing surface defect detection method according to claim 5, characterized in that, The method for obtaining the enhanced feature map is as follows: Input initial feature map Global average pooling is performed along the height and width directions respectively to generate projection vectors in the height and width directions. The two projection vectors are divided into four groups on an average scale along the channel dimension. After being processed by one-dimensional convolutional layers with kernel sizes of 3, 5, 7, and 9, they are concatenated and then used to generate a height-oriented spatial attention mask through Sigmoid activation. Spatial attention mask in width direction ,Will , Compared with the initial input feature map respectively Element-wise multiplication to output spatially calibrated feature map ; Spatial calibration feature map Downsampling is performed, and then the query matrix Q, key matrix K, and value matrix V are generated through three parallel 1×1 convolutional layers; Q and... The dot product of V is then processed by Softmax to generate a channel correlation matrix. The channel correlation matrix is ​​then used to perform weighted aggregation on V to obtain the channel context features. The channel context features are upsampled and restored to match the spatial calibration feature map. Channel weights of the same size are generated by Sigmoid activation. ,Will Spatial calibration feature map Element-wise multiplication outputs the enhanced feature map after feature calibration. .

7. The bearing surface defect detection method according to claim 6, characterized in that, The training and optimization method for the MSF-YOLO network model is as follows: Randomly rotate, horizontally flip, randomly perturb the brightness, and add Gaussian noise to the initial feature map to construct a defect dataset. The defect dataset is then divided into training set, validation set, and test set according to the proportions. The images in the training set are input into the MSF-YOLO network model, and the joint loss function is calculated. The joint loss function includes Varifocal Loss for classification accuracy evaluation, CIoU Loss for bounding box regression accuracy evaluation, and Distribution Focal Loss for bounding box distribution optimization. The stochastic gradient descent (SGD) optimizer is used to update the convolutional kernel weights, channel fusion weights of the feature fusion module L_FFM, and attention parameters of the collaborative attention module C2f_SCSA through backpropagation based on the joint loss value. The training rounds are set to 200 rounds to validate the model performance on the validation set until the loss converges, thus obtaining the trained MSF-YOLO network model.

8. The bearing surface defect detection method according to claim 7, characterized in that, The detection method for the multi-scale decoupling detection head is as follows: The coaxial light image of the bearing to be detected is input into the trained MSF-YOLO network model, and processed step by step by the four MSPConv modules of the backbone network to extract multi-scale feature maps. The multi-scale decoupled detection head performs independent classification and regression on feature maps at four scales: P2, P3, P4, and P5, predicting the category confidence and bounding box coordinates of defects. The prediction results for all scales are summarized, and the non-maximum suppression (NMS) algorithm is used to remove redundant detection boxes with excessive overlap. Finally, the category, confidence level and precise location coordinates of the bearing surface defects are output.

9. The bearing surface defect detection method according to claim 1, characterized in that, The bearing surface image includes at least one type of defect, such as scratches, grooves, elongated cracks, or electrolytic erosion holes.

10. An electronic device, characterized in that, include: Memory, used to store computer programs; The processor, when executing the computer program stored in the memory, implements the bearing surface defect detection method according to any one of claims 1 to 9.