Real-time monitoring system for production line material blockage based on visual recognition
By using a closed-loop automation system with an improved Swin Transformer V2 U-Net and YOLO11 network structure, the shortcomings of traditional material blockage monitoring methods have been addressed. This system enables early warning and graded control, improves detection accuracy and robustness, forms a complete closed-loop automation system, significantly reduces the rate of missed and false detections, and improves production efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING MACH TIANCHENG TECH CO LTD
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-12
AI Technical Summary
Traditional material blockage monitoring methods cannot identify progressive blockages, have a high false alarm rate, and cannot meet the requirements of industrial production for high reliability, high real-time performance, and forward-looking capabilities. Furthermore, existing vision solutions have unstable image quality in industrial environments and lack a complete technical solution.
Image denoising is performed using an improved Swin Transformer V2 U-Net network structure, combined with an improved YOLO11 network structure for recognition, forming a closed-loop automation system from perception to execution, including image acquisition, preprocessing, state recognition, and decision control unit, to achieve early warning and hierarchical control.
It enables early warning and graded control of material blockage, improves detection accuracy and robustness, forms a complete closed-loop automation, significantly reduces the rate of missed detection and false detection, and improves production efficiency and intelligence level.
Smart Images

Figure CN122200544A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of industrial automation and computer vision technology, specifically to a real-time monitoring system for material blockage on production lines based on visual recognition. Background Technology
[0002] In modern industrial production, automated production lines are the core of achieving efficient and continuous operation. However, during the material conveying process, blockages often occur at critical points due to uneven material characteristics, improper feeding rates, or mechanical failures. If blockages are not addressed in a timely manner, they can cause the entire production line to shut down, resulting in significant economic losses and potentially even safety accidents such as equipment damage.
[0003] Traditional methods for monitoring material blockages mainly rely on mechanical limit switches or photoelectric sensors. These methods are mostly contact or point-based detection methods, which have inherent drawbacks such as easy wear, short lifespan, and limited detection range. More importantly, they cannot identify gradual, incomplete blockages (i.e., "congestion warning" states), and can only issue alarms after the blockage has completely occurred and physical contact has been formed, thus missing the best opportunity for early intervention and essentially acting as a "post-event handling."
[0004] In recent years, computer vision-based monitoring solutions have been explored. These methods acquire images from the field using cameras and attempt to identify blockages using image processing techniques. However, industrial environments are complex, with numerous factors such as changing lighting, equipment vibration, and dust interference, leading to unstable image quality and high noise levels. Directly applying general image recognition algorithms results in poor performance and a high false alarm rate. Existing vision solutions often focus on simple image processing or directly applying unoptimized general models, lacking a complete and rigorous technical solution for enhancing industrial image quality, effectively extracting blockage features, and closely integrating with actual control systems. Their preprocessing capabilities are insufficient, the models are insensitive to subtle and variable blockage features, and their decision-making logic is simplistic, failing to meet the stringent requirements of industrial production for high reliability, real-time performance, and proactive early warning.
[0005] To address the aforementioned issues, there is an urgent need for a real-time monitoring system for material blockage on production lines based on visual recognition, which can solve the problems existing in traditional methods and achieve real-time monitoring of material blockage on production lines. Summary of the Invention
[0006] The purpose of this invention is to provide a real-time monitoring system for material blockage on a production line based on visual recognition. This system enables early warning and graded control of material blockage. Through advanced deep learning image denoising and recognition models, it significantly improves detection accuracy and robustness, and forms a closed-loop automation from perception to execution, thereby greatly improving production efficiency and intelligence.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A real-time monitoring system for material blockage on a production line based on vision recognition, comprising: An image acquisition unit is fixedly installed at key monitoring points in the material conveying section of the production line. It is used to acquire real-time video streams from the key monitoring points and output raw image frames. An image processing unit, connected to the image acquisition unit, is used to receive raw image frames, perform preprocessing operations on them, and output a standardized image. The preprocessing operations include image denoising processing of the raw image frames based on an improved Swing Transformer V2 U-Net network structure. A status recognition unit, connected to the image processing unit, is used to recognize standardized images based on an improved YOLO11 network structure to determine the material blockage status of the production line. The decision control unit is connected to the status recognition unit and is used to issue control commands to the actuators of the production line according to the material blockage situation of the production line using preset decision logic.
[0008] Furthermore, the image acquisition unit includes multiple industrial cameras, which are respectively set at key monitoring points in the material conveying section of the production line. The key monitoring points include directly above the feed inlet of the discharge hopper, directly in front of the discharge end of the conveying mechanism, and directly above the conveying mechanism.
[0009] Furthermore, the improved Swin Transformer V2 U-Net network structure includes a feature embedding module, an encoder downsampling and feature extraction module, a decoder upsampling and feature fusion module, and an image reconstruction module.
[0010] Furthermore, the preprocessing operation also includes: After denoising the image, the region of interest is extracted to obtain a region of interest image containing only the monitored area; The image size and numerical values of the region of interest are standardized to obtain a standardized image.
[0011] Furthermore, the region of interest (ROI) is extracted from the denoised image to obtain a ROI image containing only the monitored area, specifically: Based on the fixed installation position, viewing angle, and focal length of the industrial camera on the production line, the coordinates of the vertices of the polygonal region, i.e., the ROI, are predefined in the image coordinate system during the system initialization phase. A binary mask is generated based on the vertex coordinates of a predefined polygon region. In the mask, the pixel value inside the ROI is 1, and the pixel value outside the ROI is 0. The image after denoising is then ANDed with the binary mask to extract the image region inside the ROI, thus obtaining the region of interest image.
[0012] Furthermore, based on the improved YOLO11 network structure, standardized images are recognized, specifically as follows: A production line material blockage identification model was constructed based on an improved YOLO11 network structure. The production line material blockage identification model is trained based on a pre-set dataset; The standardized image is input into the trained production line material blockage recognition model to obtain the production line material blockage status.
[0013] Furthermore, the specific improvements to the YOLO11 network structure are as follows: In the C3k2 module of the backbone network of the original YOLO11 network structure, switchable dilated convolution (SAC) is used to replace the standard convolution; In the neck network of the original YOLO11 network structure, the DySample dynamic upsampling module is used to replace the traditional upsampling; The ASFFHead detection head was used to replace the detection head of the original YOLO11 network structure.
[0014] Furthermore, the system also includes a data recording unit, which is connected to the state recognition unit and the decision control unit, and is used to acquire and record relevant information.
[0015] In summary, the present invention has at least one of the following beneficial technical effects: 1. Achieving true early warning and tiered control: This invention, through an improved YOLOv11 model, can accurately distinguish between three states: "normal," "congestion warning," and "complete blockage." Especially in recognizing the "congestion warning" state, the system can issue an early warning and automatically adjust the feeding device as soon as material flow becomes abnormal but before it completely stops, achieving "pre-emptive intervention" and fundamentally avoiding production interruptions, minimizing losses.
[0016] 2. Superior image preprocessing capabilities enhance system robustness: An innovative deep learning denoising network based on an improved SwinTransformer V2 U-Net is applied to industrial scenarios. This network extracts global and local features in parallel through DB-Transformer and combines feature fusion modules and targeted optimizations such as mirror filling to effectively suppress complex noise in industrial environments. Even under extreme conditions, it can output clear, high-quality images, laying a solid foundation for subsequent accurate recognition and greatly improving the system's adaptability to harsh industrial environments.
[0017] 3. High-precision and high-efficiency blockage identification and localization: Three key improvements were made to the YOLO11 model: SAC convolution was introduced to enhance multi-scale feature extraction capabilities, DySample upsampling was used to preserve key details, and ASFFHead was used to optimize multi-scale feature fusion. This makes the model extremely sensitive to small, early-stage blockages, while accurately locating the blockage area, significantly reducing the false negative and false positive rates, and providing a reliable basis for accurate decision-making.
[0018] 4. Forming a complete closed-loop automation of perception-decision-execution: This invention is not an isolated identification algorithm, but a complete system. Starting from the physical structure of the production line, it clarifies the deployment of monitoring points and directly converts the identification results into control instructions for actuators such as the feeding device and the main control PLC through rigorous decision-making logic (such as "slow down when warning" and "stop immediately when completely blocked"). This forms a fully automated closed loop from "seeing the problem" to "solving the problem," greatly reducing manual intervention and improving production efficiency and intelligence.
[0019] 5. Possesses a robust foundation for data traceability and system optimization: The system integrates data recording units, completely preserving images, recognition results, control commands, and timestamps for each event. This not only provides an immutable data chain for tracing production accidents but also offers valuable data support for subsequent analysis of blockage patterns and continuous optimization of production line processes and model performance. Attached Figure Description
[0020] Figure 1 This is a schematic diagram of the network structure for a denoising algorithm; Figure 2 This is a schematic diagram of the DB-Transformer module structure; Figure 3 This is a schematic diagram of the P-Swin Transformer V2 module structure; Figure 4 This is a schematic diagram of the YOLO11 network structure; Figure 5 A schematic diagram illustrating the introduction of Switchable Dilated Convolution (SAC); Figure 6A schematic diagram of the C3k2_SAC module; Figure 7 This is a schematic diagram of the DySample upsampling module; Figure 8 This is a schematic diagram of the ASFFHead detection head module; Figure 9 This is a schematic diagram of the improved YOLO11 network structure; Figure 10 This is a schematic diagram of the system structure of the present invention. Detailed Implementation
[0021] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.
[0022] First, this invention introduces the entire production line: 1. Feeding device: Used to quantitatively and evenly feed materials into the conveying mechanism, such as a vibrating feeder or a rotary feeder valve; 2. Conveying mechanism: As the core material conveying channel, it is usually a conveyor belt, chain conveyor or screw conveyor. The inlet end of the conveying mechanism is connected to the outlet of the feeding device to receive materials. 3. Transition components: At specific points in the conveying path, there are components that are prone to causing material accumulation, such as a discharge hopper (whose inlet is located below the discharge end of the conveying mechanism to receive and guide the material) or a pipe deflector. 4. Power and actuator: including a drive motor and a reducer. The drive motor is connected to the reducer via a coupling. The output shaft of the reducer is connected to the drive roller of the conveying mechanism via a chain or belt to provide power to it. The discharge port of the feeding device is located above the inlet of the conveying mechanism. The discharge port of the conveying mechanism transports the material to the inlet of the discharge funnel. The drive motor drives the conveying mechanism to move through the reducer and the transmission chain.
[0023] like Figure 10 As shown, this invention provides a real-time monitoring system for material blockage on a production line based on visual recognition, comprising: An image acquisition unit is fixedly installed at key monitoring points in the material conveying section of the production line. It is used to acquire real-time video streams from the key monitoring points and output raw image frames. An image processing unit, connected to the image acquisition unit, is used to receive raw image frames, perform preprocessing operations on them, and output a standardized image. The preprocessing operations include image denoising processing of the raw image frames based on an improved Swing Transformer V2 U-Net network structure. A status recognition unit, connected to the image processing unit, is used to recognize standardized images based on an improved YOLO11 network structure to determine the material blockage status of the production line. The decision control unit is connected to the status recognition unit and is used to issue control commands to the actuators of the production line according to the material blockage situation of the production line using preset decision logic.
[0024] The image acquisition unit includes multiple industrial cameras, which are respectively set at key monitoring points in the material conveying section of the production line via rigid brackets or universal joint gimbals. The key monitoring points include directly above the feed inlet of the discharge hopper, directly in front of the discharge end of the conveying mechanism, and directly above the conveying mechanism. Each of these is explained separately. The camera is positioned directly above the feed inlet of the material hopper, with the lens pointing vertically downwards or at a certain angle, aimed at the feed inlet plane of the hopper. It is used to monitor whether the material forms an arch-shaped blockage at the hopper inlet. The camera lens is positioned directly in front of the discharge end of the conveying mechanism, with the lens horizontal or slightly tilted downwards, aimed at the trajectory area of the material being thrown from the conveying mechanism to the discharge hopper, in order to monitor whether the material flow is smooth and whether there is any accumulation. The camera is positioned directly above the conveyor belt or chain conveyor to monitor the accumulation on the surface of the conveyor due to belt slippage, material adhesion, or other reasons.
[0025] The improved Swin Transformer V2 U-Net network structure includes a feature embedding module, an encoder downsampling and feature extraction module, a decoder upsampling and feature fusion module, and an image reconstruction module. Its structure is described in detail below: This invention uses the U-Net network to construct a 5-layer neural network for image denoising. Although pooling operations can enhance the extraction of high-level semantic information during downsampling, they may also cause partial loss of low-level detail information. Therefore, when designing the network structure, the number of downsampling times is controlled to be two, so as to retain the image detail features to the maximum extent while obtaining abstract features. Figure 1 The network structure of the denoising algorithm is shown, and the noisy image is represented as I. noised ∈R B×3×H×WWhere B is the batch size, 3 represents the three color channels of the color image, and H and W are the spatial dimensions of the image. The image is first processed through a non-overlapping embedding layer M. embed (.) Obtain the low-level feature map R0∈R B×C×H×W Where C represents the number of channels in the embedded feature map, the process is expressed as follows: (1) In the formula, M embed (.) This layer contains a 3×3 convolutional layer with a stride of 1 and layer normalization. Through the embedding layer, the image is segmented into non-overlapping blocks and mapped to the feature space. Adding normalization to this embedding layer effectively avoids gradient vanishing or exploding problems. The feature map F0 obtained after the embedding layer is input into the first-layer encoder module for processing, yielding the output feature map F1 of that layer. The five-layer encoder and decoder designed in this invention employ skip connections, which can fully integrate feature information from each layer, promoting the complementarity between shallow details and deep semantics, thereby improving the accuracy and detail recovery capability of the reconstructed image. The feature map F5 obtained from the last layer is input into the de-embedding module M. unembed Reconstruct the image in parentheses; M unembed (.) contains a convolution operation with a kernel size of 3×3 and a stride of 1, as well as a layer normalization module. The de-embedding module restores the number of channels in the feature map to match the input image. Considering the significant information shared between the noisy image and the prediction result, a residual connection is introduced at the end, allowing the network to focus on learning the differences between the two, thereby more accurately restoring the details of the original image. estimate The output predicted image is represented by the following expression: (2) Next, we introduce the DB-Transformer module. In the encoder, this invention designs the DB-Transformer module to replace the traditional Transformer module. This is a module with two feature extraction branches, and its structure is as follows: Figure 2 As shown, specifically, this module includes parallel Transformer branches T and convolutional branches D. First, the input features are passed through a convolutional layer to expand the number of channels, providing richer feature representations for subsequent branch processing. Then, the feature map is evenly divided (chunked) into two parts along the channel dimension, and fed into Transformer branch T and convolutional branch D respectively for processing. This process is represented as follows: (3) In the formula, This represents the subset of features input to the Transformer branch. F represents the subset of features input to the convolution branch. in f represents the input features. Conv1×1 (.) indicates a convolution operation with a kernel size of 1×1; For the convolutional branch D, a 1×1 convolutional layer is first used to further increase the number of channels; then a 3×3 depthwise separable convolution (DConv) is used to extract local spatial features; another 1×1 convolutional layer is used to restore the original number of channels; and a non-linear transformation is introduced using the GELU (Gaussian error linear unit) activation function to avoid the phenomenon that neurons may no longer learn in the ReLU (rectified linear units) activation function. Finally, the input of the convolutional branch is added to the output after the above operations through residual connections to obtain the final output FD of the convolutional branch. The process is denoted as: (4) To avoid the subtraction of mean and division by standard deviation during layer normalization from affecting image contrast, normalization was removed from all convolution operations here. For the Transformer branch T, the processing includes a multi-head self-attention mechanism and a feedforward neural network module, which outputs features F after a series of transformations. T Finally, F T With F D The data is concatenated along the channel dimension, then fused through a 1×1 convolutional layer to obtain the output feature F of this module. out The mathematical expression for this process is: (5) This design effectively combines the advantages of convolutional neural networks and Transformers by processing local details and global dependencies in parallel, which helps improve the model's performance in detail recovery and global information capture. Next, we introduce the P-SwinTransformer V2 module, which makes adaptive improvements to the SwinTransformerV2 for image denoising tasks and proposes the P-Swin Transformer V2 module. First, we further describe the structure of this module, such as... Figure 3As shown, the specific design of this module follows the traditional "Norm-MHSA-Norm-FFN" structure, mainly including two sub-modules: a multi-head self-attention layer with a partitioned window and a multilayer perceptron (MLP) module. In the figure, the sub-modules marked with an asterisk are only used in the P-Swin Transformer block with the displacement partitioned window. In the multi-head self-attention part, Swing Transformer V2 cosine attention is adopted, using cosine similarity (cos) to replace the dot product, and dynamically adjusting the attention distribution through a learnable scaling factor τ. The attention is specifically calculated as follows: (6) In the formula, Q, K, and V are matrices obtained by linear transformation of the feature map, and Q and K need to be L2 regularized to ensure that the magnitude of each vector is 1. B represents the positional bias generated by a small MLP network. The attention score is obtained by multiplying the result of the softmax function with V. After completing the self-attention calculation, the feature map is fed into the multilayer perceptron module. This module uses convolution operations to enhance the ability to capture local features. First, the first convolution operation is applied to the input. Then, the Leaky ReLU (leakyrectified linear unit) activation function is used to perform a non-linear transformation on the convolution result. Finally, the second convolution layer is used to map the features back to the original dimension. In order to train a large-scale general model to handle advanced vision tasks such as image segmentation, Swin Transformer V2 replaces the traditional dot product attention with scaled cosine similarity attention on the basis of the original Swin Transformer. Furthermore, it performs layer normalization (LayerNorm) after calculating self-attention, thereby stabilizing the training process of large models. Since the scaling cosine similarity attention itself has a regularization operation on the input, Swing Transformer V2 places layer normalization after the self-attention module to stabilize the output of each layer. In the case of a very deep network, this can suppress the trend of the output of each layer continuously increasing, thereby reducing the gap between shallow and deep layer outputs. Considering that the original Swing Transformer V2 requires training a large model with more than 3 billion parameters, but for most small and medium-sized models with a much smaller scale, post-layer normalization can significantly slow down the convergence speed in practice due to different regularization methods and update strategies. Based on this consideration, in order to train faster and maintain the accuracy and stability of the model, Swing Transformer V2 returns to placing the layer normalization position before the input, which fully combines the efficient training process of the original Swing Transformer and the good adaptability of Swing Transformer V2 to high-resolution inputs. Looking back at the original Swin Transformer, when using the shift window strategy to calculate window self-attention, the feature map edge regions cannot form complete regular windows. To solve this problem, the original Swin Transformer used cyclic shift to reposition edge blocks, piecing together a complete window structure by cyclically moving redundant feature blocks to the opposite side. However, this operation destroys the real spatial topology of the feature blocks, resulting in discontinuous positional encoding within the recombined window. To eliminate this interference from spurious correlations, the Swin Transformer introduced a mask mechanism to shield unnecessary attention weights. This artificially suppresses feature learning in edge regions. In image restoration tasks, edge regions often carry equally crucial structural information, such as object contours and texture boundaries. Therefore, a mirror padding strategy is proposed to replace the traditional cyclic shift method. This involves mirror-symmetric expansion at the feature map boundaries, ensuring that each edge window can form a complete window through real feature expansion. This mechanism preserves the real feature distribution of the edge regions, which helps to accurately reconstruct edge details. After self-attention calculation, the feature map needs to be cropped to restore it to its original size. This invention deploys DB-Transformer blocks only in the downsampling stage, while using a decoder containing only Transformer blocks in the upsampling image reconstruction stage. In the upsampling stage, a multi-feature map fusion module is designed before each decoder. This module dynamically selects feature maps from different stages and branches through a channel attention mechanism. Therefore, the neural network uses this module to learn the features extracted from each stage that are more useful for image denoising. First, a convolution operation with a kernel size of 1×1 and a stride of 1 is performed on the feature maps from the previous stage, i.e.: (7) In the formula, l * This refers to the layer number of the encoder located at the same level as the l-th layer fusion module. In this invention, the network is a 5-layer network. Therefore, the 5th layer feature fusion module should receive the feature map from the 1st layer parallel feature encoder. Then, it uses average pooling (AP(.)) to calculate the channel-level statistics s, i.e.: (8) In the formula, C is the same as the number of channels in the feature map. Let s pass through a multilayer perceptron, which includes a pointwise convolution with a kernel size of 1×1 and a stride of 1, a LeakyReLU activation function, and a 1×1 convolution that doubles the number of channels. Then, the softmax function is used to predict the weight matrix {a1, a2} on the output of the multilayer perceptron. The expression for this process is: (9) Finally, these weights are used to calculate a weighted sum on the feature maps, thereby suppressing or enhancing the features at each stage, and then summed with the output feature map F from the previous layer. l-1 Perform residual connection, let F fusion If the output features of the feature fusion module are represented, then the process is defined as follows: (10) The preprocessing operation also includes: After denoising the image, the region of interest is extracted to obtain a region of interest image containing only the monitored area; The image size and numerical values of the region of interest are standardized to obtain a standardized image.
[0026] After denoising the image, the region of interest (ROI) is extracted to obtain an ROI image containing only the monitored area, specifically: Based on the fixed installation position, viewing angle, and focal length of the industrial camera on the production line, the coordinates of the vertices of the polygonal region, i.e., the ROI, are predefined in the image coordinate system during the system initialization phase. A binary mask is generated based on the vertex coordinates of a predefined polygon region. In the mask, the pixel value inside the ROI is 1, and the pixel value outside the ROI is 0. The image after denoising is then ANDed with the binary mask to extract the image region inside the ROI, thus obtaining the region of interest image.
[0027] The standardized image is recognized based on an improved YOLO11 network structure, specifically as follows: A production line material blockage identification model was constructed based on an improved YOLO11 network structure. The production line material blockage identification model is trained based on a pre-set dataset; The standardized image is input into the trained production line material blockage recognition model to obtain the production line material blockage status.
[0028] The specific improvements to the YOLO11 network structure are as follows: YOLO11, as the current object detection algorithm in the YOLO series, is an optimization and upgrade based on the YOLOv8 model architecture. This algorithm not only inherits the efficient and fast detection characteristics of YOLOv8 but also further expands its functional range, enabling it to perform various computer vision tasks such as object detection and tracking, instance segmentation, image classification, and pose estimation. This significantly improves the algorithm's overall performance and broadens its application areas. It mainly consists of several parts, including the input network, backbone network, neck network, and head. Its network structure is as follows: Figure 4 As shown; The YOLO11 model performs preprocessing on the input image, including resizing and normalization. Its backbone network integrates Conv, C3k2, C2PSA, and SPPF modules. Compared to YOLOv8, it adds the C2PSA module and replaces the C2f module with C3k2. It uses different convolutional kernel sizes and channel separation strategies to optimize the extraction of more complex features. The C3k2 block performs feature extraction at different stages of the backbone network. By segmenting the feature map and applying a series of smaller convolutional kernels, it improves speed and reduces computational cost. The C2PSA module introduces a position-sensitive attention (PSA) mechanism, combined with multi-head attention and feedforward neural networks, to enhance feature extraction capabilities. The neck network adopts an FPN structure, which helps to fuse features of different scales and optimize feature propagation. The head network, based on the original decoupled head, introduces an anchor-free design for the classification and detection head, realizing the separation of classification and detection tasks. Next, this invention will provide a detailed description of the improved YOLO11 network structure, which includes the following improvements: 1. Switchable dilated convolution SAC Since the receptive field size of a standard convolutional kernel is fixed within the same convolutional layer and cannot be dynamically adjusted according to the scale of the input features, a switchable dilated convolution (SAC) is introduced into the model, with the structure as follows: Figure 5 As shown, the C3k2_SAC network module was constructed by replacing the ordinary convolution in the original C3k2 module with SAC. SAC effectively expands the receptive field by applying convolution operations with different dilation rates in the same convolutional layer, enabling the model to capture features at both large and small scales. This dynamic adjustment mechanism significantly improves the model's feature extraction capability for targets of different scales and complexities. The SAC convolution has three main components: the SAC component and two global context modules attached before and after the SAC. The global context module first compresses the input features using a global average pooling layer, and then passes them through a convolutional layer. The SAC component converts each 3×3 convolutional layer into SAC. This conversion allows for soft switching of convolutional computation between different dilation rates, transforming the convolutional layer into SAC. Furthermore, a spatially dependent switching function dynamically fuses the results of convolutions with different dilation rates to generate the final feature output.
[0029] The C3k2_SAC module, for example Figure 6 As shown, this module includes C3k and a switchable dilated convolution SAC. The C3k2_SAC module replaces part of the C3k2 module in the network, enhancing the network's feature representation capabilities.
[0030] 2. DySample Upsampling Module Traditional upsampling methods are prone to causing blurred boundaries and loss of detail information. To address this issue, this invention introduces the DySample dynamic upsampling module in the Neck section. DySample uses point resampling and reduces computational workload and latency by avoiding time-consuming dynamic convolutions and additional sub-networks to generate dynamic kernels. This module can adaptively scale up the input feature map while retaining sufficient detail information. By introducing DySample, the model can capture detailed features more accurately and improve detection accuracy. DySample generates sampling points using static and dynamic range factors. The feature map X is sampled through a linear layer to generate a feature map of the corresponding size. Then, a pixel shuffle technique combined with the range factor is used to generate an offset O, which is then added to the original grid position G to obtain the sampling set S. Figure 7 As shown.
[0031] 3. ASFFHead Detection Head Module To address the shortcomings of traditional detection heads in multi-scale feature fusion, this invention proposes a detection head optimization method based on ASFF adaptive spatial feature fusion. This method replaces the original YOLO11 detection head by constructing an ASFFHead module, as shown in the following structure. Figure 8 As shown, ASFF uses learnable weight parameters to dynamically adjust the semantic contribution of multi-scale feature maps, achieving adaptive calibration of cross-level features in the spatial dimension. ASFFHead also achieves adaptive calibration of cross-level features in the spatial dimension, effectively filtering out conflicting information between features of different scales and enhancing the model's scale invariance. This mechanism enables the model to better integrate multi-scale features. The detection head receives feature maps from multiple scales as input and outputs prediction results at three different scales. The sampling factors for different feature levels (Level 1, Level 2, and Level 3) are 32, 16, and 8, respectively. ASFF-1, ASFF-2, and ASFF-3 represent feature fusion at different levels using the ASFF mechanism. Taking the first fusion layer, ASFF-1, as an example, features from other levels (X... 2→1 X 3→1 ) will be adjusted to match the first layer (X) 1→1 Using the same dimensions, the learned weight maps are then used for weighted fusion to finally generate fused features for prediction, as shown in the following formula: (2) In the formula, α, β, and γ are learnable parameters.
[0032] This invention improves the C3k2, upsampling, and detection head in the original model network. In the backbone, C3k2 is replaced with C3k2_SAC, improving feature extraction capability. In the neck, DySample is used to preserve detailed information. Finally, ASFFHead is used in the head to enhance multi-scale feature fusion capability. The improved YOLO11 structure is as follows: Figure 9 As shown.
[0033] The input to the production line material congestion identification model is a standardized image, and its output is a material conveying status label (e.g., "Normal", "Congestion Warning", "Complete Congestion") and an optional bounding box (BBox) that identifies the congestion area (typically in [x...]). min y min x width y height ]).
[0034] The process of building the dataset for training is as follows: Collect historical video recordings and images from key monitoring points of the production line (such as the material discharge hopper, conveyor belt outlet, etc. as defined in step 1); Use a labeling tool (such as LabelImg) to annotate the clogging conditions in the image. The annotation information includes: Category Labels: Based on expert experience or operational records, each frame of image is labeled as "Normal", "Congestion Warning", or "Complete Congestion"; Bounding box: For images in "Congestion Warning" and "Complete Blockage" states, use a rectangular box to precisely select the blocked material area; The labeled dataset is randomly divided into training, validation, and test sets in an approximately 8:1:1 ratio. For example, a dataset containing 4000 images can be divided into a training set of 3200 images, a validation set of 400 images, and a test set of 400 images each.
[0035] Based on the material blockage situation on the production line, a preset decision logic is used to issue corresponding control commands to the actuators of the production line, specifically: Receive the recognition results from step 3, and output corresponding control commands and early warning information based on the preset decision logic to realize the automated response to the material blockage status of the production line. Its core is to transform the "perception" results of the visual recognition model into "execution" actions in the physical world. Its input is: the material conveying status label (“Normal”, “Congestion Warning”, “Complete Blockage”) and its corresponding bounding box (BBox) from step S3; The output includes: instruction signals sent to the production line control system and warning information from the human-machine interface; Specifically, the steps include the following: Step 401: State Resolution and Decision Logic Execution The system receives the status information output from step 3 in real time and triggers a preset decision logic. This logic is a conditional judgment process to ensure the timeliness and accuracy of the response, as follows: 1. Conditional judgment A: Label == "Normal" Decision: Material flow is smooth, requiring no intervention; Execution path: The system maintains the current monitoring state and does not generate any output commands. It can record a "normal status" heartbeat signal in the system log for system health monitoring. 2. Conditional judgment B: Label == "Congestion Warning"; Decision: If slow material flow or initial accumulation is detected, early warning intervention is required to prevent the situation from worsening; Execution path: (1) Information output: Immediately pop up a yellow warning pop-up window on the human-machine interface (HMI) in the central control room and display the real-time screen that triggered the warning. The warning information should include: "Congestion warning", "Location: [Automatically generated based on camera ID and BBox location information]", "Time: [Current system time]"; Control command output: Send "decelerate" or "intermittent operation" command CMD_SlowDown to the upstream feeding device (such as vibrating feeder) through industrial communication protocols (such as OPC UA, Modbus TCP) to reduce the amount of material fed in from the source and attempt to alleviate congestion on its own; 3. Conditional judgment C: Label == "Completely blocked" Decision: Material flow has completely stopped, and emergency measures must be taken immediately to avoid equipment damage or production interruption; Execution path: Information Output: Immediately display a red emergency alarm pop-up on the HMI in the central control room and trigger the on-site audible and visual alarms. The alarm information must include: "Complete blockage!", "Location: [Automatically generated based on camera ID and BBox location information]", and "Time: [Current system time]". Control command output: Send the "emergency stop" command CMD_EmergencyStop to the production line main control system (PLC) through industrial I / O modules or communication protocols. This command will stop the downstream equipment and the upstream feeding device in sequence according to the preset safety process, and notify the maintenance personnel to handle it immediately. Step 402: Distribution of control commands and early warning information This step distributes the control commands and warning information generated in step 401 to the corresponding execution units via a reliable industrial communication link, specifically as follows: 1. Instruction distribution: The commands CMD_SlowDown and CMD_EmergencyStop are precisely sent to predefined actuators (such as the frequency converter of the feeder or the PLC of the production line) via industrial Ethernet or fieldbus. 2. Information distribution: Warning and alarm information is pushed to the HMI screen in the central control room in real time through the factory's internal network, and can also be pushed to the mobile terminals of relevant responsible persons via SMS or a dedicated app, depending on the configuration. Step 403: System Status Update and Log Recording After each decision and execution, the system updates its own status and leaves an immutable record, which is crucial for production traceability and system optimization. Status update: The system's internal state machine updates according to the current alarm level (e.g., Normal -> Warning -> Normal; or Normal -> Completely blocked -> Pending processing). Log recording: Records the complete chain of this event into the database, including: The original image frame that triggered the event or its storage path on the server; Step 3: The Label and BBox output by the model; The specific decision-making action taken in step 4 (such as "issuing a deceleration command"); Precise timestamp; Operator's subsequent confirmation or processing records; It is implemented through a data recording unit, which is connected to the state recognition unit and the decision control unit.
[0036] To verify the effectiveness and universality of this invention, it was deployed and tested on a centralized carbon black feeding and automatic batching production line of a large rubber products company. Specifically: I. Implementation Environment: Production Line: The company uses a centralized feeding system, which uses pneumatic conveying to transport carbon black from storage tanks to intermediate silos corresponding to multiple internal mixers. Before entering each intermediate silo, the conveying pipeline passes through a pipeline reversing valve and a square screen filter. Historically, the problem was that damp carbon black easily caked and clogged the screen, and the reversing valves malfunctioned due to material blockage, leading to conveying interruptions. Previously, relying on timed inspections and pressure sensor alarms often resulted in multiple production lines being shut down due to material shortages by the time problems were detected.
[0037] Monitoring point selection: Select the transparent observation window of the screen filter most prone to clogging as the key monitoring point.
[0038] II. System Deployment: An 8-megapixel industrial dustproof camera with a wide-angle lens and built-in fill light is installed directly in front of the screen viewing window to ensure complete coverage of the screen area and the visible parts of the upstream and downstream pipelines.
[0039] System Integration: Image processing, status recognition, and decision control units are integrated into a workshop-level edge computing industrial PC. This industrial PC connects to the camera via gigabit Ethernet and interacts with the enterprise's MES system and material supply system PLC via the OPC UA protocol.
[0040] III. Implementation Process and Results: After training and fine-tuning the system using the image dataset (normal, half-blocked screen, and fully blocked screen) collected by the company, it was put into continuous operation online.
[0041] Case study of "Congestion Warning" handling: On the 5th day of system operation, the model detected a small amount of carbon black accumulating on the screen surface from real-time video, and the material throughput speed decreased slightly (output status label: "Congestion Warning", bounding box accurately locates the material accumulation area). The decision control unit immediately executes the preset logic: A yellow warning message, including location and real-time video, is pushed to the MES system operation interface in the central control room.
[0042] At the same time, the OPC UA sends a "backflush cleaning pulse enhancement" command (CMD_Increase_Purge) to the PLC of the centralized feeding system, which automatically triggers the backflush cleaning device of the screen to perform high-intensity, short-duration pulse cleaning.
[0043] Result: After enhanced backflushing, the accumulated carbon black was removed within 30 seconds, screen permeability was restored, and the system automatically returned to "normal" status. Conveying pressure remained stable, and downstream internal mixer production was unaffected. A potential conveying interruption was successfully averted at its inception.
[0044] Case study of "complete blockage": On the 18th day of system operation, a batch of carbon black was found to be abnormally damp. Despite the system issuing a warning and increasing backflushing, the material rapidly clumped on the screen, causing a sudden increase in pipeline pressure and visually stopping the material flow completely. The model accurately identified this state (status label: "complete blockage").
[0045] Result: The decision control unit immediately triggered the on-site audible and visual alarms and sent the highest-level alarm and CMD_Emergency_Stop&Switch_Line commands to the MES and the feeding PLC. The PLC executed the commands within seconds. 1) Stop this feeding line; 2) Automatically switch to the backup material supply path; 3) Locate the faulty screen and notify maintenance. Maintenance personnel, using precise location information and real-time video feed from the alarm, went to the site with specialized tools to handle the issue. Due to the extremely rapid detection and isolation, the blockage only affected a single feed line for approximately 20 minutes, and production continuity was ensured through backup lines, minimizing the overall impact.
[0046] During the two-month testing period, the system accurately triggered 34 alarms at the monitoring point, including 28 "congestion warnings," which, through automatic triggering of enhanced cleaning, successfully prevented 26 complete blockages; and 6 "complete blockage" alarms, all of which were quickly isolated and handled. Compared with traditional pressure sensor alarms, the visual warnings were provided an average of 3-5 minutes earlier, enabling preventative maintenance. This successful case has been extended to the company's automated batching system's weighing hopper inlet monitoring and the intelligent automated warehouse's outbound conveyor blockage monitoring. The system has demonstrated good adaptability and reliability, improving the company's production continuity, equipment utilization, and intelligent management level, bringing significant economic benefits.
[0047] Embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0048] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0049] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0050] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0051] Contents not described in detail in this specification are prior art known to those skilled in the art. It is hereby indicated that the above description is intended to help those skilled in the art understand this invention, but does not limit the scope of protection of this invention. Any equivalent substitutions, modifications, improvements, or simplifications of the above descriptions that do not depart from the essential content of this invention fall within the scope of protection of this invention.
Claims
1. A real-time monitoring system for material blockage on a production line based on visual recognition, characterized in that, include: An image acquisition unit is fixedly installed at key monitoring points in the material conveying section of the production line. It is used to acquire real-time video streams from the key monitoring points and output raw image frames. An image processing unit, connected to the image acquisition unit, is used to receive raw image frames, perform preprocessing operations on them, and output a standardized image. The preprocessing operations include image denoising processing of the raw image frames based on an improved SwinTransformer V2 U-Net network structure. The improved SwinTransformer V2 U-Net network structure includes a feature embedding module, an encoder downsampling and feature extraction module, a decoder upsampling and feature fusion module, and an image reconstruction module. A status recognition unit, connected to the image processing unit, is used to recognize standardized images based on an improved YOLO11 network structure to determine the material blockage status of the production line. The specific improvement of the YOLO11 network structure is as follows: In the C3k2 module of the backbone network of the original YOLO11 network structure, switchable dilated convolutions are used to replace standard convolutions. In the neck network of the original YOLO11 network structure, the DySample dynamic upsampling module is used to replace the traditional upsampling; The ASFFHead detection head was used to replace the detection head of the original YOLO11 network structure. The decision control unit is connected to the status recognition unit and is used to issue control commands to the actuators of the production line according to the material blockage situation of the production line using preset decision logic.
2. The real-time monitoring system for material blockage on a production line based on visual recognition as described in claim 1, characterized in that, The image acquisition unit includes multiple industrial cameras, which are respectively set at key monitoring points in the material conveying section of the production line. The key monitoring points include directly above the feed inlet of the discharge hopper, directly in front of the discharge end of the conveying mechanism, and directly above the conveying mechanism.
3. The real-time monitoring system for material blockage on a production line based on visual recognition according to claim 2, characterized in that, The preprocessing operation also includes: After denoising the image, the region of interest is extracted to obtain a region of interest image containing only the monitored area; The image of the region of interest is processed to obtain a standardized image.
4. The real-time monitoring system for material blockage on a production line based on visual recognition according to claim 3, characterized in that, After denoising the image, the region of interest (ROI) is extracted to obtain an ROI image containing only the monitored area, specifically: Based on the fixed installation position, viewing angle, and focal length of the industrial camera on the production line, the coordinates of the vertices of the polygonal region, i.e., the ROI, are predefined in the image coordinate system during the system initialization phase. A binary mask is generated based on the vertex coordinates of a predefined polygon region. In the mask, the pixel value inside the ROI is 1, and the pixel value outside the ROI is 0. The image after denoising is then ANDed with the binary mask to extract the image region inside the ROI, thus obtaining the region of interest image.
5. The real-time monitoring system for material blockage on a production line based on visual recognition according to claim 4, characterized in that, The image of the region of interest is processed to obtain a standardized image, specifically as follows: The image size and numerical values of the region of interest are standardized to obtain a standardized image.
6. The real-time monitoring system for material blockage on a production line based on visual recognition according to claim 5, characterized in that, The standardized image is recognized based on an improved YOLO11 network structure, specifically as follows: A production line material blockage identification model was constructed based on an improved YOLO11 network structure. The production line material blockage identification model is trained based on a pre-set dataset; The standardized image is input into the trained production line material blockage recognition model to obtain the production line material blockage status.
7. The real-time monitoring system for material blockage on a production line based on visual recognition according to claim 6, characterized in that, The system also includes a data recording unit, which is connected to the state recognition unit and the decision control unit, and is used to acquire and record relevant information.