A method for recognizing and positioning monitoring of high-altitude operation safety helmet wearing based on pattern recognition
By using camera calibration and dual-path feature extraction modules, combined with adaptive illumination-color decoupling algorithms and pseudo-label learning, the accuracy and cross-scene adaptability issues of safety helmet wearing recognition in high-altitude operations have been solved. This has enabled accurate recognition and three-dimensional positioning of safety helmet wearing status in high-altitude operation scenarios, improving the accuracy of supervision and on-site practicality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI WATER CONSERVANCY DEV CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies for helmet wearing recognition in high-altitude work scenarios suffer from problems such as weak small target detection capability, severe light interference, low model accuracy, inability to accurately locate targets, and poor cross-scenario adaptability, leading to frequent false detections, missed detections, and invalid warnings, resulting in insufficient effectiveness of on-site supervision.
By establishing multi-coordinate system transformation through camera calibration, a semantically-deeply coupled 3D hierarchical electronic fence model is constructed. Combined with a dual-path feature extraction module and an adaptive illumination-color decoupling algorithm, an end-to-end helmet wearing recognition model is built to achieve fine-grained feature capture and cross-scene adaptation. The model is optimized by combining pseudo-label learning and pattern drift perception mechanisms.
It achieves accurate identification and three-dimensional positioning of the helmet wearing status in high-altitude operation scenarios, reduces false detections and missed detections, improves the accuracy of supervision and on-site practicality, and can quickly adapt to changes in scenarios while maintaining high recognition accuracy.
Smart Images

Figure CN122200536A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of machine vision technology, specifically a method for high-altitude work safety helmet wearing recognition and positioning monitoring based on pattern recognition. Background Technology
[0002] Falls from heights and being struck by objects are the most common types of accidents in high-altitude operations. Wearing safety helmets correctly is a core protective measure to reduce the probability of injury or death in these accidents. Currently, machine vision-based helmet-wearing recognition technology is gradually being applied to on-site engineering supervision; however, in high-altitude work scenarios, existing technology still has the following technical problems: High-altitude work monitoring cameras need to cover a large working area, are deployed over long distances, and have special shooting angles. Safety helmets are often small targets in images, and general target detection models have weak ability to capture the fine-grained features of small targets. At the same time, high-altitude scenes are subject to interference from strong light and backlight, low light and rain, and the sky / building facade being similar in color to the safety helmet, which can easily lead to missed detections and false detections by the model, making it impossible to meet the accuracy requirements for high-altitude scene supervision.
[0003] Most existing technologies can only achieve binary classification recognition of wearing and not wearing helmets, and cannot make fine distinctions between the compliance of helmet wearing. They cannot accurately identify false compliance scenarios that are easily misjudged as compliant, such as wearing the helmet correctly, wearing it incorrectly, or holding the helmet to cover the head. The effectiveness of supervision is seriously insufficient.
[0004] Existing technologies can only perform planar target recognition in images, and cannot achieve precise three-dimensional spatial positioning of workers. They cannot spatially bind the results of helmet wearing status recognition to dangerous work areas such as high-altitude edges, openings, and cantilevered scaffolds. They cannot provide accurate graded warnings for high-risk scenarios in dangerous areas where workers are not wearing safety helmets in compliance with regulations. As a result, there are a large number of invalid and false warnings, and the technology is not very practical in the field.
[0005] Existing general recognition models suffer from severe domain offset issues when dealing with high-altitude work scenarios. The recognition accuracy drops significantly after changing the work scenario, requiring the collection of a large number of samples for full retraining, resulting in extremely high deployment costs. At the same time, there is a lack of a closed-loop iteration mechanism for recognition data, making it impossible to continuously optimize the model based on false positives and false negatives. As the work scenario changes dynamically, the model's recognition accuracy continues to decline. Summary of the Invention
[0006] The purpose of this invention is to provide a method for identifying and monitoring the wearing of safety helmets for high-altitude operations based on pattern recognition, so as to solve one or more problems mentioned in the background art.
[0007] To achieve the above objectives, the present invention provides the following technical solution: a method for high-altitude work safety helmet wearing recognition and positioning monitoring based on pattern recognition, comprising the following specific steps: Preferably, in the calibration and modeling stage, the intrinsic and extrinsic parameters of the monitoring camera for high-altitude operations are calibrated, lens distortion is corrected, and a transformation matrix is established between the image pixel coordinate system, the camera coordinate system, and the scene's three-dimensional world coordinate system. Based on the calibrated camera parameters, a semantic segmentation model for high-altitude scenes is constructed, pixel-level semantic recognition is performed on the scene's reference image, and semantic categories such as the work area, edge, opening, cantilever, work level, sky, and building facade are extracted. Combined with a monocular vision depth estimation algorithm, three-dimensional spatial depth attributes are assigned to each semantic category. The monocular vision depth estimation algorithm is based on the intrinsic and extrinsic parameters obtained from camera calibration, and combines them with prior spatial information of the high-altitude operation scenario to construct a depth estimation model. For each semantic region obtained from semantic segmentation, the corresponding prior depth range of the scene is matched. Among them, the working layer, edge, opening, and cantilever area are matched with the design elevation of the corresponding floor as the reference depth, and the sky and building facade areas are matched with the far-distance depth range. Based on the pinhole imaging model of camera imaging, the algorithm combines the pixel position and size ratio of the semantic region in the image to calculate the corresponding continuous depth value for each pixel in the semantic region, generating a pixel-level depth map with the same size as the input image, realizing the one-to-one binding of semantic category and three-dimensional spatial depth attribute.
[0008] By verifying the consistency of spatiotemporal semantic features in continuous frames, the boundaries of dynamically advancing work surfaces are automatically corrected. With the help of manual calibration, vector boundary annotation is completed, and a semantic-deeply coupled three-dimensional hierarchical electronic fence pattern library is constructed. The three-dimensional spatial boundaries and semantic attributes of regular work areas, controlled work areas, and high-risk danger areas are clarified, and spatial semantic pattern modeling of high-altitude work scenarios is completed. The high-risk danger zone includes areas with a risk of falling from height, such as openings near edges, cantilevered scaffolds, and unclosed work platforms. Its boundary is defined by a three-dimensional spatial range extending 2 meters outward from the edge of the opening. The controlled work area is the main work area within the construction work layer, and its boundary is defined by the construction range of the work surface. The regular work area is a safe passage and material storage area that has been completed and has complete protective measures. Its boundary is defined by the closed range of the on-site protective facilities. The three-dimensional boundary of each area is defined by combining the floor elevation and the depth information calibrated by the camera.
[0009] The calibration process uses Zhang's calibration method to improve parameter accuracy, and the semantic segmentation model is pre-trained with small samples to adapt to high-altitude scenarios, ensuring that the electronic fence modeling matches the actual operation scenario.
[0010] Preferably, the sample preprocessing stage is based on the scene semantic pattern library constructed in the calibration and modeling stage. It collects images of all types of high-altitude operation scenes, such as building construction, bridges, power, and wind power, covering different shooting distances, downward angles, lighting conditions, and degrees of occlusion. It constructs a hierarchical fine-grained sample pattern library of main categories and sub-feature patterns. The main categories include standard wearing, non-standard wearing, not wearing, holding the helmet and obscuring the head, and similar color interference. Each main category is decomposed and labeled with dedicated sub-feature patterns. Standard wearing is decomposed into features such as the helmet body being intact, the chin strap fitting the face, and the brim being in the correct position. Non-standard wearing is decomposed into features such as the chin strap being suspended and not fastened, the helmet body being tilted at an angle, and the helmet body being worn backwards. Feature-decoupled adaptive preprocessing is performed on the sample images. An adaptive illumination-color decoupling algorithm is used to separate the illumination and color components of the image. The illumination component is balanced, and the color component is enhanced with helmet-specific color features. Redundant backgrounds are filtered based on a scene semantic pattern library to retain effective image data of the work area. A fine-grained feature-aware super-resolution enhancement algorithm is used, guided by the helmet edge, chin strap lines, and brim texture, to perform targeted super-resolution enhancement on the effective feature areas of the helmet, strengthening the fine-grained features of small targets and completing the sample preprocessing. Invalid samples are removed after preprocessing by feature similarity screening.
[0011] Preferably, during the model training phase, a hierarchical fine-grained sample pattern library obtained during the sample preprocessing phase is used to construct an end-to-end fine-grained helmet wearing pattern recognition model. A residual network is used as the backbone feature extraction network, and a dedicated dual-path feature coupling extraction module for helmets is embedded. The feature coupling extraction module includes a global semantic path and a fine-grained detail path. The global semantic path extracts the overall semantic features of the helmet, while the fine-grained detail path extracts subtle texture features such as the chin strap, brim, and helmet posture. The two paths adopt a cross-path feature-guided coupling mechanism, where the global semantic path locates the target region and guides the fine-grained detail path to complete feature enhancement and extraction within the target region. Cross-level feature fusion enhances the ability to capture fine-grained features of small targets. For scenarios where the head is obscured by a safety helmet or other objects, the model extracts the overall features of the person's head region through the global semantic path, and combines the helmet outline and edge features extracted through the fine-grained detail path to complete the features of the obscured area. By using pre-learned standard safety helmet feature templates, the model matches and completes the missing features in the local occlusion scenario, distinguishing between pseudo-compliant scenarios where the head is obscured by a handheld safety helmet and normal wearing scenarios, and avoiding misjudging a handheld safety helmet that obscures the head as being worn correctly.
[0012] The helmet wearing pattern recognition model detection head is equipped with a multi-task coupled output branch, which simultaneously outputs the target box pixel coordinates, the helmet's three-dimensional pose angle, and hierarchical classification confidence, to achieve main category determination and sub-feature compliance verification, and to distinguish between all scene modes such as standard wearing, various non-standard wearing, handheld helmets obscuring the head, and similar color interference. End-to-end training was performed using a hierarchical fine-grained sample pattern library. A multi-task joint loss function was constructed, which integrates object detection regression loss, pose estimation angle loss, fine-grained classification hierarchy loss, and feature consistency loss. The focus loss function was improved to address the sample class imbalance problem, and the model convergence and accuracy were optimized to obtain a pre-trained benchmark model for helmet wearing pattern recognition. An adaptive learning rate adjustment strategy was adopted during the training process to ensure that the model converged to the optimal state, while cross-validation was used to remove the influence of outlier samples.
[0013] The improved focus loss function, for the hierarchical classification system of this invention, sets differentiated category weights for different main categories and sub-feature categories. Among them, niche sample categories such as handheld helmets obscuring the head and unfastened chin straps have higher weights, while regular and standardized wearing sample categories have lower weights. At the same time, for the small target detection scenario of high-altitude helmets, the loss weight of small-sized target samples is increased, while the loss weight of large-sized easily identifiable samples is reduced. This guides the model to focus on small targets that are difficult to identify and niche and non-standard wearing scenarios, solving the model bias problem caused by the imbalance of sample categories.
[0014] The feature consistency loss is used to constrain the feature consistency of the same safety helmet target extracted by the global semantic path and the fine-grained detail path, while also constraining the feature distribution stability of the same target in consecutive frames. For the same target, the cosine similarity between the global semantic features and the fine-grained detail features is calculated, and a loss penalty is applied to samples with similarity below a threshold. This guides the features extracted by the two paths to focus on the effective features of the same safety helmet target, while reducing feature fluctuations caused by scene changes and lighting changes, and improving the feature extraction stability of the model.
[0015] Preferably, in the model optimization stage, for the actual deployment of the target high-altitude operation scenario, a small number of unlabeled images are collected to construct an incremental learning adaptation sample set, combined with the pre-trained benchmark model obtained in the model training stage; a semantic-feature dual-domain alignment unsupervised domain adaptive algorithm is adopted to perform source scene and target scene distribution alignment on the pre-trained benchmark model. Semantic domain alignment realizes the unified distribution of semantic categories for high-altitude operations, eliminating semantic differences between scene backgrounds. Feature domain alignment performs local feature distribution alignment for fine-grained features such as the helmet body, chin strap, and posture, realizing the elimination of cross-scene domain offset. Semi-supervised pseudo-label learning is used to adapt the model to specific scenarios. A pattern drift awareness closed-loop calibration mechanism and a feature pattern library of false positives and false negatives are established. False positives and false negatives generated during operation are automatically decomposed into feature patterns and classified and archived. The drift state of the model feature distribution is monitored in real time. When the scene changes or the false positive and false negative rates exceed the threshold, incremental learning is automatically triggered to update the classifier decision boundary and the local weights of the feature extraction module for the corresponding feature patterns, suppressing the decay of model accuracy. The pseudo-label generation adopts a confidence threshold screening mechanism, with the false positive and false negative drift threshold set at 5%.
[0016] The semi-supervised pseudo-label learning uses the domain-adapted pre-trained baseline model as the inference model to perform inference on the unlabeled adaptation sample set of the target scene, outputting the target detection box, wearing status classification result and corresponding confidence score for each sample; a confidence score threshold of 0.8 is set, and high-confidence samples with classification confidence scores higher than the threshold are selected and pseudo-labels consistent with the manual annotation format are automatically generated for them; the generated pseudo-label samples are mixed with a small number of manually annotated target scene samples to construct an incremental training set, and fine-tuning training is performed on the model output branch to enable the model to quickly adapt to the imaging environment, background features and operation mode of the target scene.
[0017] The pattern drift perception and monitoring uses the target scene feature distribution at the initial deployment of the model as the baseline distribution. It collects the helmet target features and scene semantic features extracted by the model during online inference at fixed intervals, calculates the cosine distance between the current feature distribution and the baseline distribution as the feature distribution drift degree, and simultaneously calculates the false detection rate and false negative rate of the model inference results every day and compares them with the preset 5% drift threshold. When the feature distribution drift degree exceeds the preset threshold of 0.2 (cosine distance), or the false detection rate and false negative rate exceed the 5% threshold for 3 consecutive days, it is determined that pattern drift has occurred, and the incremental learning process is automatically triggered to extract the false detection and false negative samples and the newly added scene samples in the corresponding period to complete the model weight update and pattern calibration.
[0018] Preferably, the spatial positioning stage is based on the scene-adapted pattern recognition model obtained in the model optimization stage, combined with the camera calibration parameters and the 3D electronic fence pattern library in the calibration modeling stage, reads the real-time video stream of the monitoring camera, extracts single-frame images at a preset frame rate, uses the scene-adapted pattern recognition model to perform end-to-end inference, and outputs the hierarchical classification results of the worker's safety helmet wearing, the pixel coordinates of the target box, the helmet pose estimation results, and the coordinates of fine-grained feature points. Based on the camera calibration transformation matrix, combined with prior information such as the multi-dimensional features of the safety helmet, national standard size, and work level elevation, the spatial coordinates of the worker's head in the three-dimensional world coordinate system are calculated using a monocular vision PNP algorithm. The depth information of the three-dimensional spatial coordinates is used to progressively correct the target box detection confidence, filtering out false detection targets with inconsistent depths. This achieves bidirectional coupling optimization of wearing status recognition and three-dimensional spatial positioning, binding all-dimensional features of the wearing status to the three-dimensional spatial coordinates one by one, and at the same time associating the positioning results with scene semantic information.
[0019] Based on the helmet wearing status classification results, helmet posture estimation results, and fine-grained feature point coordinates output by the model inference, the calculated 3D spatial coordinates are optimized and corrected. For targets with proper wearing and stable helmet posture, the coordinate calculation results are optimized with the feature point coordinates of the center point of the top of the helmet as the core. For targets with improper wearing and abnormal posture, the 3D coordinate matching relationship of the feature points is corrected by combining the helmet posture angle, eliminating the coordinate calculation error caused by the abnormal posture, and improving the 3D spatial positioning accuracy.
[0020] Preferably, the violation determination stage integrates the worker's wearing status, three-dimensional spatial coordinates, and fine-grained feature data obtained in the spatial positioning stage. Based on continuous frame image information, a spatiotemporal feature coupled multi-target tracking algorithm is adopted, and a unique fixed ID is assigned to each worker target based on the three-dimensional motion trajectory, wearing status, hat posture, and fine-grained feature points. The real-time 3D spatial coordinates and semantic attributes of the target are matched with the 3D hierarchical electronic fence pattern library constructed in the calibration and modeling stage. Combined with the hierarchical wearing status classification results, a two-dimensional hierarchical judgment rule for violations is established, which includes spatial risk mode and wearing compliance mode. Violation events are divided into high, medium and low levels, forming a dual verification of feature evidence and spatial evidence. Invalid warnings and false warnings are filtered out to complete the hierarchical judgment of violation events. The tracking algorithm uses the Kalman filter algorithm to predict the target's motion trajectory. The correlation threshold based on feature and trajectory similarity is set to 0.7, and time accumulation verification is added to the judgment rule.
[0021] The time-cumulative verification sets corresponding duration thresholds for different levels of violations, calculated based on a video stream frame rate of 15-25fps. The duration threshold for high-level violations is set to 3 consecutive frames, for medium-level violations to 5 consecutive frames, and for low-level violations to 10 consecutive frames. Only when the target meets the corresponding level of violation judgment conditions in consecutive frames and the duration exceeds the corresponding threshold is it finally judged as a valid violation event. Abnormal states where the number of single frames or consecutive frames does not reach the threshold are judged as invalid warnings and are directly filtered.
[0022] Preferably, the closed-loop handling stage receives the graded violation event results output by the violation judgment stage, and triggers the corresponding graded handling mechanism according to the violation event level. High-level violations immediately trigger on-site audio and visual alarms and push warning information including personnel ID, three-dimensional spatial location, violation screenshot, and characteristic evidence chain to the management personnel. For medium and low-level violations, the entire process of event and evidence chain archiving is performed, and violation statistical reports and trend analysis are pushed out periodically. All violation samples and false positives and false negatives confirmed on-site are automatically decomposed into feature patterns and archived into the incremental learning sample library and the false positive and false negative feature pattern library, which are then fed back to the model optimization stage. Based on the pattern drift perception closed-loop calibration mechanism, the incremental learning and pattern calibration of the model are adaptively triggered.
[0023] The beneficial effects of this invention are as follows: 1. This invention eliminates interference from lighting and similar color backgrounds through feature decoupling adaptive preprocessing, while enhancing the fine-grained features of small targets on the safety helmet. Combined with a dual-path feature coupling extraction module, it accurately captures the global semantics and subtle texture features of the safety helmet. A hierarchical fine-grained sample pattern library is constructed to achieve multi-dimensional classification of safety helmet wearing status. It can accurately distinguish between standard wearing, various non-standard wearing, and false compliance scenarios such as holding the safety helmet to cover the head, making the regulatory coverage of high-altitude operation safety helmet identification more comprehensive.
[0024] 2. This invention establishes a multi-coordinate system transformation relationship through camera calibration, calculates the three-dimensional spatial coordinates of workers using a monocular vision algorithm, and deeply binds the wearing status characteristics with spatial coordinates. At the same time, it completes spatial matching by combining a three-dimensional hierarchical electronic fence. It achieves continuous and stable tracking of personnel through a spatiotemporal feature coupling tracking algorithm, establishes a dual-dimensional violation judgment rule of spatial risk and wearing compliance, and forms a dual verification mechanism to filter invalid warnings, making the safety supervision of high-altitude operations more targeted and improving the accuracy and practicality of warnings.
[0025] 3. This invention achieves rapid scene adaptation through a semantic-feature dual-domain alignment unsupervised domain adaptive algorithm, eliminating cross-scene domain offsets without the need for collecting a large number of samples for full retraining. At the same time, it establishes a pattern drift perception closed-loop calibration mechanism, which archives and feeds back violations, false detections, and missed detections during the operation to the model optimization stage, automatically triggering incremental learning to update model parameters, forming a full-process data closed loop. This enables the model to adapt to the dynamic changes of high-altitude operation scenarios, continuously suppressing accuracy decay and maintaining high recognition accuracy over a long period of time. Attached Figure Description
[0026] Figure 1 This is an overall flowchart of the method of the present invention; Figure 2 This is a flowchart illustrating the calibration modeling and 3D hierarchical electronic fence construction process of this invention. Figure 3This is a flowchart of the training process for the fine-grained helmet wearing pattern recognition model of the present invention. Figure 4 This is a flowchart of the spatial positioning and violation classification judgment process of the present invention. Detailed Implementation
[0027] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0028] like Figures 1 to 4 As shown, this embodiment of the invention provides a method for high-altitude work safety helmet wearing recognition and positioning monitoring based on pattern recognition, including the following specific steps: The calibration and modeling stage involves calibrating the intrinsic and extrinsic parameters of the monitoring camera for high-altitude operations, correcting lens distortion, and establishing a transformation matrix between the image pixel coordinate system, the camera coordinate system, and the scene's three-dimensional world coordinate system. Based on the calibrated camera parameters, a semantic segmentation model for high-altitude scenes is constructed. Pixel-level semantic recognition is performed on the scene's reference image to extract semantic categories such as the work area, edge, opening, cantilever, work level, sky, and building facade. A monocular vision depth estimation algorithm is then used to assign three-dimensional spatial depth attributes to each semantic category. The high-altitude scene semantic segmentation model adopts a lightweight encoder-decoder architecture. The encoder uses MobileNetV2 as the backbone network to extract multi-scale semantic features of the image, and the decoder uses bilinear interpolation upsampling combined with a skip connection structure to restore the feature map resolution to the input image size. The model input is a calibrated and distortion-corrected 3-channel RGB image, with the input size uniformly scaled to 512×512 pixels and the pixel values normalized to the [0, 1] interval. The model output is a single-channel semantic segmentation mask map with the same size as the input image. Each pixel value in the mask map corresponds to 8 preset high-altitude scene semantic labels.
[0029] By verifying the consistency of spatiotemporal semantic features in continuous frames, the boundaries of dynamically advancing work surfaces are automatically corrected. With the help of manual calibration, vector boundary annotation is completed, and a semantic-deeply coupled three-dimensional hierarchical electronic fence pattern library is constructed. The three-dimensional spatial boundaries and semantic attributes of regular work areas, controlled work areas, and high-risk danger areas are clarified, and spatial semantic pattern modeling of high-altitude work scenarios is completed. The consistency verification of spatiotemporal semantic features of continuous frames first extracts the semantic segmentation results of 10 consecutive frames of images. For core semantic categories such as work area, edge, and work layer, the boundary pixel coordinates of the corresponding region in each frame are extracted, and the boundary overlap of the same category regions between adjacent frames is calculated. Boundary regions with an overlap of more than 90% are determined as stable boundaries and are directly retained and fixed. For dynamic boundary regions with an overlap of less than a threshold, combined with the boundary change trend of continuous frames, abnormal boundaries with sudden changes in a single frame are eliminated, and a continuous and smooth work surface advancement boundary is fitted to realize the automatic correction and update of the dynamic work surface during construction.
[0030] The calibration process uses Zhang's calibration method to improve parameter accuracy, and the semantic segmentation model is pre-trained with small samples to adapt to high-altitude scenarios, ensuring that the electronic fence modeling matches the actual operation scenario.
[0031] The model pre-training uses pre-trained weights from a public dataset of high-altitude scenes as initial weights. Small sample adaptation training uses 100-200 labeled high-altitude operation images of the target scene. The training batch size is set to 8. The optimizer is AdamW, the initial learning rate is set to 1e-4, and a cosine annealing learning rate decay strategy is adopted. The training iteration rounds are set to 50 rounds, and the cross-entropy loss function is used as the training loss. After completion, the semantic segmentation intersection-union ratio of the model is no less than 85%.
[0032] The Zhang's calibration method is designed for high-altitude work scenarios with large downward-angle monitoring cameras. It uses a 200mm×200mm 9×6 checkerboard calibration board. Within the work area covered by the camera, the calibration board is placed at three distance intervals (near, medium, and far) and at different horizontal and pitch angles. At least 15 calibration images are collected in each interval. The calibration process first solves the camera's intrinsic distortion coefficient, and then solves the camera's extrinsic parameters by combining the prior information of the work level elevation. After completion, the calibration results are checked for reprojection error. Calibration images with a reprojection error greater than 0.5 pixels are discarded and re-acquired to ensure that the calibration parameter accuracy meets the requirements of three-dimensional spatial positioning.
[0033] The sample preprocessing stage is based on the scene semantic pattern library built in the calibration and modeling stage. It collects images of all types of high-altitude operation scenes such as building construction, bridges, power, and wind power, covering different shooting distances, downward angles, lighting conditions and occlusion degrees, and constructs a fine-grained sample pattern library with a hierarchical structure of main categories and sub-feature patterns. The sample annotation adopts a binding specification of main category and sub-feature level. First, the target box and main category label are labeled for the safety helmet target in each sample image. Then, the corresponding sub-feature labels are labeled for the targets within the same target box. The sub-feature labels are bound to the main category labels one by one. During the annotation process, the fine-grained feature point coordinates of the helmet edge, chin strap line and brim corner are annotated simultaneously. Together with the target box and classification label, they form a complete sample annotation file, ensuring that the annotation content corresponds one-to-one with the output content of the multi-task coupled output branch of the model.
[0034] The main categories include proper wearing, improper wearing, not wearing, holding the helmet and obscuring the head, and similar colored interference. Each main category is broken down and labeled with a special sub-feature pattern. Proper wearing is broken down into features such as an intact helmet body, a chin strap fitting the face, and a brim facing forward. Improper wearing is broken down into features such as a chin strap hanging off the ground and not fastened, a tilted helmet body, and a helmet body worn backwards. The sample image is subjected to feature decoupling adaptive preprocessing. The image illumination component and color component are separated by an adaptive illumination-color decoupling algorithm. The illumination component is balanced to eliminate interference from strong light, backlight and weak light. The color component is enhanced with special color features for safety helmets to suppress interference from similar color backgrounds. The adaptive illumination-color decoupling algorithm first converts the input RGB sample image to the YCbCr color space, separating the Y component representing illumination information and the Cb and Cr components representing color information. Adaptive histogram equalization is applied to the Y component to compress the dynamic range of strong light areas and enhance the brightness of weak light areas, eliminating interference from uneven illumination on target recognition. For the Cb and Cr components, color enhancement intervals are set based on the five standard color values commonly used in safety helmets: red, yellow, blue, white, and orange. Contrast-oriented enhancement is performed on color components within these intervals, while smoothing suppression is performed on background color components such as building facades and the sky outside the intervals, reducing the probability of misidentification due to similar color backgrounds. After processing, the Y, Cb, and Cr components are converted back to the RGB color space, and the preprocessed image is output.
[0035] Redundant backgrounds are filtered based on a scene semantic pattern library to retain effective image data of the work area; a fine-grained feature perception super-resolution enhancement algorithm is adopted, guided by the helmet edge, chin strap lines, and brim texture, to perform targeted super-resolution enhancement on the effective feature areas of the safety helmet, strengthen the fine-grained features of small targets, and complete sample preprocessing; The fine-grained feature-aware super-resolution enhancement algorithm first extracts key feature regions such as the helmet edge, chin strap lines, and brim texture from the sample image using an edge detection operator, generating a fine-grained feature guiding mask. Based on the guiding mask, it locks the effective feature region where the safety helmet target is located, and performs 4x super-resolution reconstruction on this region to improve the image resolution of small target regions. The background region outside the mask is not subjected to super-resolution processing, but only slight blurring is performed to reduce the interference of background redundancy information on feature extraction. During the reconstruction process, the edge details of the helmet outline and chin strap lines are preferentially preserved to ensure that the enhanced image can fully present the fine-grained features required for determining the compliance of safety helmet wearing.
[0036] The number of samples collected should be no less than 10,000, and the proportion of samples in each main category should be balanced. The proportion of samples with niche features such as improper wearing should be no less than 15%. After preprocessing, invalid samples should be removed by feature similarity screening to ensure the quality of the sample library.
[0037] The preprocessed sample images are uniformly scaled to 640×640 pixels in 3-channel RGB format, and the pixel values are normalized to the range of [-1, 1]. The corresponding annotation files are matched synchronously to form a standardized training sample set, which is then directly input into the subsequent end-to-end fine-grained helmet wearing pattern recognition model.
[0038] The feature similarity screening process involves first extracting the safety helmet feature vector from the preprocessed sample image and calculating the cosine similarity with the standard feature template of the corresponding category in the hierarchical fine-grained sample pattern library. A similarity threshold of 0.3 is set (based on the cosine similarity statistics of the standard safety helmet feature template). Samples with similarity below the threshold are determined to be invalid samples without effective safety helmet features and are directly removed. For samples with similarity above the threshold, the feature repetition of the sample is further compared with that of the existing samples in the library. Redundant samples with repetition higher than 0.95 are removed. The final retained samples are all qualified samples with effective features and no redundancy.
[0039] In the model training phase, a hierarchical fine-grained sample pattern library obtained in the sample preprocessing phase is used to construct an end-to-end fine-grained helmet wearing pattern recognition model. A residual network is used as the backbone feature extraction network, and a dedicated dual-path feature coupling extraction module for helmets is embedded. The feature coupling extraction module includes a global semantic path and a fine-grained detail path. The global semantic path extracts the overall semantic features of the helmet to achieve long-distance small-sized target localization, while the fine-grained detail path extracts subtle texture features such as the chin strap, brim, and helmet posture. The two paths adopt a cross-path feature-guided coupling mechanism, where the global semantic path locates the target area and guides the fine-grained detail path to complete feature enhancement and extraction within the target area. Cross-level feature fusion enhances the ability to capture fine-grained features of small targets. The global semantic path consists of three consecutive residual bottleneck modules. Each module contains a 1×1 convolutional dimensionality reduction layer, a 3×3 convolutional feature extraction layer, and a 1×1 convolutional dimensionality increase layer, outputting a high-level semantic feature map with a stride of 16 relative to the input image, used for target region localization and candidate box generation. The fine-grained detail path consists of four consecutive convolutional modules. Each module contains two 3×3 convolutional layers and one max pooling layer, maintaining a stride of 4 for the output feature map and preserving high-resolution fine-grained features. The cross-path feature-guided coupling mechanism is as follows: the target region feature map output by the global semantic path is used by the spatial attention module to generate a target region weight mask. The weight mask is then multiplied pixel-wise with the feature map output by the fine-grained detail path to achieve targeted enhancement of fine-grained features within the target region. The enhanced fine-grained features and global semantic features are then fused across layers through a feature concatenation layer and output to the model detection head.
[0040] The spatial attention module first performs global max pooling and global average pooling on the target region feature map output by the global semantic path, aggregating the global context information of the feature map to obtain two pooled feature vectors. After fusing the two feature vectors, they are processed by convolutional layers and activation functions to generate a spatial weight mask with the same size as the fine-grained detail path feature map. In the weight mask, the weight value of the safety helmet target candidate region is close to 1, and the weight value of the background region is close to 0, thereby achieving directional focusing on the safety helmet target region, suppressing invalid features in the background region, and enhancing the fine-grained feature extraction effect of distant small targets.
[0041] The helmet wearing pattern recognition model detection head is equipped with a multi-task coupled output branch, which simultaneously outputs the target box pixel coordinates, the helmet's three-dimensional pose angle, and hierarchical classification confidence, to achieve main category determination and sub-feature compliance verification, and to distinguish between all scene modes such as standard wearing, various non-standard wearing, handheld helmets obscuring the head, and similar color interference. The multi-task coupled output branch contains three parallel decoding heads that share the fused feature map output by the backbone network and the dual-path feature coupling extraction module: 1. Target detection decoding head, with an output dimension of 4×N, where N is the preset number of detection anchor points, corresponding to the center coordinates and normalized width and height values of the target box, used to locate the target area of the safety helmet; 2. Attitude estimation decoding head, with an output dimension of 3×N, provides normalized values of pitch angle, yaw angle, and roll angle corresponding to the three-dimensional attitude of the hat, which are used to determine the compliance of the hat wearing attitude; 3. Hierarchical classification decoding head, with an output dimension of C×N, where C is the total number of preset classification categories, including hierarchical classification labels of 5 main categories and 8 sub-features, used to output the classification confidence of the wearing status; The outputs of the three decoders correspond one-to-one with the same detection anchor point, achieving deep binding of the localization, attitude, and classification results of the same target.
[0042] End-to-end training is performed using a hierarchical fine-grained sample pattern library. A multi-task joint loss function is constructed, which integrates target detection regression loss, pose estimation angle loss, fine-grained classification hierarchy loss, and feature consistency loss. The focus loss function is improved to address the sample class imbalance problem, and the model convergence and accuracy optimization are completed to obtain a pre-trained safety helmet wearing pattern recognition benchmark model. The multi-task joint loss function is obtained by weighting the target detection regression loss, pose estimation angle loss, fine-grained classification hierarchy loss, and feature consistency loss with fixed weights. Among them, the target detection regression loss adopts the CIoU loss function with a weight coefficient of 0.4; the pose estimation angle loss adopts the smoothed L1 loss function with a weight coefficient of 0.2; the fine-grained classification hierarchy loss adopts the improved focus loss function with a focus coefficient of 2, a balance factor of 0.25, and a weight coefficient of 0.3; and the feature consistency loss adopts the cosine similarity loss function with a weight coefficient of 0.1.
[0043] The training process employs an adaptive learning rate adjustment strategy, with 300 iterations. Training is terminated early when the model loss function converges to below 0.01 and shows no significant decrease for 10 consecutive iterations, ensuring that the model converges to its optimal state. Simultaneously, cross-validation is used to eliminate the influence of outlier samples and improve the model's generalization ability.
[0044] The model training initialized the weights of the pre-trained ResNet50 backbone network, with a training batch size of 16. The optimizer used was the SGD optimizer with momentum, a momentum factor of 0.937, a weight decay coefficient of 0.0005, and an initial learning rate of 0.01. The adaptive learning rate adjustment strategy was as follows: linear warm-up was used for the first 3 rounds, and cosine annealing was used to decay the learning rate after the warm-up. The minimum learning rate was set to 0.01 times the initial learning rate. During the training process, the training set and the validation set were divided in an 8:2 ratio, and cross-validation was performed every 5 rounds to remove outlier samples with an accuracy of less than 60% on the validation set.
[0045] The cross-validation adopts a 5-fold cross-validation method, which randomly divides the training sample set into 5 non-overlapping subsets. Each time, 4 subsets are selected as the training set and 1 subset is selected as the validation set, and the training is repeated 5 times. For each sample, its recognition accuracy and classification confidence in the 5 validations are calculated. If a sample has a classification error or a target detection box regression deviation of more than 2 pixels in 3 or more validations, it is judged as an abnormal sample and removed from the training sample set to avoid abnormal samples interfering with the model convergence direction.
[0046] In the model optimization phase, targeting the actual high-altitude operation scenario, a small number of unlabeled images are collected to construct an incremental learning adaptation sample set, based on the pre-trained benchmark model obtained in the model training phase. A semantic-feature dual-domain alignment unsupervised domain adaptive algorithm is adopted to perform source scene and target scene distribution alignment on the pre-trained benchmark model. Semantic domain alignment achieves unified distribution of semantic categories specific to high-altitude operations, eliminating semantic differences between scene backgrounds. Feature domain alignment performs local feature distribution alignment for fine-grained features such as the helmet body, chin strap, and posture, achieving cross-scene domain offset elimination. The dual-domain alignment algorithm is implemented based on an adversarial learning architecture, adding two parallel domain discriminators: a semantic domain discriminator and a feature domain discriminator. The semantic domain discriminator takes the global semantic features output by the model's backbone network as input and outputs the domain classification probabilities of the source and target scenes. It achieves global semantic distribution alignment between the source and target scenes through min-max adversarial training. The feature domain discriminator takes the fine-grained features output by the dual-path feature coupling extraction module as input and performs domain classification only on features within the candidate region of the safety helmet target. It achieves local fine-grained feature distribution alignment through adversarial training. The algorithm takes labeled samples from the source scene and unlabeled samples from the target scene as input and outputs the domain-adapted model weights without modifying the original inference structure of the model.
[0047] Semi-supervised pseudo-label learning enables scene-specific model adaptation, eliminating the need for full sample retraining and manual annotation, thus reducing the cost of scene-based deployment. A closed-loop calibration mechanism for pattern drift perception and a feature pattern library for false positives and false negatives are established. False positives and false negatives generated during operation are automatically decomposed into feature patterns and classified and archived. The drift status of model feature distribution is monitored in real time. When the scene changes or the false positive and false negative rates exceed the threshold, incremental learning is automatically triggered to update the classifier decision boundary and local weights of the feature extraction module for the corresponding feature patterns, suppressing model accuracy decay and improving the accuracy and anti-interference capability of target scene recognition. The incremental learning execution steps are as follows: 1. Extract incremental samples with corresponding feature patterns from the archived sample library, divide the training set and validation set in a 7:3 ratio, and the number of samples should not be less than 50. 2. Freeze the bottom weights of the model backbone network and the dual-path feature extraction module, and only unlock the weights of the top feature extraction layer and the multi-task coupled output branch; 3. The AdamW optimizer was used, with an initial learning rate of 1e-5, a training batch size of 8, and 20 iterations. The aforementioned multi-task joint loss function was used for fine-tuning. 4. After fine-tuning, when the overall recognition accuracy of the model is not less than 98% of the accuracy of the model before the update, and the recognition accuracy of the corresponding feature pattern samples is improved by not less than 10%, update the model weights and replace the online inference model.
[0048] After incremental learning is completed, a fixed validation sample set is used to perform accuracy verification on the updated model. The validation sample set includes samples from regular scenarios and corresponding feature pattern samples that trigger incremental learning. The model is considered to be qualified for update only when the overall recognition accuracy of the model is not lower than 98% of the accuracy of the model before the update, and the recognition accuracy of the corresponding feature pattern samples is improved by not less than 10%. The online inference model is then replaced. For models that fail the verification, the sample is expanded again to perform incremental learning, but no online replacement is performed.
[0049] The number of samples collected for the adaptation sample set is only 50-100. The pseudo-label generation adopts a confidence threshold screening mechanism, and the false detection and false negative rate drift threshold is set to 5% to ensure the model adaptation efficiency and optimization accuracy. The optimized model can be directly used for subsequent spatial localization inference.
[0050] The feature patterns are automatically decomposed. First, feature extraction is performed on false positives and false negatives, resulting in five core feature patterns: scene background features, lighting features, target size features, hat pose features, and occlusion features. The decomposed feature patterns are then compared with existing templates in the false positive and false negative feature pattern library. If a match is found in the corresponding feature category, it is added to the incremental sample set of that category. If no match is found in the existing template, it is added as a new feature pattern category, and the feature pattern library is updated synchronously to provide accurate sample classification basis for subsequent incremental learning.
[0051] The spatial positioning stage is based on the scene-adapted pattern recognition model obtained in the model optimization stage. It combines the camera calibration parameters and the 3D electronic fence pattern library from the calibration modeling stage, reads the real-time video stream of the monitoring camera, extracts single-frame images at a preset frame rate, and uses the scene-adapted pattern recognition model to perform end-to-end inference. It outputs the hierarchical classification results of workers' safety helmet wearing, the pixel coordinates of the target box, the helmet pose estimation results, and the coordinates of fine-grained feature points. Based on the camera calibration transformation matrix, and combined with prior information such as the multi-dimensional features of the safety helmet, national standard size, and work level elevation, the spatial coordinates of the worker's head in the three-dimensional world coordinate system are calculated using a monocular vision PNP algorithm. The target bounding box pixel coordinates and fine-grained feature point coordinates output by the model inference correspond to the four corner points of the helmet body and the two feature points of the chin strap, totaling six two-dimensional pixel feature points, which serve as the input two-dimensional coordinates for the monocular vision PNP algorithm. Combined with the three-dimensional standard dimensions of the helmet body corresponding to the national standard size of the helmet, the three-dimensional world coordinate system reference coordinates corresponding to the six feature points are constructed to form the input three-dimensional reference point set of the PNP algorithm.
[0052] By using depth information from three-dimensional spatial coordinates to reverse correct the target bounding box detection confidence, and filtering out false detection targets with inconsistent depth, a two-way coupling optimization of wearing status recognition and three-dimensional spatial positioning is achieved, binding all-dimensional features of wearing status to three-dimensional spatial coordinates one by one. The video stream frame rate is set to 15-25fps to ensure a balance between real-time performance and recognition accuracy. The PNP algorithm incorporates robustness optimization to eliminate abnormal coordinate points, keeping the average positioning error within 5cm. At the same time, the positioning results are associated with scene semantic information.
[0053] The depth information reverse correction logic first defines the depth range of the effective high-altitude operation area based on a three-dimensional hierarchical electronic fence pattern library. For each target's three-dimensional spatial coordinates obtained from the calculation, it is determined whether its depth value is within the depth range of the effective operation area. If the target's depth value exceeds the effective range, it is determined to be a false background target, and the detection confidence of the target is directly cleared to zero and filtered out. If the target's depth value is within the effective range, based on the depth value and the target's pixel size in the image, the actual physical size of the target is reverse-checked to see if it conforms to the national standard size range for safety helmets. The detection confidence of targets with inconsistent sizes is reduced, and targets below the confidence threshold are directly filtered out. Finally, the valid targets retained are all real safety helmet targets with matching depth and size.
[0054] The robustness optimization first obtains multiple initial 3D coordinate solutions based on the PNP algorithm. Then, combined with the prior information of the scene's work layer elevation, it selects valid solutions whose depth values are within the effective work area. The valid solutions are then subjected to reprojection error calculation, and abnormal coordinate points whose reprojection errors exceed a preset threshold are removed. The remaining valid coordinate points are then iteratively optimized using a random sampling consensus algorithm to remove outlier interference. Finally, the optimal 3D spatial coordinates are output, ensuring the stability and accuracy of coordinate calculation in complex scenes.
[0055] The violation determination stage integrates the worker's wearing status, three-dimensional spatial coordinates, and fine-grained feature data obtained in the spatial positioning stage. Based on continuous frame image information, a spatiotemporal feature coupled multi-target tracking algorithm is adopted. The three-dimensional motion trajectory, wearing status, hat posture, and fine-grained feature points are used as the association basis to assign a unique fixed ID to each worker target, thereby realizing continuous frame stable tracking and spatiotemporal consistency verification of wearing status. The spatiotemporal feature-coupled multi-target tracking algorithm first uses a Kalman filter algorithm to predict the target's 3D spatial position in the current frame based on the target's 3D spatial coordinates and motion velocity in the previous frame, generating a trajectory prediction result as the basis for spatial dimension temporal correlation. Simultaneously, it extracts the target's wearing state classification result, hat posture angle, and feature vectors of fine-grained feature points in the current frame, and calculates the cosine similarity with the corresponding feature vectors of the target with the same ID in the previous frame, as the basis for feature dimension temporal correlation. The spatial matching degree of the trajectory prediction result and the similarity of the feature vectors are weighted and summed at a weight of 6:4 to obtain the total correlation degree. Targets with a total correlation degree higher than the 0.7 threshold retain their original unique ID, achieving stable tracking in consecutive frames. For targets whose wearing state changes abruptly during tracking, a spatiotemporal consistency check is performed for 3 consecutive frames to avoid state jumps caused by misjudgment in a single frame.
[0056] The spatiotemporal consistency verification of wearing status involves establishing a continuous frame wearing status time sequence for each worker target with a fixed ID. The wearing status classification result output in the current frame is compared with the time sequence of the previous 5 frames. If the current frame state changes abruptly from the stable state of the previous consecutive frames, and the abrupt change lasts for less than 3 frames, it is determined to be a single-frame misjudgment, and the stable state result of the previous frames is retained. The wearing status result of the target is only updated when the change in wearing status lasts for more than 3 frames, to avoid misjudgment of status caused by blurry single-frame images or temporary occlusion.
[0057] The real-time 3D spatial coordinates and semantic attributes of the target are matched with the 3D hierarchical electronic fence pattern library built in the calibration and modeling stage. Combined with the hierarchical wearing status classification results, a two-dimensional hierarchical judgment rule for violations is established, which includes spatial risk mode and wearing compliance mode. Violation events are divided into high, medium and low levels, forming a dual verification of feature evidence and spatial evidence to filter invalid warnings and false warnings, and complete the hierarchical judgment of violation events. The tracking algorithm uses the Kalman filter algorithm to predict the target's motion trajectory. The correlation threshold based on feature and trajectory similarity is set to 0.7 to ensure the tracking stability in multi-target unobstructed and few-obstructed scenarios. Time accumulation verification is added to the judgment rule to avoid invalid warnings caused by single-frame misjudgment and improve the reliability of judgment.
[0058] The dual-dimensional hierarchical violation judgment rule first divides the spatial risk mode of the three-dimensional graded electronic fence into three levels: high-risk danger zone is a high-risk area for falls from heights such as edges, openings, and cantilevered scaffolds; controlled operation zone is the regular operation area within the operation layer; and regular operation zone is a safe passage area for non-operational activities. Then, the wearing compliance mode is divided into three categories: compliance status is wearing a safety helmet correctly; general non-compliance status is improper wearing behavior such as not fastening the chin strap or the helmet being crooked; and serious non-compliance status is false compliance behavior such as not wearing a safety helmet or holding the safety helmet to cover the head. A high-level violation is defined as follows: the target is in a high-risk danger zone and the wearing status is seriously non-compliant; Medium-level violations are defined as follows: the target is in a controlled work area and the wearing status is seriously non-compliant, or the target is in a high-risk danger area and the wearing status is generally non-compliant. Low-level violations are defined as: the target is in a controlled work area and the wearing status is generally non-compliant, or the target is in a regular work area and the wearing status is seriously non-compliant. Only when the target simultaneously meets the corresponding classification conditions of spatial risk mode and wearing compliance mode will it be judged as a violation event of the corresponding level. Abnormal wearing status in a single frame without spatial risk matching will not be judged as a violation event.
[0059] The closed-loop handling stage receives the graded violation event results output by the violation judgment stage, and triggers the corresponding graded handling mechanism according to the violation event level. High-level violations immediately trigger on-site audio and visual alarms and push warning information including personnel ID, three-dimensional spatial location, violation screenshots, and characteristic evidence chains to management personnel. Medium and low-level violations are archived with the entire process of events and evidence chains, and violation statistical reports and trend analysis are pushed out regularly. The audible and visual alarms are linked through the intelligent monitoring PTZ cameras and audible and visual alarms deployed on-site. The alarm command is transmitted to the on-site equipment via Ethernet. Once triggered, the alarm will continuously emit audible and visual prompts until the violation is eliminated or the management personnel manually confirm the cancellation. The warning information pushed to the management personnel is sent simultaneously through three methods: platform pop-up window, mobile APP push, and SMS reminder, to ensure that the management personnel receive and handle the situation as soon as possible. The warning information is also archived in the platform event database and cannot be tampered with.
[0060] After all violation samples and false positives / false negatives confirmed on-site are automatically decomposed into feature patterns, they are archived into the incremental learning sample library and the false positive / false negative feature pattern library, and fed back to the model optimization stage to provide fresh samples for incremental model learning. Based on the pattern drift perception closed-loop calibration mechanism, the incremental model learning and pattern calibration are adaptively triggered, and the model can be continuously iterated and optimized without manual intervention. This forms a closed-loop data process of pattern recognition, spatial matching, violation judgment, handling feedback, pattern calibration, and model optimization. The closed-loop optimization results can feed back the features of violations and false positives / false negatives actually generated on-site to the early technical stages, providing real on-site scene data for subsequent calibration and modeling of electronic fence boundary optimization, expansion of the feature pattern library for sample preprocessing, and iterative updates of sample training for model training. This further improves the technical accuracy of each stage and achieves full-link technical optimization. The response time from the completion of violation determination to the triggering of alarm is no more than 1 second. Reports are automatically generated daily / weekly / monthly, including information such as violation type, regional distribution, and frequency statistics, providing data support for on-site safety management.
[0061] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.
[0062] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for high-altitude work safety helmet wearing recognition and positioning monitoring based on pattern recognition, characterized in that, The specific steps include the following: Calibration and modeling stage: Perform parameter calibration and distortion correction on the monitoring camera of the high-altitude operation scene, establish multi-coordinate system transformation relationship, construct a high-altitude scene semantic segmentation model, perform pixel-level semantic recognition on scene images, assign three-dimensional spatial depth attributes to the recognition results, construct a three-dimensional hierarchical electronic fence pattern library through continuous frame spatiotemporal semantic feature consistency verification and manual calibration, and complete scene spatial semantic modeling. Sample preprocessing stage: Collect images of all types of high-altitude operation scenarios based on the scene semantic pattern library, construct a hierarchical fine-grained sample pattern library and decompose and label features, and strengthen small target features and remove invalid samples through feature decoupling adaptive preprocessing; Model training phase: Using the preprocessed hierarchical fine-grained sample pattern library, an end-to-end fine-grained helmet wearing pattern recognition model is constructed. Multi-task coupled output branches are set to achieve full-scene recognition. Multi-task joint loss function and adaptive training strategy are used to complete model training and obtain a pre-trained benchmark model. Model optimization phase: Adapt the pre-trained benchmark model to the target high-altitude operation scenario, establish a closed-loop calibration mechanism, update model parameters through incremental learning, and suppress model accuracy decay; Spatial positioning stage: The scene-adapted pattern recognition model reads the monitoring video stream and performs inference, outputting the worker's safety helmet wearing status and feature coordinates. Combined with camera calibration parameters and scene prior information, the three-dimensional spatial coordinates are calculated through monocular vision algorithm to achieve bidirectional coupling optimization of recognition and positioning, and to bind features and spatial data. Violation determination phase: Integrate wearing status and three-dimensional coordinate data to achieve continuous tracking and status verification of operators, and combine with the three-dimensional hierarchical electronic fence pattern library to complete the hierarchical determination of violation events and filter invalid warnings; Closed-loop handling phase: Based on the violation determination results, the corresponding handling mechanism is triggered according to the level. At the same time, the violation, false detection and missed detection samples are archived and fed back to the model optimization phase to trigger automatic iterative optimization of the model.
2. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 1, characterized in that, In the calibration and modeling stage, the intrinsic and extrinsic parameters of the monitoring camera are calibrated, lens distortion is corrected, and a transformation matrix is established between the image pixel coordinate system, the camera coordinate system, and the scene's three-dimensional world coordinate system. Based on the calibrated camera parameters, semantic categories including the work area, edge, opening, cantilever, work level, sky, and building facade are extracted through a high-altitude scene semantic segmentation model. Combined with a monocular vision depth estimation algorithm, three-dimensional spatial depth attributes are assigned to each semantic category. The boundary of the work surface is automatically corrected by continuous frame spatiotemporal semantic feature consistency verification, and vector boundary annotation is completed by manual calibration. A semantic-deep coupled three-dimensional hierarchical electronic fence pattern library is constructed to clarify the three-dimensional spatial boundaries and semantic attributes of regular work areas, controlled work areas, and high-risk danger areas. The calibration process adopts Zhang's calibration method to improve parameter accuracy. The high-altitude scene semantic segmentation model is adapted to high-altitude scenes after small sample pre-training.
3. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 2, characterized in that, In the sample preprocessing stage, based on the scene semantic pattern library, images of all types of high-altitude operation scenes, including building construction, bridges, power, and wind power, are collected, covering different shooting distances, downward angles, lighting conditions, and degrees of occlusion. A hierarchical fine-grained sample pattern library of main categories and sub-feature patterns is constructed. The main categories include standard wearing, non-standard wearing, not wearing, holding the helmet and obscuring the head, and similar color interference. Each main category is decomposed and labeled with dedicated sub-feature patterns. Standard wearing is decomposed into features such as an intact helmet body, a chin strap fitting the face, and a forward-facing brim. Non-standard wearing is decomposed into features such as an unfastened chin strap, a tilted helmet body, and a helmet body worn backwards. Feature decoupling adaptive preprocessing is performed on the sample images. The image illumination and color components are separated by an adaptive illumination-color decoupling algorithm. The illumination component is balanced and the color component is enhanced with helmet-specific color features. Redundant backgrounds are filtered based on a scene semantic pattern library. A fine-grained feature-aware super-resolution enhancement algorithm is used to enhance the fine-grained features of small targets in a directional manner, guided by the helmet edge, chin strap lines, and brim texture. Invalid samples are removed after preprocessing by feature similarity screening.
4. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 3, characterized in that, During the model training phase, the constructed end-to-end fine-grained helmet wearing pattern recognition model uses a residual network as the backbone feature extraction network and embeds a dedicated dual-path feature coupling extraction module for helmets. The feature coupling extraction module includes a global semantic path and a fine-grained detail path. The global semantic path extracts the overall semantic features of the helmet, while the fine-grained detail path extracts the subtle texture features of the chin strap, brim, and helmet posture. The two paths adopt a cross-path feature-guided coupling mechanism, where the global semantic path locates the target region and guides the fine-grained detail path to complete feature enhancement and extraction within the target region. Cross-level feature fusion enhances the ability to capture fine-grained features of small targets. The multi-task coupled output branch of the model detection head synchronously outputs the target box pixel coordinates, the three-dimensional pose angle of the cap, and the hierarchical classification confidence, realizing the main category determination and sub-feature compliance verification. The training process constructs a multi-task joint loss function that integrates object detection regression loss, pose estimation angle loss, fine-grained classification hierarchy loss, and feature consistency loss. It addresses the problem of imbalanced sample classes by improving the focus loss function, adopts an adaptive learning rate adjustment strategy during training, and eliminates the influence of outlier samples through cross-validation.
5. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 4, characterized in that, In the model optimization stage, combined with the pre-trained benchmark model, an adaptive sample set is constructed by collecting unlabeled images of the target high-altitude operation scenario. The semantic-feature dual-domain alignment unsupervised domain adaptive algorithm is used to complete the distribution alignment of the source scene and the target scene and eliminate cross-scene domain offset. Semi-supervised pseudo-label learning is used to adapt the model to specific scenarios. A pattern drift awareness closed-loop calibration mechanism and a feature pattern library of false positives and false negatives are established. False positives and false negatives generated during operation are automatically decomposed into feature patterns and classified and archived. The drift status of model feature distribution is monitored in real time. When the scenario changes or the false positive and false negative rates exceed the threshold, incremental learning is automatically triggered to update the model parameters for the corresponding feature patterns and suppress accuracy decay. The pseudo-label generation adopts a confidence threshold screening mechanism, setting the false positive and false negative rate drift threshold to 5%.
6. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 5, characterized in that, In the spatial positioning stage, single-frame images of the monitoring video stream are extracted at a preset frame rate, and the hierarchical classification results of the workers' safety helmet wearing, the pixel coordinates of the target box, the helmet posture estimation results, and the coordinates of fine-grained feature points are output through model inference. Based on the camera calibration transformation matrix, combined with the prior information of the safety helmet's national standard size and the working layer elevation, the spatial coordinates of the worker's head in the three-dimensional world coordinate system are calculated using a monocular vision PNP algorithm. The depth information of the three-dimensional spatial coordinates is used to reverse correct the target box detection confidence, filter out false detection targets with inconsistent depth, bind all-dimensional features of the wearing status to the three-dimensional spatial coordinates one by one, and associate the positioning results with the scene semantic information.
7. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 6, characterized in that, In the violation determination stage, a spatiotemporal feature coupled multi-target tracking algorithm is adopted. Based on the three-dimensional motion trajectory, wearing status, hat posture, and fine-grained feature points, a unique fixed ID is assigned to each worker target to achieve continuous and stable tracking and status verification of workers. The target's real-time 3D spatial coordinates and semantic attributes are matched with a 3D hierarchical electronic fence pattern library. Combined with the hierarchical wearing status classification results, a dual-dimensional hierarchical judgment rule for violations is established, which includes spatial risk patterns and wearing compliance patterns. Violations are classified into high, medium, and low levels, forming a dual verification mechanism of feature evidence and spatial evidence to filter out invalid and false warnings and complete the judgment of violations. The tracking algorithm uses the Kalman filter algorithm to predict the target's motion trajectory, sets the association threshold based on feature and trajectory similarity to 0.7, and incorporates time accumulation verification into the judgment rule.
8. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 7, characterized in that, During the closed-loop handling phase, a corresponding graded handling mechanism is triggered according to the level of the violation. High-level violations immediately trigger on-site audio and visual alarms and push warning information including personnel ID, three-dimensional spatial location, violation screenshots, and characteristic evidence chains to management personnel. Medium and low-level violations are archived for the entire process of events and evidence chains, and violation statistical reports and trend analysis are pushed out regularly. After all violation samples and false positives and false negatives confirmed on-site are automatically decomposed into feature patterns, they are archived into the incremental learning sample library and the false positive and false negative feature pattern library, and fed back to the model optimization stage. The closed-loop calibration mechanism triggers the automatic iterative optimization of the model.
9. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 8, characterized in that, The end-to-end fine-grained helmet wearing pattern recognition model completes the missing features of the occlusion scene by using pre-learned standard helmet feature templates.
10. The high-altitude work safety helmet wearing recognition and positioning monitoring method based on pattern recognition according to claim 9, characterized in that, The incremental learning process freezes the underlying weights of the model's backbone network.