Multi-level dynamic fence detection method and system based on multi-modal adaptive fusion
By employing a dynamic fence detection method based on multimodal adaptive fusion and spatiotemporal correlation, the problems of static judgment, weak model generalization ability, and neglect of spatiotemporal context in existing electronic fence systems are solved, enabling accurate risk assessment and flexible early warning in factory safety monitoring.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA NAT BUILDING MATERIALS TECH CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing electronic fence systems in factory safety monitoring suffer from problems such as static judgment, weak model generalization ability, crude judgment logic, and neglect of spatiotemporal context, resulting in alarm redundancy, delayed response, and decreased detection accuracy.
A multi-modal adaptive fusion multi-level dynamic fence detection method is adopted. By simultaneously acquiring visible light video streams and infrared video streams, target detection is performed using a multi-modal adaptive fusion detection network model, and risk assessment is performed by combining a spatiotemporal correlation module, so as to achieve accurate and intelligent risk assessment of personnel behavior.
It enables accurate detection and intelligent risk assessment of human behavior in complex environments, reduces false alarms, and improves detection accuracy and the flexibility and practicality of risk assessment.
Smart Images

Figure CN122244608A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and intelligent security technology, and in particular to a multi-level dynamic fence detection method and system based on multimodal adaptive fusion. Background Technology
[0002] Electronic fences, as an essential safety hazard detection technology, have shown great application potential in fields such as factory safety and public place monitoring. Today, video-based electronic fence systems have become a key technology for ensuring safe production. However, existing technologies mainly have the following limitations: 1. Static and Single-Level Judgment: Most systems use fixed, single-level fenced areas. When personnel enter, an alarm of the same level is triggered. It is impossible to classify and warn based on the depth of intrusion, resulting in alarm redundancy or delayed response.
[0003] 2. Weak model generalization ability: Models that rely on pre-training on general datasets (such as the YOLO series) have significantly reduced detection accuracy for specific factory scenarios (such as workers wearing specific work clothes and protective equipment, or nighttime infrared mode), resulting in a high false negative rate.
[0004] 3. Crude judgment logic: Judgment is usually based on the intersection of the bounding box and the fence area. The bounding box contains a lot of background noise and cannot accurately reflect the human posture and the actual space occupied, which can easily lead to false alarms (such as an arm extending to trigger an alarm) or missed alarms (such as a part of the human body entering without triggering).
[0005] 4. Ignoring spatiotemporal context: Traditional methods analyze each frame independently, neglecting the continuity of actions and behavioral intent within the video sequence. For example, lingering near a warning line should be treated differently from rushing quickly into a danger zone, but existing systems struggle to perform such advanced behavioral analysis.
[0006] Therefore, there is an urgent need for a dynamic fence detection method and device that can adapt to complex environments, accurately perceive human body contours, and integrate spatiotemporal information for intelligent risk assessment. Summary of the Invention
[0007] The purpose of this invention is to provide a multi-level dynamic fence detection method and system based on multimodal adaptive fusion, aiming to solve the above-mentioned problems in the prior art.
[0008] This invention provides a multi-level dynamic fence detection method based on multimodal adaptive fusion, comprising: Simultaneously acquire visible light video streams and infrared video streams from the monitored area to obtain corresponding visible light image and infrared image pairs; The visible light image and infrared image are input into a pre-constructed multimodal adaptive fusion detection network model, and the detection results of the target person are output. Based on the detection results, the comprehensive risk assessment level of the target personnel is calculated, and the comprehensive risk assessment level is judged according to a preset threshold. Based on the judgment result, corresponding early warning measures are triggered.
[0009] This invention provides a multi-level dynamic fence detection system based on multimodal adaptive fusion, comprising: The data module is used to synchronously acquire visible light video streams and infrared video streams in the monitored area, and obtain corresponding visible light image and infrared image pairs; The detection module is used to input the visible light image and infrared image pair into a pre-constructed multimodal adaptive fusion detection network model and output the detection result of the target person; The judgment module is used to calculate the comprehensive risk assessment level of the target personnel based on the detection results, and to judge the comprehensive risk assessment level according to a preset threshold, and to trigger corresponding early warning measures according to the judgment result.
[0010] This invention also provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the steps of the multi-level dynamic fence detection method based on multimodal adaptive fusion described above.
[0011] This invention also provides a computer-readable storage medium storing an information transmission implementation program, which, when executed by a processor, implements the steps of the multi-level dynamic fence detection method based on multimodal adaptive fusion described above.
[0012] The following beneficial effects can be achieved by adopting the embodiments of the present invention: The embodiments of the present invention provide a multi-level dynamic fence detection method based on multimodal adaptive fusion and spatiotemporal correlation. The method uses a novel multimodal feature adaptive fusion network to efficiently process visible light and infrared images under a unified framework; and introduces a spatiotemporal correlation module to analyze the movement trajectory, speed and behavior patterns of personnel, thereby achieving a more accurate and intelligent hierarchical risk assessment of intrusion behavior. Attached Figure Description
[0013] To more clearly illustrate the technical solutions in one or more embodiments of this specification or in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0014] Figure 1 This is a flowchart of a multi-level dynamic fence detection method based on multimodal adaptive fusion according to an embodiment of the present invention; Figure 2 This is a schematic diagram of a multi-level dynamic fence detection system based on multimodal adaptive fusion according to an embodiment of the present invention. Detailed Implementation
[0015] To enable those skilled in the art to better understand the technical solutions in one or more embodiments of this specification, the technical solutions in one or more embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of the embodiments. Based on one or more embodiments of this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of this document.
[0016] Method Implementation Examples According to embodiments of the present invention, a multi-level dynamic fence detection method based on multimodal adaptive fusion is provided. Figure 1 This is a flowchart of a multi-level dynamic fence detection method based on multimodal adaptive fusion according to an embodiment of the present invention, as follows: Figure 1 As shown, the multi-level dynamic fence detection method based on multimodal adaptive fusion according to an embodiment of the present invention specifically includes: Step S101: Synchronously acquire visible light video stream and infrared video stream of the monitored area to obtain corresponding visible light image and infrared image pairs; Step S102: Input the visible light image and infrared image pair into a pre-constructed multimodal adaptive fusion detection network model, and output the detection result of the target person; the multimodal adaptive fusion detection network model includes a two-stream feature extraction module, an adaptive feature fusion module, and a multi-task output module connected in sequence; wherein, the adaptive feature fusion module includes a learnable adaptive channel attention fusion gate; specifically including: The dual-stream feature extraction module extracts the first modality feature map and the second modality feature map from the visible light image and the infrared image, respectively. The first modality feature map and the second modality feature map are input into the adaptive feature fusion module. The adaptive channel attention fusion gate dynamically generates a fusion weight vector based on the modal characteristics of the first and second modality feature maps. Then, based on the fusion weight vector, channel-weighted feature fusion is performed on the first and second modality feature maps to obtain a fused feature map. Specifically, this includes: The first modality feature map and the second modality feature map are subjected to global average pooling through the adaptive channel attention fusion gate to obtain the first channel description vector and the second channel description vector. The first channel description vector and the second channel description vector are concatenated and mapped through a multilayer perceptron to output the initial weight vector; The initial weight vector is separated into a first weight vector corresponding to the visible light mode and a second weight vector corresponding to the infrared mode, and then normalized respectively. The first modality feature map is channel-weighted using the normalized first weight vector, and the second modality feature map is channel-weighted using the normalized second weight vector. The two weighted feature maps are added together to obtain the fused feature map; The fused feature map is input to the multi-task output module, which performs target detection, human keypoint estimation, and semantic segmentation tasks in parallel based on the fused feature map, and outputs the detection results including human detection boxes, human keypoints, and segmentation masks of the target person.
[0017] Step S103: Calculate the comprehensive risk assessment level of the target personnel based on the detection results, and judge the comprehensive risk assessment level according to a preset threshold. Trigger corresponding early warning measures based on the judgment result; specifically including: The instantaneous risk value of the target personnel is calculated based on the segmentation mask and the preset risk gradient field; wherein, the risk gradient field has the highest risk value in the core danger area and decreases outwards; the instantaneous risk value is obtained by summing the risk values corresponding to all pixels covered by the segmentation mask and then normalizing them; Based on the detection results of multiple consecutive frames, trajectory tracking of the target person is performed. The component of the target person's movement velocity in the direction pointing towards the danger zone is calculated based on the obtained motion trajectory, and a trend risk value is calculated in conjunction with the distance from the current location to the danger zone; specifically including: For each detected person in the current frame, obtain their human body detection bounding box and segmentation mask; The detected personnel in the current frame are matched with existing tracking trajectories; the cost calculation of the matching includes at least two of the following metrics: geometric similarity metric based on the cross-union ratio of human detection boxes, cross-union ratio metric based on segmentation masks, and risk consistency metric. The motion state of the successfully matched tracking trajectory is updated using a Kalman filter; wherein the observed values of the motion state are determined based on the coordinates of the center point of the human detection box and / or the geometric centroid coordinates of the segmentation mask. For each successfully maintained tracking trajectory, record its position sequence across multiple consecutive frames, and calculate its motion speed and direction based on the position sequence for use in calculating the trend risk value; Based on the human body key points, identify whether the target person's behavior matches a predefined high-risk behavior pattern. If it matches, assign a corresponding behavior pattern risk value. The instantaneous risk value, trend risk value, and behavioral pattern risk value are weighted and fused to obtain the comprehensive risk assessment level of the target personnel; At least two risk level thresholds are preset, and the risk range is divided into multiple continuous or discontinuous risk level intervals, including low-risk zone, medium-risk zone, high-risk zone and emergency risk zone, based on the risk level thresholds. The comprehensive risk assessment level is compared with the risk level threshold to determine the risk level range to which it belongs; Trigger early warning measures corresponding to the determined risk level range; wherein the early warning measures are progressively enhanced early warning response strategies configured for different risk level ranges.
[0018] The following describes in detail the above-mentioned technical solution of the present invention with reference to the specific circumstances of the multi-level dynamic fence detection method based on multimodal adaptive fusion in the embodiments of the present invention.
[0019] This invention proposes a multi-level dynamic fence detection method based on multimodal adaptive fusion and spatiotemporal correlation, comprising the following steps: S1: Simultaneously acquire visible light video stream and infrared video stream of the monitored area; S2: Input the visible light image and the infrared image into the multimodal adaptive fusion detection network; the network dynamically weights and fuses the dual-modal features through an adaptive channel attention fusion gate, and outputs the human detection box, key points and high-precision segmentation mask of the target; S3: Calculate the instantaneous intrusion risk value of the target based on the segmentation mask and the preset risk gradient field; where the risk gradient field is a two-dimensional scalar field, the scalar value is the highest at the preset core danger area, and decreases continuously or segmentally from the outside along the space. S4: Perform spatiotemporal trajectory correlation and tracking of the same target in consecutive frames, analyze its motion speed, direction and behavior pattern, and calculate trend risk value and behavior pattern risk value; S5: The overall risk assessment level of the target is obtained by combining the instantaneous risk value, trend risk value, and behavioral pattern risk value; S6: When the comprehensive risk assessment level exceeds the preset threshold, an alarm corresponding to that level will be triggered.
[0020] Specifically, the technical solutions adopted in the embodiments of the present invention are described as follows: I. Multimodal Adaptive Fusion Detection Network 1. Dual-stream feature extraction: Construct a dual-branch backbone network, taking visible light and infrared images from a camera as inputs. The two branches share some of the lower-level convolutional layers to extract general features, but use independent paths in the higher-level network to learn modality-specific features (such as color, texture, and thermal radiation contours).
[0021] 2. Adaptive Feature Fusion Module: A learnable adaptive channel attention fusion gate is introduced into the key layers of the network. This module dynamically generates weight vectors based on the quality of the current input image pair (e.g., calculating the sharpness and contrast of the visible light image, and the signal-to-noise ratio of the infrared image), guiding the weighted fusion of feature maps from the two modalities. For example, in well-lit daytime conditions, the weight of visible light features automatically increases; at night or in foggy weather, the weight of infrared features dominates. This method replaces simple image switching or early / late fixed fusion strategies, achieving pixel-level and semantic-level adaptive complementarity.
[0022] 3. Unified Detection and Segmentation Head: An improved, lightweight decoupled head is connected to the fused feature map, simultaneously outputting human detection bounding boxes, key points (such as head, shoulders, and feet), and high-precision human semantic segmentation masks. This integrated design ensures consistency of detection, localization, and segmentation information at the feature level, making it more efficient and accurate than sequentially performing detection and segmentation.
[0023] II. Multi-level dynamic risk assessment based on spatiotemporal correlation 1. Spatiotemporal trajectory modeling: The system maintains a list of personnel trajectory tracking. For each person detected in a frame, the improved DeepSORT algorithm is used to associate them, recording not only their position (based on the centroid of the segmentation mask), but also their movement speed, direction, and behavioral segments (such as standing, walking, running, climbing) within the past N frames.
[0024] 2. Dynamic Fence and Risk Field Modeling: The preset static polygonal fence is expanded into a continuous risk gradient field. The core hazardous area (such as an operating machine tool) has the highest risk value (e.g., 100), and the risk value decreases gradually towards the periphery. The precise segmentation mask of personnel is mapped onto this risk field.
[0025] 3. Related Risk Assessment Engine: Instantaneous risk: Calculate the sum of risk values of all pixels covered by the personnel segmentation mask, and normalize it to obtain the instantaneous intrusion risk value of the current frame.
[0026] Trend Risk: Analyze the individual's movement trajectory and speed over the past few seconds. If their movement direction is pointing towards a high-risk area and their speed is increasing, a trend risk gain is triggered; if they are leaving or lingering, the risk value will be suppressed.
[0027] Behavioral pattern risk: Combine posture estimation (from key points) to identify specific high-risk behavioral patterns, such as reaching into a restricted area or bending over to approach dangerous equipment, triggering specific risk amplification coefficients.
[0028] 4. Tiered Alarm Decision-Making: The system comprehensively assesses instantaneous risk, trend risk, and behavioral pattern risk through a configurable decision function, outputting a final overall risk level (e.g., low, medium, high, emergency). An alarm of the corresponding level is triggered only when the overall risk exceeds a preset threshold. Simultaneously, the system can predict risk trends; for example, if personnel are rushing towards a danger zone at high speed, a high-level warning can be triggered even before they enter the highest-risk zone.
[0029] In this embodiment of the invention, the software system of the above method is deployed on a general-purpose server.
[0030] 1. System hardware and software environment Computing Unit: Servers equipped with NVIDIA RTX A6000 or higher-level GPUs for running deep learning models.
[0031] Acquisition Unit: Dual-spectrum network cameras that support simultaneous output of visible light (RGB) and thermal infrared (IR) video streams, such as the Hikvision DS-2TD series.
[0032] Software stack: Ubuntu 20.04 LTS operating system, CUDA 12.1, cuDNN 8.9, deep learning framework PyTorch 2.0+, and multimedia processing libraries such as OpenCV and FFmpeg.
[0033] 2. Data Preparation and Model Training A. Dataset Construction: In the target factory environment (such as a chemical plant or machining workshop), collect visible light and infrared image pairs covering daytime, nighttime, different weather conditions, and different workstation scenarios. Perform detailed annotations on workers wearing various work clothes (anti-static clothing, acid-resistant clothing) and protective equipment (safety helmets, face shields) in the images. Annotation information includes: Bounding Box: Used for target detection; Keypoints: Includes at least 14 keypoints such as the top of the head, both shoulders, both knees, and both feet, used for pose estimation; Segmentation Mask: A pixel-level human body segmentation mask.
[0034] B. Training of a multimodal adaptive fusion detection network: a. Backbone Network: An improved CSPDarknet is used as the dual-stream shared low-level feature extractor, with two independent branches branching off from the middle layer. and .
[0035] b. Adaptive Channel Attention Fusion Gate Design: Located after the 3rd and 4th CSP modules in the network, the structure of this gate unit G is as follows: Input: From Feature map and from Feature map .
[0036] Operation: Perform global average pooling (GAP) on the two feature maps respectively to obtain two channel description vectors. .
[0037] Concatenation and mapping: Concatenating two vectors into... It outputs a 2C-dimensional weight vector through a small neural network MLP containing two fully connected layers (with ReLU activation in between). .
[0038] Weight separation and activation: The front C-dimensional and back C-dimensional dimensions are separated and normalized to the (0, 1) interval using the Sigmoid function to obtain the RGB weight vector. and IR weight vector .
[0039] Fusion Output: Fusion Feature Map ,in Element-wise multiplication (broadcast mechanism) representing the channel direction.
[0040] The specific formula can be expressed as: , ; ; , ; ; c. Decoupling Head: In the fusion feature Then, three independent lightweight convolutional subnetworks are connected, which are used for: Detection head: Predicts bounding box coordinates, confidence level, and category; Key points: Predicting heatmaps of 14 key points on the human body; Segmentation Header: A mask image for predicting human semantic segmentation.
[0041] d. Loss function: Total loss ; in, For YOLO's CIoU Loss and Classification Focal Loss, The mean square error loss of the key point heatmap. Dice Loss for splitting tasks and To balance the hyperparameters, they can be set to 0.5 and 1.0.
[0042] e. Training process: Using the AdamW optimizer with an initial learning rate of 1e-3, end-to-end training was performed for 150 epochs on a self-built dataset, and data augmentation (blending, flipping, color jittering, etc.) was applied.
[0043] 3. Spatiotemporal correlation and risk assessment Trajectory tracking: An improved DeepSORT algorithm is employed. The geometric center of the segmentation mask is used as the observation location, and the IoU of the mask is used as a supplementary measure of appearance similarity. For each tracked target... Maintain a state vector containing position (x, y), velocity (x, y), and other parameters. , ), the position sequence of the past K frames (e.g., K=30, corresponding to 1 second), and a behavior code.
[0044] The improvements to the DeepSORT algorithm are as follows: A. Introduce segmentation mask IoU as a core correlation metric. In factory environments, personnel wear similar clothing (uniform work clothes) and there may be occlusion, resulting in insufficient distinguishability of appearance features relied upon by traditional DeepSORT, which easily leads to ID switching. Therefore, in addition to using the Intersection over Union (Bbox IoU) of the detection box and the appearance cosine distance, this embodiment of the invention adds a more accurate geometric association metric—the segmentation mask IoU.
[0045] Calculation method: For the target mask detected in the current frame And the predicted target mask in the tracking list Calculate their intersection-union ratio: ; B. Motion model based on mask centroid Traditional DeepSORT uses the center point of the detection box for Kalman filtering motion prediction and update. However, the center of the detection box is greatly affected by the target pose and occlusion and is unstable. Therefore, this embodiment of the invention uses the geometric centroid of the segmentation mask as the motion state observation.
[0046] Centroid Calculation: For a segmentation mask M, its centroid... The calculation is as follows: , ; Where A is the area of the mask (total number of pixels). The coordinates (i, j) indicate that the pixel belongs to the set of segmentation masks M (i.e., the pixel is identified as "human body"). i represents the row number of the pixel in the matrix, and j represents the column number.
[0047] C. Cost of Associative Matching of Integrated Risk States Traditional association only considers geometry and appearance, ignoring the behavioral semantics of the target. For example, in a fence intrusion scenario, a person leaving a low-risk area and another person entering a high-risk area should be treated differently, even if they look and are in similar positions. Therefore, this embodiment of the invention incorporates the instantaneous risk state of the target into the cost matrix of association matching.
[0048] Risk state vector: for each tracked target Maintain a short risk history vector (For example, the instantaneous risk value of the past two frames).
[0049] Risk consistency cost: In calculating the current detection target With the tracking target When considering the associated costs, besides the apparent distance... and distance of movement (Mahaviron distance), adding a risk consistency cost : ; in, It is the instantaneous risk currently being detected. It is the expected value (mean) of the target risk history. It is the weighting coefficient of risk cost.
[0050] Total association cost: ; Association matching is accomplished by minimizing the total cost using the Hungarian algorithm.
[0051] Risk gradient field Modeling: In the system configuration interface, the administrator can draw the core hazardous area H (such as a rotating machine tool) using a map or video feed. The system automatically generates a risk field, for example, using a combination of a two-dimensional Gaussian kernel function and a distance decay function. A. Core and adjacent areas (close to the center): ; This formula describes the center of the hazard source. The Gaussian risk field is centered on the center, and the risk value starts from the maximum value at the center. Smooth decay outwards.
[0052] B. Outer area (when the distance exceeds a certain limit): ; This formula describes how the risk value decays according to a power law when the distance is sufficiently large. Wherein, Point Risk value at the location, The highest risk value (e.g., 100). The standard deviation of the Gaussian kernel represents the diffusion range of the high-risk control area. The larger the area, the wider the high-risk zone, and the flatter the attenuation. The smaller the area, the more concentrated the high-risk areas, and the more rapid the decline. Point The shortest distance to the boundary of the core danger zone. Indicates the reference distance or attenuation start distance, when = At that time, the risk value of the peripheral formula is It defines the distance from which the power-law decay begins to be applied, and is usually connected to the effective range of the Gaussian field. The decay index controls the rate at which external risks decay. The larger the distance, the faster the risk decreases with increasing distance; The smaller the value, the wider the scope of the risk's impact, and the slower the decay.
[0053] In practical systems, the two formulas mentioned above are not used simultaneously, but rather depending on the current calculation point. Position segment selection: a. Determine distance: Calculate point To the hazard center distance ; b. Select formula: If D≤ (For example =3 If the value is a fixed value, then the Gaussian formula is used, which covers the core and adjacent areas, resulting in smooth risk decay.
[0054] If D> If the risk is not zero, then the power law formula is used, which applies to more distant peripheral areas, ensuring that the risk does not decrease to zero, but the impact gradually diminishes.
[0055] Preferably, a smooth transition can also be selected: at the switching point Nearby, the calculation results of the two formulas can be weighted and averaged to achieve a smooth and continuous transition of risk values and avoid abrupt changes.
[0056] Risk assessment calculation: A. Instantaneous risk Regarding the target In the segmentation mask of the current frame Calculate all the pixels it covers. The sum of risk values is divided by the mask area for standardization. B. Trend Risk Calculate the target's movement trend toward the danger zone H over the past K frames: ; ; In the formula, For the projection scalar of velocity in the danger direction, its physical meaning is: positive value: indicates that the target velocity component is pointing towards the center of danger (close to danger); zero: indicates that the target velocity direction is perpendicular to the line or is stationary (lingering); negative value: indicates that the target velocity component is away from the center of danger (far from danger). The shortest Euclidean distance from the target's current location to the boundary (not the center) of the preset danger zone H. As a distance smoothing factor, a small positive constant (e.g., 1.0) prevents when... When the value is very small, the formula value becomes too large, causing the numerical value to be unstable. This is the gain coefficient. Let be the instantaneous velocity vector of the target in the current frame. This indicates a direction from the target's current location to the center of the core hazard area. The unit direction vector.
[0057] C. Behavioral pattern risk Calculate the skeletal vector based on 14 key points output from the key head view. Predefined high-risk behavior template library: Climbing: Detect the angle between the torso and the horizontal plane, and combine this with key hand points that are higher than the shoulders.
[0058] Reaching in: Detects whether the arm bone vector extends into the boundary of the high-risk area.
[0059] Quick squat / bend: Combine speed at key points with changes in torso proportions.
[0060] When a matching behavior is detected It is assigned a fixed risk premium B, otherwise it is 0.
[0061] Comprehensive Risk Decision: Final Risk Value ;in , , Configurable weighting coefficients (e.g., 0.7, 0.2, 0.1).
[0062] Tiered alarms: Set different risk thresholds based on... Upon entering the designated area, an alarm of the corresponding level (low, medium, high, emergency) is triggered. Preferably, for the emergency level, in addition to local audible and visual alarms and push notifications, the system can be linked to perform emergency shutdown and other operations.
[0063] 4. System Workflow The system initializes, loads the trained model, and configures the risk gradient field and alarm parameters.
[0064] A dual-spectral camera continuously inputs (RGB, IR) image pairs.
[0065] For each pair of image frames, perform steps S2 to S5.
[0066] like If the risk threshold is reached, an alarm process is triggered, and the alarm event (including time, location, risk level, snapshot, and short video clip) is recorded to the database.
[0067] Managers can view videos, risk heatmaps, alarm lists, and review historical events in real time via a web client.
[0068] Preferably, for edge devices with limited computing resources (such as factory network edge servers and smart cameras), this embodiment of the invention also provides a lightweight implementation scheme.
[0069] 1. Lightweight Model: Replace the backbone network with MobileNetV3 or EfficientNet-Lite; keep the adaptive fusion gate and decoupling head structure unchanged, but compress the number of channels; use knowledge distillation or neural network architecture search techniques to reduce the number of model parameters and computational cost while maintaining accuracy.
[0070] 2. Simplified risk assessment: At the edge, only calculation is required. And simplified (e.g., judging solely by velocity magnitude and direction quadrant), complex Identification and risk field overlay calculations can be performed in the cloud. Edge devices upload detection results, masks, and trajectory data to the cloud, where the cloud completes the final comprehensive risk assessment and decision-making, and then sends instructions to the edge devices to execute alarms. This cloud-edge collaborative model can balance real-time performance and computational complexity.
[0071] System Implementation Examples According to embodiments of the present invention, a multi-level dynamic fence detection system based on multimodal adaptive fusion is provided. Figure 2 This is a schematic diagram of a multi-level dynamic fence detection system based on multimodal adaptive fusion according to an embodiment of the present invention, as shown below. Figure 2As shown, the multi-level dynamic fence detection system based on multimodal adaptive fusion according to an embodiment of the present invention specifically includes: Data module 20 is used to synchronously acquire visible light video streams and infrared video streams in the monitored area, and obtain corresponding visible light image and infrared image pairs; Detection module 22 is used to input the visible light image and infrared image pair into a pre-constructed multimodal adaptive fusion detection network model and output the detection result of the target person; The judgment module 24 is used to calculate the comprehensive risk assessment level of the target personnel based on the detection results, and to judge the comprehensive risk assessment level according to a preset threshold, and to trigger corresponding early warning measures according to the judgment result.
[0072] Preferably, the system can also provide a graphical management interface, the main functions of which include: 1. Fence and Risk Field Drawing: Administrators can directly draw polygonal, circular, or irregularly shaped areas on the video screen, and draw risk contour lines or set center and attenuation parameters for each area, visualizing the risk gradient field in real time.
[0073] 2. Model Management: Supports uploading new labeled data, starting incremental training, and updating online models (supports A / B testing).
[0074] 3. Alarm rule configuration: Flexible adjustment of risk thresholds at all levels, risk assessment weights, and behavioral pattern rule base.
[0075] 4. Alarm Handling and Reporting: Displays real-time alarms, supports marking false alarms and confirming their handling, and generates various statistical reports (such as alarm frequency distribution, high-risk period analysis, etc.).
[0076] The embodiments of the present invention are system embodiments corresponding to the above method embodiments. The specific operation of each module can be understood by referring to the description of the method embodiments, and will not be repeated here.
[0077] In summary, compared with the prior art, the beneficial effects of the embodiments of the present invention include: 1. Significantly improved detection accuracy: The multimodal adaptive fusion network can make full use of effective information under different imaging conditions, significantly improving the detection and segmentation accuracy in complex lighting, harsh weather, and special clothing scenarios.
[0078] 2. Intelligent risk assessment: By introducing spatiotemporal correlation analysis, the system has been upgraded from static image intrusion judgment to dynamic behavior risk assessment, which greatly reduces false alarms caused by instantaneous posture and false alarm detection boxes, and can provide early warning of potential dangers.
[0079] 3. Enhanced flexibility and practicality: The risk gradient field model is more flexible and easier to configure and adjust than a fixed multi-level fence.
[0080] Device Example 1 This invention provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it performs the steps described in the method embodiment.
[0081] Device Example 2 This invention provides a computer-readable storage medium storing an information transmission implementation program, which, when executed by a processor, performs the steps described in the method embodiment.
[0082] The computer-readable storage media described in this embodiment include, but are not limited to, ROM, RAM, disk, or optical disk.
[0083] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A multi-level dynamic fence detection method based on multimodal adaptive fusion, characterized in that, include: Simultaneously acquire visible light video streams and infrared video streams from the monitored area to obtain corresponding visible light image and infrared image pairs; The visible light image and infrared image are input into a pre-constructed multimodal adaptive fusion detection network model, and the detection results of the target person are output. Based on the detection results, the comprehensive risk assessment level of the target personnel is calculated, and the comprehensive risk assessment level is judged according to a preset threshold. Based on the judgment result, corresponding early warning measures are triggered.
2. The method according to claim 1, characterized in that, The multimodal adaptive fusion detection network model includes a dual-stream feature extraction module, an adaptive feature fusion module, and a multi-task output module connected in sequence. The adaptive feature fusion module includes a learnable adaptive channel attention fusion gate.
3. The method according to claim 2, characterized in that, The visible light image and infrared image are input into a pre-constructed multimodal adaptive fusion detection network model, and the output detection results of the target person specifically include: The dual-stream feature extraction module extracts the first modality feature map and the second modality feature map from the visible light image and the infrared image, respectively. The first modality feature map and the second modality feature map are input into the adaptive feature fusion module. The adaptive channel attention fusion gate dynamically generates a fusion weight vector based on the modal characteristics of the first modality feature map and the second modality feature map. Based on the fusion weight vector, channel-weighted feature fusion is performed on the first modality feature map and the second modality feature map to obtain a fused feature map. The fused feature map is input to the multi-task output module, which performs target detection, human key point estimation, and semantic segmentation tasks in parallel based on the fused feature map, and outputs the detection results including human detection boxes, human key points, and segmentation masks of the target person.
4. The method according to claim 3, characterized in that, The adaptive channel attention fusion gate dynamically generates a fusion weight vector based on the modal characteristics of the first and second modal feature maps, and performs channel-weighted feature fusion on the first and second modal feature maps based on the fusion weight vector to obtain the fused feature map, specifically including: The first modality feature map and the second modality feature map are subjected to global average pooling through the adaptive channel attention fusion gate to obtain the first channel description vector and the second channel description vector. The first channel description vector and the second channel description vector are concatenated and mapped through a multilayer perceptron to output the initial weight vector; The initial weight vector is separated into a first weight vector corresponding to the visible light mode and a second weight vector corresponding to the infrared mode, and then normalized respectively. The first modality feature map is channel-weighted using the normalized first weight vector, and the second modality feature map is channel-weighted using the normalized second weight vector. The two weighted feature maps are added together to obtain the fused feature map.
5. The method according to claim 3, characterized in that, The calculation of the target personnel's comprehensive risk assessment level based on the test results specifically includes: The instantaneous risk value of the target personnel is calculated based on the segmentation mask and the preset risk gradient field; wherein, the risk gradient field has the highest risk value in the core danger area and decreases outwards; the instantaneous risk value is obtained by summing the risk values corresponding to all pixels covered by the segmentation mask and then normalizing them; Based on the detection results of multiple consecutive frames, the trajectory of the target person is tracked. The component of the target person's movement speed in the direction pointing to the danger zone is calculated according to the obtained movement trajectory. The trend risk value is calculated in combination with the distance from the current position to the danger zone. Based on the human body key points, identify whether the target person's behavior matches a predefined high-risk behavior pattern. If it matches, assign a corresponding behavior pattern risk value. The instantaneous risk value, trend risk value, and behavioral pattern risk value are weighted and integrated to obtain the comprehensive risk assessment level of the target personnel.
6. The method according to claim 5, characterized in that, Tracking the trajectory of a target person based on the detection results of multiple consecutive frames specifically includes: For each detected person in the current frame, obtain their human body detection bounding box and segmentation mask; The detected personnel in the current frame are matched with existing tracking trajectories; the cost calculation of the matching includes at least two of the following metrics: geometric similarity metric based on the cross-union ratio of human detection boxes, cross-union ratio metric based on segmentation masks, and risk consistency metric. The motion state of the successfully matched tracking trajectory is updated using a Kalman filter; wherein the observed values of the motion state are determined based on the coordinates of the center point of the human detection box and / or the geometric centroid coordinates of the segmentation mask. For each successfully maintained tracking trajectory, record its position sequence across multiple consecutive frames, and calculate its motion speed and direction based on the position sequence for use in calculating the trend risk value.
7. The method according to claim 1, characterized in that, The comprehensive risk assessment level is determined based on a preset threshold, and corresponding early warning measures are triggered based on the determination result, specifically including: At least two risk level thresholds are preset, and the risk range is divided into multiple continuous or discontinuous risk level intervals, including low-risk zone, medium-risk zone, high-risk zone and emergency risk zone, based on the risk level thresholds. The comprehensive risk assessment level is compared with the risk level threshold to determine the risk level range to which it belongs; Trigger early warning measures corresponding to the determined risk level range; wherein the early warning measures are progressively enhanced early warning response strategies configured for different risk level ranges.
8. A multi-level dynamic fence detection system based on multimodal adaptive fusion, characterized in that, include: The data module is used to synchronously acquire visible light video streams and infrared video streams in the monitored area, and obtain corresponding visible light image and infrared image pairs; The detection module is used to input the visible light image and infrared image pair into a pre-constructed multimodal adaptive fusion detection network model and output the detection result of the target person; The judgment module is used to calculate the comprehensive risk assessment level of the target personnel based on the detection results, and to judge the comprehensive risk assessment level according to a preset threshold, and to trigger corresponding early warning measures according to the judgment result.
9. An electronic device, characterized in that, include: The memory, the processor, and the computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps of the multi-level dynamic fence detection method based on multimodal adaptive fusion as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores an implementation program for information transmission, which, when executed by a processor, implements the steps of the multi-level dynamic fence detection method based on multimodal adaptive fusion as described in any one of claims 1-7.