A park safety intelligent early warning method and system based on digital twinning
By establishing a digital twin 3D model and a spatiotemporal graph convolutional network, the problem of fragmented field of view in park monitoring was solved, enabling continuous tracking of cross-regional targets and high-precision behavior recognition, thereby improving the real-time performance and accuracy of park security management.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN SILVER CHAIN SECURITY TECH CO LTD
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-16
AI Technical Summary
Existing park monitoring solutions rely on independent cameras, resulting in fragmented monitoring views. This makes it difficult to achieve continuous spatiotemporal tracking and logical association of targets across regions, and lacks verification of multi-dimensional spatiotemporal topological relationships, leading to insufficient accuracy and real-time performance of early warning results.
A digital twin 3D model is established, a unified world coordinate system is set, the 3D coordinates of pedestrian targets are extracted through video streams, a dynamic spatiotemporal graph is constructed, and a spatiotemporal graph convolutional network is used to extract high-dimensional spatiotemporal behavioral features. Combined with the digital twin space, the physical layer devices are driven to capture and collect evidence.
It enables continuous target tracking across cameras, enhances the overall situational awareness of the park, accurately identifies complex behavioral logic, reduces misjudgment, shortens response time, and achieves seamless integration of early warning and evidence collection.
Smart Images

Figure CN122223652A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of digital twin intelligent early warning technology, specifically involving a digital twin-based intelligent early warning method and system for park security. Background Technology
[0002] With the booming development of smart cities and the Industrial Internet, park security management, as a core link in ensuring asset safety and production order, is undergoing a profound transformation from traditional manual patrols to digital and intelligent monitoring. Modern parks are deploying large-scale sensor networks and video surveillance matrices to build a comprehensive security protection system to cope with increasingly complex security environments and sudden risks. As the perception cornerstone of the security system, video surveillance systems play a crucial role in real-time risk identification and emergency command and dispatch.
[0003] Among them, the digital twin-based park monitoring technology provides a high-precision spatial reference and multi-source data fusion framework for security early warning by constructing a real-time digital mapping of the physical environment. This technology aims to map discrete video perception information onto a unified world coordinate system, and reconstruct the motion state of targets in a three-dimensional virtual space to achieve accurate perception and behavioral logic analysis of the overall situation of the park.
[0004] However, existing park monitoring solutions primarily rely on independent camera nodes, resulting in severe fragmentation of the monitoring field of view and making it difficult to achieve continuous spatiotemporal tracking and logical correlation of targets across regions. Furthermore, conventional recognition algorithms focus on feature extraction from single-frame images or local video segments, lacking deep integration with the geospatial semantics of the park, and thus failing to effectively identify complex abnormal behaviors constrained by spatial location. In addition, single-point monitoring modes suffer from poor robustness and high false alarm rates due to fluctuations in lighting, environmental occlusion, and viewing angle deviations. Moreover, existing systems lack verification mechanisms based on multi-dimensional spatiotemporal topological relationships, making it difficult to meet practical requirements for the accuracy and real-time performance of early warning results. Therefore, a digital twin-based intelligent early warning solution for park security is desired. Summary of the Invention
[0005] The purpose of this invention is to provide a digital twin-based intelligent early warning method and system for park security, which can effectively solve the problems mentioned in the background art.
[0006] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A digital twin-based intelligent early warning method for park security includes the following specific steps: Establish a digital twin 3D model of the park and set a unified world coordinate system; Acquire real-time video streams from various surveillance cameras within the park and extract the pixel coordinates of the center point of the foot of the pedestrian target in the image; Based on the pre-calibrated camera intrinsic and extrinsic parameters and homography matrix, the pixel coordinates of the foot center point are mapped to the three-dimensional spatial coordinates in the world coordinate system. The elevation is then fused with the digital elevation model and the prior value of the pedestrian's height to generate the three-dimensional coordinates of the pedestrian target in the world coordinate system. Based on three-dimensional coordinates, temporal smoothing is performed on the same pedestrian target in consecutive frames to generate continuous motion trajectories. Multiple pedestrian targets are defined as graph nodes. Spatial edges are constructed based on the spatial distance between nodes, and temporal edges are constructed based on the trajectory association of the same target to form a dynamic spatiotemporal graph. The dynamic spatiotemporal graph is input into the spatiotemporal graph convolutional network, which contains multiple stacked spatiotemporal convolutional blocks. Each spatiotemporal convolutional block introduces spatial attention and temporal attention mechanisms to weight the spatial and temporal features in the dynamic spatiotemporal graph, respectively, and extract high-dimensional spatiotemporal behavioral features. Security risks are identified based on high-dimensional spatiotemporal behavioral characteristics. When a security risk is identified, the abnormal target is visualized in the digital twin 3D model. At the same time, based on the 3D coordinates of the abnormal target in the world coordinate system, the nearest PTZ camera in the physical layer is driven in reverse to adjust its rotation angle, pitch angle and focal length parameters to capture and collect evidence of the abnormal target.
[0007] Furthermore, the pixel coordinates of the foot center point are mapped to three-dimensional spatial coordinates in the world coordinate system, and elevation fusion is performed by combining the digital terrain elevation model and the prior value of the pedestrian's height to generate the three-dimensional coordinates of the pedestrian target in the world coordinate system, specifically including: The intrinsic parameter matrix and distortion coefficient of the surveillance camera are obtained by Zhang Zhengyou's calibration method. By selecting calibration points with known geographical coordinates, the rotation matrix and translation vector of the camera relative to the world coordinate system are solved by the direct linear transformation method to complete the extrinsic parameter calibration. Then, the homography matrix between the image coordinate system and the ground plane is solved. After distortion correction of the extracted foot center point pixel coordinates, the world coordinates on the ground plane are obtained through homography matrix linear transformation. The elevation of the ground at world coordinates on the ground plane is obtained by querying the digital elevation model. Combined with the preset shoe sole compensation value, the elevation of the pedestrian's feet is determined. Then, the statistical prior value of the pedestrian's height is used for verification and constraint, and finally the three-dimensional spatial coordinates of the pedestrian target are obtained.
[0008] Furthermore, based on three-dimensional coordinates, temporal smoothing processing is performed on the same pedestrian target in consecutive frames to generate continuous motion trajectories, specifically including: The Kalman filter algorithm is used to construct the system state vector with the three-dimensional spatial coordinates and velocity components of the pedestrian target. Through the state prediction equation and the observation equation, the three-dimensional world coordinates acquired in continuous time frames are temporally smoothed to eliminate coordinate jumps. Based on the change in the continuous position after smoothing, the real-time displacement vector, instantaneous velocity and direction of movement of the pedestrian in three-dimensional space are calculated.
[0009] Furthermore, multiple pedestrian targets are defined as graph nodes. Spatial edges are constructed based on the spatial distance between nodes, and temporal edges are constructed based on the trajectory associations of the same target, forming a dynamic spatiotemporal graph, specifically including: Each pedestrian target detected in each frame is treated as a graph node. The initial features of the node include at least its three-dimensional coordinates in the world coordinate system, its motion velocity, motion direction, and acceleration. Within the same time frame, all node pairs are traversed. When the Euclidean distance between two nodes is less than a preset spatial neighborhood threshold, a spatial edge is established to represent the instantaneous interaction relationship between pedestrians. A multi-target tracking algorithm is used to assign a unique tracking identifier to each pedestrian target, and a time edge is established between nodes with the same tracking identifier in adjacent frames to connect the motion trajectory of the same target.
[0010] Furthermore, each spatiotemporal convolutional block of the spatiotemporal graph convolutional network contains a spatial graph convolutional layer, a temporal convolutional layer, and a nonlinear activation function. The spatial graph convolutional layer uses the Laplacian matrix to aggregate spatial neighborhood features to extract the spatial structural features of the crowd. The temporal convolutional layer uses a one-dimensional convolutional kernel to perform sliding window calculations along the time axis to capture the temporal evolution features of the target's motion trajectory.
[0011] Furthermore, the spatial attention mechanism learns weights for graph nodes in different regions, enabling the network to automatically focus on preset high-risk locations; the temporal attention mechanism assigns higher weights to time periods of drastic change in actions, enabling the network to focus on behavioral turning points; through the weighting of the spatial and temporal attention mechanisms, the ability to represent behavioral features strongly correlated with security events is enhanced.
[0012] Furthermore, security risks are identified based on high-dimensional spatiotemporal behavioral characteristics, specifically including: The high-dimensional spatiotemporal behavioral feature vector is input into a fully connected classification network, and mapped to a preset behavioral category probability space through the Softmax function. The output is the probability value of the pedestrian target belonging to at least one behavioral category among normal walking, abnormal loitering, illegal intrusion, illegal gathering, fall injury, or violent running. When the probability value of a specific anomaly category is greater than a preset probability threshold, and the duration of this high-probability state exceeds a preset duration threshold, a security risk is identified.
[0013] Furthermore, when a security risk is identified, the abnormal target is visualized in the digital twin 3D model, specifically including: Extract the real-time 3D coordinates of the abnormal target node in the world coordinate system, and in the digital twin 3D model, change the color of the representative point of the abnormal target from the first preset color representing the normal state to the second preset color representing the abnormal state and generate a bright halo. At the same time, pop up an early warning window to display the category of abnormal behavior, the time of occurrence, the geographical location, the motion trajectory playback and the associated real-time video footage.
[0014] Furthermore, based on the three-dimensional coordinates of the anomalous target in the world coordinate system, the nearest PTZ camera in the physical layer is driven in reverse to adjust its rotation angle, pitch angle, and focal length parameters to capture and collect evidence of the anomalous target. Specifically, this includes: Based on the three-dimensional coordinates of the abnormal target in the world coordinate system, the required rotation angle, pitch angle and focal length parameters of the PTZ camera closest to the target in the physical layer are calculated. Control commands are sent to the PTZ camera via a standard protocol to drive it to rotate automatically and adjust its focus, placing the abnormal target in the center of the frame for high-definition capture. The captured images and video clips are then uploaded to the early warning center database and pushed to the security personnel's terminals.
[0015] A digital twin-based intelligent early warning system for park security includes: The digital twin model building module is used to create a digital twin 3D model of the park and set a unified world coordinate system; The data acquisition and processing module is used to acquire real-time video streams from various surveillance cameras in the park and extract the pixel coordinates of the center point of the foot of the pedestrian target in the image. The coordinate mapping and fusion module is used to map the pixel coordinates of the foot center point to the three-dimensional spatial coordinates in the world coordinate system based on the pre-calibrated camera parameters and homography matrix, and to perform elevation fusion by combining the digital elevation model and the prior value of the pedestrian's height to generate the three-dimensional coordinates of the pedestrian target in the world coordinate system. The dynamic spatiotemporal graph construction module is used to perform temporal smoothing on the same pedestrian target in consecutive frames based on three-dimensional coordinates to generate continuous motion trajectories. Multiple pedestrian targets are defined as graph nodes, spatial edges are constructed based on the spatial distance between nodes, and temporal edges are constructed based on the trajectory association of the same target to form a dynamic spatiotemporal graph. The behavior feature extraction module has a built-in spatiotemporal graph convolutional network. The spatiotemporal graph convolutional network contains multiple stacked spatiotemporal convolutional blocks. Each spatiotemporal convolutional block introduces spatial attention and temporal attention mechanisms to weight the spatial and temporal features in the dynamic spatiotemporal graph, respectively, in order to extract high-dimensional spatiotemporal behavior features. The early warning and linkage module is used to determine security risks based on high-dimensional spatiotemporal behavioral characteristics. When a security risk is determined to exist, the abnormal target is visualized in the digital twin 3D model. At the same time, based on the 3D coordinates of the abnormal target in the world coordinate system, the module drives the nearest PTZ camera in the physical layer to adjust its rotation angle, pitch angle and focal length parameters to capture and collect evidence of the abnormal target.
[0016] In summary, this application includes at least one of the following beneficial technical effects: 1. This invention integrates the previously fragmented and independent field of view of cameras under a unified spatial reference by constructing a digital twin 3D model and establishing a unified world coordinate system. Using coordinate mapping technology, pedestrian targets in the video are transformed into 3D coordinates in physical space, eliminating the logical breakage problem in cross-regional target tracking in traditional surveillance. Through the construction of a dynamic spatiotemporal map, the continuous movement trajectory of targets throughout the entire park can be accurately depicted, greatly enhancing the perception depth of security management personnel regarding the overall situation of the park.
[0017] 2. Compared to traditional algorithms that only focus on the visual features of a single frame image or local video clip, this invention introduces geospatial semantic information. By using a spatiotemporal graph convolutional network to extract the displacement vectors and interrelationship features of targets in the real geographic coordinate system, the system can accurately identify complex behavioral logic closely related to spatial location. For the determination of abnormal behavior in specific areas, this invention provides higher-dimensional feature support, effectively addressing the shortcomings of traditional algorithms in spatial semantic understanding.
[0018] 3. This invention utilizes a multi-dimensional constraint mechanism of digital twin space to verify the recognition results. By analyzing the kinematic characteristics of the target in physical space and the spatiotemporal topological relationships between targets, it can effectively eliminate recognition errors caused by environmental factors or camera perspective deviations. Deep mining of time-series features through multi-layer spatiotemporal convolutional blocks ensures that the warning logic is based on a continuous behavioral evolution process, significantly improving the system's operational stability in complex environments.
[0019] 4. This invention uses digital twin spatial coordinates to drive physical layer devices, achieving seamless integration of early warning and evidence collection. When the system identifies a security hazard, it can automatically drive the corresponding camera for precise tracking and capture, completing the closed loop from risk assessment to on-site evidence collection without manual intervention. This precise linkage mechanism based on geospatial coordinates significantly reduces response time and provides technical support for emergency command in the park. Attached Figure Description
[0020] Figure 1 This is an overall schematic diagram of a digital twin-based intelligent early warning method for park security. Figure 2This is a schematic diagram illustrating the principle of feature extraction based on dynamic spatiotemporal graph construction and spatiotemporal graph convolutional network; Figure 3 It is a logical flowchart of the mapping from the pixel coordinate system of video images to the coordinate system of the digital twin three-dimensional world; Figure 4 This is a schematic diagram of the multi-level interaction and data flow between the digital twin early warning center and the physical layer security linkage equipment. Detailed Implementation
[0021] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments.
[0022] The intelligent early warning method for park security based on digital twins is implemented according to the following steps: First, for step S1, a digital twin 3D model of the park is established and a unified world coordinate system is set. Step S1 aims to construct a high-precision digital mirror of the park's physical environment and provide a unified spatial reference for all subsequent monitoring data. By integrating UAV LiDAR scanning, oblique photogrammetry, and high-precision coordinate transformation technology, step S1 transforms the park's buildings, terrain, and infrastructure into a 3D digital model with accurate geographic coordinates. This model is not only the foundation for visualization but also the core support for achieving spatiotemporal fusion and precise spatial analysis of multi-source video data.
[0023] To achieve the above objective, step S1 is specifically completed through the following sub-steps: Step S101: Acquire and process 3D point cloud data of the park. First, a full-coverage aerial scan of the park is conducted using an UAV-borne LiDAR system. This system integrates a high-precision inertial navigation unit and a global positioning system. During flight, it emits laser beams to the ground at a preset pulse frequency and receives the reflected echoes. By accurately recording the round-trip time of the laser pulses, the real-time spatial displacement of the UAV, and its attitude angles, high-precision 3D point cloud data of the surface objects in the park can be obtained. To ensure the accuracy of subsequent modeling, the average point cloud density should be greater than a preset density threshold during the acquisition process, for example, the number of point clouds per square meter should not be less than 100 points.
[0024] After acquiring the raw point cloud data, data preprocessing is performed; statistical outlier filtering algorithms are used to remove isolated noise points caused by atmospheric suspended matter or sensor noise; cloth simulation filtering algorithms or morphological filtering algorithms are used to classify the point cloud into ground points and non-ground points, thereby accurately extracting the digital elevation model and digital surface model of the park.
[0025] For key features such as buildings, roads, vegetation, and security infrastructure, a point cloud semantic segmentation technology based on deep learning is used for target recognition and classification, and the outline and height information of buildings are initially extracted.
[0026] Step S102: Reconstruct the three-dimensional geometric model of the building and complete the texture mapping; based on the point cloud data processed in step S101, start the refined geometric modeling.
[0027] For buildings with regular structures, a plane fitting algorithm is used to extract the wall and roof planes to construct a white model structure; for buildings with complex irregular shapes, multi-view images obtained by oblique photogrammetry are combined with a multi-source data fusion algorithm to reconstruct the grid and accurately restore its geometric shape.
[0028] After the geometric model is constructed, high-resolution texture mapping is performed. By establishing a mapping relationship between the high-resolution image pixels of the drone and the coordinates of the 3D model surface, the image is attached to the model surface to ensure that the digital twin model maintains a high degree of consistency with the physical world in visual presentation.
[0029] Throughout the modeling process, it is necessary to strictly control the spatial position accuracy to ensure that the geometric error of the 3D model is within the preset allowable error range, such as a planar error of less than 0.1 meters and an elevation error of less than 0.15 meters.
[0030] Step S103: Set a unified world coordinate system and complete the model coordinate transformation; In order to unify the constructed digital twin model under a spatial reference that can be referenced by multiple sources of data, a unified world coordinate system needs to be set.
[0031] The specific operation involves selecting national-level control points or self-established benchmark reference points within the park as the origin of the coordinate system, and transforming the local modeling coordinate system to the geodetic coordinate system through rotation matrices and translation vectors, such as using the 2000 National Geodetic Coordinate System, i.e., CGCS2000.
[0032] This coordinate transformation ensures that every vertex and texture pixel in the digital twin model has unique longitude, latitude, and altitude parameters corresponding to the physical world. This unified coordinate benchmark provides a physical basis for the subsequent precise fusion and positioning of multiple surveillance video data in the digital twin space, and is a key foundation for solving the problem of fragmented field of view in traditional surveillance.
[0033] In summary, step S1 completes the precise mapping from the physical campus to the digital twin space. This model not only possesses high-precision geometric structure and visual texture, but more importantly, it has a unified spatial coordinate system strictly aligned with the physical world. This foundational work enables the accurate localization and fusion of pedestrian targets from discrete camera video streams into this three-dimensional space, thus providing the necessary technical prerequisites for advanced applications such as continuous target tracking across cameras and geospatial semantic-based anomaly behavior analysis.
[0034] The next step, S2, involves acquiring real-time video streams from various surveillance cameras within the park and extracting the image coordinates of pedestrians. Step S2 aims to establish a connection with the physical cameras within the park and accurately detect pedestrian targets from the real-time video streams, extracting their pixel coordinates in the images. This provides the foundational data for subsequently mapping pedestrians to the digital twin space. Through unified network protocol access, device identifier binding, and deep learning-based pedestrian detection and keypoint localization, this step achieves the conversion from dispersed video sources to structured image coordinate data, serving as a crucial data entry point connecting physical surveillance equipment and the digital twin model. To achieve the above objectives, step S2 is specifically completed through the following sub-steps: Step S201: Connect the surveillance cameras and bind device identifiers; the system connects to multiple surveillance cameras distributed within the park via network protocols. Upon connection, the system generates a unique device identifier for each camera. This identifier uses a 64-bit integer encoding and includes the camera's region code, device type code, and device serial number.
[0035] The system binds the identifier to the virtual camera pose in the digital twin model. Specifically, a device mapping table is established in the data layer of the digital twin model. This table uses the device identifier as the primary key and stores the spatial coordinates, pitch angle, yaw angle, and roll angle parameters of the corresponding virtual camera. The video acquisition module acquires video data frames in real time at a preset frame rate and sends the acquired video frames to the edge computing unit or central processing server for subsequent analysis.
[0036] Step S202: Pedestrian detection and bounding box localization based on deep learning; during image coordinate extraction, a pedestrian detection model based on a deep residual network ResNet as the backbone feature extraction network is used. This model has been pre-trained on a large-scale security dataset and has robust detection capabilities under complex lighting, occlusion, and background interference. The specific detection process is as follows: Multi-scale feature fusion operators are used to process video frames. A typical implementation is the Feature Pyramid Network (FPN). By fusing feature maps of different scales, pedestrian targets of different distances and sizes can be effectively captured. The model locates the spatial extent of a pedestrian in an image by predicting bounding boxes. Each bounding box is defined by its top-left pixel coordinates, width, and height. For each detected pedestrian target, the system records its bounding box information and generates a corresponding target tracking identifier.
[0037] To ensure the reliable operation of the pedestrian detection model described above, it needs to be adequately trained. The training data is constructed as follows: at least 100,000 historical video frames from the park's surveillance scenes are collected. Pedestrian targets in each frame are manually labeled, with the label including the top-left pixel coordinates, width, and height of each pedestrian's bounding box, as well as the target's category label, which includes "pedestrian" and "non-pedestrian." The labeled data is divided into training, validation, and test sets in an 8:1:1 ratio.
[0038] The loss function is a joint form of bounding box regression loss and classification loss. The bounding box regression loss uses the SmoothL1 Loss function, which is calculated as follows: in, This represents the parametric coordinates of the bounding box predicted by the network. The parametric coordinates represent the actual bounding box. and The coordinates of the center point of the bounding box. and Here, represents the width and height of the bounding box. The specific form of the smooth L1 loss function is: The classification loss uses the cross-entropy loss function, which is calculated as follows: ,in, This is the probability distribution vector of the classes predicted by the network. For real category labels, Predict the correct category for the network The probability value. The joint loss function is... ,in This is a balance coefficient, with a value of 1. The training objective is to use a stochastic gradient descent optimization algorithm to iteratively update the network weight parameters, minimizing the joint loss function value, and ultimately obtaining a detection model that can accurately detect pedestrian targets in video frames and output their bounding box positions.
[0039] Step S203: Extract the center point of the pedestrian's foot as the spatial positioning anchor point. To achieve accurate mapping from image coordinates to geographic coordinates, a key point that is stable in contact with the ground in physical space and easily identifiable in the image needs to be selected. In this step, the midpoint of the bottom edge of the pedestrian bounding box is selected as the center point of the pedestrian's foot. This point represents the contact position between the pedestrian and the ground in physical space and is the key anchor point for subsequent image coordinate to geographic coordinate conversion.
[0040] The system records the horizontal pixel coordinates of the point in the image pixel grid. with vertical pixel coordinates ,in The value ranges from 0 to the image width minus 1. The value ranges from 0 to the image height value minus 1. The system encapsulates this coordinate sequence along with the current timestamp and the corresponding camera device identifier into a data packet to be processed. This data packet will be used as input to step S3 to perform coordinate mapping calculations.
[0041] Through the aforementioned sub-steps of step S2, the transformation from physical camera video streams to structured image coordinate data is completed. Each detected pedestrian target is assigned the coordinates of its foot center point in the image, and associated with a unique timestamp and camera identifier. This data structure provides the necessary input parameters for subsequently mapping pedestrians from the two-dimensional image space to the digital twin's three-dimensional world coordinate system, ensuring that target tracking across cameras can be performed under a unified spatiotemporal reference.
[0042] For step S3, the real-time spatial displacement vector of the target is generated. The core of this step is to establish a precise mapping relationship between the two-dimensional image space and the three-dimensional geographic space, converting the pedestrian foot image coordinates extracted in step S2 into three-dimensional spatial coordinates in the digital twin world coordinate system, thereby generating a real-time displacement vector reflecting the target's motion state. Through camera calibration, coordinate mapping, elevation fusion, and temporal smoothing, this step realizes the transformation from discrete image detection points to continuous spatial motion trajectories, providing basic data for subsequent spatiotemporal behavior analysis. To achieve the above objectives, step S3 is specifically completed through the following sub-steps: Step S301, Camera Parameter Calibration: Before coordinate mapping, the intrinsic and extrinsic parameters of each surveillance camera need to be obtained in advance. The intrinsic parameter calibration uses the Zhang Zhengyou calibration method, specifically as follows: Prepare a checkerboard calibration board with 11 rows by 8 columns of corner points, and each square has a side length of 30 mm. Collect multiple images of the calibration board in the camera's field of view under different orientations, at least 15 images. Extract the sub-pixel level image coordinates of the checkerboard corner points in each image using a corner detection algorithm, establish the correspondence between the corner point image coordinates and their three-dimensional coordinates in the calibration board's physical coordinate system, and solve for the camera's intrinsic parameter matrix using the least squares method. The intrinsic parameter matrix is in the form of: in, and These are the normalized focal lengths in the horizontal and vertical directions of the image, respectively, in pixels; and The principal point coordinates are the pixel coordinates of the intersection of the optical axis and the image plane. Simultaneously, the radial distortion coefficient of the lens is obtained by solving for these coordinates. , , and tangential distortion coefficient , This is used for distortion correction of subsequent image coordinates.
[0043] Extrinsic parameter calibration is used to determine the pose of the camera relative to the coordinate system of the digital twin world. At least four feature calibration points should be selected in the physical space of the park. These feature calibration points should be fixed locations that are clearly identifiable in the image and have known geographic coordinates, such as the corner of a paving stone, the center of a street lamp base, or the edge of a manhole cover.
[0044] The geodetic coordinates of these calibration points were measured using the Differential Global Positioning System and then transformed into three-dimensional coordinates in the digital twin world coordinate system. Simultaneously, the pixel coordinates of these calibration points are manually marked or automatically detected in the images captured by each camera. According to the pinhole camera model, world coordinates and pixel coordinates satisfy the following relationship: in, As a scale factor, It is a 3x3 rotation matrix. It is a 3x1 translation vector. The perspective projection matrix is solved using the direct linear transformation method through at least four pairs of corresponding points, and then the rotation matrix is obtained by decomposition. Translation vector The extrinsic parameter calibration was completed. Simultaneously, based on the extrinsic parameter calibration results, the homography matrix between the ground plane in the image coordinate system and the digital twin world coordinate system was solved using the least squares method. The homography matrix is a 3x3 matrix that satisfies: in , World coordinates on the ground plane is the scale factor.
[0045] Step S302: Mapping image coordinates to 3D world coordinates and elevation fusion; for the pixel coordinates of the pedestrian's foot center point extracted in step S203. First, distortion correction is performed using the distortion coefficients obtained from step S301 to eliminate the effects of radial and tangential distortion of the lens, thus obtaining the corrected pixel coordinates. Then, through the homography matrix... Perform a linear transformation to map the corrected pixel coordinates to world coordinates on the ground plane: in, and These are the world coordinate components on the ground plane. The scale factor is used. The final ground projection coordinates are obtained through normalization. .
[0046] Determine the altitude of the pedestrian target At the same time, the system combines information on camera installation height, the park's surface elevation model (DEM), and prior statistical values of pedestrian height for fusion calculation. The DEM provides coordinates for each ground plane. The surface elevation of the location Since the center point of a pedestrian's feet is located on the ground surface, its altitude is theoretically equal to... However, considering the thickness of the shoe sole between the pedestrian's foot and the ground, in practical applications, it is recommended to take... ,in The preset sole compensation value is 0.02 meters.
[0047] Simultaneously, using prior statistical values of pedestrian height as validation constraints, if the difference between the altitude inferred from the head position and the foot altitude in subsequent consecutive frames exceeds the normal adult height range of 1.2 meters to 2.0 meters, a relocation check is triggered. Ultimately, the pedestrian's three-dimensional coordinates in the digital twin world coordinate system are obtained. .
[0048] Step S303, temporal smoothing and displacement vector calculation: Due to potential errors in single-frame detection, coordinate jumps may occur between adjacent frames. To eliminate this effect, temporal smoothing is performed on the world coordinates acquired from consecutive time frames.
[0049] Using the Kalman filter algorithm, the system state vector is defined as follows: ,in , , For the first The three-dimensional coordinates of the frame , , Let be the velocity component in the corresponding direction. The state prediction equation is: The observation equation is ,in The state transition matrix is represented by a uniform motion model. Smoothed 3D coordinates are output through Kalman filtering prediction and update steps.
[0050] Based on the change in continuous position after smoothing, calculate the real-time displacement vector of the pedestrian in three-dimensional space. Let the first... The smooth coordinates of the frame are , No. The smooth coordinates of the frame are Then the displacement vector is defined as Instantaneous velocity is calculated by dividing the displacement vector by the time interval. The calculation yielded, where This represents the time difference between adjacent frames, in seconds. The direction of motion is determined by the projection angle of the displacement vector onto the horizontal plane, calculated using the following formula: ,in This is the arctangent function in the four quadrants, and the output angle range is... arrive .
[0051] Through the sub-steps described above in step S3, a precise mapping from image pixel coordinates to three-dimensional coordinates in the digital twin world coordinate system is completed, generating a real-time displacement vector containing position, velocity, and direction information. This data not only eliminates coordinate jumps caused by single-point detection errors but also transforms discrete detection results into continuous motion trajectories, providing high-quality foundational data for the construction of the dynamic spatiotemporal map in step S4 and subsequent behavioral pattern analysis.
[0052] Step S4: Construct a dynamic spatiotemporal graph. This step aims to organize the pedestrian's 3D spatial coordinates and motion state generated in Step S3 into a graph structure that can simultaneously characterize individual motion trajectories and group interaction relationships. By defining each pedestrian target as a graph node, and establishing interaction edges based on distance in the spatial dimension and association edges based on tracking results in the temporal dimension, this step constructs a multi-level dynamic spatiotemporal graph, providing a structured data carrier for subsequent spatiotemporal behavioral feature extraction. To achieve the above objective, Step S4 is specifically completed through the following sub-steps: Step S401: Define graph nodes and initialize node features; the system treats each detected pedestrian target as a node in the graph structure. The node set is defined as follows: in, Indicates the first The first detected in the frame A pedestrian target, the node and the 3D world coordinates generated in step S303. Related. The initial features of a node include the target's three-dimensional coordinates, as well as its velocity, direction of motion, and acceleration information obtained by calculating the first and second derivatives of the coordinates with respect to time.
[0053] Step S402: Construct spatial edges to represent the instantaneous interaction relationships between pedestrians; in the same time frame Within the system, the process iterates through all active pedestrian node pairs and calculates the values of any two nodes. and The Euclidean distance between the two nodes is used to determine their spatial neighborhood. If this Euclidean distance is less than a preset spatial neighborhood threshold, a spatial connection edge is established between the two nodes. The spatial neighborhood threshold can be dynamically adjusted based on the historical pedestrian density in different areas of the park, and is used to characterize the possible interaction, following, or clustering relationships between pedestrians.
[0054] Step S403: Construct temporal boundaries to connect the motion trajectories of the same target. To characterize the motion evolution of pedestrian targets over time, the system utilizes multi-target tracking algorithms to establish the association between the same target in adjacent frames. Specifically, DeepSORT or ByteTrack algorithms can be used. These algorithms combine the position and appearance features of the target detection box with motion prediction results to assign a unique tracking identifier to each detected pedestrian target. Based on the tracking results, the system will... Frame nodes With the Nodes with the same tracking identifier in the frame A time connection edge is established between them, and the weight of the time edge can be assigned according to the continuity of the displacement and the degree of overlap with the predicted position.
[0055] Through the sub-steps described in step S4, the system constructs a multi-level dynamic spatiotemporal graph. This graph structure uses pedestrian targets as nodes, spatial edges to depict the instantaneous interaction relationships between groups, and temporal edges to depict the temporal evolution of individual movements. It can completely record the dynamic evolution of people in the geographic space within the park, providing standardized input data for feature extraction by the spatiotemporal graph convolutional network in the subsequent step S5.
[0056] Next, for step S5, the spatiotemporal behavioral features of the target are extracted. The dynamic spatiotemporal graph constructed in step S4 is input into the spatiotemporal graph convolutional network, and the high-dimensional spatiotemporal behavioral features contained therein are extracted through a deep learning model. Step S5 is specifically completed through the following sub-steps: Step S501: Construct a spatiotemporal graph convolutional network architecture. The system uses the STGCN (Spatiotemporal Graph Convolutional Network) as the feature extraction model, which consists of multiple stacked spatiotemporal convolutional blocks. Each spatiotemporal convolutional block contains a spatial graph convolutional layer, a temporal convolutional layer, and a nonlinear activation function. The spatial graph convolutional layer is responsible for aggregating the features of nodes in their spatial neighborhood, while the temporal convolutional layer is responsible for extracting the evolution patterns of motion trajectories along the time axis. Multiple spatiotemporal convolutional blocks are connected sequentially to form a hierarchical feature extraction structure, enabling the network to gradually abstract global behavioral semantics from local spatiotemporal patterns.
[0057] Step S502 involves extracting spatial interaction features through spatial graph convolution. The spatial graph convolutional layer aggregates spatial neighborhood features using a Laplacian matrix. Specifically, for the input spatial graph, an adjacency matrix is first constructed to describe the connections between nodes. Then, the adjacency matrix is symmetrically normalized to obtain a normalized Laplacian matrix. Spatial graph convolution, by weighted summing of the features of the surrounding nodes of each node, converges the information of adjacent nodes to the central node, thereby extracting spatial structural features such as the local density distribution, relative speed differences, and spatial aggregation patterns of the crowd.
[0058] Step S503: Extract motion evolution features through temporal convolution. After spatial graph convolution completes the spatial dimension feature aggregation, the temporal convolutional layer uses a one-dimensional convolutional kernel to perform sliding window calculations along the time axis. Each temporal convolutional kernel covers several consecutive frames. By performing convolution operations on the features of the same node at different times, it captures the temporal evolution features of the target's motion trajectory, such as dynamic patterns like sudden changes in velocity, changes in direction, and continuous changes in acceleration. By stacking multiple convolutional kernels of different scales, the temporal convolutional layer can simultaneously extract short-term rapid motion features and long-term behavioral trend features.
[0059] Step S504 introduces an attention mechanism to enhance key features. To improve the network's sensitivity to key behaviors, an attention mechanism is introduced into the spatiotemporal convolutional block. The attention mechanism has two dimensions: spatial attention and temporal attention. Spatial attention learns the weights of nodes in different regions, enabling the network to automatically focus on targets in sensitive areas, such as warehouse entrances, fence edges, or high-risk locations around core equipment. Temporal attention assigns higher weights to time periods of drastic change in action, allowing the network to focus on behavioral turning points, such as the sudden transition from normal walking to rapid running. Through the weighted approach of the attention mechanism, the network can more effectively extract behavioral features strongly correlated with security events.
[0060] Step S505: Output the high-dimensional spatio-temporal behavior feature vector. Through the hierarchical feature extraction of the multi-layer spatio-temporal convolutional block, the network maps the input dynamic spatio-temporal graph into a high-dimensional feature vector. This vector deeply integrates geographical semantic information, individual kinematic features, and group interaction relationships, and can comprehensively represent the behavior patterns of pedestrians in the three-dimensional geographical space, providing high-quality input for the safety hazard classification in Step S6.
[0061] To ensure the reliable operation of the above spatio-temporal graph convolutional network, it needs to be fully trained. The training data is constructed as follows: collect a large amount of historical video data in the park monitoring scenario, and process each video into a dynamic spatio-temporal graph sample according to the procedures of Steps S1 to S4, and label each sample with the corresponding behavior category label. The behavior categories include normal walking, abnormal wandering, illegal intrusion, illegal gathering, fall injury, violent running, etc. The labeled data is divided into a training set, a validation set, and a test set according to the ratio of 8:1:1.
[0062] The loss function uses the cross-entropy loss function, and its calculation formula is: where, is the total number of behavior categories; is the sign function. If the true category of the sample is equal to then the value is 1, otherwise it is 0; is the probability that the network predicts that the sample belongs to the category . The training objective is to iteratively update the network weight parameters through the backpropagation algorithm to minimize the value of the loss function, and finally obtain a network model that can accurately extract spatio-temporal behavior features and perform behavior classification.
[0063] In summary, through the above sub-steps of Step S5, the system inputs the dynamic spatio-temporal graph constructed in Step S4 into the spatio-temporal graph convolutional network, and extracts a high-dimensional spatio-temporal behavior feature vector that integrates geographical semantics, kinematic features, and social relationships. This feature vector comprehensively depicts the behavior patterns of pedestrians in the three-dimensional geographical space, providing an accurate discrimination basis for determining safety hazards and triggering visual warnings in Step S6. <0OO0289>
[0064] Finally, in Step S6, determine safety hazards and conduct visual warnings. The core of this step is to convert the high-dimensional spatio-temporal behavior features extracted in the previous steps into executable safety warning instructions. After the system completes the determination of abnormal behaviors through the classification network, it synchronously presents them visually in the digital twin three-dimensional interface, and automatically drives the security equipment in the physical layer to perform precise linkage, thus forming a complete closed-loop from feature analysis, risk confirmation, visual warning to on-site evidence collection and emergency response. This process ensures the accuracy of the warning results and the real-time nature of the disposal, and is specifically achieved through the following sub-steps: Step S601: Classify safety hazards based on spatiotemporal behavioral feature vectors; the system inputs the high-dimensional spatiotemporal behavioral feature vectors extracted in step S505 into a fully connected classification network. This network maps the feature vectors to a preset behavioral category probability space through the Softmax function, and finally outputs the probability values of each pedestrian target belonging to multiple behavioral categories such as normal walking, abnormal loitering, illegal intrusion (entering a restricted area), illegal gathering, fall injury, and violent running.
[0065] To ensure the reliable operation of the aforementioned classification network, it needs to be adequately trained. The training data is constructed as follows: a large amount of historical video data from park surveillance scenarios is collected, and each video segment is processed into a dynamic spatiotemporal graph sample strictly according to steps S1 to S4; subsequently, each sample is manually labeled with the corresponding behavior category label. The labeled data is divided into training set, validation set, and test set in an 8:1:1 ratio.
[0066] The loss function used during training is the cross-entropy loss function, and its calculation formula is as follows: in Indicates the total number of behavior categories; For the sign function, if the sample The true category equals The value is 1 if it is 1, otherwise it is 0. For network prediction samples Category The probability of loss function is determined by iteratively updating the weight parameters in the network through backpropagation, thereby obtaining a network model capable of accurately classifying behavior based on spatiotemporal behavioral characteristics.
[0067] Step S602: Perform abnormal behavior determination; the probability values output by the classification network reflect the behavioral tendencies of each pedestrian target. The system sets a preset probability threshold; when the probability value of a specific abnormal category exceeds this threshold, it is determined that a safety hazard exists.
[0068] For example, if the probability of a target being judged as "abnormally loitering" near a core classified area exceeds 0.85, and this high-probability state lasts for more than a preset duration threshold, the system will trigger an alert. By introducing continuous judgment in the time dimension, false alarms caused by single-frame false detections can be effectively filtered out, improving the reliability of the alert.
[0069] Step S603: Initiate digital twin 3D visualization early warning; upon determining the existence of a security risk, the system immediately extracts the real-time 3D coordinates of the abnormal target node in the digital twin space and simultaneously executes visualization early warning in the digital twin 3D interface.
[0070] In operation, the interface changes the color of the representative point or 3D model of the abnormal target from a first preset color (e.g., green) representing a normal state to a flashing second preset color (e.g., red), and generates a bright halo around the target for visual guidance. An alert window automatically pops up, clearly displaying the type of abnormal behavior, the time of occurrence, the specific geographical location (latitude and longitude), floor information, a playback of the target's movement trajectory, and related real-time video footage. Through this intuitive presentation in 3D space, security personnel can quickly grasp the situation on-site.
[0071] Step S604: Trigger the security linkage mechanism and precise capture. When the system determines that there is a security risk, it will automatically activate the security linkage mechanism. First, based on the three-dimensional coordinates of the abnormal target in the digital twin world coordinate system, the system calculates the rotation angle Pan, pitch angle Tilt, and focal length parameter Zoom of the PTZ camera closest to the target in the physical layer.
[0072] Subsequently, the drive control module sends commands to the PTZ camera via standard protocols such as PELCO-D or CGI, controlling its automatic rotation and focusing to precisely center the abnormal target in the frame for high-definition capture. The captured images and video clips are uploaded to the early warning center database in real time and simultaneously pushed to the mobile terminals of security personnel. This process achieves seamless integration from risk identification to on-site evidence collection, significantly shortening emergency response time.
[0073] Step S6, as the core decision-making and execution link of the entire early warning method, realizes a complete closed loop from spatiotemporal behavioral feature analysis to safety hazard handling. This step first inputs the high-dimensional spatiotemporal behavioral features extracted in the previous step into a classification network, and accurately identifies the abnormal behavior category through a fully trained classification model. After determining that a safety hazard exists, the system simultaneously executes a visual early warning in the digital twin 3D interface, intuitively presenting the status and location of the abnormal target through color changes, high-brightness apertures, and information pop-ups. Based on this, the system automatically drives physical layer security equipment, precisely controlling nearby PTZ cameras to complete capture and evidence collection based on the world coordinates of the abnormal target. The entire process integrates feature classification, spatiotemporal constraint verification, 3D visualization interaction, and equipment linkage control, ensuring the accuracy of early warning results, the real-time nature of handling, and the traceability of risk records, providing park safety management personnel with full-process technical support from risk identification to emergency response.
[0074] Thus, this application constructs a complete technology chain from physical park digital twin modeling, multi-source video data spatiotemporal fusion, dynamic spatiotemporal map construction, spatiotemporal behavioral feature extraction to intelligent early warning and joint handling of security hazards. It effectively solves the problems of fragmented monitoring field of view, lack of spatial semantics and insufficient early warning accuracy in existing technologies, and realizes intelligent and precise park security management.
[0075] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention. Therefore, the embodiments should be regarded as exemplary and non-limiting in all respects.
[0076] Furthermore, it should be understood that although this specification describes embodiments, not every embodiment contains only one independent technical solution. This narrative style is merely for clarity. Those skilled in the art should consider the specification as a whole, and the technical solutions in each embodiment can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
Claims
1. A digital twin-based intelligent early warning method for park security, characterized in that, Includes the following steps: Establish a digital twin 3D model of the park and set a unified world coordinate system; Acquire real-time video streams from various surveillance cameras within the park and extract the pixel coordinates of the center point of the foot of the pedestrian target in the image; Based on the pre-calibrated camera intrinsic and extrinsic parameters and homography matrix, the pixel coordinates of the foot center point are mapped to the three-dimensional spatial coordinates in the world coordinate system. The elevation is then fused with the digital elevation model and the prior value of the pedestrian's height to generate the three-dimensional coordinates of the pedestrian target in the world coordinate system. Based on three-dimensional coordinates, temporal smoothing is performed on the same pedestrian target in consecutive frames to generate continuous motion trajectories. Multiple pedestrian targets are defined as graph nodes. Spatial edges are constructed based on the spatial distance between nodes, and temporal edges are constructed based on the trajectory association of the same target to form a dynamic spatiotemporal graph. The dynamic spatiotemporal graph is input into the spatiotemporal graph convolutional network, which contains multiple stacked spatiotemporal convolutional blocks. Each spatiotemporal convolutional block introduces spatial attention and temporal attention mechanisms to weight the spatial and temporal features in the dynamic spatiotemporal graph, respectively, and extract high-dimensional spatiotemporal behavioral features. Security risks are identified based on high-dimensional spatiotemporal behavioral characteristics. When a security risk is identified, the abnormal target is visualized in the digital twin 3D model. At the same time, based on the 3D coordinates of the abnormal target in the world coordinate system, the nearest PTZ camera in the physical layer is driven in reverse to adjust its rotation angle, pitch angle and focal length parameters to capture and collect evidence of the abnormal target.
2. The method according to claim 1, characterized in that, The pixel coordinates of the foot center point are mapped to three-dimensional spatial coordinates in the world coordinate system, and then combined with the digital elevation model and the prior value of the pedestrian's height for elevation fusion to generate the three-dimensional coordinates of the pedestrian target in the world coordinate system, specifically including: The intrinsic parameter matrix and distortion coefficient of the surveillance camera are obtained by Zhang Zhengyou's calibration method. By selecting calibration points with known geographical coordinates, the rotation matrix and translation vector of the camera relative to the world coordinate system are solved by the direct linear transformation method to complete the extrinsic parameter calibration. Then, the homography matrix between the image coordinate system and the ground plane is solved. After distortion correction of the extracted foot center point pixel coordinates, the world coordinates on the ground plane are obtained through homography matrix linear transformation. The elevation of the ground at world coordinates on the ground plane is obtained by querying the digital elevation model. Combined with the preset shoe sole compensation value, the elevation of the pedestrian's feet is determined. Then, the statistical prior value of the pedestrian's height is used for verification and constraint, and finally the three-dimensional spatial coordinates of the pedestrian target are obtained.
3. The method according to claim 1, characterized in that, Based on three-dimensional coordinates, temporal smoothing is performed on the same pedestrian target in consecutive frames to generate continuous motion trajectories, specifically including: The Kalman filter algorithm is used to construct the system state vector with the three-dimensional spatial coordinates and velocity components of the pedestrian target. Through the state prediction equation and the observation equation, the three-dimensional world coordinates acquired in continuous time frames are temporally smoothed to eliminate coordinate jumps. Based on the change in the continuous position after smoothing, the real-time displacement vector, instantaneous velocity and direction of movement of the pedestrian in three-dimensional space are calculated.
4. The method according to claim 1, characterized in that, Multiple pedestrian targets are defined as graph nodes. Spatial edges are constructed based on the spatial distance between nodes, and temporal edges are constructed based on the trajectory associations of the same target, forming a dynamic spatiotemporal graph, which specifically includes: Each pedestrian target detected in each frame is treated as a graph node. The initial features of the node include at least its three-dimensional coordinates in the world coordinate system, its motion velocity, motion direction, and acceleration. Within the same time frame, all node pairs are traversed. When the Euclidean distance between two nodes is less than a preset spatial neighborhood threshold, a spatial edge is established to represent the instantaneous interaction relationship between pedestrians. A multi-target tracking algorithm is used to assign a unique tracking identifier to each pedestrian target, and a time edge is established between nodes with the same tracking identifier in adjacent frames to connect the motion trajectory of the same target.
5. The method according to claim 1, characterized in that, Each spatiotemporal convolutional block in the spatiotemporal graph convolutional network contains a spatial graph convolutional layer, a temporal convolutional layer, and a nonlinear activation function. The spatial graph convolutional layer uses the Laplacian matrix to aggregate spatial neighborhood features to extract the spatial structural features of the crowd. The temporal convolutional layer uses a one-dimensional convolutional kernel to perform sliding window calculations along the time axis to capture the temporal evolution characteristics of the target's motion trajectory.
6. The method according to claim 5, characterized in that, Spatial attention mechanism learns weights for graph nodes in different regions, enabling the network to automatically focus on preset high-risk locations; temporal attention mechanism assigns higher weights to time periods of drastic change in action, enabling the network to focus on behavioral turning points. By weighting spatial attention mechanisms and temporal attention mechanisms, the ability to represent behavioral characteristics strongly correlated with security incidents is enhanced.
7. The method according to claim 1, characterized in that, Security risks are identified based on high-dimensional spatiotemporal behavioral characteristics, specifically including: The high-dimensional spatiotemporal behavioral feature vector is input into a fully connected classification network, and mapped to a preset behavioral category probability space through the Softmax function. The output is the probability value of the pedestrian target belonging to at least one behavioral category among normal walking, abnormal loitering, illegal intrusion, illegal gathering, fall injury, or violent running. When the probability value of a specific anomaly category is greater than a preset probability threshold, and the duration of this high-probability state exceeds a preset duration threshold, a security risk is identified.
8. The method according to claim 1, characterized in that, When a security risk is identified, the abnormal target is visualized in the digital twin 3D model, specifically including: Extract the real-time 3D coordinates of the abnormal target node in the world coordinate system, and in the digital twin 3D model, change the color of the representative point of the abnormal target from the first preset color representing the normal state to the second preset color representing the abnormal state and generate a bright halo. At the same time, pop up an early warning window to display the category of abnormal behavior, the time of occurrence, the geographical location, the motion trajectory playback and the associated real-time video footage.
9. The method according to claim 1, characterized in that, Based on the three-dimensional coordinates of the anomalous target in the world coordinate system, the nearest PTZ camera in the physical layer is driven in reverse to adjust its rotation angle, pitch angle, and focal length parameters to capture and collect evidence of the anomalous target. Specifically, this includes: Based on the three-dimensional coordinates of the abnormal target in the world coordinate system, the required rotation angle, pitch angle and focal length parameters of the PTZ camera closest to the target in the physical layer are calculated. Control commands are sent to the PTZ camera via a standard protocol to drive it to rotate automatically and adjust its focus, placing the abnormal target in the center of the frame for high-definition capture. The captured images and video clips are then uploaded to the early warning center database and pushed to the security personnel's terminals.
10. A digital twin-based intelligent early warning system for park security, used to perform the method as described in any one of claims 1 to 9, characterized in that, include: The digital twin model building module is used to create a digital twin 3D model of the park and set a unified world coordinate system; The data acquisition and processing module is used to acquire real-time video streams from various surveillance cameras in the park and extract the pixel coordinates of the center point of the foot of the pedestrian target in the image. The coordinate mapping and fusion module is used to map the pixel coordinates of the foot center point to the three-dimensional spatial coordinates in the world coordinate system based on the pre-calibrated camera parameters and homography matrix, and to perform elevation fusion by combining the digital elevation model and the prior value of the pedestrian's height to generate the three-dimensional coordinates of the pedestrian target in the world coordinate system. The dynamic spatiotemporal graph construction module is used to perform temporal smoothing on the same pedestrian target in consecutive frames based on three-dimensional coordinates to generate continuous motion trajectories. Multiple pedestrian targets are defined as graph nodes, spatial edges are constructed based on the spatial distance between nodes, and temporal edges are constructed based on the trajectory association of the same target to form a dynamic spatiotemporal graph. The behavior feature extraction module has a built-in spatiotemporal graph convolutional network. The spatiotemporal graph convolutional network contains multiple stacked spatiotemporal convolutional blocks. Each spatiotemporal convolutional block introduces spatial attention and temporal attention mechanisms to weight the spatial and temporal features in the dynamic spatiotemporal graph, respectively, in order to extract high-dimensional spatiotemporal behavior features. The early warning and linkage module is used to determine security risks based on high-dimensional spatiotemporal behavioral characteristics. When a security risk is determined to exist, the abnormal target is visualized in the digital twin 3D model. At the same time, based on the 3D coordinates of the abnormal target in the world coordinate system, the module drives the nearest PTZ camera in the physical layer to adjust its rotation angle, pitch angle and focal length parameters to capture and collect evidence of the abnormal target.