An AR terminal-based operation risk cooperative alarm method and system in a multi-person operation scenario

By collecting multi-dimensional data through AR terminals to construct a multi-person temporal behavior map, and using a spatiotemporal graph transformer model to predict operational conflicts in low-voltage uninterrupted power operations, the problem of unpredictable safety hazards in multi-person collaborative scenarios is solved, and accurate early warning and safety improvement are achieved.

CN122244741APending Publication Date: 2026-06-19ELECTRIC POWER RES INST OF GUANGXI POWER GRID CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ELECTRIC POWER RES INST OF GUANGXI POWER GRID CO LTD
Filing Date
2026-02-25
Publication Date
2026-06-19

Smart Images

  • Figure CN122244741A_ABST
    Figure CN122244741A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of live-line work safety control technology, specifically disclosing a collaborative alarm method and system for operational risks in multi-person work scenarios based on AR terminals. The method includes: collecting perception data such as video frame streams, relative device poses, and timestamps through AR terminals worn by workers; generating node feature vectors containing human keypoint embeddings, object interaction vectors, and velocity vectors based on the perception data using edge computing nodes; forming a multi-person temporal behavior graph by constructing edges based on the spatial relationships of different personnel and the adjacent time step relationships of the same personnel; inputting the graph into a pre-trained spatiotemporal graph transformer model, and outputting the classification probability of personnel behavioral intentions and the probability of conflict between personnel operational intentions through joint inference via spatial and temporal self-attention mechanisms; generating an early warning command when the conflict probability exceeds a preset threshold, thus achieving early warning before conflict occurs and effectively improving the safety of multi-person work.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of live-line working safety control technology, specifically to a collaborative alarm method and system for operational risks in multi-person working scenarios based on AR terminals. Background Technology

[0002] In multi-person collaborative work scenarios within the field of safety management for low-voltage uninterrupted power supply operations, the division of labor among workers is clear, and their operations are highly interconnected. Dangers often stem from conflicting intentions among multiple personnel, rather than from a single violation. For example, if a maintenance worker is working inside equipment and another worker mistakenly closes the circuit breaker, it can lead to a serious safety accident.

[0003] Current intelligent recognition systems mostly focus on action recognition or posture recognition, which can only perceive a single behavior of a person and cannot uncover the operational intentions behind the actions or the potential conflicts between multiple intentions. They are also unable to provide early warnings of impending but not yet occurring conflicts in multi-person collaboration.

[0004] Traditional algorithms have obvious limitations: First, they lack predictability, only able to identify risks after they occur, and cannot avoid them in advance; second, they neglect collaboration, only analyzing the behavior of a single individual and not considering the interaction between multiple people; and third, they rely on static recognition and cannot infer causal logic through the evolution of temporal actions.

[0005] Therefore, there is a need for a collaborative alarm method and system for operational risks in multi-person operation scenarios based on AR terminals. Summary of the Invention

[0006] To address this issue, the present invention provides a collaborative alarm method and system for operational risks in multi-person operation scenarios based on AR terminals, in order to solve the aforementioned technical problems.

[0007] This invention provides a collaborative alarm method for operational risks in multi-person work scenarios based on AR terminals, comprising the following steps: AR terminals deployed at multiple work sites collect real-time perception data including one or more of the following: video frame streams, relative device pose, and timestamps. Based on the perceived data, a node feature vector containing one or more of the following is generated for each worker at the edge computing node: human keypoint embedding, object interaction vector, and velocity vector. Using the node feature vector of each worker at each time step within a preset time window as a node, edges between the nodes are constructed based on the spatial relationship between workers and the relationship between adjacent time steps of the same worker, thus constructing a multi-person temporal behavior graph; The multi-person temporal behavior graph is input into a pre-trained spatiotemporal graph transformer model. The model performs joint reasoning on the graph through spatial self-attention mechanism and temporal self-attention mechanism, and synchronously outputs the behavioral intention classification probability of each operator in the future time period, as well as the operational intention conflict probability between any two operators. When the probability of conflict between the operation intentions exceeds a preset threshold, a corresponding conflict warning instruction is generated and sent to the corresponding AR terminal for notification, thereby providing a warning before the conflict occurs.

[0008] Preferably, constructing an edge specifically includes constructing a spatial edge representing the spatial relationship between people at the same time step, constructing a temporal edge representing the continuity of the same person's actions, and constructing an interaction edge that marks the detected interaction event.

[0009] Preferably, the spatiotemporal graph transformer model sequentially includes a node embedding layer for mapping multimodal node features to a unified dimension, a graph transformer block for modeling spatial relationships between multiple subjects, a temporal transformer block for modeling the temporal evolution of actions, and a fusion layer for realizing the mutual cross-interest between spatial and temporal representations.

[0010] Preferably, in the graph transformer block, a multi-head self-attention mechanism is used, and edge features are added as relative position biases to the calculation of attention weights. At the same time, an adjacency mask is used to restrict nodes to only pay attention to neighboring nodes that are connected to them by edges.

[0011] Preferably, the spatiotemporal graph transformer model is trained using a joint loss function, which includes at least: an intent classification cross-entropy loss for supervising intent prediction, and a conflict binary classification cross-entropy loss for supervising conflict discrimination.

[0012] Preferably, the joint loss function further includes a temporal consistency loss term, which is used to impose a smoothness constraint on the prediction results of adjacent time steps.

[0013] Preferably, the method further includes occlusion compensation and spatiotemporal consistency discrimination. Specifically, it receives multi-view observation data of the same worker from different AR terminals, performs timestamp alignment and coordinate system unification, calculates the consistency score of each view observation, and performs weighted fusion of the multi-view data based on the consistency score to compensate for occlusion or observation loss under a single view.

[0014] Preferably, when the confidence level of the observation data from all perspectives is lower than the second preset threshold, the edge computing node triggers a conservative prediction mode, reducing the trigger sensitivity of the conflict warning or generating only low-priority prompt information.

[0015] Preferably, the generation of conflict warning instructions specifically involves generating warning prompts of different levels based on the different probability ranges in which the probability of conflict with the operation intention is located. The levels include at least non-intrusive information prompts, voice and visual prompts containing suggested actions, and immediate blocking alarms that trigger linkage control.

[0016] In another aspect, this application also provides a collaborative alarm system for operational risks in multi-person work scenarios based on AR terminals, including: The perception data acquisition module is used to collect one or more types of perception data, including video frame streams, relative device pose, and timestamps, in real time through AR terminals deployed at multiple operators. The node feature vector generation module is used to generate a node feature vector at the edge computing node for each worker based on the perception data, which includes one or more of the following: human key point embedding, object interaction vector, and velocity vector. The multi-person temporal behavior graph construction module is used to construct the multi-person temporal behavior graph by using the node feature vector of each worker at each time step within a preset time window as a node, and constructing the edges between the nodes based on the spatial relationship between workers and the relationship between adjacent time steps of the same worker. The operation intention conflict probability output module is used to input the multi-person temporal behavior map into a pre-trained spatiotemporal graph transformer model. The model performs joint reasoning on the map through spatial self-attention mechanism and temporal self-attention mechanism, and synchronously outputs the behavior intention classification probability of each operator in the future time period, as well as the operation intention conflict probability between any two operators. The early warning module is used to generate a corresponding conflict early warning command when the probability of conflict between the operation intentions exceeds a preset threshold, and send it to the corresponding AR terminal for prompting, thereby providing early warning before the conflict occurs.

[0017] This disclosure also provides an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to execute the collaborative alarm method for operational risks in a multi-person operation scenario based on an AR terminal as described above.

[0018] In another aspect, this disclosure provides a computer-readable storage medium having stored thereon computer program instructions that can be executed by a processor to implement the collaborative alarm method for operational risks in a multi-person operation scenario based on an AR terminal, as described above.

[0019] On the other hand, the present disclosure provides a computer program product, including a computer program, which when executed by a processor implements the operation risk collaborative warning method in a multi-person operation scenario based on an AR terminal as described above.

[0020] The present invention collects multi-dimensional perception data through an AR terminal, constructs a multi-person time-series behavior map in combination with edge computing, uses a spatio-temporal graph transformer model to achieve early prediction of behavior intentions and operation conflicts, ensures data reliability with multi-view occlusion compensation, and then generates warning prompts according to the conflict probability grading and supports system linkage control. Compared with traditional solutions, it can accurately warn before conflicts occur, effectively improve operation safety, avoid equipment failures and casualties, adapt to the needs of different operation scenarios, and has remarkable practicability and reliability. BRIEF DESCRIPTION OF THE DRAWINGS

[0021] In order to more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required for use in the description of the specific embodiments or the prior art. In all the drawings, similar elements or parts are generally identified by similar reference numerals. In the drawings, the elements or parts are not necessarily drawn to scale.

[0022] Figure 1 It is a flowchart of an operation risk collaborative warning method in a multi-person operation scenario based on an AR terminal provided by an embodiment of the present invention; Figure 2 It is a schematic structural diagram of a spatio-temporal graph transformer model provided by an embodiment of the present invention; Figure 3 It is a schematic diagram of an occlusion compensation process provided by an embodiment of the present invention; Figure 4 It is a schematic diagram of an operation risk collaborative warning system in a multi-person operation scenario based on an AR terminal provided by an embodiment of the present invention; Figure 5 It is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. DETAILED DESCRIPTION OF THE EMBODIMENTS

[0023] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are some, but not all, of the embodiments of the present invention. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the protection scope of the present invention.

[0024] It should be understood that, when used in this specification and the appended claims, the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.

[0025] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0026] It should also be further understood that the term "and / or" as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0027] like Figure 1 As shown, this embodiment of the invention discloses a collaborative alarm method 100 for operational risks in multi-person operation scenarios based on AR terminals, including the following method steps: S1, through AR terminals deployed at multiple operators, collects in real time one or more of the following perception data: video frame stream, relative device pose, and timestamp; S2, Based on the perceived data, generate a node feature vector at the edge computing node for each worker, which includes one or more of the following: human keypoint embedding, object interaction vector, and velocity vector. S3, using the node feature vector of each worker at each time step within a preset time window as a node, constructing the edges between the nodes based on the spatial relationship between workers and the relationship between adjacent time steps of the same worker, and constructing a multi-person temporal behavior graph; S4, input the multi-person temporal behavior graph into a pre-trained spatiotemporal graph transformer model. The model performs joint reasoning on the graph through spatial self-attention mechanism and temporal self-attention mechanism, and synchronously outputs the classification probability of each operator's behavior intention in the future time period, as well as the probability of conflict between the operation intentions of any two operators. S5. When the probability of conflict between the operation intentions exceeds a preset threshold, a corresponding conflict warning instruction is generated and sent to the corresponding AR terminal for prompting, thereby providing a warning before the conflict occurs.

[0028] In one embodiment, for step S1, for multi-person collaborative operation scenarios, such as live-line power operation, one AR terminal is configured for each operator, and each AR terminal is pre-bound to the corresponding operator's identity information and assigned a unique terminal ID to ensure one-to-one correspondence between the terminal and the operator, so as to realize the subsequent association and traceability of perception data with the operator's identity.

[0029] Specifically, for video frame stream acquisition, the AR terminal is equipped with an RGB image sensor to acquire video frame streams from the work site in real time. The acquisition resolution can be adjusted according to the clarity requirements of the actual work scene, and the frame rate is controlled at 10-15fps. This allows the AR terminal to meet the timeliness requirements of motion capture while avoiding the transmission and processing pressure caused by excessive data volume. During the acquisition process, each frame of video data carries a timestamp of the acquisition time to ensure the time synchronization of data from multiple terminals.

[0030] For relative pose acquisition, the AR terminal has a built-in Inertial Measurement Unit (IMU) and a visual odometry module. The IMU acquires motion parameters such as acceleration and angular velocity in real time. Combined with the visual odometry's identification and matching of environmental feature points, the relative pose data of the AR terminal relative to the working equipment is calculated. This includes the relative distance and relative angle between the AR terminal and the working equipment. The sampling frequency of this relative pose data is consistent with the video frame stream acquisition frequency, and each set of relative pose data is associated with a corresponding timestamp and terminal ID. The relative angle includes azimuth and pitch angles.

[0031] The AR terminal has a built-in clock module that generates a unique timestamp for each acquisition operation of video frame streams and device relative pose. At the same time, the terminal periodically synchronizes its time with edge computing nodes to ensure that the timestamps generated by different AR terminals are on the same time base, avoiding errors in subsequent data fusion and map construction due to time deviations.

[0032] Optionally, the AR terminal can perform preliminary preprocessing on the acquired video frame stream, including image noise reduction and image cropping; remove outliers from the device relative pose data, and smooth data fluctuations through sliding window filtering to ensure the stability of the pose data; after preprocessing, associate and encapsulate the video frame stream, device relative pose data, and corresponding timestamps, terminal IDs, and bound personnel identity information to form structured perception data units.

[0033] Among them, the AR terminal can use a Gaussian filtering algorithm to remove environmental noise; and retain the effective area containing workers and equipment by cropping meaningless background areas.

[0034] In some embodiments, for step S2, the edge computing node receives structured perception data units from each AR terminal via a wireless communication module. The perception data units include video frame streams, device relative poses, timestamps, terminal IDs, and bound personnel identity information.

[0035] Edge computing nodes invoke a pre-trained human keypoint detection model to process the pre-processed video frame stream frame by frame to extract human keypoints. The human keypoint detection model can be a lightweight model based on HRNet.

[0036] Specifically, for each frame of an image, the coordinates of 17 core human body key points are detected and output, and the coordinates are represented in the image pixel coordinate system.

[0037] The key human body coordinates include the head, neck, shoulders, elbows, wrists, hips, knees, and ankles.

[0038] The image pixel coordinate system has its origin at the top left corner.

[0039] The coordinates of the detected key points are normalized: the coordinates of the key points are divided by the width and height of the video frame image, and converted into normalized values ​​in the range of [0,1] to eliminate the influence of different resolution images on the position of key points. Calculate the detection confidence score for each keypoint. If the confidence score of a keypoint is below 0.6, it is marked as needing compensation, which will be supplemented later through multi-view fusion. The keypoint confidence score can be adjusted according to the needs of the scenario.

[0040] The normalized coordinates of human key points and the corresponding detection confidence scores are combined to form a feature matrix, which is then input into a small convolutional neural network (CNN) or a multi-layer perceptron (MLP) to achieve human key point embedding.

[0041] Specifically, the CNN or MLP contains two convolutional layers with a kernel size of 3×3 and output channels of 32 and 64 respectively; or, it contains two fully connected layers with 128 and 64 neurons respectively, and uses ReLU as the activation function.

[0042] Through the above network mapping, the high-dimensional keypoint features are compressed into a 64-dimensional dense vector, which is the human keypoint embedding. This embedding vector retains the core features of human posture while reducing the computational complexity of subsequent operations.

[0043] Then, the object's interaction vector can be obtained through interactive device identification and localization, as well as object interaction vector assembly.

[0044] Among them, the edge computing node can perform target detection on the working equipment in the video frame stream, output the equipment category, and assign a unique one-hot code to each type of equipment. The coding dimension is set according to the number of types of equipment on site. Then, combined with the relative pose data of the equipment transmitted by the AR terminal, the relative distance and relative direction angle between the operator and the target equipment are extracted, and the relative distance and relative direction angle are normalized.

[0045] For example, the relative distance between the operator and the target equipment can be taken as the distance between the center point of the chest in the human body and the geometric center of the target equipment.

[0046] For example, the relative orientation angle is 0° with the person directly in front of them.

[0047] Then, the device's one-hot encoding can be concatenated with the standardized relative distance and relative direction angle to form an object interaction vector.

[0048] The object interaction vector has a dimension of "number of device categories + 2". For example, 10 device categories correspond to a 12-dimensional vector. Each device category is assigned a unique dimension using one-hot encoding, resulting in 10 dimensions for 10 device categories. A dimension of 1 indicates that the current interaction is with that type of device, while the others are 0. The remaining two dimensions correspond to the relative distance between the person and the device, and the contact state or relative orientation angle between the person and the device, respectively. This ensures that the vector can fully represent the interaction state between the person and the device. Specifically, one of the two options, contact state or relative orientation angle, can be selected based on the scenario requirements, with the core being the quantification of spatial relationships.

[0049] If, based on the target detection results and relative pose, the operator does not interact with any equipment (e.g., the relative distance between the operator and the equipment is greater than 2 meters and there is no contact action with the equipment), then the one-hot encoding in the object interaction vector is all 0, and the relative distance and relative direction angle are set to 0.

[0050] The generation of the velocity vector includes displacement calculation and velocity calculation.

[0051] For displacement calculation, the hip key point among the key points of the worker's body can be used as the motion reference point. Based on the continuous video frames after time synchronization, the normalized coordinates (x1, y1) and (x2, y2) of the hip key point of the same worker in two adjacent frames are extracted. Through the relative pose calibration of the device with the participation of the AR terminal, the conversion relationship between pixel coordinates and physical coordinates is established, and the normalized coordinates are mapped back to the actual physical coordinate system to obtain the physical coordinates (X1, Y1) and (X2, Y2). The displacement between adjacent frames is calculated as follows: ΔX = X2 - X1, ΔY = Y2 - Y1.

[0052] For velocity calculation, the velocity can be calculated based on the displacement and time interval. Specifically, the two components of the velocity are the horizontal velocity v. x =ΔX / Δt, vertical velocity vᵧ=ΔY / Δt, unit is meters per second (m / s); the velocity vector is truncated to limit the velocity value to the range of [-5,5] m / s, covering the normal movement speed range of power field workers and avoiding abnormal velocity values ​​caused by detection errors.

[0053] The velocity vector is a 2-dimensional vector used to represent the direction and magnitude of the worker's movement.

[0054] The generated 64-dimensional human keypoint embedding, the "device category number + 2" dimensional object interaction vector, and the 2-dimensional velocity vector are concatenated with additional fixed features and terminal confidence to form a complete node feature vector.

[0055] Optionally, the fixed feature is the operator's role or identity code, with 16 dimensions, predefined based on the personnel identity information bound to the terminal; the terminal confidence score is a 1-dimensional scalar, taken from the local confidence score transmitted by the AR terminal. In this case, the total dimension of the node feature vector is 64 + (number of device categories + 2) + 2 + 16 + 1. The node feature vector can be controlled between 100 and 128 dimensions, balancing feature completeness and computational efficiency.

[0056] Each node's feature vector is associated with a unique worker ID and time step identifier. The time step is determined by the edge computing node's temporal caching strategy to ensure that the vector accurately corresponds to the state of a specific worker at a specific moment. This temporal caching strategy is matched with the time window of the subsequent multi-person temporal behavior graph.

[0057] In some embodiments, for step S3, the edge computing node can pre-configure the time window W of the timing graph, with a value range of 3 to 5 seconds. The specific time window is adjusted according to the action response requirements of the operation scenario. For example, 5 seconds is used for power equipment maintenance scenarios to ensure coverage of the complete operation process; 3 seconds is used for simple closing scenarios to improve real-time performance.

[0058] Then, the number of time steps T within the time window can be calculated by combining the AR terminal's video frame rate. The calculation formula is T=W×frame rate. For example, when W=4 seconds and the frame rate=12fps, T=48 steps, meaning that each time window contains 48 consecutive time steps.

[0059] In one embodiment, an edge computing node can perform the following spatial relationship calculation for each time step t, for all valid nodes corresponding to the operators at that time step, in order to construct a spatial edge.

[0060] For spatial relationships between people, the center point of the chest among the key points of the worker's body can be selected as the spatial position reference point. Based on the coordinate data in the physical coordinate system in step S2, the relative distance d and relative direction angle θ between any two workers p1 and p2 at time step t are calculated. The relative distance d is calculated using the Euclidean distance formula, and the relative direction angle θ can be calculated using the arctangent function, i.e., θ=arctan[(Y4-Y3) / (X4-X3)], where (X3,Y3) are the coordinates of the center point of p1's chest, and (X4,Y4) are the coordinates of the center point of p2's chest.

[0061] Based on the above spatial relationship parameters, spatial edges between corresponding nodes can be constructed and edge attributes can be assigned.

[0062] Specifically, for any two workers p1 and p2, a spatial edge E_space(p1,p2,t) is constructed at time step t. The edge attributes include relative distance d and relative direction angle θ. If the relative distance d is greater than 5 meters, it is determined that there is no direct spatial relationship and the spatial edge is not constructed.

[0063] In one embodiment, for time edge construction, for each worker p, the effective nodes N(p,t) and N(p,t+1) of continuous time steps t and t+1 within a preset time window can be extracted.

[0064] Specifically, the cosine similarity (cosα) of the velocity vectors in N(p,t) and N(p,t+1) can be calculated. If cosα is greater than or equal to 0.7, it indicates that the velocity direction is continuous, and the action is considered to be continuous. Then, the Euclidean distance (dist) between the human keypoint embeddings in N(p,t) and N(p,t+1) can be calculated. If dist is less than or equal to 0.5, it indicates that the change in human posture is small, and the action is considered to be continuous.

[0065] It is understandable that the above-mentioned judgment thresholds can be dynamically adjusted based on the judgment results.

[0066] If worker p meets the action continuity determination condition at time steps t and t+1, then a time edge E_time(p,t,t+1) is constructed, with edge attributes including velocity cosine similarity cosα and keypoint embedding distance dist; if the determination condition is not met, then the time edge is not constructed to avoid incorrect representation of action continuity relationship.

[0067] In one embodiment, for the construction of interactive edges, edge computing nodes can detect interaction events between operators and equipment / tools based on object interaction vectors and video frame stream motion analysis in step S2.

[0068] Specifically, if the device one-hot encoding in the object interaction vector is not all zeros, and the relative distance d' is less than or equal to the contact threshold, and at the same time, actions such as a hand approaching the device operating part or a hand grasping the operating part are detected in the video frame stream, then it is determined to be a device operation interaction event; and whether the device operation interaction event has occurred is added as an additional attribute to the corresponding node.

[0069] If the relative distance d between two operators p1 and p2 is less than or equal to the interaction threshold, and collaborative actions such as "p1 points to the device, p2 observes the device" or "p1 passes a tool, p2 receives the tool" are detected in the video frame stream, then it can be determined as a personnel collaborative interaction event, such as collaborative operation guidance or collaborative tool transfer. This collaborative action can be determined using an action classification model and posture matching.

[0070] For personnel collaboration interaction events, an interaction edge E_interact(p1,p2,t) can be constructed, with the edge attributes including the interaction event type and the interaction confidence.

[0071] Then, all valid nodes, spatial edges, temporal edges, and interaction edges within the preset time window are integrated to form an initial multi-person temporal behavior graph G=(V,E), where V is the set of nodes and E is the set of edges. The graph is stored using an adjacency list data structure, where each node is associated with the indices and attributes of all its adjacent edges, facilitating rapid retrieval by subsequent models.

[0072] Optionally, edge computing nodes use a sliding window mechanism to update the temporal graph: when a new time step arrives, all nodes and associated edges of the earliest time step in the original time window are deleted, and the valid nodes corresponding to the new time step and newly constructed spatial edges, temporal edges, and interaction edges are added to form an updated multi-person temporal behavior graph G'. The update frequency is consistent with the time step interval to ensure that the graph can reflect changes in personnel behavior and interaction relationships at the work site in real time.

[0073] In some embodiments, for step S4, firstly, the dynamically constructed multi-person temporal behavior graph is decomposed into a node matrix and an edge feature matrix, which are used as inputs to a pre-trained spatiotemporal graph transformer model.

[0074] This involves extracting the node feature vectors of all valid nodes in the graph to form a node matrix. , where N is the total number of nodes, including the number of workers in the time window × the number of time steps; D is the dimension of the feature vector of a single node; the node matrix is ​​normalized to map all feature values ​​to the interval [0,1] to eliminate the influence of the difference in the scale of different feature dimensions on the model inference.

[0075] This allows for the aggregation of attribute information on spatial edges, temporal edges, and interaction edges in the graph, and the construction of an edge feature matrix based on edge type and connectivity. Where M is the total number of edges and K is the attribute dimension of a single edge; simultaneously, an adjacency matrix is ​​generated. If node i and node j are connected by an edge, then ,otherwise , used for subsequent adjacency mask generation.

[0076] In one embodiment, this application also provides, as Figure 2 The spatiotemporal graph transformer model shown is as follows. Figure 2 The diagram shows the structure of the spatiotemporal graph transformer model, which includes, in sequence, a node embedding layer for mapping multimodal node features to a unified dimension, a graph transformer block for modeling spatial relationships between multiple subjects, a temporal transformer block for modeling the temporal evolution of actions, a fusion layer for realizing the mutual cross-interest between spatial and temporal representations, and an output layer for intent classification and conflict probability calculation.

[0077] Specifically, the node embedding layer receives the preprocessed node matrix X and maps the multimodal node features to a unified dimension through a fully connected neural network.

[0078] The node embedding layer consists of two fully connected layers. The first layer has an input dimension of D and an output dimension of 256, while the second layer has an input dimension of 256 and an output dimension of 128. Each layer is followed by a ReLU activation function and a LayerNorm layer to prevent gradient vanishing and accelerate model convergence. The second layer generates a node embedding matrix of uniform dimensions. This ensures that subsequent spatial and temporal attention mechanisms can be computed based on features of the same dimension.

[0079] Among them, the graph transformer block is based on the node embedding matrix. With the edge feature matrix E, spatial relationships between multiple subjects are captured through a multi-head self-attention mechanism.

[0080] Specifically, the graph transformer block generates an adjacency mask matrix based on the adjacency matrix A. ,like ,but ,otherwise By incorporating the adjacency mask matrix into self-attention computation, nodes are forced to focus only on neighboring nodes with which they are connected by edges, reducing interference from irrelevant nodes and improving inference efficiency.

[0081] The model employs a 4-head self-attention mechanism, which embeds nodes into a matrix. Divide into 4 based on the number of heads. For each submatrix, the attention weights for each head are calculated: a query vector Q, a key vector K, and a value vector V are generated; then, the base attention score is calculated. , The square root of the feature dimension is used to scale the score; then, the edge features are used as relative position biases to extract the feature vector of the edge between node i and node j. Mapped to bias values ​​via a small MLP This is then added to the base attention score: .

[0082] Then, apply the adjacency mask matrix M: Then, the attention weights W are obtained by normalization using the Softmax function; the attention output of each head is then calculated. The outputs of the four heads are concatenated and passed through a fully connected layer to obtain spatial attention features. .

[0083] Then, spatial attention features With node embedding matrix Perform residual connection After passing through the LayerNorm layer and the MLP layer, the final spatial relationship features are output. The graph transformer block is executed twice to enhance the modeling effect of spatial relationships among multiple subjects.

[0084] The timing transformer block is used to capture the temporal evolution of actions by using a time self-attention mechanism to target the node characteristics of the same worker across time steps.

[0085] Here, an absolute time code is generated for each time step t. And calculate the relative time code of adjacent time steps. This is used to represent time interval information. Absolute time encoding uses sine and cosine position encoding, generated by sine and cosine functions of different frequencies, to ensure that time step differences can be recognized by the model.

[0086] Then, spatial relationship features are defined by worker ID. Grouping, with each group corresponding to all node features of a single worker within a time window; performing 4-head temporal self-attention on each group of features, specifically including, 1. Generate query vectors Key vector Value vector And superimposed absolute time encoding ; 2. Calculate the baseline time attention score The relative time encoding bias is achieved through MLP. Mapped to bias values, and then summed to obtain ; 3. Apply an upper triangular mask to prevent the model from focusing on features of future time steps, and obtain the temporal attention weights through Softmax normalization. Calculate the attention output during computation time ; 4. By splicing the outputs from the four heads and passing them through a fully connected layer, the temporal characteristics of a single worker can be obtained. The temporal characteristics of all operators are summarized as follows: .

[0087] Then, the time series features Spatial Relationship Characteristics Residual connections are made, and the data is passed through LayerNorm and MLP layers to output the final temporal evolution features. The timing converter block is executed twice to enhance the ability to capture the timing patterns of actions.

[0088] The fusion layer uses a cross-attention mechanism to achieve the interrelationship between spatial relationship features and temporal evolution features, capturing dependencies across people and time.

[0089] Among them, spatial relationship characteristics as query vector Temporal evolution characteristics as a key vector AND value vector Perform single-head cross-attention: calculate the cross-attention score. Cross-attention weights are obtained through Softmax normalization. ; Calculate cross-attention output This allows spatial features to focus on temporal features.

[0090] Then, using temporal features as the query vector and spatial features as the key and value vectors, the above calculation is repeated to obtain the bidirectional cross-attention output. .

[0091] Then, the bidirectional cross-attention output is residually connected with spatial and temporal features. After passing through the LayerNorm layer and the MLP layer, the final fused features are output. .

[0092] Finally, the output layer performs intent classification and conflict probability calculation.

[0093] Specifically, for the characteristics of an individual worker at a certain time step, the features will be fused. The input intent prediction head contains a fully connected layer and a Softmax normalization function: the fully connected layer maps 128-dimensional fused features to the intent category dimension; the Softmax function normalizes the output values ​​into a probability distribution, obtaining the probability of each worker's behavioral intent classification in the future time period. Where p is the operator ID, t is the time step, and c is the intent category.

[0094] For any two workers and Extract the fusion features of the two individuals at the same time step. and The input is a conflict inference head; wherein the conflict inference head is a small neural network that outputs a binary conflict probability.

[0095] The specific reasoning process may include the conflict inference head calculating the cosine similarity of the merged features of the two individuals. And calculate the difference in the probability of classifying the intentions of the two individuals. Then, the cosine similarity and the difference in intent classification probability are concatenated into a primary association vector, which is then input into a multilayer perceptron, which may include multiple fully connected layers. The output of the multilayer perceptron is then passed through a sigmoid function to map the result to the [0,1] interval, thus obtaining the probability of conflict between the two users' operational intentions. .

[0096] For example, the multilayer perceptron includes three fully connected layers: the first fully connected layer has an input dimension of 2 and an output dimension of 16, followed by a ReLU activation function; the second fully connected layer has an input dimension of 16 and an output dimension of 8, followed by a ReLU activation function; and the third fully connected layer, which is also the output layer, has an input dimension of 8 and an output dimension of 1.

[0097] The conflict probabilities of all personnel pairs form a conflict probability matrix. Let P be the total number of workers, and the matrix elements satisfy... .

[0098] In one embodiment, the model training phase is based on a labeled dataset, and the model parameters are optimized using a joint loss function. The labeled dataset includes videos of the work scenario, manually annotated intent categories, and conflict labels.

[0099] The joint loss function includes intention classification cross-entropy loss, conflict binary classification cross-entropy loss, and temporal consistency loss.

[0100] Intended classification cross-entropy loss The formula used to monitor the accuracy of predicting behavioral intentions is: ; in, The label represents the intent category. The intent category label is one-hot encoded, with 1 for the correct category and 0 for the rest; C represents the number of intent categories. This represents the probability of the intent category; a weighted approach is used to assign higher weights to intent categories with smaller sample sizes, thus balancing the uneven distribution of categories.

[0101] Conflict binary classification cross-entropy loss The formula used to monitor the accuracy of operational intent conflict prediction is as follows: ; in The conflict label indicates that a conflict will occur in the future, and a conflict will not occur. For the total number of people; Let be the conflict probability between personnel i and j.

[0102] Temporal consistency loss The formula used to apply smoothness constraints to the prediction results of adjacent time steps to avoid abrupt changes in prediction is as follows: ; Where t ranges from 1 to T-1, and T is the number of time steps. The prediction difference between adjacent time steps is penalized by L1 loss.

[0103] General formula for joint loss function: For example, where =1.0、 =1.5、 =0.5 is the loss weight. The model uses the Adam optimizer to minimize the total loss until the model's intent classification accuracy on the validation set is greater than or equal to 92% and the conflict prediction F1 score is greater than or equal to 90%, thus completing the pre-training.

[0104] Preferably, in one embodiment, the edge computing node can also receive multi-view observation data of the same worker from different AR terminals, perform timestamp alignment and coordinate system unification, calculate the consistency score of each view observation, and perform weighted fusion of the multi-view data based on the consistency score to compensate for occlusion or observation loss under single view.

[0105] Specifically, such as Figure 3 The diagram shows the occlusion compensation process. First, S301, multi-view observation data reception and preprocessing.

[0106] The process begins by receiving observation data transmitted from all AR terminals covering the same work scenario. This observation data must be associated with unique identification information, including: terminal ID, bound worker ID, data collection timestamp, observation target, and original observation content. The observation target is the observed worker ID, and the original observation content includes the coordinates of key human body points, the relative pose of the equipment, and the terminal's local detection confidence level.

[0107] Then, invalid data with missing target identification or local detection confidence below 0.3 are removed; valid data with the same operator as the target and whose collection timestamps are within a preset time window are retained to form a multi-view observation dataset, denoted as [data missing]. Where m is the number of viewpoints for valid observation data, i.e., the number of AR terminals transmitting valid data. This represents the observation data from the k-th viewpoint.

[0108] Then, based on the acquisition timestamps of the observation data from each perspective, time synchronization processing is performed: using the system clock of the edge computing node as a reference, the timestamp of the observation data from each perspective is obtained. ; Calculate the deviation between all viewpoint timestamps and the reference timestamp. ,like If the time error exceeds the maximum allowable threshold, linear interpolation is used to correct the time of the observation data for that viewpoint, generating observation data aligned with the reference timestamp. After alignment, all perspectives are focused on the same operator's observation data at the same time step, with time deviation controlled within ±50ms, meeting the data synchronization requirements for spatiotemporal consistency judgment.

[0109] Then, based on the relative pose data of the devices transmitted by the AR terminal, the observation data from different perspectives are mapped to a unified physical coordinate system. Specifically, a three-dimensional Cartesian coordinate system is established with the preset fixed equipment reference point in the work scene as the origin, the X-axis along the direction of equipment arrangement, the Y-axis perpendicular to the X-axis, and the Z-axis perpendicular to the ground.

[0110] This involves extracting the pose parameters of each AR terminal, including the terminal's position coordinates in a unified coordinate system. ) and roll angle Pitch angle Yaw angle These three pose angles; for the coordinates of key human points in the k-th viewpoint observation data, a rigid transformation is performed on the terminal intrinsic parameter matrix and pose parameters to convert them into three-dimensional coordinates in the terminal coordinate system; the three-dimensional coordinates in the terminal coordinate system are then superimposed with the terminal position coordinates. ), to obtain the coordinates of key points in a unified physical coordinate system ( ).

[0111] Then, the same coordinate transformation is performed on the relative pose data of the device to ensure that the key points of the human body and the position of the device are in the same coordinate system under multiple perspectives, thus eliminating the positional deviation caused by the difference in perspective.

[0112] Secondly, in S302, consistency score calculation is performed. For the multi-view observation data after coordinate system one, the consistency score of each view is calculated from the spatiotemporal dimension to quantify the credibility of the observation data. Specifically, it includes the score calculation and fusion of the following three dimensions.

[0113] 1. Historical trajectory consistency score .

[0114] Based on the historical movement trajectories of the workers, the consistency between the current observation data and the trajectory trend can be determined.

[0115] Specifically, historical observation data of the worker from the previous five consecutive time steps can be extracted, and the historical trajectory can be fitted using a linear fitting or Kalman filter prediction model to obtain the predicted key point coordinates for the current time step. ); Calculate the coordinates of the keypoints currently observed from the k-th viewpoint. Euclidean distance between the predicted coordinates and the target coordinates ; The sigmoid function is used to map the distance to a consistency score: ,in The distance threshold is set to [0,1], with a score range of [0,1]. The smaller the distance, the higher the score, indicating stronger consistency with historical trajectories. The distance threshold can be set to the default 0.5 meters and adjusted according to the worker's movement speed.

[0116] 2. Score for partial occlusion indication .

[0117] Based on the image features of the video frame stream, it is possible to determine whether there is local occlusion in the observation viewpoint, which affects the accuracy of the observation.

[0118] Specifically, for the video frame stream from the k-th viewpoint, an image segmentation algorithm is used to identify occlusions within the frame; the area ratio of the overlapping region between the occlusion and the observed worker is calculated. That is, the ratio of the overlapping area to the area of ​​the bounding rectangle of the worker within the frame; calculate the local occlusion indication score: The score range is [0,1]. The lower the overlap ratio, the higher the score, indicating that the observation view is less affected by occlusion and the data is more reliable.

[0119] 3. Detection confidence score .

[0120] The local detection confidence score output by the AR terminal can be directly used for standardization: Extracting the local detection confidence from the observation data of the k-th viewpoint The original range is [0,1]; if If the value is less than 0.3, it should be corrected to 0.3 to avoid excessive influence from extremely low confidence data; if If the value is greater than 0.9, it should be adjusted to 0.9 to avoid excessive concentration of weights in data with extremely high confidence levels.

[0121] Then, the corrected confidence score is normalized to obtain the detection confidence score. The score ranges from [0.3, 0.9], directly reflecting the reliability assessment of the observation data by the terminal detection algorithm.

[0122] Finally, a weighted average is used to combine the scores from the three dimensions to obtain the overall consistency score for the k-th perspective. : Among them, the weighting coefficient =0.4、 =0.3、 =0.3, which can be optimized through validation set to prioritize the influence of historical trajectory and detection confidence; the comprehensive score range is [0,1], and the higher the score, the more reliable the observation data from this perspective.

[0123] Finally, S303, multi-view data weighted fusion and occlusion compensation, based on the comprehensive consistency score of each viewpoint. Calculate the fusion weights for each viewpoint. This ensures that trusted data receives higher weight.

[0124] For example, if the overall score for all perspectives is 0, then an average weight is assigned. Otherwise, normalized weights are used for calculation. ,in, The weight range is [0,1].

[0125] For multi-view observation data from the same operator at the same time step, weighted fusion is performed according to weights to generate complete observation data after compensation.

[0126] Specifically, for each core human keypoint, a weighted average of the coordinates from various viewpoints is calculated, which is then used as the fused keypoint coordinates. , , ; Then, the relative distance and relative angle between the workers and the equipment were calculated using the same weighted average method to obtain the fused value; the confidence level of the fused observation data was then determined. This reflects the overall credibility of the fused data.

[0127] To further eliminate fluctuation noise in the fused data, a Kalman filter algorithm is used to perform temporal smoothing on the fused observation data. Preferably, using the fused keypoint coordinates and relative device pose as observation values, the state equation and observation equation of the Kalman filter are established, with the state vector including position and velocity parameters. Based on the filtering results of the previous time step, the fused data of the current time step is predicted and updated, and the smoothed observation data is output to ensure the continuity of the data in the temporal dimension and avoid abrupt changes caused by single-frame observation errors.

[0128] Preferably, when the overall consistency score of all perspectives is... When all values ​​are below the second preset threshold, it is determined to be a low-confidence scenario from all perspectives, and the edge computing node triggers the conservative prediction mode.

[0129] Among them, edge computing nodes can lower the original conflict warning threshold by 20% to ensure that potential conflicts are not missed; at the same time, they only generate low-priority information prompts and do not trigger voice prompts or linkage controls to avoid false warnings caused by unreliable data.

[0130] In some embodiments, for step S5, the edge computing node is pre-configured with a three-level conflict probability threshold, the threshold range of which is set based on historical operation conflict case data and security risk assessment results.

[0131] For example, the low-level warning threshold is set to the range of [0.5, 0.7), and the corresponding warning information is a non-intrusive information prompt, which is suitable for scenarios with low conflict risk and only require personnel attention; the medium-level warning threshold is set to the range of [0.7, 0.9), and the corresponding warning information is a suggested action prompt combining voice and vision, which is suitable for scenarios with medium conflict risk and require personnel to adjust operations; the high-level warning threshold is set to the range of [0.9, 1], and the corresponding warning information is an immediate blocking alarm, which is suitable for scenarios with extremely high conflict risk and require mandatory intervention.

[0132] Understandably, all thresholds support dynamic adjustment to adapt to the safety requirements of different work scenarios.

[0133] Edge computing nodes receive the operation intent conflict probability matrix output by the spatiotemporal graph transformer model. The conflict probability matrix is ​​traversed, all off-diagonal elements are extracted, and invalid probability values ​​of themselves are excluded. Then, the extracted conflict probabilities are compared with the preset three-level thresholds one by one to determine the conflict warning level of each person pair.

[0134] For each pair of personnel matched with an alert level, their corresponding personnel identity information, current operation object, and location are associated. Based on the matched alert level, the edge computing node generates a structured conflict alert instruction, which includes core fields such as instruction identifier, alert level, conflict subject information, prompt content, display parameters, and validity period.

[0135] For example, the field configuration of a low-level warning instruction includes: The warning level is "LOW"; the conflicting entities include A (ID:001, target of operation: maintenance compartment 2); B (ID:002, target of operation: switch 2); the warning message is "Potential conflict risk: A is undergoing maintenance, B is approaching the switch, please coordinate"; the display parameters include setting the visual prompt type to border highlight to highlight the real-time position frame of the conflicting parties in the AR terminal's field of view, the color to yellow, the display duration to 3 seconds, and disabling voice prompts; the validity period is set to the current time step + 2 seconds, after which the instruction is automatically cleared.

[0136] The field configuration for intermediate warning commands includes: The warning level is "MIDDLE"; the conflict information also includes the relative distance between the conflicting parties; the prompt content includes text and voice scripts suggesting actions, such as "Conflict risk: Party A is undergoing maintenance, Party B is preparing to close the circuit breaker. It is recommended that Party B suspend its operation and confirm Party A's location." The voice script uses a standard text-to-speech format, and the speech rate and volume can be adjusted by the user; the display parameters include a visual prompt type set to a combination of a directional arrow and a text pop-up, with the arrow pointing to the location of the conflicting party and the color being orange; the pop-up is located in the lower right corner of the AR field of view, occupying 15% of the size, and the display duration is set to 5 seconds, with the voice prompt and visual prompt triggered synchronously; the validity period is set to the current time step + 5 seconds, and if the conflict probability is still higher than the threshold after the timeout, the instruction will be regenerated.

[0137] Advanced alert command field configurations include: The warning level is "HIGH"; the conflict information also includes a conflict risk assessment, such as "It is estimated that a closing-maintenance conflict may occur within 3 seconds, resulting in equipment short circuits and electric shock to personnel"; the prompt content is a strong warning text and voice, with the text example being "Emergency alarm: Closing is prohibited! A is currently performing maintenance inside the equipment, stop operation immediately!", the voice script is set to fast speed, the volume is set to maximum, and it uses intermittent repetition; the visual prompt type is set to full-screen flashing warning combined with a red highlighted box, red flashing at the edge of the screen, and the conflict location box is red flashing, with a flashing frequency of 2 times / second and an unlimited display duration until the conflict is resolved or manually confirmed, and the instruction is automatically terminated when the probability is lower than 0.5.

[0138] After generating the aforementioned conflict warning command, the edge computing node can send the conflict warning command to the AR terminal of the operator corresponding to the conflict subject, thereby providing early warning before the conflict occurs.

[0139] Figure 4 A collaborative alarm system 400 for operational risks in a multi-person work scenario based on an AR terminal is shown. This device embodiment is similar to... Figure 1 Corresponding to the illustrated method embodiments, the specific methods include: The perception data acquisition module 401 is used to collect one or more types of perception data, including video frame streams, relative device poses, and timestamps, in real time through AR terminals deployed at multiple operators. The node feature vector generation module 402 is used to generate a node feature vector at the edge computing node for each worker based on the perception data, which includes one or more of human key point embedding, object interaction vector and velocity vector. The multi-person temporal behavior graph construction module 403 is used to construct the multi-person temporal behavior graph by using the node feature vector of each worker at each time step within a preset time window as a node, and constructing the edges between the nodes based on the spatial relationship between workers, the spatial relationship between workers and objects, and the relationship between adjacent time steps of the same worker. The operation intention conflict probability output module 404 is used to input the multi-person temporal behavior map into a pre-trained spatiotemporal graph transformer model. The model performs joint reasoning on the map through spatial self-attention mechanism and temporal self-attention mechanism, and synchronously outputs the behavior intention classification probability of each operator in the future time period, as well as the operation intention conflict probability between any two operators. The early warning module 405 is used to generate a corresponding conflict early warning command and send it to the corresponding AR terminal to provide a prompt when the probability of conflict between the operation intentions exceeds a preset threshold, thereby providing an early warning before the conflict occurs.

[0140] Those skilled in the art will clearly understand that the technical solutions of the embodiments of this application can be implemented by means of software and / or hardware. In this specification, "unit" and "module" refer to software and / or hardware that can independently complete or cooperate with other components to complete a specific function, wherein the hardware may be, for example, a field-programmable gate array (FPGA), an integrated circuit (IC), etc.

[0141] Each processing unit and / or module in the embodiments of this application can be implemented by an analog circuit that implements the functions described in the embodiments of this application, or by software that executes the functions described in the embodiments of this application.

[0142] Please see Figure 5 It shows a schematic diagram of the structure of an electronic device according to an embodiment of this application, which can be used to implement... Figure 1 The method in the illustrated embodiment. (As shown) Figure 5 As shown, the electronic device 500 may include: The system includes at least one processor 501, at least one network interface 504, a user interface 503, a memory 505, and at least one communication bus 502. The communication bus 502 is used to enable connection and communication between the components. The user interface 503 may include buttons, and optionally include a standard wired or wireless interface. The network interface 504 may include, but is not limited to, a Bluetooth module, an NFC module, a Wi-Fi module, etc.

[0143] The processor 501 may include one or more processing cores and connect to various parts within the device 500 via various interfaces and lines. It implements the various functions and data processing of the device 500 by running or executing instructions, programs, code sets, or instruction sets stored in the memory 505, and by accessing data in the memory 505. Optionally, the processor 501 may be implemented using at least one hardware form of DSP, FPGA, or PLA. The processor 501 may also integrate one or more combinations of CPU, GPU, and modem. The CPU is mainly used to handle the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content required for display on the screen; and the modem is used for wireless communication. It is understood that the modem may not be integrated into the processor 501, but may be implemented through a separate chip.

[0144] Memory 505 may include random access memory (RAM) or read-only memory (ROM). Optionally, memory 505 includes a non-transitory computer-readable medium for storing instructions, programs, code, code sets, or instruction sets. Memory 505 may be divided into a program storage area and a data storage area, wherein the program storage area may be used to store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, audio playback functionality, image playback functionality, etc.), and instructions for implementing the aforementioned method embodiments; the data storage area may be used to store data involved in the relevant method embodiments. Memory 505 may also be at least one storage device located remotely from processor 501. Figure 5 As shown, the memory 505, which serves as a computer storage medium, may contain an operating system, a network communication module, a user interface module, and program instructions.

[0145] In particular, the methods and / or embodiments in this application can be implemented as computer software programs. For example, the embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowchart. When the computer program is executed by processor 501, it performs the functions defined in the methods of this application.

[0146] Another embodiment of this application provides a computer-readable storage medium having computer program instructions stored thereon, which can be executed by a processor to implement the methods and / or technical solutions of any one or more embodiments of this application described above.

[0147] The computer-readable storage medium may include, but is not limited to, any type of disk, including floppy disks, optical disks, DVDs, CD-ROMs, microdrives, as well as magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic cards or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and / or data.

[0148] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0149] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A collaborative alarm method for operational risks in multi-person work scenarios based on AR terminals, characterized in that, Includes the following steps: AR terminals deployed at multiple work sites collect real-time perception data including one or more of the following: video frame streams, relative device pose, and timestamps. Based on the perceived data, a node feature vector containing one or more of the following is generated for each worker at the edge computing node: human keypoint embedding, object interaction vector, and velocity vector. Using the node feature vector of each worker at each time step within a preset time window as a node, edges between the nodes are constructed based on the spatial relationship between workers and the relationship between adjacent time steps of the same worker, thus constructing a multi-person temporal behavior graph; The multi-person temporal behavior graph is input into a pre-trained spatiotemporal graph transformer model. The model performs joint reasoning on the graph through spatial self-attention mechanism and temporal self-attention mechanism, and synchronously outputs the behavioral intention classification probability of each operator in the future time period, as well as the operational intention conflict probability between any two operators. When the probability of conflict between the operation intentions exceeds a preset threshold, a corresponding conflict warning instruction is generated and sent to the corresponding AR terminal for notification, thereby providing a warning before the conflict occurs.

2. The method according to claim 1, characterized in that, It also includes, The edges include spatial edges used to characterize the spatial relationship between people at the same time step, temporal edges used to characterize the continuity of the same person's actions, and interactive edges used to mark detected interactive events.

3. The method according to claim 1, characterized in that, The spatiotemporal graph transformer model sequentially includes a node embedding layer for mapping multimodal node features to a unified dimension, a graph transformer block for modeling spatial relationships between multiple subjects, a temporal transformer block for modeling the temporal evolution of actions, and a fusion layer for realizing the mutual cross-interest between spatial and temporal representations.

4. The method according to claim 3, characterized in that, include: In the graph transformer block, a multi-head self-attention mechanism is used, and edge features are added as relative position biases to the calculation of attention weights. At the same time, an adjacency mask is used to restrict nodes to only pay attention to neighboring nodes that are connected to them by edges.

5. The method according to claim 1, characterized in that, The spatiotemporal graph transformer model is trained using a joint loss function, which includes at least: an intent classification cross-entropy loss for supervising intent prediction and a conflict binary classification cross-entropy loss for supervising conflict discrimination.

6. The method according to claim 5, characterized in that, The joint loss function also includes a temporal consistency loss term, which is used to impose a smoothness constraint on the prediction results of adjacent time steps.

7. The method according to claim 1, characterized in that, The generation of the corresponding conflict warning instruction includes: Based on the different probability ranges of the conflict probability of the operation intention, conflict warning instructions of different levels are generated. The conflict warning instructions of different levels include at least non-intrusive information prompts, voice and visual prompts containing suggested actions, and immediate blocking alarms that trigger linkage control.

8. A collaborative alarm system for operational risks in multi-person work scenarios based on AR terminals, characterized in that, include: The perception data acquisition module is used to collect one or more types of perception data, including video frame streams, relative device pose, and timestamps, in real time through AR terminals deployed at multiple operators. The node feature vector generation module is used to generate a node feature vector at the edge computing node for each worker based on the perception data, which includes one or more of the following: human key point embedding, object interaction vector, and velocity vector. The multi-person temporal behavior graph construction module is used to construct the multi-person temporal behavior graph by using the node feature vector of each worker at each time step within a preset time window as a node, and constructing the edges between the nodes based on the spatial relationship between workers and the relationship between adjacent time steps of the same worker. The operation intention conflict probability output module is used to input the multi-person temporal behavior map into a pre-trained spatiotemporal graph transformer model. The model performs joint reasoning on the map through spatial self-attention mechanism and temporal self-attention mechanism, and synchronously outputs the behavior intention classification probability of each operator in the future time period, as well as the operation intention conflict probability between any two operators. The early warning module is used to generate a corresponding conflict early warning command when the probability of conflict between the operation intentions exceeds a preset threshold, and send it to the corresponding AR terminal for prompting, thereby providing early warning before the conflict occurs.

9. An electronic device, characterized in that, include: At least one processor; and a memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A computer-readable medium having computer program instructions stored thereon, characterized in that, The computer program instructions can be executed by a processor to implement the method as described in any one of claims 1-7.