A human body heterogeneous action data prediction method driven by a graph attention network
A graph attention network-driven method for predicting human heterostructure motion data solves the problem of rapid and accurate prediction of unsafe worker behaviors in human-machine collaborative environments, improving prediction accuracy and generalization ability, and enhancing safety and production efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HEFEI UNIV OF TECH
- Filing Date
- 2024-03-27
- Publication Date
- 2026-06-23
AI Technical Summary
In human-machine collaborative environments, existing technologies struggle to quickly and accurately predict unsafe worker behaviors, and the high computational complexity of existing models means that safety risks persist, impacting production efficiency.
A graph attention network-driven method for predicting human heterostructure motion data is proposed. This method involves real-time acquisition of human motion images, construction of skeletal image sequences, and application of graph attention networks for motion prediction. It utilizes multi-class, small-sample training samples and data augmentation techniques, combined with a few-sample deep learning model, to improve prediction accuracy and generalization ability.
It improves the accuracy of human motion prediction and the generalization ability of the model, reduces unnecessary work interruptions, and enhances the safety and productivity of the human-machine collaborative environment.
Smart Images

Figure CN118230415B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of human-computer interaction and intelligent sensing technology in the field of industrial robots. Specifically, this invention relates to a graph attention network-driven method for predicting human heterostructure motion data. Background Technology
[0002] In the context of human-robot collaboration, the application of collaborative robots is gradually increasing as the demand for cost-effective and versatile automation partners in the industrial sector continues to rise. Therefore, human-robot collaboration solutions are constantly emerging in industrial settings. Nevertheless, achieving autonomous robot operation while ensuring safety and seamless coordination with factory workers remains a challenge.
[0003] Given the potential risks of injury to operators, industrial safety regulations have established clear rules to prevent accidents. These rules delineate areas where robots and workers work together, ensuring that authorized personnel can only enter the robot's operating area to perform tasks requiring human intervention when the robot is not operating autonomously. While this approach ensures worker safety to some extent, it also reduces productivity and is not conducive to the optimal use of industrial space resources. Furthermore, physical isolation alone cannot completely eliminate safety hazards; serious accidents can still occur in extreme cases, causing irreversible harm to workers, and safety risks remain. Against this backdrop, introducing advanced human motion prediction technology has become crucial for improving the efficiency and safety of human-robot collaboration. By predicting operator movements in real time, robot response strategies can be optimized, unnecessary work interruptions can be reduced, and workplace safety can be ensured, which is essential for the efficient operation of modern automated production lines.
[0004] During the invention process, the inventors discovered that deep learning technology has demonstrated remarkable effectiveness in the field of human motion prediction, especially in popular time series prediction models. However, in advanced manufacturing, unsafe worker behaviors in human-machine collaborative environments exhibit significant environmental sensitivity. Firstly, the variety of these unsafe behaviors and the limited sample size pose challenges to data collection. Secondly, existing behavior recognition models struggle to react quickly enough for real-time identification of unsafe worker behaviors due to their high computational complexity. Therefore, improving the generalization and transferability of motion prediction, enabling rapid and accurate prediction of human movements based on heterogeneous data, is a crucial area for improvement in small-sample heterogeneous algorithms. Summary of the Invention
[0005] This invention provides a graph attention network-driven method for predicting human heterogeneous motion data, in order to solve safety issues in human-computer collaborative environments.
[0006] To achieve the above objectives, the technical solution of the present invention is as follows:
[0007] A graph attention network-driven method for predicting heterogeneous human motion data includes the following steps:
[0008] S1: Real-time acquisition of human motion images in the human-machine collaborative production environment to obtain a skeletal image sequence;
[0009] First, multiple high-definition KinectV2 cameras are positioned at key locations within the human-machine collaborative production environment to ensure the capture of human movements from multiple perspectives. Real-time video capture technology is used to synchronously record the dynamic behavior of personnel in the production environment, generating a continuous video stream. Then, the video stream captured in step S12 undergoes preprocessing, including video noise reduction, background removal, and image enhancement, to improve image quality and ensure the accuracy of skeleton detection. Advanced human pose estimation algorithms are applied to analyze the preprocessed video frames, identifying and extracting key human points to construct a skeletal image sequence. Finally, the skeletal image sequence constructed in step S14 is converted into graph-structured data, where key human points are treated as nodes in the graph, and physical connections between nodes are treated as edges, resulting in graph data. Where E represents the set of edges, and the vertex set V consists of C nodes. It is a symmetric adjacency matrix, representing the edge type between nodes. If nodes i and j are connected by an edge, then... .
[0010] S2: Process the collected human motion data, split the data into multiple heterogeneous forms, and obtain a small number of training samples of multiple categories;
[0011] S3: Apply graph attention network to calculate the weight distribution between nodes based on graph structure data;
[0012] S4: Use the training samples obtained in S2 to learn the model, and use the human heterostructure motion data prediction model to predict the future sequence of motion;
[0013] S5: Calculate the angle of deviation between the predicted sequence result and the actual action sequence, and output the prediction result.
[0014] A further preferred technical solution provided by the present invention is:
[0015] Step S14 specifically includes:
[0016] S141: Load the OpenPose model and configure its parameters;
[0017] S142: Apply the OpenPose algorithm to the input video frame to detect the posture of each person in the image in real time. The key points of the human body are set as head, shoulders, elbows, hands, torso, knees and feet.
[0018] S143: Extract key point information of the human body from the output of step S142, including transforming the coordinate system, filtering out key points with low detection confidence, and the connection information between key points, and finally constructing a human skeletal image sequence.
[0019] A further preferred technical solution provided by the present invention is:
[0020] Step S2 specifically includes:
[0021] S21: Classify the collected skeletal image sequences according to action type and performer identity to form multi-category data subsets;
[0022] S22: Apply data augmentation techniques to the datasets for each category formed in S21. These techniques include rotation, scaling, and cropping to increase data diversity during model training and create a standard graph dataset. Each sample in each task contains the same set of moving nodes V, whose adjacency matrix A is exactly the same.
[0023] S23: Randomly select a small number of samples from the augmented datasets of each category formed in S22 to form a small training sample set for multiple categories, which will be used for subsequent model learning.
[0024] S24: Each time the training samples formed according to S23 are used, a sensor is randomly selected as the initial root node, and then a subset of neighboring vertices is recursively added to the graph, including all edges whose endpoints are in the current subset, thereby sampling the induced subgraph of the original human skeleton graph to form a dataset containing heterostructured nodes. That is, the underlying graph structure and vertex set are different for different tasks, while the supporting data and query data for the same task are shared. The number of motion nodes C is not fixed, but depends on the current task.
[0025] A further preferred technical solution provided by the present invention is:
[0026] Step S3 specifically includes:
[0027] S31: Design a graph attention layer that learns the dependencies between nodes (i.e., key points of the human body) through a self-attention mechanism, strengthens important features and suppresses irrelevant information, automatically identifies and enhances key action features. The input to the graph attention layer is a set of node features. Where N is the number of nodes, K is the number of features per node, and the output of this layer is the new node features. ;
[0028] S32: Inject the graph structure into this mechanism by performing masking concerns, only computing the nodes. of ,in It is a neighborhood of node i in the graph, and the parameter is the weight matrix. Then, a self-attention-shared attention mechanism is executed on the node. Calculate the attention coefficient, which is shown below:
[0029]
[0030] This indicates the importance of the features of node j to node i;
[0031] S33: The Softmax function is used to normalize the coefficients belonging to node j for comparing the coefficients of different nodes. The function processing is shown below:
[0032]
[0033] Attention mechanism 'a' is a single-layer feedforward neural network consisting of a weight vector. Parameterize and apply the LeakyReLU nonlinear function. The attention mechanism coefficients are expressed as:
[0034]
[0035] in· T Indicates transpose. Indicates a connection operation;
[0036] S34: Once obtained, the normalized attention coefficients are used to calculate a linear combination of the corresponding features, serving as the final output feature for each node. This layer is then extended to a multi-attention head mechanism to stabilize the self-attention learning process. Specifically, the features of K independent attention mechanisms are concatenated to obtain the following output feature representation:
[0037]
[0038] in Indicates a connection. It is the normalized attention coefficient calculated from the k-th attention coefficient. Attention mechanism. , This is the weight matrix of the corresponding linear transformation of the input. Note that in this case, the final returned output... This will include each node. Features (not) );
[0039] S35: For heterogeneous graphs, the model proposed in this paper can handle graphs with different types of nodes and edges. Its core idea is to represent the nodes and edges in the heterogeneous graph as low-dimensional vectors, and then use the low-dimensional vectors to continue the task.
[0040] A further preferred technical solution provided by the present invention is:
[0041] Step S35 specifically includes:
[0042] S351: Using heterogeneous attention to compute the weights between nodes:
[0043]
[0044] in, Indicates the type of edge. Indicates node i with respect to type The expression, Represents the weight matrix;
[0045] S352: Integrating different types of nodes and edges through Heterogeneous Attention Pooling (HAP):
[0046]
[0047] in, Indicates node i with respect to type The set of neighboring nodes.
[0048] A further preferred technical solution provided by the present invention is:
[0049] Step S4 specifically includes:
[0050] S41: Based on feature representations constructed using graph attention networks, develop an action sequence prediction model. This model can learn the spatiotemporal features of actions from a small number of training samples, and then perform a set of M tasks. As a meta-dataset, each task includes supporting data. and query data ,in and This graph is shared across support and query instances for a given task. Given labeled support and query data, the model exhibits the least prediction loss for the query data of the task, as shown below:
[0051] ;
[0052] S42: The model parameters are optimized by applying few-shot learning techniques, enabling the model to effectively learn and predict action sequences under limited data conditions. The model structure consists of two main parts: an inference network and a prediction network.
[0053] A further preferred technical solution provided by the present invention is:
[0054] Step S42 specifically includes:
[0055] S421: First is the inference network, which processes the task support set. The predicted and target data are used to generate a potential task representation, which should include useful information about the predicted query instance.
[0056] S422: Next, the prediction network computes the query set for the current task based on the task embeddings of its predictors and support networks. Actual motion prediction;
[0057] S423: The main components of these two parts are multiple stacked depth blocks, which process the input data as a set of attributes to support the data. On instance I, we compute the embedding of each vertex c∈C. Thus, a single layer in such a block is a depth set layer, as shown below:
[0058]
[0059] K is The potential output dimension. Using an inner function for each element of the instance set X. and the use of external functions Aggregate this set to establish a permutation-invariant layer that operates on the set of instances;
[0060] S424: The single-layer definition in the attention block used in the model is as follows:
[0061]
[0062] , It is the set of all vertices located within the c-neighborhood, for each It can be known [.] represents a concatenation along the latent characteristic axis, and the inner function. Prepare neighborhood embeddings, and the outer function Update the vertex features of the corresponding node based on the aggregated neighbor information;
[0063] S425: Inference Network Processing fully supported data Calculate task embeddings across instances and motion nodes to predict the network. Process the query data and output the final prediction result. The model's prediction for task m. Given by the following formula:
[0064]
[0065] Inference Network It consists of a GAT block between two DS blocks, used to capture information across instances. Prediction Network It consists of stacked GAT blocks and GRU layer blocks, with the GRU blocks used to compute target motion predictions.
[0066] A further preferred technical solution provided by the present invention is:
[0067] Step S5 specifically includes:
[0068] S51: Calculate the deviation between the action sequence predicted by the model and the actual action sequence, and use the angle deviation as the evaluation criterion to quantify the prediction accuracy;
[0069] S52: The mean absolute angle error (MAAE) is used as a performance evaluation index. The accuracy of the prediction results is comprehensively considered to optimize and adjust the model, thereby reducing prediction bias and improving prediction performance. The specific calculation method is as follows:
[0070]
[0071] n is the total number of samples. It is the i-th actual angle value. It is the i-th predicted angle value. It is the absolute value of the difference between the actual angle value and the predicted angle value. When the predicted MAAE is greater than the set 80%, the model outputs an action check instruction.
[0072] The beneficial effects of this invention are:
[0073] This invention focuses on human movements during collaboration between humans and collaborative robots within a production unit. Optimization goals include improving the accuracy and sophistication of the prediction model for human movements. The invention also proposes a graph attention network-driven method for predicting heterogeneous human movement data by processing the collected human movement datasets to address their heterogeneous structure. This provides a reasonable and effective approach to improving safety in human-robot collaboration. Furthermore, to address the challenges of diverse collaboration scenarios and worker movements, a few-sample deep learning model capable of handling heterogeneous data is employed as an optimization method to enhance the accuracy of worker movement prediction.
[0074] This invention improves the performance of human motion prediction models by refining them. First, it employs a graph attention network approach for feature extraction, integrating information from adjacent sensors to address the limitations of existing models in handling graph-structured data. Second, it utilizes a small-sample heterogeneous data time-series model to enhance the model's predictive generalization ability. Compared to other traditional models, the improved model demonstrates enhanced generalization ability and accuracy.
[0075] This invention targets data with complex time dependencies and heterogeneous channel characteristics, and employs a few-shot deep learning model to enhance generalization ability, so as to adapt to action recognition tasks in variable environments. Attached Figure Description
[0076] Figure 1 A flowchart of a graph attention network-driven method for predicting human heterostructure motion data is provided by the present invention.
[0077] Figure 2 A schematic diagram illustrating the capture of human motion within a human-machine collaborative production environment, provided as an example of the present invention;
[0078] Figure 3 This is a flowchart of the deep learning algorithm within a graph attention network-driven method for predicting heterogeneous human motion data, as described in an example of the present invention. Detailed Implementation
[0079] Next, with the reference to the accompanying drawings of the embodiments of the present invention, we will describe in detail the technical solutions proposed by the present invention. It should be noted that the embodiments discussed herein only illustrate a portion of the implementation of the present invention, and not its entire scope. All variations and modifications that can be derived or implemented by those skilled in the art based on the embodiments disclosed in this document without innovative work should be considered to fall within the protection scope of the present invention.
[0080] Combination Figures 1 to 3 The present invention is described in detail below:
[0081] A graph attention network-driven method for predicting heterogeneous human motion data includes the following steps:
[0082] S1: Real-time acquisition of human motion images in the human-machine collaborative production environment to obtain a skeletal image sequence;
[0083] Step S1 specifically includes:
[0084] S11: Multiple high-definition KinectV2 cameras are placed at key locations within the human-machine collaborative production environment to ensure that human movements are captured from multiple perspectives;
[0085] S12: Utilize real-time video capture technology to synchronously record the dynamic behavior of personnel in the production environment and generate a continuous video stream;
[0086] S13: Preprocess the video stream captured in S12. The preprocessing includes video noise reduction, background removal, and image enhancement to improve image quality and ensure the accuracy of bone detection.
[0087] S14: Apply human pose estimation algorithm to analyze the preprocessed video frames, identify and extract human key point information, and construct a skeletal image sequence;
[0088] In step S14, the human pose estimation algorithm is the OpenPose real-time multi-person pose estimation algorithm, and the steps include:
[0089] S141: Load the OpenPose model and configure its parameters;
[0090] S142: Apply the OpenPose algorithm to the input video frame to detect the posture of each person in the image in real time. The key points of the human body are set as head, shoulders, elbows, hands, torso, knees and feet.
[0091] S143: Extract key point information of the human body from the output of step S142, including transforming the coordinate system, filtering out key points with low detection confidence, and the connection information between key points, and finally constructing a human skeletal image sequence.
[0092] S15: The skeletal image sequence constructed based on S14 is transformed into graph structure data, where human keypoints are treated as nodes in the graph, and the physical connections between nodes are treated as edges, thus obtaining graph structure data. Where E represents the set of edges, and the vertex set V consists of C nodes. It is a symmetric adjacency matrix, representing the edge type between nodes. If nodes i and j are connected by an edge, then... .
[0093] S2: Process the collected human motion data, split the data into multiple heterogeneous forms, and obtain a small number of training samples of multiple categories;
[0094] Step S2 specifically includes:
[0095] S21: Classify the collected skeletal image sequences according to action type and performer identity to form multi-category data subsets;
[0096] S22: Apply data augmentation techniques to the datasets for each category formed in S21. These techniques include rotation, scaling, and cropping to increase data diversity during model training and create a standard graph dataset. , where m and m' are the number of nodes in the graph dataset for different tasks, and each sample of each task contains the same set of moving nodes V, whose adjacency matrix A is exactly the same;
[0097] S23: Randomly select a small number of samples from the augmented datasets of each category formed in S22 to form a small training sample set for multiple categories, which will be used for subsequent model learning.
[0098] S24: Each time the training samples formed according to S23 are used, a sensor is randomly selected as the initial root node, and then a subset of neighboring vertices is recursively added to the graph, including all edges whose endpoints are in the current subset, thereby sampling the induced subgraph of the original human skeleton graph to form a dataset containing heterostructured nodes. That is, the underlying graph structure and vertex set are different for different tasks, while the supporting data and query data for the same task are shared. The number of motion nodes C is not fixed, but depends on the current task.
[0099] S3: Apply graph attention network to calculate the weight distribution between nodes based on graph structure data to enhance the recognition of key action features;
[0100] Step S3 specifically includes:
[0101] S31: Design the graph attention layer, which learns the dependencies between nodes (i.e., key points of the human body) through a self-attention mechanism, strengthens important features and suppresses irrelevant information, automatically identifies and enhances key action features. The input to the graph attention layer is a set of node features. Where N is the number of nodes, K is the number of features per node, i is the number of nodes, and T is the time dimension of the node features. The output of this layer is the new node features. ;
[0102] S32: Inject the graph structure into this mechanism by performing masking concerns, only computing the nodes. of ,in It is a neighborhood of node i in the graph, and the parameter is the weight matrix. Then, a self-attention-shared attention mechanism is executed on the node. Calculate the attention coefficient, which is expressed as:
[0103]
[0104] This indicates the importance of the features of node j to node i. The feature vector of the previous node. For the feature vector of the next node, This is the output of the new node feature vector;
[0105] S33: The softmax function is used to normalize the coefficients belonging to node j for comparing the coefficients of different nodes. The function processing is shown below:
[0106]
[0107] Attention mechanism 'a' is a single-layer feedforward neural network consisting of a weight vector. Parameterization is performed, and the LeakyReLU nonlinear function is applied; the attention mechanism coefficients are expressed as:
[0108]
[0109] in· T Indicates transpose. Indicates a connection;
[0110] S34: Once obtained, the normalized attention coefficients are used to calculate a linear combination of the corresponding features as the final output feature of each node, extending this layer to a multi-attention head mechanism to stabilize the self-attention learning process; specifically, the features of K independent attention mechanisms are concatenated to obtain the following output feature representation:
[0111]
[0112] in Indicates a connection. It is the normalized attention coefficient calculated from the k-th attention coefficient;
[0113] Attention mechanism , It is the weight matrix of the corresponding linear transformation of the input;
[0114] Final output This will include each node. feature;
[0115] S35: For heterogeneous graphs, represent the nodes and edges in the heterogeneous graph as low-dimensional vectors, and then use the low-dimensional vectors to continue the task.
[0116] Step S35 specifically includes:
[0117] S351: Using heterogeneous attention to compute the weights between nodes:
[0118]
[0119] in, Indicates the type of edge. Indicates node i with respect to type The expression, Represents the weight matrix;
[0120] S352: Integrating different types of nodes and edges through Heterogeneous Attention Pooling (HAP):
[0121]
[0122] in, Indicates node i with respect to type The set of neighboring nodes.
[0123] S4: Use the training samples obtained in S2 to learn the model, and use the human heterostructure motion data prediction model to predict the future sequence of motion;
[0124] Step S4 specifically includes:
[0125] S41: Based on feature representations constructed using graph attention networks, develop an action sequence prediction model. This model can learn the spatiotemporal features of actions from a small number of training samples, and then perform a set of M tasks. As a meta-dataset, each task includes a support set. and query set ,in and This graph is shared across support and query instances for a given task;
[0126] Given the labeled supporting data and query data, the model has the lowest prediction loss for the query data of the task, expressed as:
[0127]
[0128] in To predict losses, To calculate the loss function equation, For the meta dataset, The query set contains the actual sequence data;
[0129] S42: The model parameters are optimized by applying few-shot learning techniques, enabling the model to effectively learn and predict action sequences under limited data conditions. The model structure consists of two main parts: an inference network and a prediction network.
[0130] Step S42 specifically includes:
[0131] S421: First is the inference network, which processes the task support set. The predicted and target data are used to generate a potential task representation, which should include useful information about the predicted query instance.
[0132] S422: Next, the prediction network computes the query set for the current task based on the task embeddings of its predictors and support networks. Actual motion prediction;
[0133] S423: The main components of these two parts are multiple stacked depth blocks, which process the input data as a set of attributes to support the data. The embedding of each vertex c∈C is computed on the set of instance nodes I. A single layer in such a block is a depth set layer, as shown below:
[0134]
[0135] For each vertex embedding, c is the vertex, I is the set of instance nodes, and K is... Potential output dimensions;
[0136] Use inner function for each element of instance set X and the use of external functions Aggregate this set to establish a permutation-invariant layer that operates on the set of instances;
[0137] S424: The single-layer definition in the attention block used in the model is as follows:
[0138]
[0139] To summarize the information from each sensor, The aggregated adjacent information This is the neighborhood embedding information, where j is any vertex belonging to the neighborhood nodes. It is the set of all vertices located within the c-neighborhood, for each It can be known , [.] represents the concatenation along the potential feature axis;
[0140] inner function Prepare neighborhood embeddings, and the outer function Update the vertex features of the corresponding nodes based on the aggregated neighbor information;
[0141] S425: Inference Network Processing fully supported data Compute task embeddings across instances and moving nodes; predictive network Process the query data and output the final prediction result;
[0142] Model prediction for task m Given by the following formula:
[0143]
[0144] in For the final predicted sequence of task m, the inference network Composed of GAT blocks between two DS blocks, used to capture information across instances; prediction network It consists of stacked GAT blocks and GRU layer blocks, with the GRU blocks used to compute target motion predictions.
[0145] S5: Calculate the angle of deviation between the predicted sequence result and the actual action sequence, and output the prediction result.
[0146] In step S5
[0147] S51: Calculate the deviation between the action sequence predicted by the model and the actual action sequence, and use the angle deviation as the evaluation criterion to quantify the prediction accuracy;
[0148] S52: The mean absolute angle error (MAAE) is used as a performance evaluation index. The accuracy of the prediction results is comprehensively considered to optimize and adjust the model, thereby reducing prediction bias and improving prediction performance. The specific calculation method is as follows:
[0149]
[0150] n is the total number of samples. It is the i-th actual angle value. It is the i-th predicted angle value. It is the absolute value of the difference between the actual angle value and the predicted angle value. When the predicted MAAE is greater than 80% of the set value, the model outputs an action check instruction.
[0151] This invention focuses on human movements during collaboration between humans and collaborative robots within a production unit. Optimization goals include improving the accuracy and sophistication of the prediction model for human movements. The invention also proposes a graph attention network-driven method for predicting heterogeneous human movement data by processing the collected human movement datasets to address their heterogeneous structure. This provides a reasonable and effective approach to improving safety in human-robot collaboration. Furthermore, to address the challenges of diverse collaboration scenarios and worker movements, a few-sample deep learning model capable of handling heterogeneous data is employed as an optimization method to enhance the accuracy of worker movement prediction.
[0152] This invention improves the performance of human motion prediction models by refining them. First, it employs a graph attention network approach for feature extraction, integrating information from adjacent sensors to address the limitations of existing models in handling graph-structured data. Second, it utilizes a small-sample heterogeneous data time-series model to enhance the model's predictive generalization ability. Compared to other traditional models, the improved model demonstrates enhanced generalization ability and accuracy.
[0153] This invention targets data with complex time dependencies and heterogeneous channel characteristics, and employs a few-shot deep learning model to enhance generalization ability, so as to adapt to action recognition tasks in variable environments.
[0154] The examples shown in this specification only illustrate a portion of the specific implementations of this technical solution and are not a complete description of all possible embodiments of the present invention. All variations and modifications that can be derived or implemented by those skilled in the art based on the embodiments disclosed in this document without innovative work should be considered to fall within the protection scope of this invention.
Claims
1. A graph attention network-driven method for predicting human heterostructure motion data, characterized in that, The steps include the following: S1: Real-time acquisition of human motion images in the human-machine collaborative production environment to obtain a skeletal image sequence; S2: Process the collected human motion data, split the data into multiple heterogeneous forms, and obtain a small number of training samples of multiple categories; S3: Apply graph attention network to calculate the weight distribution between nodes based on graph structure data to enhance the recognition of key action features; S4: Use the training samples obtained in S2 to learn the model, and use the human heterostructure motion data prediction model to predict the future sequence of motion; Step S4 specifically includes: S41: Based on feature representations constructed using graph attention networks, develop an action sequence prediction model. This model can learn the spatiotemporal features of actions from a small number of training samples, and then perform a set of M tasks. As a meta-dataset, each task includes a support set. and query set ,in and This graph is shared across support and query instances for a given task; Given the labeled supporting data and query data, the model has the lowest prediction loss for the query data of the task, expressed as: in To predict losses, To calculate the loss function equation, For the meta dataset, The query set contains the actual sequence data; S42: The model parameters are optimized by applying few-shot learning techniques, enabling the model to effectively learn and predict action sequences under limited data conditions. The model structure consists of two main parts: an inference network and a prediction network. Step S42 specifically includes: S421: First is the inference network, which processes the task support set. The predicted and target data are used to generate a potential task representation, which should include useful information about the predicted query instance. S422: Next, the prediction network computes the query set for the current task based on the task embeddings of its predictors and support networks. Actual motion prediction; S423: The main components of these two parts are multiple stacked depth blocks, which process the input data as a set of attributes to support the data. The embedding of each vertex c∈C is computed on the set of instance nodes I. A single layer in such a block is a depth set layer, as shown below: For each vertex embedding, c is the vertex, I is the set of instance nodes, and K is... Potential output dimensions; Use inner function for each element of instance set X and the use of external functions Aggregate this set to establish a permutation-invariant layer that operates on the set of instances; S424: The single-layer definition in the attention block used in the model is as follows: To summarize the information from each sensor, The aggregated adjacent information This is the neighborhood embedding information, where j is any vertex belonging to the neighborhood nodes. It is the set of all vertices located within the c-neighborhood, for each It can be known , [.] represents the concatenation along the potential feature axis; inner function Prepare neighborhood embeddings, and the outer function Update the vertex features of the corresponding nodes based on the aggregated neighbor information; S425: Inference Network Processing fully supported data Compute task embeddings across instances and moving nodes; predictive network Process the query data and output the final prediction result; Model prediction for task m Given by the following formula: in For the final predicted sequence of task m, the inference network Composed of GAT blocks between two DS blocks, used to capture information across instances; prediction network It consists of stacked GAT blocks and GRU layer blocks, with the GRU blocks used to compute target motion prediction; S5: Calculate the angle of deviation between the predicted sequence result and the actual action sequence, and output the prediction result.
2. The method for predicting human heterostructure motion data driven by graph attention network according to claim 1, characterized in that: Step S1 specifically includes: S11: Multiple high-definition KinectV2 cameras are placed at key locations within the human-machine collaborative production environment to ensure that human movements are captured from multiple perspectives; S12: Utilize real-time video capture technology to synchronously record the dynamic behavior of personnel in the production environment and generate a continuous video stream; S13: Preprocess the video stream captured in S12. The preprocessing includes video noise reduction, background removal, and image enhancement to improve image quality and ensure the accuracy of bone detection. S14: Apply human pose estimation algorithm to analyze the preprocessed video frames, identify and extract human key point information, and construct a skeletal image sequence; S15: The skeletal image sequence constructed based on S14 is transformed into graph structure data, where human keypoints are treated as nodes in the graph, and the physical connections between nodes are treated as edges, thus obtaining graph structure data. Where E represents the set of edges, and the vertex set V consists of C nodes. It is a symmetric adjacency matrix, representing the edge type between nodes. If nodes i and j are connected by an edge, then... .
3. The method for predicting human heterostructure motion data driven by graph attention network according to claim 2, characterized in that: In step S14, the human pose estimation algorithm is the OpenPose real-time multi-person pose estimation algorithm, and the steps include: S141: Load the OpenPose model and configure its parameters; S142: Apply the OpenPose algorithm to the input video frame to detect the posture of each person in the image in real time. The key points of the human body are set as head, shoulders, elbows, hands, torso, knees and feet. S143: Extract key point information of the human body from the output of step S142, including transforming the coordinate system, filtering out key points with low detection confidence, and the connection information between key points, and finally constructing a human skeletal image sequence.
4. The method for predicting human heterostructure motion data driven by graph attention network according to claim 1, characterized in that: Step S2 specifically includes: S21: Classify the collected skeletal image sequences according to action type and performer identity to form multi-category data subsets; S22: Apply data augmentation techniques to the datasets for each category formed in S21. These techniques include rotation, scaling, and cropping to increase data diversity during model training and create a standard graph dataset. , where m and m' are the number of nodes in the graph dataset for different tasks, and each sample of each task contains the same set of moving nodes V, whose adjacency matrix A is exactly the same; S23: Randomly select a small number of samples from the augmented datasets of each category formed in S22 to form a small training sample set for multiple categories, which will be used for subsequent model learning. S24: Each time the training samples formed according to S23 are used, a sensor is randomly selected as the initial root node, and then a subset of neighboring vertices is recursively added to the graph, including all edges whose endpoints are in the current subset, thereby sampling the induced subgraph of the original human skeleton graph to form a dataset containing heterostructured nodes. That is, the underlying graph structure and vertex set are different for different tasks, while the supporting data and query data for the same task are shared. The number of motion nodes C is not fixed, but depends on the current task.
5. The method for predicting human heterostructure motion data driven by graph attention network according to claim 1, characterized in that: Step S3 specifically includes: S31: Design the graph attention layer, which learns the dependencies between nodes (i.e., key points of the human body) through a self-attention mechanism, strengthens important features and suppresses irrelevant information, automatically identifies and enhances key action features. The input to the graph attention layer is a set of node features. Where N is the number of nodes, K is the number of features per node, i is the number of nodes, and T is the time dimension of the node features. The output of this layer is the new node features. ; S32: Inject the graph structure into this mechanism by performing masking concerns, only computing the nodes. of ,in It is a neighborhood of node i in the graph, and the parameter is the weight matrix. Then, a self-attention-shared attention mechanism is executed on the node. Calculate the attention coefficient, which is expressed as: This indicates the importance of the features of node j to node i. The feature vector of the previous node. For the feature vector of the next node, This is the output of the new node feature vector; S33: The softmax function is used to normalize the coefficients belonging to node j for comparing the coefficients of different nodes. The function processing is shown below: Attention mechanism 'a' is a single-layer feedforward neural network consisting of a weight vector. Parameterization is performed, and the LeakyReLU nonlinear function is applied; the attention mechanism coefficients are expressed as: in· T Indicates transpose. Indicates a connection; S34: Once obtained, the normalized attention coefficients are used to calculate a linear combination of the corresponding features as the final output feature of each node, extending this layer to a multi-attention head mechanism to stabilize the self-attention learning process; specifically, the features of K independent attention mechanisms are concatenated to obtain the following output feature representation: in Indicates a connection. It is the normalized attention coefficient calculated from the k-th attention coefficient; Attention mechanism , It is the weight matrix of the corresponding linear transformation of the input; Final returned output This will include each node. feature; S35: For heterogeneous graphs, represent the nodes and edges in the heterogeneous graph as low-dimensional vectors, and then use the low-dimensional vectors to continue the task.
6. The method for predicting human heterostructure motion data driven by graph attention network according to claim 5, characterized in that: Step S35 specifically includes: S351: Using heterogeneous attention to compute the weights between nodes: in, Indicates the type of edge. Indicates node i with respect to type The expression, Represents the weight matrix; S352: Integrating different types of nodes and edges through heterogeneous attention pooling: in, Indicates node i with respect to type The set of neighboring nodes.
7. The method for predicting human heterostructure motion data driven by graph attention network according to claim 1, characterized in that: In step S5 S51: Calculate the deviation between the action sequence predicted by the model and the actual action sequence, and use the angle deviation as the evaluation criterion to quantify the prediction accuracy; S52: The mean absolute angle error is used as the performance evaluation index. The accuracy of the prediction results is comprehensively considered to optimize and adjust the model in order to reduce prediction bias and improve prediction performance. The specific calculation method is as follows: n is the total number of samples. It is the i-th actual angle value. It is the i-th predicted angle value. It is the absolute value of the difference between the actual angle value and the predicted angle value.