Human motion prediction method, device, equipment and storage medium
The target motion prediction model constructed by the joint attention mechanism of the coding unit solves the problem of low accuracy in human motion prediction in the existing technology and achieves more accurate human motion prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN UNIV
- Filing Date
- 2022-12-16
- Publication Date
- 2026-06-12
Smart Images

Figure CN116189284B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the technical field of motion prediction, and more particularly to a method, apparatus, device, and storage medium for predicting human motion. Background Technology
[0002] Human motion prediction is the process of predicting future human motion based on observations of human movements over a past period. It can predict future posture sequences, playing a crucial role in numerous fields, such as autonomous driving, human-computer interaction, and pedestrian tracking.
[0003] In recent years, neural networks have been increasingly used in human motion prediction. Currently, existing neural network models typically take a single past action sequence as input and predict a single future action sequence, processing different time series of different actions sequentially. This results in the neural network model used for motion prediction only focusing on a single action sequence during the prediction process, leading to low accuracy in human motion prediction results. Summary of the Invention
[0004] The main objective of this invention is to provide a method, apparatus, device, and computer-readable storage medium for predicting human motion, with the aim of improving the accuracy of human motion prediction.
[0005] To achieve the above objectives, the present invention provides a method for predicting human motion, which includes the following steps:
[0006] Obtain the observed action sequence of the action to be predicted;
[0007] Obtain the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and obtain the observed action sequence of each training action in the training dataset used to train the target motion prediction model;
[0008] The observed action sequence of the action to be predicted and the observed action sequence of each of the training actions are input into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions.
[0009] The prediction sequence corresponding to the action to be predicted in each of the prediction sequences is determined as the motion prediction result.
[0010] Optionally, before the step of obtaining the observed action sequence of the action to be predicted, the method further includes:
[0011] Obtain the initial motion prediction model constructed by the joint attention mechanism of the encoding units, and obtain the observed action sequences of each category of training actions in the training dataset;
[0012] The observed action sequences of each training action are input into the initial motion prediction model, and feature extraction is performed through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action.
[0013] Each of the learning features is input into the decoding unit of the initial motion prediction model to obtain the training result corresponding to each of the training actions.
[0014] Based on the loss function, the model parameters in the initial motion prediction model are adjusted according to the training results to obtain the target motion prediction model.
[0015] Optionally, the encoding unit includes an attention network and a cascaded network;
[0016] The step of inputting the observed action sequences of each of the training actions into the initial motion prediction model, and extracting features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each of the training actions includes:
[0017] The observed action sequences of each of the training actions are input together into the encoding unit of the initial motion prediction model;
[0018] Based on the observed action sequences of each training action, the sequence features of each training action and the fusion features corresponding to each training action are extracted by the attention network of the encoding unit.
[0019] The sequence features of each training action and the fusion features corresponding to each training action are input into the cascaded network for learning, thereby obtaining the learning features of each training action.
[0020] Optionally, the step of extracting the sequence features of each training action and the corresponding fusion features of each training action through the attention network of the encoding unit based on the observed action sequences of each training action includes:
[0021] The observed action sequence of each training action is divided into three parts: query, key, and value.
[0022] For any target action sequence in the observed action sequence of each training action, the correlation weight between each first action sequence and the target action sequence is calculated based on the query of the target action sequence and the key of each first action sequence, wherein the first action sequence is the action sequence other than the target action sequence in the observed action sequence of each training action.
[0023] The values of the first action sequence corresponding to each of the aforementioned correlation weights are weighted by the respective correlation weights.
[0024] The weighted values of each of the first action sequences are fused to obtain the fusion feature of the training action corresponding to the target action sequence.
[0025] Optionally, the cascaded network includes a GCN (Graph Convolutional Neural Networks) network and a GRU (Gate Recurrent Unit) network; the step of inputting the sequence features of each of the training actions and the fused features corresponding to each of the training actions into the cascaded network for learning to obtain the learned features of each of the training actions includes:
[0026] The sequence features of each training action and the fusion features corresponding to each training action are input into the GCN network of the cascaded network.
[0027] By using the GCN network of the cascaded network, the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action are learned, so as to obtain the reinforcement features of each training action.
[0028] Each of the aforementioned enhancement features is input into the GRU network of the cascaded network;
[0029] The time dependency information of each reinforcement feature is learned through the GRU network of the cascaded network to obtain the learning features of each training action.
[0030] Optionally, a specific bias matrix is introduced into the GCN network of the cascaded network; wherein, the specific bias matrix is obtained based on the cosine correlation between the velocity vectors of each joint node in the training action; the step of learning the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action through the GCN network of the cascaded network to obtain the reinforcement features of each training action includes:
[0031] By introducing a GCN network with a specific bias matrix, the correlation between each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, and the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, so as to obtain the reinforcement features of each training action.
[0032] Optionally, the decoding unit includes a GRU network; the step of inputting each of the learned features into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each of the training actions includes:
[0033] Each of the learned features is input into the GRU network of the decoding unit;
[0034] The GRU network of the decoding unit recursively generates the training results corresponding to each training action.
[0035] Furthermore, to achieve the above objectives, the present invention also provides a human motion prediction device, which includes:
[0036] The acquisition module is used to acquire the observed action sequence of the action to be predicted;
[0037] The acquisition module is also used to acquire the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and to acquire the observed action sequence of each training action in the training dataset used to train the target motion prediction model.
[0038] The prediction module is used to input the observed action sequence of the action to be predicted and the observed action sequence of each of the training actions into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions.
[0039] The determination module is used to determine the prediction sequence corresponding to the action to be predicted in each prediction sequence as the motion prediction result.
[0040] In addition, to achieve the above objectives, the present invention also provides a human motion prediction device, which includes a memory, a processor, and a human motion prediction program stored in the memory and executable on the processor. When the human motion prediction program is executed by the processor, it implements the steps of the above-described human motion prediction method.
[0041] In addition, to achieve the above objectives, the present invention also provides a computer-readable storage medium storing a human motion prediction program, which, when executed by a processor, implements the steps of the human motion prediction method described above.
[0042] In this invention, the observed action sequence of the action to be predicted is obtained, the target motion prediction model constructed by the joint attention mechanism of the encoding unit is obtained, and the observed action sequence of each training action in the training dataset used to train the target motion prediction model is obtained. The observed action sequence of the action to be predicted and the observed action sequences of each training action are input into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each training action. The prediction sequence corresponding to the action to be predicted in each prediction sequence is determined as the motion prediction result.
[0043] In this invention, the target motion prediction model simultaneously predicts the action to be predicted and the observed actions. During the prediction process, the encoding unit of the joint attention mechanism in the target motion prediction model can focus on the correlation between the action to be predicted and each training action. Each observed action in the training dataset provides auxiliary information for the action to be predicted, making the prediction process through the target prediction model more consistent with the actual movement patterns of the human body and improving the accuracy of human motion prediction. Attached Figure Description
[0044] Figure 1 This is a flowchart illustrating the first embodiment of the human motion prediction method of the present invention;
[0045] Figure 2 This is a schematic diagram of the framework of an embodiment of the human motion prediction method of the present invention;
[0046] Figure 3 This is a schematic diagram of the structure of an embodiment of the human motion prediction method of the present invention;
[0047] Figure 4 This is a schematic diagram of the functional modules of the human motion prediction device involved in the embodiments of the present invention;
[0048] Figure 5 This is a schematic diagram of the human motion prediction device involved in the embodiments of the present invention;
[0049] Figure 6 This is a schematic diagram of the structure of a computer-readable storage medium involved in an embodiment of the present invention.
[0050] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0051] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0052] This invention provides a method for predicting human motion, referring to... Figure 1 As shown, Figure 1 This is a flowchart illustrating the first embodiment of the human motion prediction method of the present invention.
[0053] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
[0054] In this embodiment, the device executing the human motion prediction method of this invention can be a sensor for detecting human motion, such as an image sensor; or it can be a device that establishes a communication connection with the sensor for detecting human motion, such as a smartphone, PC (Personal Computer), tablet computer, portable computer, etc. For ease of description, the executing entity is omitted below. The human motion prediction method of this embodiment includes:
[0055] Step S10: Obtain the observed action sequence of the action to be predicted;
[0056] In this embodiment, the action for which the corresponding future action needs to be predicted is called the action to be predicted, and the observed action sequence of the action to be predicted is called the observed action sequence.
[0057] Specifically, in this embodiment, the observed action sequence of the action to be predicted is acquired. In a specific implementation, the observed action sequence can be acquired through an image sensor. In this implementation, the observed action sequence can be a series of consecutive frames of images of the action to be predicted.
[0058] Step S20: Obtain the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and obtain the observed action sequences of each training action in the training dataset used to train the target motion prediction model;
[0059] In this embodiment, the trained human motion prediction model is referred to as the target motion prediction model, and the target motion prediction model is obtained. Specifically, in this embodiment, the target motion prediction model is constructed using a joint attention mechanism within an encoder-decoder framework. More specifically, it is a joint attention mechanism between the encoder units in the encoder-decoder framework. The target motion prediction model constructed using this joint attention mechanism can simultaneously predict future action sequences of multiple actions. The attention mechanism allows the target motion prediction model to focus on the correlation between the action to be predicted and the training action.
[0060] Specifically, in this embodiment, the observed action sequences of each action (hereinafter referred to as training actions for distinction) in the training dataset used to train the target motion prediction model are obtained.
[0061] Step S30: Input the observed action sequence of the action to be predicted and the observed action sequence of each of the training actions into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions.
[0062] In this embodiment, the target motion prediction model can simultaneously predict future action sequences of multiple actions. The observed action sequences of the action to be predicted, along with the sequences of each observed action, are input into the target motion prediction model. The target motion prediction model then outputs the predicted sequences corresponding to the action to be predicted and each training action.
[0063] In a specific implementation, the process of inputting the observed action sequence of the action to be predicted and the observed action sequences of each training action into the target motion prediction model to obtain the prediction sequences corresponding to the action to be predicted and each training action can be as follows: inputting the observed action sequence of the action to be predicted and the observed action sequences of each training action into the target motion prediction model, extracting features from each observed action sequence through the encoding unit of the joint attention mechanism in the target motion prediction model to obtain the features of each training action and the action to be predicted, wherein the attention mechanism can focus on the correlation between the observed action sequence and each observed action sequence; inputting the features of the training action and the action to be predicted into the decoding unit of the target motion prediction model to obtain the prediction sequences corresponding to each training action and the action to be predicted.
[0064] Step S40: Determine the prediction sequence corresponding to the action to be predicted in each prediction sequence as the motion prediction result.
[0065] In this embodiment, after obtaining the prediction sequence corresponding to the action to be predicted and each training action, the prediction sequence corresponding to the action to be predicted is determined from each prediction sequence as the motion prediction result.
[0066] In this embodiment, by acquiring the observed action sequence of the action to be predicted, the target motion prediction model constructed by the joint attention mechanism of the encoding unit is acquired, and the observed action sequence of each training action in the training dataset used to train the target motion prediction model is acquired. The observed action sequence of the action to be predicted and the observed action sequences of each training action are input into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each training action. The prediction sequence corresponding to the action to be predicted in each prediction sequence is determined as the motion prediction result.
[0067] In this embodiment, the target motion prediction model simultaneously predicts the action to be predicted and the observed actions. During the prediction process, the encoding unit of the joint attention mechanism in the target motion prediction model can focus on the correlation between the action to be predicted and each training action. Each observed action in the training dataset provides auxiliary information for the action to be predicted, making the prediction process through the target prediction model more consistent with the actual movement patterns of the human body and improving the accuracy of human motion prediction.
[0068] Furthermore, based on the first embodiment described above, a second embodiment of the human motion prediction method of the present invention is proposed. In this embodiment, before step S10 described above, the human motion prediction method further includes:
[0069] Step S50: Obtain the initial motion prediction model constructed by the joint attention mechanism of the encoding units, and obtain the observed action sequences of each category of training actions in the dataset to be trained;
[0070] In this embodiment, the motion prediction model that has not been trained is referred to as the initial motion prediction model. In this embodiment, the initial prediction model is an encoder-decoder framework, wherein the encoder unit is constructed using a joint attention mechanism. When performing feature extraction, the encoder unit using the joint attention mechanism enables the initial motion prediction model to simultaneously learn multiple different observed action sequences. Specifically, in this embodiment, the initial motion prediction model constructed by the encoder unit using the joint attention mechanism is obtained.
[0071] In this embodiment, the observed action sequences for each training action in the training dataset are obtained. The training dataset includes training actions of multiple categories, such as walking, running, and jumping. In a specific implementation, the training dataset can be constructed based on the classified training actions after the training action classification network has been trained. This is not limited to any particular method and can be configured according to actual needs.
[0072] Step S60: Input the observed action sequences of each training action into the initial motion prediction model, and extract features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action.
[0073] In this embodiment, after obtaining the observed action sequences of each training action in the initial motion prediction model and the training dataset, the observed action sequences of each training action are input into the initial motion prediction model, and the initial motion prediction model is trained using the observed action sequences of each training action.
[0074] Specifically, in this embodiment, the observed action sequences of each training action are input into the initial motion prediction model, and feature extraction is performed through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the features of each training action (hereinafter referred to as learning features for distinction).
[0075] Step S70: Input each of the learned features into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each of the training actions;
[0076] In this embodiment, feature extraction is performed by the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action. Then, each learning feature is input into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each training action.
[0077] Step S80: Based on the loss function, adjust the model parameters in the initial motion prediction model according to the training results to obtain the target motion prediction model.
[0078] In this embodiment, after inputting all observed action sequences into the initial motion prediction model to obtain the training results corresponding to each training action, the model parameters in the initial motion prediction model are adjusted based on the loss function and the training results to obtain the target motion prediction model.
[0079] Furthermore, in some feasible embodiments, before step S80 above: adjusting the model parameters in the initial motion prediction model based on the loss function and according to each of the training results to obtain the target motion prediction model, the human motion prediction method further includes:
[0080] The training results output from the initial motion prediction model are input into the trained action classification network, and the training actions corresponding to the known action categories of each training result are also input into the action classification network. In a specific implementation, the action classification network can be a network built based on MLP (Multilayer Perceptron), and the specific configuration can be set according to actual needs, without limitation here.
[0081] The action classification network outputs the category of each training result (hereinafter referred to as the training category for distinction), and also outputs the action category of the corresponding training action for each training result (hereinafter referred to as the observed category for distinction).
[0082] Based on the training category and observed category corresponding to each training result, the cross-entropy loss of the action classification network is determined. In this embodiment, the cross-entropy loss is added to the loss function.
[0083] Specifically, refer to Figure 2 , Figure 2 This is a schematic diagram of the framework of an embodiment of the human motion prediction method of the present invention, as shown below. Figure 2 As shown, in this embodiment, multiple observed action sequences are simultaneously input into the initial motion prediction model to obtain the training results corresponding to each training action.
[0084] The training results (i.e.) Figure 2 The PF (prediction frame) shown is input into the trained action classification network, and the observed actions of the known action categories corresponding to each training result (i.e., the predicted values) are input into the network. Figure 2 The ground truth (GT) shown in the figure is input into the action classification network.
[0085] This action classification network outputs the training category for each training result based on the training results (i.e., ...). Figure 2 The label (category) shown is used to classify the action, and the network outputs the observed category corresponding to each training result (i.e., the category). Figure 2 The label (category) shown is (GT)).
[0086] In this embodiment, the loss function can be constructed by combining the cross-entropy loss of the action classification network with some inherent body information, such as the invariant length of the human skeleton. Specifically, in a feasible embodiment, a skeleton invariance constraint is introduced into the loss function. In this embodiment, based on inherent human information, such as the invariant length of the human skeleton, a skeleton length invariance constraint is added to the loss function. The loss function after adding the cross-entropy loss of the action classification network and the skeleton length invariance constraint is as follows:
[0087]
[0088] The loss function consists of two parts: the first part calculates the L2 norm of the predicted node coordinates and the ground truth coordinates, where T is the prediction time length and L is the number of joints; the second part calculates the average bone length error of the predicted action over time T, where N is the total number of bones. B is the bone length of the action sequence for the training movement, and L is the bone length of the resulting action sequence. C This represents the cross-entropy loss of the action classification network.
[0089] It should be noted that in this embodiment, a skeleton invariance constraint is added to the loss function, so that the initial motion prediction model can be adjusted to better conform to the human body structure during training. Furthermore, a cross-entropy loss is added to the loss function, and the training results of the initial motion prediction model are further corrected by the action classification network of the classification model, so that the prediction results of the target motion prediction model are more accurate.
[0090] It should be noted that in this embodiment, an initial motion prediction model is constructed through a joint attention mechanism, which enables the initial motion prediction model to learn multiple different observed action sequences simultaneously. When learning a training action through the initial motion prediction model, it can pay attention to the information of other training actions, so that the initial motion prediction model can pay attention to the correlation between various training actions. This makes the prediction results of the target motion prediction model obtained from the training more in line with the movement laws of the human body, thereby improving the accuracy of motion prediction.
[0091] Furthermore, in some feasible embodiments, the coding units in the initial motion prediction model include attention networks and cascaded networks.
[0092] In this embodiment, step S60 above involves inputting the observed action sequences of each training action into the initial motion prediction model, and extracting features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action, including:
[0093] Step S601: Input the observed action sequences of each of the training actions into the encoding unit of the initial motion prediction model;
[0094] In this embodiment, after obtaining the initial motion prediction model constructed by the joint attention mechanism of the encoding unit and obtaining the observed action sequences of each training action in the training dataset, the observed action sequences of each training action are input into the encoding unit of the initial motion prediction model to obtain the training results corresponding to each training action.
[0095] In this embodiment, the initial motion prediction model includes an encoding unit and a decoding unit. The encoding unit includes an attention network and a cascaded network for feature extraction. Specifically, all observed action sequences are input into the encoding unit of the initial motion prediction model.
[0096] Step S602: Based on the observed action sequences of each training action, the sequence features of each training action and the fusion features corresponding to each training action are extracted through the attention network of the encoding unit.
[0097] In this embodiment, the features of the observed action sequence of each training action are called sequence features, and the correlation features between each training action and other training actions are called fusion features.
[0098] Specifically, after inputting the observed action sequences of each training action into the encoding unit of the initial motion prediction model, the sequence features of each training action and the corresponding multi-fusion features of each training action are extracted through the attention network of the encoding unit based on the observed action sequences of each training action.
[0099] In a specific implementation, an attention network is used to extract the fusion features of any action sequence (hereinafter referred to as the target action sequence for distinction) from each observed action. This can be achieved by fusing the target action sequence and each first action sequence to obtain the fusion features. No specific restrictions are imposed here, and the settings can be made according to actual needs.
[0100] Step S603: Input the sequence features of each training action and the fusion features corresponding to each training action into the cascaded network for learning to obtain the learning features of each training action.
[0101] In this embodiment, based on the observed action sequences of each training action, the attention network of the encoding unit extracts the sequence features of each training action and the fusion features corresponding to each training action. Then, the sequence features of the observed action sequences of each training action and the fusion features corresponding to each training action are input into the cascaded network for learning, so as to obtain the features of each training action for training (hereinafter referred to as learning features for distinction).
[0102] In a specific implementation, the cascaded network can be a GCN network and a GRU network. In this implementation, the spatial dependency information of each joint node in the observed action is learned through the GCN network in the cascaded network, and the temporal dependency information of the training action is learned through the GRU network in the cascaded network.
[0103] It should be noted that the motion prediction model in this embodiment includes an encoding unit and a decoding unit. The encoding unit includes an attention network and a cascaded network. The attention network extracts the corresponding fusion features from the observed action sequences of each different training action, so that when learning a training action through the initial motion prediction model, it can pay attention to the information of other training actions. This allows the initial motion prediction model to pay attention to the correlation between various training actions, so that the prediction results of the trained target motion prediction model can better conform to the movement laws of the human body and improve the accuracy of motion prediction.
[0104] Further, in some feasible embodiments, step S602 above: based on the observed action sequences of each training action, extracting the sequence features of each training action and the fusion features corresponding to each training action through the attention network of the encoding unit, includes:
[0105] Step S6021: Divide the observed action sequence of each training action into three parts: query, key, and value;
[0106] In this embodiment, the fusion features corresponding to each training action are extracted through the attention network of the encoding unit. Specifically, in this embodiment, the observed action sequence of each training action is divided into three parts: query, key, and value.
[0107] In a specific implementation, the proportions of query, key, and value in the observed action sequence can be set according to actual needs. For example, in one implementation, the query can be the entire observed action sequence, the key can be the first 80% of the observed action sequence, and the value can be the last 20% of the observed action sequence. No specific restrictions are imposed here.
[0108] Step S6022: For any target action sequence in the observed action sequence of each training action, calculate the correlation weight between each first action sequence and the target action sequence based on the query of the target action sequence and the key of each first action sequence, wherein the first action sequence is the action sequence other than the target action sequence in the observed action sequence of each training action.
[0109] In this embodiment, after dividing each observed action sequence into three parts—query, key, and value—for any target action sequence in each observed action sequence, the correlation weight between each first action sequence and the target action sequence is calculated based on the query of the target action sequence and the key of each first action sequence. In this embodiment, the action sequences other than the target action sequence in the observed action sequences of each training action are referred to as first action sequences.
[0110] In a specific implementation, the specific process of calculating the correlation weight can be as follows: using the query of the target predicted action sequence, comparing it with the key of each first action sequence to calculate the correlation score between each first action sequence and the target action sequence, and then normalizing the correlation score to obtain the correlation weight between each first action sequence and the target action sequence.
[0111] Step S6023: Weight each element in the value of the first action sequence corresponding to each of the aforementioned correlation weights is applied using the respective correlation weights;
[0112] In this embodiment, for any target action sequence in each observed action sequence, based on the query of the target action sequence and the key of each first action sequence, the correlation weight between each first action sequence and the target action sequence is calculated, and then each element in the value of the first action sequence corresponding to each correlation weight is weighted by each correlation weight.
[0113] Step S6024: The weighted values of each of the first action sequences are fused to obtain the fusion feature of the training action corresponding to the target action sequence.
[0114] In this embodiment, after weighting each element in the value of the first action sequence corresponding to each correlation weight, the weighted values of each first action sequence are fused to obtain the fusion feature of the target action.
[0115] In a specific implementation, the extraction of the target action sequence through an attention network can refer to the formula of the attention mechanism, as follows:
[0116]
[0117] Where Q represents the query for the target action sequence, K is the key of the target action sequence, V is the value of the target action sequence, and softmax is the normalization operation. In this embodiment, the query for the target predicted action sequence is compared with the key of each first action sequence to calculate the correlation score between each first action sequence and the target action sequence. After normalization of the correlation score, the correlation weight between each first action sequence and the target action sequence is obtained. The elements in the value of the first action sequence corresponding to each correlation weight are weighted according to the above formula. It can be seen that the greater the similarity value between Q and K, the greater the weight between the first action sequence and the target action sequence, that is, the greater the correlation between the first action sequence and the target action sequence.
[0118] It should be noted that the attention network extracts corresponding fusion features for each different observed action sequence, enabling it to pay attention to information about other training actions when learning a training action through the initial motion prediction model. This allows the initial motion prediction model to focus on the correlation between various training actions, making the prediction results of the trained target motion prediction model more consistent with the laws of human movement and improving the accuracy of motion prediction.
[0119] Furthermore, in some feasible embodiments, the cascaded network includes a GCN network and a GRU network. In this embodiment, step S603 above: inputting the sequence features of each training action and the fusion features corresponding to each training action into the cascaded network for learning, to obtain the learning features of each training action, includes:
[0120] Step S6031: Input the sequence features of each training action and the fusion features corresponding to each training action into the GCN network of the cascaded network;
[0121] In this embodiment, the cascaded network includes a GCN network and a GRU network. In this embodiment, the sequence features of each training action and the fusion features corresponding to each training action are input into the GCN network of the cascaded network.
[0122] Step S6032: Through the GCN network of the cascaded network, learn the sequence features of each training action and the spatial dependency information of each joint node in the fusion features corresponding to each training action, and obtain the reinforcement features of each training action.
[0123] In this embodiment, after inputting the sequence features of each training action and the fusion features corresponding to each training action into the GCN network of the cascaded network, the spatial dependency information of each joint node in the sequence features and fusion features corresponding to each training action is learned through the GCN network of the cascaded network, so as to obtain the features of each training action after strengthening the spatial dependency information (hereinafter referred to as strengthening features for distinction).
[0124] In this embodiment, the calculation formula for the GCN network in the coding unit is as follows:
[0125] X (l+1) =σ(AX) (l) W (l) +b (l) )
[0126] Among them, X (l) σ represents the sequence features and fusion features of the l-th layer, σ represents the nonlinear transformation sigmoid, A represents a trainable adjacency matrix used to characterize the correlation between nodes, and W and b represent the weights and biases of the l-th layer, respectively.
[0127] Step S6033: Input each of the enhancement features into the GRU network of the cascaded network;
[0128] In this embodiment, the GCN network of the cascaded network learns the sequence features of each training action and the spatial dependency information of each joint node in the fusion features corresponding to each training action. After obtaining the reinforcement features of each training action, the reinforcement features are input into the GRU network of the cascaded network.
[0129] Step S6034: Learn the time dependency information of each reinforcement feature through the GRU network of the cascaded network to obtain the learning features of each training action.
[0130] In this embodiment, each reinforcement feature is input into the GRU network of the cascaded network, and the time dependency information of each reinforcement feature is learned through the GRU network of the cascaded network to obtain the learning features of each training action.
[0131] Specifically, in this embodiment, the calculation formula for the GRU network in the coding unit is as follows:
[0132] r t =σ(W r *[h t-1 ,x t ]+b r )
[0133] z t =σ(W t *[h t-1 ,x t ]+b z )
[0134]
[0135]
[0136] Here, z represents the update gate, which, as a weight, controls the degree to which information from the previous state is added to the current state. A larger update gate value indicates that more information from the previous state is added. r represents the reset gate, which controls how much information from the previous state is written into the current candidate hidden state. h t Let t represent the learning features at time t. In the last step of the formula, the information to be retained and updated will be determined based on the weights of the update gates and passed to the next unit. W and b represent the weights and biases of their respective gates.
[0137] It should be noted that by using the cascaded GCN network, the spatial dependency information of each joint node in the fusion features of each training action is learned, thus obtaining the reinforcement features of each training action. The temporal dependency information of each reinforcement feature is then learned by the cascaded GRU network. This allows the initial motion prediction model and the trained target motion prediction model to take into account the temporal and spatial correlation information of each action sequence in the model. As a result, the target motion prediction model is more consistent with the temporal and spatial laws of human movement when making predictions, thereby improving the accuracy of predictions using the target motion prediction model.
[0138] Furthermore, in some feasible embodiments, a specific bias matrix is introduced into the GCN network of the cascaded network described above, wherein the specific bias matrix is obtained based on the cosine correlation between the velocity vectors of each joint node in the training action.
[0139] In this embodiment, step S6032 above: through the GCN network of the cascaded network, learn the sequence features of each training action and the spatial dependency information of each joint node in the fusion features corresponding to each training action, to obtain the enhancement features of each training action, including:
[0140] Step S60321: By introducing a GCN network with a specific bias matrix, learn the correlation between each joint node in the sequence features of each training action and the fusion features corresponding to each training action, and learn the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action, to obtain the reinforcement features of each training action.
[0141] Due to the differences in the length of human limbs, the speed of joint movement also varies when the human body performs actions. Therefore, in this embodiment, the cosine correlation between the node velocities of each joint node is used as a specific bias matrix that is spatially and temporally related to the observed actions, and is added to the adjacency matrix of the decoding unit.
[0142] Specifically, in this embodiment, the differential information between each joint node of the observed action sequence can be calculated, the node velocity vector can be calculated based on the differential information, and the cosine correlation can be calculated based on the node velocity vector. Specifically, the calculation of the cosine correlation is as follows:
[0143]
[0144] In this embodiment, by introducing a GCN network with a specific bias matrix, the correlation between each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, and the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, so as to obtain the enhancement features of each training action.
[0145] Specifically, in this embodiment, the GCN network formula after adding cosine correlation as a specific bias is as follows:
[0146] X t =σ((A) cos_sim +A)X (l) W (l) +b (l) )
[0147] Among them, X (l) σ represents the sequence features and fusion features of the l-th layer, σ represents the nonlinear transformation sigmoid, A represents a trainable adjacency matrix used to characterize the correlation between nodes, and W and b represent the weights and biases of the l-th layer, respectively.
[0148] It should be noted that, in this embodiment, considering the different correlations of joint nodes in different actions, a specific bias is added to the adjacency matrix of the GCN, so that the GCN can pay attention to the differences between different actions and sequences, thereby improving the prediction accuracy of the target motion prediction model.
[0149] Furthermore, in some feasible embodiments, the decoding unit includes a GRU network. In this embodiment, step S70 above: inputting each of the learned features into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each of the training actions, includes:
[0150] Step S701: Input each of the learned features into the GRU network of the decoding unit;
[0151] In this embodiment, the decoding unit may include a GRU network. Specifically, after inputting all observed action sequences into the initial motion prediction model and extracting features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action, the learning features are input into the GRU network of the decoding unit.
[0152] Step S702: The training results corresponding to each training action are recursively generated through the GRU network of the decoding unit.
[0153] In this embodiment, after inputting all observed action sequences into the initial motion prediction model, feature extraction is performed through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action. Then, the learning features are input into the GRU network of the decoding unit to obtain the training results corresponding to each training action.
[0154] Specifically, in this embodiment, the training results corresponding to each training action are recursively generated by the GRU network of the decoding unit. The specific calculation formula of the GRU network can be referred to in step S6034 above, and will not be repeated here.
[0155] Furthermore, in a feasible embodiment, reference is made to Figure 3 , Figure 3 This is a schematic diagram of an embodiment of the human motion prediction method of the present invention. In this embodiment, three observed actions are used as model inputs, such as... Figure 3 As shown, the initial motion prediction model is an encoder-decoder unit framework. The encoder unit includes an attention network, a cascaded GCN and GRU network, and the decoder unit includes a GRU network. In this embodiment, the attention network fuses the sequence features of other training actions for the observed action sequence of each training action. Specifically, when extracting the sequence features of the observed action sequence of the first training action, the attention network performs key operations on the query value of the observed action sequence of the first training action (i.e., the target action sequence) and the observed action sequences of other training actions (i.e., the first action sequence) to obtain the correlation weight of the first action sequence to the target action sequence. Then, the correlation weight is used to weight the value of the first action sequence to obtain the weighted value of each first action sequence as the fusion feature. The fusion feature and the sequence features of the target action sequence are fed into the subsequent cascaded network for learning.
[0156] It is understood that in this embodiment, when performing motion prediction for the action to be predicted, the observed action sequence of the training action can be processed by referring to the training process of this embodiment. Specifically, in this embodiment, the observed action sequence of the action to be predicted and the observed action sequence of each training action are input into the target motion prediction model.
[0157] The sequence features and fusion features of the action to be predicted and each training action are extracted by the coding units in the target motion prediction model. These features, along with their corresponding fusion features, are then input into a cascaded network for learning, resulting in the learned features of the action to be predicted and each training action. In this embodiment, the specific process of obtaining the learned features of the action to be predicted and each training action can be referred to in the various implementation methods described herein, and will not be elaborated upon here.
[0158] In this embodiment, after calculating the learning features of the action to be predicted and each training action, the sequence features and corresponding fusion features of the action to be predicted and each training action are input into the GCN network of the cascaded network. Through the GCN network of the cascaded network, the spatial dependency information of each joint node in the sequence features and corresponding fusion features of the action to be predicted and each training action is learned, and the reinforcement features of the action to be predicted and each training action are obtained.
[0159] Each reinforcement feature is input into the GRU network of the cascaded network. The GRU network of the cascaded network learns the time dependency information of each reinforcement feature, and obtains the learning features of the action to be predicted and each training action.
[0160] Each learned feature is input into the GRU network of the decoding unit. The GRU network of the decoding unit recursively generates the prediction sequence corresponding to each training action of the action to be predicted. The prediction sequence corresponding to the action to be predicted in each prediction sequence is determined as the motion prediction result.
[0161] In this embodiment, an initial motion prediction model is constructed by acquiring the joint attention mechanism of the encoding unit, and the observed action sequences of each category of training action in the training dataset are acquired. The observed action sequences of each training action are input into the initial motion prediction model to obtain the training results corresponding to each training action. Based on the loss function, the model parameters in the initial motion prediction model are adjusted according to the training results to obtain the target motion prediction model.
[0162] In this embodiment, a joint attention mechanism is used to construct an initial motion prediction model, which allows the initial motion prediction model to learn multiple different observed action sequences simultaneously. When learning a training action through the initial motion prediction model, it can pay attention to information about other training actions besides that training action, enabling the initial motion prediction model to pay attention to the correlation between various training actions. This makes the prediction results of the target motion prediction model obtained from the training more consistent with the movement patterns of the human body, thus improving the accuracy of motion prediction.
[0163] In addition, the present invention also provides a human motion prediction device, referring to Figure 4 , Figure 4 This is a functional module diagram of the human motion prediction device according to an embodiment of the present invention. The human motion prediction device of the present invention includes:
[0164] The acquisition module 10 is used to acquire the observed action sequence of the action to be predicted;
[0165] The aforementioned acquisition module 10 is also used to acquire the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and to acquire the observed action sequence of each training action in the training dataset used to train the target motion prediction model.
[0166] Prediction module 20 is used to input the observed action sequence of the action to be predicted and the observed action sequence of each of the training actions into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions.
[0167] The determining module 30 is used to determine the prediction sequence corresponding to the action to be predicted in each prediction sequence as the motion prediction result.
[0168] Furthermore, the aforementioned human motion prediction device also includes a training module, which is used for:
[0169] Obtain the initial motion prediction model constructed by the joint attention mechanism of the encoding units, and obtain the observed action sequences of each category of training actions in the training dataset;
[0170] The observed action sequences of each training action are input into the initial motion prediction model, and feature extraction is performed through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action.
[0171] Each of the learning features is input into the decoding unit of the initial motion prediction model to obtain the training result corresponding to each of the training actions.
[0172] Based on the loss function, the model parameters in the initial motion prediction model are adjusted according to the training results to obtain the target motion prediction model.
[0173] Furthermore, the encoding unit includes an attention network and a cascaded network, and the training module is also used for:
[0174] The observed action sequences of each of the training actions are input together into the encoding unit of the initial motion prediction model;
[0175] Based on the observed action sequences of each training action, the sequence features of each training action and the fusion features corresponding to each training action are extracted by the attention network of the encoding unit.
[0176] The sequence features of each training action and the fusion features corresponding to each training action are input into the cascaded network for learning, thereby obtaining the learning features of each training action.
[0177] Furthermore, the aforementioned training module is also used for:
[0178] The observed action sequence of each training action is divided into three parts: query, key, and value.
[0179] For any target action sequence in the observed action sequence of each training action, the correlation weight between each first action sequence and the target action sequence is calculated based on the query of the target action sequence and the key of each first action sequence, wherein the first action sequence is the action sequence other than the target action sequence in the observed action sequence of each training action.
[0180] The values of the first action sequence corresponding to each of the aforementioned correlation weights are weighted by the respective correlation weights.
[0181] The weighted values of each of the first action sequences are fused to obtain the fusion feature of the training action corresponding to the target action sequence.
[0182] Furthermore, the cascaded network includes a GCN network and a GRU network, and the training module is also used for:
[0183] The sequence features of each training action and the fusion features corresponding to each training action are input into the GCN network of the cascaded network.
[0184] By using the GCN network of the cascaded network, the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action are learned, so as to obtain the reinforcement features of each training action.
[0185] Each of the aforementioned enhancement features is input into the GRU network of the cascaded network;
[0186] The time dependency information of each reinforcement feature is learned through the GRU network of the cascaded network to obtain the learning features of each training action.
[0187] Furthermore, a specific bias matrix is introduced into the GCN network of the cascaded network;
[0188] The specific bias matrix is obtained based on the cosine correlation between the velocity vectors of each joint node in the training action;
[0189] The above training module is also used for:
[0190] By introducing a GCN network with a specific bias matrix, the correlation between each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, and the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, so as to obtain the reinforcement features of each training action.
[0191] Furthermore, the decoding unit includes a GRU network, and the training module is also used for:
[0192] Each of the learned features is input into the GRU network of the decoding unit;
[0193] The GRU network of the decoding unit recursively generates the training results corresponding to each training action.
[0194] Each functional module of the human motion prediction device performs the steps of the human motion prediction method described above during operation.
[0195] Furthermore, the present invention also provides a human motion prediction device. (See reference...) Figure 5 , Figure 5 This is a schematic diagram of the human motion prediction device according to an embodiment of the present invention. Specifically, the human motion prediction device according to an embodiment of the present invention can be a device that runs a locally running human motion prediction system.
[0196] like Figure 5 As shown, the human motion prediction device of this embodiment may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to enable communication between these components. The user interface 1003 may include a display screen and an input unit such as a keyboard; optionally, the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).
[0197] The memory 1005 is disposed on the main body of the human motion prediction device. The memory 1005 stores a program that, when executed by the processor 1001, performs corresponding operations. The memory 1005 also stores parameters used by the human motion prediction device. The memory 1005 can be a high-speed RAM or a stable, non-volatile memory, such as a disk storage device. Optionally, the memory 1005 can also be a storage device independent of the aforementioned processor 1001.
[0198] Those skilled in the art will understand that Figure 5 The structure of the human motion prediction device shown does not constitute a limitation on the human motion prediction device. It may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0199] like Figure 5 As shown, the memory 1005, which serves as a storage medium, may include an operating system, a network processing module, a user interface module, and a human motion prediction program.
[0200] exist Figure 5 In the human motion prediction device shown, the processor 1001 can be used to call the human motion prediction program stored in the memory 1005 and perform the following operations:
[0201] Obtain the observed action sequence of the action to be predicted;
[0202] Obtain the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and obtain the observed action sequence of each training action in the training dataset used to train the target motion prediction model;
[0203] The observed action sequence of the action to be predicted and the observed action sequence of each of the training actions are input into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions.
[0204] The prediction sequence corresponding to the action to be predicted in each of the prediction sequences is determined as the motion prediction result.
[0205] Furthermore, the processor 1001 can also be used to call the human motion prediction program stored in the memory 1005 and perform the following operations:
[0206] Obtain the initial motion prediction model constructed by the joint attention mechanism of the encoding units, and obtain the observed action sequences of each category of training actions in the training dataset;
[0207] The observed action sequences of each training action are input into the initial motion prediction model, and feature extraction is performed through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action.
[0208] Each of the learning features is input into the decoding unit of the initial motion prediction model to obtain the training result corresponding to each of the training actions.
[0209] Based on the loss function, the model parameters in the initial motion prediction model are adjusted according to the training results to obtain the target motion prediction model.
[0210] Further, the encoding unit includes an attention network and a cascaded network. The operation of inputting the observed action sequences of each training action into the initial motion prediction model, and extracting features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action, includes:
[0211] The observed action sequences of each of the training actions are input together into the encoding unit of the initial motion prediction model;
[0212] Based on the observed action sequences of each training action, the sequence features of each training action and the fusion features corresponding to each training action are extracted by the attention network of the encoding unit.
[0213] The sequence features of each training action and the fusion features corresponding to each training action are input into the cascaded network for learning, thereby obtaining the learning features of each training action.
[0214] Further, the operation of extracting the fusion features corresponding to each of the training actions through the attention network of the encoding unit includes:
[0215] The observed action sequence of each training action is divided into three parts: query, key, and value.
[0216] For any target action sequence in the observed action sequence of each training action, the correlation weight between each first action sequence and the target action sequence is calculated based on the query of the target action sequence and the key of each first action sequence, wherein the first action sequence is the action sequence other than the target action sequence in the observed action sequence of each training action.
[0217] The values of the first action sequence corresponding to each of the aforementioned correlation weights are weighted by the respective correlation weights.
[0218] The weighted values of each of the first action sequences are fused to obtain the fusion feature of the training action corresponding to the target action sequence.
[0219] Further, the cascaded network includes a GCN network and a GRU network. The operation of inputting the sequence features of each training action and the fused features corresponding to each training action into the cascaded network for learning to obtain the learned features of each training action includes:
[0220] The sequence features of each training action and the fusion features corresponding to each training action are input into the GCN network of the cascaded network.
[0221] By using the GCN network of the cascaded network, the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action are learned, so as to obtain the reinforcement features of each training action.
[0222] Each of the aforementioned enhancement features is input into the GRU network of the cascaded network;
[0223] The time dependency information of each reinforcement feature is learned through the GRU network of the cascaded network to obtain the learning features of each training action.
[0224] Furthermore, a specific bias matrix is introduced into the GCN network of the cascaded network, wherein the specific bias matrix is obtained based on the cosine correlation between the velocity vectors of each joint node in the training action;
[0225] The operation of learning the sequence features of each training action and the spatial dependency information of each joint node in the fusion features corresponding to each training action through the cascaded GCN network to obtain the enhancement features of each training action includes:
[0226] By introducing a GCN network with a specific bias matrix, the correlation between each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, and the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, so as to obtain the reinforcement features of each training action.
[0227] Further, the decoding unit includes a GRU network, and the operation of inputting each of the learned features into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each of the training actions includes:
[0228] Each of the learned features is input into the GRU network of the decoding unit;
[0229] The GRU network of the decoding unit recursively generates the training results corresponding to each training action.
[0230] Furthermore, the present invention also provides a computer-readable storage medium. (See reference...) Figure 6 , Figure 6This is a schematic diagram of the structure of a computer-readable storage medium according to an embodiment of the present invention. The computer-readable storage medium stores a human motion prediction program, which, when executed by a processor, implements the steps of the human motion prediction method described above.
[0231] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.
[0232] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0233] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a computer-readable storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a human motion prediction device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0234] The above are merely preferred embodiments of the present invention and do not limit the scope of the patent. Any equivalent structural or procedural transformations made based on the description and drawings of the present invention, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of the present invention.
Claims
1. A method for predicting human motion, characterized in that, The human motion prediction method includes: Obtain the observed action sequence of the action to be predicted; Obtain the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and obtain the observed action sequence of each training action in the training dataset used to train the target motion prediction model; The observed action sequence of the action to be predicted and the observed action sequence of each of the training actions are input into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions. The prediction sequence corresponding to the action to be predicted in each of the prediction sequences is determined as the motion prediction result; Prior to the step of obtaining the observed action sequence of the action to be predicted, the method further includes: Obtain the initial motion prediction model constructed by the joint attention mechanism of the encoding units, and obtain the observed action sequences of each category of training actions in the training dataset; The observed action sequences of each training action are input into the initial motion prediction model, and feature extraction is performed through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action. Each of the learning features is input into the decoding unit of the initial motion prediction model to obtain the training result corresponding to each of the training actions. Based on the loss function, the model parameters in the initial motion prediction model are adjusted according to the training results to obtain the target motion prediction model.
2. The human motion prediction method as described in claim 1, characterized in that, The encoding unit includes an attention network and a cascaded network; The step of inputting the observed action sequences of each of the training actions into the initial motion prediction model, and extracting features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each of the training actions includes: The observed action sequences of each of the training actions are input together into the encoding unit of the initial motion prediction model; Based on the observed action sequences of each training action, the sequence features of each training action and the fusion features corresponding to each training action are extracted by the attention network of the encoding unit. The sequence features of each training action and the fusion features corresponding to each training action are input into the cascaded network for learning, thereby obtaining the learning features of each training action.
3. The human motion prediction method as described in claim 2, characterized in that, The step of extracting the sequence features of each training action and the corresponding fusion features of each training action from the observed action sequence of each training action through the attention network of the encoding unit includes: The observed action sequence of each training action is divided into three parts: query, key, and value. For any target action sequence in the observed action sequence of each training action, the correlation weight between each first action sequence and the target action sequence is calculated based on the query of the target action sequence and the key of each first action sequence, wherein the first action sequence is the action sequence other than the target action sequence in the observed action sequence of each training action. The values of the first action sequence corresponding to each of the aforementioned correlation weights are weighted by the respective correlation weights. The weighted values of each of the first action sequences are fused to obtain the fusion feature of the training action corresponding to the target action sequence.
4. The human motion prediction method as described in claim 2, characterized in that, The cascaded network includes a GCN network and a GRU network; The step of inputting the sequence features of each training action and the fused features corresponding to each training action into the cascaded network for learning to obtain the learned features of each training action includes: The sequence features of each training action and the fusion features corresponding to each training action are input into the GCN network of the cascaded network. By using the GCN network of the cascaded network, the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action are learned, so as to obtain the reinforcement features of each training action. Each of the aforementioned enhancement features is input into the GRU network of the cascaded network; The time dependency information of each reinforcement feature is learned through the GRU network of the cascaded network to obtain the learning features of each training action.
5. The human motion prediction method as described in claim 4, characterized in that, A specific bias matrix is introduced into the GCN network of the cascaded network; The specific bias matrix is obtained based on the cosine correlation between the velocity vectors of each joint node in the training action; The step of learning the sequence features of each training action and the spatial dependency information of each joint node in the fusion features corresponding to each training action through the cascaded GCN network to obtain the enhancement features of each training action includes: By introducing a GCN network with a specific bias matrix, the correlation between each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, and the spatial dependency information of each joint node in the sequence features of each training action and the fusion features corresponding to each training action is learned, so as to obtain the reinforcement features of each training action.
6. The human motion prediction method as described in any one of claims 2 to 5, characterized in that, The decoding unit includes a GRU network; The step of inputting each of the learned features into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each of the training actions includes: Each of the learned features is input into the GRU network of the decoding unit; The GRU network of the decoding unit recursively generates the training results corresponding to each training action.
7. A human motion prediction device, characterized in that, The human motion prediction device includes: The acquisition module is used to acquire the observed action sequence of the action to be predicted; The acquisition module is also used to acquire the target motion prediction model constructed by the joint attention mechanism of the encoding unit, and to acquire the observed action sequence of each training action in the training dataset used to train the target motion prediction model. The prediction module is used to input the observed action sequence of the action to be predicted and the observed action sequence of each of the training actions into the target motion prediction model to obtain the prediction sequence corresponding to the action to be predicted and each of the training actions. A determination module is used to determine the prediction sequence corresponding to the action to be predicted in each of the prediction sequences as the motion prediction result; The human motion prediction device further includes a training module, which is used to: acquire an initial motion prediction model constructed by the encoding unit and the joint attention mechanism, and acquire the observed action sequences of each category of training actions in the training dataset; input the observed action sequences of each training action into the initial motion prediction model, and extract features through the encoding unit of the joint attention mechanism in the initial motion prediction model to obtain the learning features of each training action; input the learning features into the decoding unit of the initial motion prediction model to obtain the training results corresponding to each training action; and adjust the model parameters in the initial motion prediction model according to the training results based on the loss function to obtain the target motion prediction model.
8. A human motion prediction device, characterized in that, The human motion prediction device includes: a memory, a processor, and a human motion prediction program stored in the memory and executable on the processor, the human motion prediction program being configured to implement the steps of the human motion prediction method as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a human motion prediction program, which, when executed by a processor, implements the steps of the human motion prediction method as described in any one of claims 1 to 6.