Method of recognizing action, method of training model, electronic device, program product
By combining graph convolutional networks and convolutional neural networks, and utilizing differential weight matrices and node-level residual processing, the problem of susceptibility to interference in action recognition in existing technologies is solved, achieving higher recognition accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING JINKANG NEW ENERGY VEHICLE CO LTD
- Filing Date
- 2026-05-12
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244955A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a method for recognizing actions, a model training method, electronic devices, and program products. Background Technology
[0002] Human motion recognition, as an important technology in the field of automated video analytics, has applications in various scenarios, such as fall detection, dangerous behavior detection, and automotive aftermarket behavior detection.
[0003] The core challenge of action recognition technology lies in effectively extracting highly discriminative features from continuous video frame sequences. Early research mainly relied on handcrafted feature designs, such as optical flow histograms, motion boundary histograms, and 3D gradient orientation histograms.
[0004] In recent years, with the rise of deep learning methods, deep learning models have been used to perform action recognition tasks in order to improve their performance. However, deep learning models based on depth images are still susceptible to interference from real-world factors such as changes in lighting conditions, background occlusion, and camera shake.
[0005] Therefore, current methods use a skeleton-driven approach to abstract human movements into joint sequences and skeletal topology, effectively stripping away superficial information and focusing on essential movement features. Then, a Graph Convolutional Network (GCN) is used to map the non-Euclidean skeleton data to Euclidean space through convolution operations on the graph structure, and finally, action recognition is performed using the skeleton data features. However, the accuracy of action recognition using this method remains relatively low.
[0006] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention
[0007] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.
[0008] This application provides a method for recognizing actions, a model training method, an electronic device, and a program product to improve the accuracy of recognizing human actions.
[0009] This application provides a method for recognizing actions, including: extracting features from multiple input human anatomy images using a graph convolutional network to obtain multiple first features; different human anatomy images are determined from different video frames of a video to be judged; the human anatomy images use human skeletal joints as graph nodes and human skeletal connections as graph edges; determining each first difference weight matrix based on each difference feature; one difference feature is the difference between two first features corresponding to a group of temporally adjacent human anatomy images; for each group of temporally adjacent human anatomy images: inputting the group of temporally adjacent human anatomy images into the convolutional neural network to obtain an off-diagonal difference weight matrix; calculating the difference weight matrix for the group of temporally adjacent human anatomy images. The node-level residuals of the graph are obtained; the node-level residuals are added to the diagonal elements of the off-diagonal difference weight matrix to obtain a second update matrix; the second update matrix is symmetricized and row-normalized to obtain a second difference weight matrix corresponding to the group of temporally adjacent human anatomy images; based on each first difference weight matrix and each human anatomy image, feature extraction is performed using the graph convolutional network to obtain each first difference weight feature; based on each second difference weight matrix and each human anatomy image, feature extraction is performed using the graph convolutional network to obtain each second difference weight feature; based on each first difference weight feature and each second difference weight feature, a preset classification layer is used for processing to obtain human motion.
[0010] In the above implementation, the graph convolutional network (GCNN) can mine the relationships between nodes through information transmission between nodes. The first difference weight matrix generated based on the features generated by the GCNN can effectively eliminate noise caused by jitter and occlusion, maintaining the structural constraints of the pose skeleton. The convolutional neural network (CNN) can better focus on the dynamic changes of local features. Node-level residuals can reflect the displacement amplitude of a single joint between two frames. Adding node-level residuals to the diagonal elements of the off-diagonal difference weight matrix allows the second update matrix to adaptively highlight joint pairs that undergo significant movement between adjacent frames. By symmetricizing and row-normalizing the second update matrix, the original feature distribution can be maintained during the calculation process. Combining the first and second difference weight matrices retains more feature information. Therefore, using the first difference weight features determined by the first difference weight matrix and the second difference weight features determined by the second difference weight matrix for classification and recognition makes the identified human actions more accurate.
[0011] Furthermore, a graph convolutional network is used to extract features from the input multiple human body structure images to obtain multiple first features, including: for each human body structure image: obtaining the adjacency matrix corresponding to the human body structure image; adding an identity matrix to the adjacency matrix to obtain a first update matrix; normalizing the first update matrix to obtain a first normalized matrix; and substituting the multiplication result of the first normalized matrix, the human body structure image, and a preset learnable weight matrix into a preset nonlinear activation function to obtain the first feature corresponding to the human body structure image.
[0012] In the above implementation, considering that using only the adjacency matrix would lead to the neglect of node features, adding an identity matrix to the adjacency matrix ensures the transmission of node information. Furthermore, considering that an unnormalized matrix might alter the original distribution of features during matrix multiplication, normalizing the first update matrix allows subsequent calculations to more accurately reflect the human anatomy diagram.
[0013] Further, determining the first difference weight matrix based on the difference features includes: for each of the human body structure diagrams: calculating the norm square of the difference between the difference feature corresponding to the i-th node and the difference feature corresponding to the j-th node in the human body structure diagram to obtain a first candidate element; dividing the first candidate element by a preset standard deviation parameter to obtain a second candidate element; and substituting the second candidate element into a preset negative exponential function to obtain the element value in the i-th row and j-th column of the first difference weight matrix corresponding to the human body structure diagram.
[0014] The first difference weight matrix determined through the above implementation method shows that edges with larger values have stronger information propagation capabilities, and nodes with similar motion patterns will mutually enhance each other's feature representations. Therefore, the first difference weight matrix better reflects the node features of the human body structure diagram.
[0015] Furthermore, based on each of the first difference weight matrices and each of the human body structure diagrams, feature extraction is performed using the graph convolutional network to obtain each of the first difference weight features, including: for each of the first difference weight matrices: normalizing the first difference weight matrix to obtain a second normalized matrix; and substituting the multiplication result of the second normalized matrix, the human body structure diagram corresponding to the first difference weight matrix, and a preset learnable weight matrix into a preset nonlinear activation function to obtain the first difference weight feature corresponding to the first difference weight matrix.
[0016] In the above implementation, by normalizing the first difference weight matrix, the original feature distribution of the first difference weight matrix can be maintained during the calculation process. The second normalized matrix, the human body structure image corresponding to the first difference weight matrix, and the preset learnable weight matrix are then recalculated using a graph convolutional network to perform feature extraction again, thereby enhancing the information.
[0017] Furthermore, based on each of the second difference weight matrices and each of the human body structure diagrams, feature extraction is performed using the graph convolutional network to obtain each of the second difference weight features, including: for each of the second difference weight matrices: normalizing the second difference weight matrix to obtain a third normalized matrix; and substituting the multiplication result of the third normalized matrix, the human body structure diagram corresponding to the second difference weight matrix, and a preset learnable weight matrix into a preset nonlinear activation function to obtain the second difference weight feature corresponding to the second difference weight matrix.
[0018] In the above implementation, by normalizing the second difference weight matrix, the original feature distribution of the second difference weight matrix can be maintained during the calculation process. The graph convolutional network is then used to recalculate the third normalized matrix, the human body structure map corresponding to the second difference weight matrix, and the preset learnable weight matrix to perform feature extraction again. This allows the extracted features to focus more on the actual motion changes between the two frames.
[0019] Furthermore, the method also includes: using a preset Top-K algorithm to retain the K edges with the largest weights in each of the first difference weight matrices to obtain each of the first key difference matrices; and using a preset Top-K algorithm to retain the K edges with the largest weights in each of the second difference weight matrices to obtain each of the second key difference matrices.
[0020] Correspondingly, based on each of the first difference weight features and each of the second difference weight features, a preset classification layer is used to process and obtain human motion, including: fusing each of the first key difference matrices with the corresponding first difference weight features to obtain each first fusion matrix; fusing each of the second key difference matrices with the corresponding second difference weight features to obtain each second fusion matrix; concatenating each of the first fusion matrices and each of the second fusion matrices to obtain each first fusion feature; inputting the second fusion feature into a fully connected layer to obtain a first output feature; the second fusion feature is obtained by concatenating each of the first fusion features; and inputting the first output feature into a preset classification layer for processing to obtain human motion.
[0021] Considering that traditional graph convolution is prone to oversmoothing after multi-layer propagation, homogeneous aggregation can "smooth out" the cross-node differences that should be prominent layer by layer, resulting in a significant dilution of key difference features and reduced discriminability. In the above implementation, a Top-K mechanism is used to filter the first key difference matrix and the second difference weight matrix to obtain the key difference matrix. The key difference matrix enables the local change feature enhancement of nodes with large action changes, explicitly amplifying their gradient response to offset the information loss caused by oversmoothing.
[0022] The action recognition model executes the above method. This application embodiment also provides a training method comprising: acquiring a sequence of human anatomy diagrams with action labels; training the action recognition model using a preset optimization algorithm with a target loss function based on the sequence of human anatomy diagrams with action labels; until a preset stopping condition is reached or the training iterations reach a set number, thereby obtaining an updated action recognition model; wherein the target loss function is determined based on a preset loss function and a constraint function; the constraint function is determined based on a second feature corresponding to a first key difference matrix and a third feature corresponding to a second key difference matrix.
[0023] In the above implementation, the constraint function ensures that the first and second key difference matrices remain semantically consistent under the same task. By combining the constraint function and a preset loss function as the final target loss function for model training, the sensitivity and recognition ability of the action recognition model to minor changes in key action nodes can be improved.
[0024] This application also provides an electronic device, including a processor and a memory. The memory stores computer-executable instructions that can be executed by the processor. The processor executes the computer-executable instructions to implement the above-described method for recognition actions or the above-described training method.
[0025] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description
[0026] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are considered similar elements. The drawings do not constitute a limitation of scale, and wherein: Figure 1 This is a flowchart illustrating a method for identifying actions provided in an embodiment of this application; Figure 2 This is a schematic diagram of extracting a human body structure diagram provided in an embodiment of this application; Figure 3This is a schematic diagram of an extraction of a human body structure sequence provided in an embodiment of this application; Figure 4 This is a flowchart illustrating a training method for an action recognition model provided in an embodiment of this application; Figure 5 This is a schematic diagram of the test accuracy curve of an action recognition model provided in an embodiment of this application; Figure 6 This is a loss curve of the training process of an action recognition model provided in an embodiment of this application; Figure 7 This is a loss curve of an action recognition model testing process provided in an embodiment of this application; Figure 8 This is a schematic diagram showing the accuracy of an AM-GCN model as a function of rounds, provided in an embodiment of this application. Figure 9 This is a schematic diagram of the accuracy of a DNDGN model as a function of rounds, provided in an embodiment of this application. Figure 10 This is a schematic diagram showing the accuracy of a GAT model as a function of rounds, provided in an embodiment of this application. Figure 11 This is a schematic diagram showing the accuracy of a GCN model as a function of rounds, provided in an embodiment of this application. Figure 12 This is a schematic diagram showing the accuracy of an MCNN-LSTM model as a function of rounds, provided in an embodiment of this application. Figure 13 This is a schematic diagram showing the accuracy of an MRF-GCN model as a function of rounds, provided in an embodiment of this application. Figure 14 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.
[0027] Figure label: 1: Processor; 2: Memory; 3: Communication interface; 4: Bus. Detailed Implementation
[0028] To provide a more detailed understanding of the features and technical content of the embodiments of this application, the implementation of the embodiments of this application will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this application. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.
[0029] The terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.
[0030] Unless otherwise stated, the term "multiple" means two or more.
[0031] The term "correspondence" can refer to an association or binding relationship. The correspondence between A and B means that there is an association or binding relationship between A and B.
[0032] With the continuous increase in my country's car ownership, the demand for quality and safety supervision in the automotive after-sales service sector is becoming increasingly urgent. Traditional quality control models mainly rely on on-site manual inspections by technicians and random post-incident checks. These methods suffer from numerous shortcomings, including high costs, varying personnel qualifications, numerous regulatory blind spots, and significant delays, making them unsuitable for meeting the increasingly complex repair scenarios and diverse safety hazard investigation requirements. Especially in critical operations such as changing oil filters and checking tire pressure, even minor errors can lead to vehicle performance degradation or even serious safety accidents.
[0033] Human motion recognition has significant application potential in automotive aftermarket service scenarios. By combining human motion recognition technology with the body posture and dynamic movement sequences of individuals in surveillance videos, it is possible to monitor and verify in real time whether their operational procedures meet predetermined standards, fundamentally replacing traditional quality control methods that rely on human experience and subjective judgment.
[0034] With the increasing standardization and refinement of automotive repair processes, intelligent quality inspection systems based on human motion recognition technology can not only efficiently capture minute motion deviations but also completely record work trajectories, providing a reliable basis for repair quality optimization and accountability. Furthermore, compared to traditional static image analysis methods, dynamic motion recognition technology is more suited to the continuous and complex nature of the repair process, effectively addressing practical challenges such as occlusion, changing perspectives, and environmental interference at the repair site, significantly improving the accuracy and robustness of the quality inspection system. Through motion recognition technology, automotive repair processes can transition from experience-driven to data-driven quality management, promoting the intelligent and lean management of repair workshop operations.
[0035] In related technologies, skeleton-driven methods are used to abstract human movements into joint sequences and skeletal topology, and then graph convolutional networks are used to extract skeleton data features for action recognition. However, this approach cannot effectively capture subtle changes in key action nodes, thus limiting the model's recognition accuracy and computational efficiency.
[0036] This application provides a method for recognizing actions, which at least addresses some deficiencies in related technologies. See also... Figure 1 As shown, Figure 1 This is a flowchart illustrating the action recognition method provided in this application embodiment. The method includes: Step S101: Use a graph convolutional network to extract features from the multiple input human body structure images to obtain multiple first features.
[0037] In this embodiment of the application, the human body structure diagram uses the joints of the human skeleton as graph nodes and the connections of the human skeleton as graph edges.
[0038] A diagram of human anatomy can be represented as a graph G=(V, E), where V represents the set of graph nodes and E represents the set of graph edges.
[0039] The system can acquire a video to be analyzed, process its frames, and determine the corresponding human anatomy diagrams. A video to be analyzed consists of multiple frames, thus allowing the generation of multiple human anatomy diagrams in chronological order.
[0040] In one alternative approach, for each video frame: the coordinates of two-dimensional human key points in the video frame are determined using the MMPose (pose estimation) algorithm; the key points are used as graph nodes, and the set human skeleton is connected as graph edges to obtain a graph representation of the human structure corresponding to the video frame.
[0041] Optionally, graph edges can be defined based on the connectivity of the human skeleton. For example, skeletal structures include connections such as nose-eye, shoulder-elbow, elbow-wrist, hip-knee, and knee-ankle. Intersecting body edges can also be added as supplementary graph edges to enhance structural coherence. Examples of intersecting body edges include left shoulder to right shoulder, left hip to right hip, etc.
[0042] For example, the coordinates of keypoints extracted by the pose estimation algorithm can correspond to 17 predefined anatomical landmarks in the COCO (a standardized evaluation metric) standard. These anatomical landmarks include: nose, eye, ear, shoulder, elbow, wrist, hip, knee, and ankle. The anatomical topology connection rules of these 17 anatomical landmarks, as defaulted to in the COCO standard, are used as graph edges.
[0043] Combination Figure 2 and Figure 3 As shown, suppose there is a video clip of car maintenance. This video is used as the video to be judged. The video to be judged is decomposed into multiple video frames. For example... Figure 2 As shown, the coordinates of key human body points are extracted from a single video frame, and the coordinates of these key human body points are connected according to the graph edges to obtain the human body structure diagram corresponding to that video frame. Figure 3As shown, after processing each video frame in the video, multiple human body structure diagrams with time sequence can be obtained, thus obtaining human skeleton data with time t.
[0044] In some embodiments, step S101 may include: for each human body structure image: obtaining the adjacency matrix corresponding to the human body structure image; adding an identity matrix to the adjacency matrix to obtain a first update matrix; normalizing the first update matrix to obtain a first normalized matrix; and substituting the multiplication result of the first normalized matrix, the human body structure image, and the preset learnable weight matrix into a preset nonlinear activation function to obtain a first feature corresponding to the human body structure image.
[0045] In an alternative approach, the adjacency matrix of the human anatomy diagram can be obtained as follows: for each node in the diagram, if there is a direct anatomical connection between node vi and node vj, the element is assigned a value of 1; otherwise, it is set to 0. Therefore, the adjacency matrix A has an A... ij The value can be represented as:
[0046] Once all nodes in a single human anatomy diagram have been evaluated using the above formula, the adjacency matrix of that diagram can be obtained.
[0047] For example, let's illustrate this with a diagram of the human anatomy containing only three nodes: Assume the nodes in the diagram are node 1, node 2, and node 3. When i is node 1, if there is a connection between node 1 and node 2, then A... 12 The value is 1. If there is a connection between the first node and the third node, then A... 13 The value is 1. When i is the second node, and there is a connection between the second node and the first node, then A... 21 The value is 1. If there is no connection between the second node and the third node, then A... 23 The value is 0. When i is the third node, and there is a connection between the third node and the first node, then A... 31 The value is 1. There is no connection between the third node and the second node, therefore A... 32 The value is 0. If a single node has no connections to itself, then A... 11 A is 0. 22 A is 0. 33 The value is 0. Therefore, the adjacency matrix corresponding to the complete human anatomy diagram can be obtained.
[0048] In one alternative approach, an identity matrix is added to the adjacency matrix, which is then summed to obtain the first update matrix.
[0049] In one alternative approach, normalizing the first update matrix to obtain the first normalized matrix can be achieved by: calculating... The first normalized matrix is obtained.
[0050] in, This is the first normalized matrix. . . Let be the degree matrix of the adjacency matrix. The adjacency matrix A is an identity matrix with the same dimensions as the adjacency matrix A. Since the diagonal of adjacency matrix A consists entirely of 0s, using only adjacency matrix A would cause the characteristics of the nodes themselves to be ignored. This is addressed by adding an identity matrix to adjacency matrix A. This ensures the transmission of information within each node. Furthermore, considering that an unnormalized matrix might alter the original distribution of features when multiplied by the feature matrix, potentially causing unpredictable problems, normalizing the adjacency matrix A preserves the original feature distribution during subsequent calculations.
[0051] In one alternative approach, the result of multiplying the first normalized matrix, the human anatomy diagram, and the preset learnable weight matrix is substituted into a preset nonlinear activation function to obtain the first feature corresponding to the human anatomy diagram. This can be achieved by: calculating... .
[0052] in, ( ) represents a non-linear activation function, such as a linear rectifier function. Here is a diagram of the human body structure at time t. Let be the learnable weight matrix at time t. Let be the first feature corresponding to the human body structure diagram at time t.
[0053] At this point, the first feature corresponding to a single node in the human anatomy diagram is: .in, Let be the feature representation of the j-th node in the human anatomy diagram at time t. At t=0, This can represent the first anatomical diagram arranged chronologically; at t=1, This can represent the second human anatomy diagram arranged in chronological order, with t representing other time points and so on. Let be the first feature corresponding to the i-th node in the human anatomy diagram at time t. equal n is the total number of nodes in the human anatomy diagram. for The value of the element in the i-th row and j-th column.
[0054] If no normalization is performed and the identity matrix is not increased... . for The value of the element in the i-th row and j-th column.
[0055] If we add an identity matrix but do not perform normalization .
[0056] In some other embodiments, step S101 may include: for each human body structure image: obtaining the adjacency matrix corresponding to the human body structure image; normalizing the adjacency matrix to obtain a fourth normalized matrix; and substituting the multiplication result of the fourth normalized matrix, the human body structure image, and the preset learnable weight matrix into a preset nonlinear activation function to obtain the first feature corresponding to the human body structure image.
[0057] Step S102: Determine the first difference weight matrix based on each difference feature.
[0058] In this embodiment of the application, a difference feature is the difference between two first features corresponding to a set of temporally adjacent human body structure diagrams.
[0059] For example, when t is greater than or equal to 1, calculate This allows us to obtain the differential features corresponding to the human anatomy diagram at time t. When t equals 0, .in, The difference features at time t are represented. Let be the first feature corresponding to the human body structure diagram at time t-1. Let be the first feature corresponding to the human body structure diagram at time t.
[0060] At this point, for a single node, when t is greater than or equal to 1, the calculation is... When t equals 0, in, Let be the difference feature corresponding to the i-th node at time t. Let be the first feature corresponding to the i-th node in the human body structure diagram at time t-1. Let i be the difference feature corresponding to the i-th node at time 0. Let i be the first feature corresponding to the i-th node in the human body structure diagram at time 0.
[0061] In some embodiments, determining the first difference weight matrix based on the difference features includes: for each human body structure diagram: calculating the norm square of the difference between the difference feature corresponding to the i-th node and the difference feature corresponding to the j-th node in the human body structure diagram to obtain a first candidate element; dividing the first candidate element by a preset standard deviation parameter to obtain a second candidate element; and substituting the second candidate element into a preset negative exponential function to obtain the element value in the i-th row and j-th column of the first difference weight matrix corresponding to the human body structure diagram.
[0062] For example, calculation , obtain the element value of the i-th row and j-th column in the first difference weight matrix.
[0063] in, Represents the first difference weight matrix The value of the element in the i-th row and j-th column. This is the standard deviation parameter, used to control the distance measurement. It can be set by technicians based on experience. The value. For example, The value is 2. Let be the difference feature corresponding to the j-th node at time t.
[0064] Step S103: Process each human body structure diagram using a convolutional neural network to obtain each second difference weight matrix.
[0065] In some embodiments, step S103 may include: for each group of temporally adjacent human anatomy images: inputting the group of temporally adjacent human anatomy images into a convolutional neural network to obtain an off-diagonal difference weight matrix; calculating the node-level residuals of the group of temporally adjacent human anatomy images; adding node-level residuals to the diagonal elements of the off-diagonal difference weight matrix to obtain a second update matrix; and performing symmetry and row normalization on the second update matrix to obtain a second difference weight matrix corresponding to the group of temporally adjacent human anatomy images.
[0066] Row normalization can be understood as normalization based on rows.
[0067] in, To and A matrix of all zeros with the same dimensions. That is, in chronological order, the human anatomy diagrams that are chronologically adjacent to the first human anatomy diagram are all zero matrices with the same dimensions as the first human anatomy diagram.
[0068] For example, when t is greater than or equal to 1, two consecutive frames of human anatomy can be used directly. and Input the data into a convolutional neural network to obtain an off-diagonal difference weight matrix. When t equals 0, use two consecutive frames of human anatomy images. and Input the data into a convolutional neural network to obtain an off-diagonal difference weight matrix.
[0069] For example, two consecutive frames of human anatomy images and Stacking along the time dimension yields a 2×M×C two-dimensional feature representation, where 2 represents the time dimension, M represents the number of nodes, and C represents the number of feature dimensions. A two-dimensional convolutional kernel is then used to cross the time dimension in one pass, performing local convolution operations on the joint sequence to extract spatiotemporal local difference features, outputting a feature map F. F has a size of 1×M×C. Finally, a 1×1 convolution is used to compress the number of channels to M dimensions, resulting in an off-diagonal difference weight matrix. .
[0070] Optionally, the size of the 2D convolutional kernel is set to 2×3. The kernel size in the time dimension is 2, used to cover two consecutive frames of skeleton data at once, i.e., two consecutive frames of human body structure diagrams at once. The kernel size in the node dimension is 3, used to cover the current node and its adjacent nodes. The convolution stride is set to (1, 1), i.e., the stride in both the time and node dimensions is 1. The padding method is set to no padding in the time dimension and padding with one zero value on each side of the node dimension, i.e., padding is set to (0, 1). The number of convolutional output channels is set to C, the activation function is a linear rectified function, and the output feature size after 2D convolution is 1×M×C.
[0071] Then, 1×1 convolution is used to compress the features of size 1×M×C output by the 2D convolution. The kernel size of the 1×1 convolution is 1×1, the stride is (1,1), the padding is no padding (0,0), the number of input channels is C, the number of output channels is M, and the activation function is the linear rectified function, thus obtaining an M×M matrix.
[0072] Calculate node-level residuals and add it to The diagonal elements are used to obtain the second update matrix. .calculate This is to achieve symmetry and row normalization of the second update matrix.
[0073] Among them, represent The transpose of the matrix, This is the second difference weight matrix. This is a normalization algorithm.
[0074] It can adaptively highlight joint pairs that undergo significant motion between adjacent frames, providing discriminative spatiotemporal topological information for subsequent graph convolutional networks.
[0075] Step S104: Based on each first difference weight matrix and each human body structure diagram, feature extraction is performed using a graph convolutional network to obtain each first difference weight feature.
[0076] In some embodiments, step S104 may be: for each first difference weight matrix: normalize the first difference weight matrix to obtain a second normalized matrix; substitute the multiplication result of the second normalized matrix, the human body structure diagram corresponding to the first difference weight matrix, and the preset learnable weight matrix into a preset nonlinear activation function to obtain the first difference weight feature corresponding to the first difference weight matrix.
[0077] For example, calculation , and obtain the first difference weight feature corresponding to the first difference weight matrix. The first difference weight feature is defined at time t. . The degree matrix of the first difference weight matrix at time t involved in the calculation, for example: in the calculation hour, for The degree matrix; in calculating hour, for The degree matrix, and so on. It is an identity matrix with the same dimension as the first difference weight matrix used in the calculation at time t.
[0078] Step S105: Based on each second difference weight matrix and each human body structure diagram, feature extraction is performed using a graph convolutional network to obtain each second difference weight feature.
[0079] In some embodiments, step S105 may be: for each second difference weight matrix: normalize the second difference weight matrix to obtain a third normalized matrix; substitute the multiplication result of the third normalized matrix, the human body structure diagram corresponding to the second difference weight matrix, and the preset learnable weight matrix into a preset nonlinear activation function to obtain the second difference weight feature corresponding to the second difference weight matrix.
[0080] For example, calculation The second difference weight feature corresponding to the second difference weight matrix is obtained. This represents the second difference weight feature at time t. . Let be the degree matrix of the second difference weight matrix used in the calculation at time t. It is an identity matrix with the same dimension as the second difference weight matrix used in the calculation at time t.
[0081] Step S106: Based on each first difference weight feature and each second difference weight feature, the human body motion is obtained by processing using a preset classification layer.
[0082] For example, in the context of car maintenance, human actions include: changing the oil filter, checking tire pressure, changing the oil, and cleaning the tires.
[0083] Optionally, the classification layer is used to determine the probability of different human actions based on the input features. The classification layer can output different human actions and their corresponding probabilities, or it can output the human action with the highest probability.
[0084] In some embodiments, step S106 may include: concatenating the first difference weight feature and the second difference weight feature at the same time to obtain the third fusion feature at each time; concatenating the third fusion features at each time to obtain the fourth fusion feature; inputting the fourth fusion feature into a fully connected layer to obtain the second output feature; and inputting the second output feature into a preset classification layer for processing to obtain human motion.
[0085] In some embodiments, the method for identifying actions further includes: using a preset Top-K algorithm to retain the K edges with the largest weights in each first difference weight matrix to obtain each first key difference matrix; and using a preset Top-K algorithm to retain the K edges with the largest weights in each second difference weight matrix to obtain each second key difference matrix.
[0086] Correspondingly, based on each first difference weight feature and each second difference weight feature, a preset classification layer is used to process and obtain human motion, including: fusing each first key difference matrix with its corresponding first difference weight feature to obtain each first fusion matrix; fusing each second key difference matrix with its corresponding second difference weight feature to obtain each second fusion matrix; concatenating each first fusion matrix and each second fusion matrix to obtain each first fusion feature; inputting the second fusion feature into a fully connected layer to obtain a first output feature; the second fusion feature is obtained by concatenating each first fusion feature; and inputting the first output feature into a preset classification layer for processing to obtain human motion.
[0087] The first difference weight feature at the same time is called the first difference weight feature corresponding to the first key difference matrix. The second difference weight feature at the same time is called the second difference weight feature corresponding to the second key difference matrix.
[0088] Optionally, K is less than the total number of edges in the first difference weight matrix. K can be determined by the technician based on experience, for example, K equals 5. Considering that traditional graph convolution is prone to oversmoothing after multiple layers of propagation, homogeneous aggregation will "smooth out" the cross-node differences that should be prominent layer by layer, causing key difference features to be significantly diluted and their recognizability reduced. Therefore, this application uses a Top-K (an algorithm for finding the top K most important or highest-ranking elements from a large amount of data) mechanism to filter the dual difference weight matrix. In the first difference weight matrix of GCN, a larger weight indicates a more similar movement trend of the nodes; in the second difference weight matrix of CNN (convolutional neural network), a larger weight indicates a greater change in the nodes. This application only retains the K edges with the largest weights in the difference weight matrix, thus obtaining the key difference matrix. Subsequently, it can perform local change feature enhancement on nodes with large action changes, explicitly amplifying their gradient response to offset the information loss caused by oversmoothing.
[0089] In one alternative approach, a single first key difference matrix is fused with its corresponding first difference weight feature to obtain a first fusion matrix, which can be achieved by: calculating... The second feature corresponding to the first key difference matrix is obtained. This is the second feature at time t. Let be the first key difference matrix at time t. Add the second feature to the corresponding first difference weight feature to obtain the first fusion matrix. . Let be the degree matrix of the first key difference matrix at time t involved in the calculation. It is an identity matrix with the same dimension as the first key difference matrix at time t involved in the calculation.
[0090] In one alternative approach, a single second key difference matrix is fused with its corresponding second difference weight feature to obtain a second fusion matrix, which can be achieved by: calculating... We obtain the third feature corresponding to the second key difference matrix. This is the third feature at time t. Let be the second key difference matrix at time t. Add the third feature to the corresponding second difference weight feature to obtain the second fusion matrix. Wherein, . Let be the degree matrix of the second key difference matrix used in the calculation at time t. It is an identity matrix with the same dimension as the second key difference matrix used in the calculation at time t.
[0091] In one alternative approach, concatenating the first fusion matrix and the second fusion matrix to obtain the first fusion feature can be: and The first attention coefficient is generated through the softmax layer (classification layer) in the attention mechanism. Second attention coefficient .calculate The first fusion feature is obtained. This represents the first fusion feature at time t. As the third characteristic, This is the second feature. The classification layer in the attention mechanism is used to generate attention coefficients.
[0092] In this way, and Add, and Adding them together explicitly amplifies the key difference signals and keeps them aligned throughout, thus effectively counteracting the information dilution caused by excessive smoothing of the graph convolution.
[0093] In another alternative approach, concatenating the first and second fusion matrices to obtain the first fusion feature can be achieved by: calculating... The first fusion feature is obtained.
[0094] For example, the DNDGN model can be divided into three parts. The first part is the dynamic neighborhood difference perception mechanism, which involves using a graph convolutional network to process multiple input human body structure images to determine a first difference weight matrix, using a convolutional neural network to process the same images to determine a second difference weight matrix, and determining first and second difference weight features based on the first and second difference weight matrices. The second part is the difference feature enhancement mechanism, which involves using the TOP-K algorithm to determine a first and second key difference matrix, and fusing the first and second key difference matrices with the first and second difference weight features. The third part is the consistency constraint, which involves determining a constraint function based on the first and second key difference matrices, combining the constraint function with a preset loss function to obtain a target loss function, and then using the target loss function for training.
[0095] The overall process of this application using a dynamic neighborhood difference-aware graph convolutional network is roughly as follows: First, a pose estimation algorithm is applied to process car maintenance video frames to extract the coordinates of two-dimensional human key points, thereby constructing a temporal human structure map corresponding to the video frames. Second, the temporal human structure map is input into the dynamic neighborhood difference-aware mechanism. Consistent difference features are extracted from adjacent temporal human structure maps using GCN and CNN respectively, resulting in a dual-difference neighborhood weight connection matrix, which consists of the first and second difference weight features. Furthermore, local change feature enhancement is performed on nodes with significant action changes. Top-K is used to extract multi-scale difference features from the consistent difference features, and consistency constraints effectively improve the sensitivity and recognition ability for minor changes in key action nodes. Then, attention coefficients are generated based on the first and second key difference matrices extracted by Top-K to enhance the fusion effect of the dual-difference neighborhood weight connection matrix, thus obtaining a dynamic neighborhood difference adjacency matrix. Finally, through classification layer operations, accurate human action recognition is output, achieving accurate assessment of car maintenance quality.
[0096] The aforementioned action recognition method can be encapsulated as a DNDGN (Dynamic Neighborhood Differential Aware Graph Convolutional Network) model. That is, the DNDGN model itself performs the aforementioned action recognition method. The DNDGN model, which is also the action recognition model of this application, can be pre-trained to recognize human actions.
[0097] Combination Figure 4 As shown in the embodiment of this application, a training method for an action recognition model is provided, including: Step S201: Obtain a sequence of human anatomy diagrams with action labels.
[0098] For example, surveillance video can be used as the training video. The training video is segmented according to human movements to obtain multiple sub-videos. Each sub-video corresponds to one human movement. For each sub-video, a human anatomy map is extracted from each frame of the sub-video to obtain a sequence of human anatomy maps corresponding to that sub-video. This sequence of human anatomy maps carries a label for the human movement corresponding to the sub-video, thus obtaining a sequence of human anatomy maps with movement labels.
[0099] Step S202: Based on the sequence of human body structure diagrams with action labels, the action recognition model is trained using a target loss function and a preset optimization algorithm until a preset stopping condition is met or the number of training iterations reaches a set number, thereby obtaining an updated action recognition model.
[0100] In this embodiment, the target loss function is determined based on a preset loss function and a constraint function; the constraint function is determined based on the second feature corresponding to the first key difference matrix and the third feature corresponding to the second key difference matrix.
[0101] In an optional approach, the preset loss function can use the cross-entropy loss function as the loss function. Where N represents the number of training samples. Let be the predicted probability that sample i belongs to the true class yi. yi represents the true class label of sample i, indicating the correct class of the sample.
[0102] In an alternative approach, the constraint function can be determined based on the second feature corresponding to the first key difference matrix and the third feature corresponding to the second key difference matrix in the following manner: Calculate , thus obtaining the constraint functions. Wherein, Let M be the constraint function, and M be the number of samples participating in the consistency constraint calculation, which is the number of samples corresponding to the M time points. The second feature is the first key difference matrix corresponding to the r-th sample. Let be the third feature corresponding to the second key difference matrix of the r-th sample. This regularizes the direction of the same node Vi in the two feature spaces. This suppresses random high-frequency noise caused by jitter while preserving amplitude differences for subsequent discrimination, effectively improving the sensitivity and recognition ability for subtle changes in key action nodes.
[0103] Correspondingly, the target loss function is determined based on the preset loss function and constraint function, which can be achieved by: calculating... , and obtain the target loss function.
[0104] Where L is the target loss function. These are preset trade-off coefficients, which can be set by technical personnel based on experience. For example: It is 0.2.
[0105] so, It is responsible for driving the network to learn discriminative features to distinguish different action categories. This approach applies consistency constraints among multiple differential features to suppress noise and amplify subtle differences in key actions. The combination of these two methods simultaneously ensures classification accuracy and action sensitivity, effectively mitigating the oversmoothing phenomenon of deep graph convolution and improving action recognition accuracy and generalization ability in complex automotive maintenance scenarios.
[0106] For example, this application is applied to action recognition in a car maintenance scenario. This application can obtain maintenance monitoring data from a domestic car dealership to obtain training videos. For instance, the maintenance monitoring recorded 280 video segments, with a total duration of approximately 100 hours.
[0107] Based on four typical procedures—oil filter replacement, tire pressure monitoring, oil change, and tire cleaning—the video was cropped and manually segmented. It was assumed that each typical procedure contained 100 complete motion sequence samples after segmentation. For each complete motion sequence sample, frames were extracted at a rate of 30 frames per second, and a pose estimation algorithm was used to automatically annotate 17 key points, which were then reviewed by two senior technicians. The coordinates of all skeleton sequences were then normalized to [-1, 1], thus constructing a human skeleton motion dataset for automotive maintenance, which was then divided into training and testing sets in a 4:1 ratio.
[0108] During model training, the target loss function determined by the cross-entropy loss function and the constraint function is used as the loss function, and Adam (an optimization algorithm) is used as the optimizer to iterate the action recognition model. The initial learning rate is set to 0.001, the batch size is 64, and the training is conducted for 300 rounds.
[0109] Optionally, the action recognition model of this application can be written and implemented based on Python 3.8 (a code language version) and PyTorch (a deep learning framework). The model training and experimentation of this application can both be processed on a server with an Intel Xeon E5 2680V2 CPU (a processor model) and an Nvidia RTX 3080 GPU (a processor model).
[0110] During training, the performance of the action recognition model is also validated. The model is trained using a human skeletal motion dataset from an automotive maintenance scenario. It is trained for 300 epochs on the training set, and each epoch is tested on the test set to obtain the action recognition model's test accuracy curve, such as... Figure 5 As shown. Figure 5 The curve in the figure represents the test accuracy, with the horizontal axis representing the number of rounds and the vertical axis representing the accuracy. The action recognition model showed good performance on the test set. In the last 10 rounds s, the maximum accuracy was 94.98%, the minimum accuracy was 94.30%, and the average accuracy reached 94.81%, indicating that the action recognition model can efficiently complete the task of recognizing car maintenance actions. Figure 6 This is the loss curve during the training process of the action recognition model. Figure 6 The curve in the figure represents the training loss value, with the horizontal axis representing the number of rounds and the vertical axis representing the loss value. Figure 7 This is the loss curve during the testing process of the action recognition model. Figure 7 The curve in the graph represents the test loss value, with the horizontal axis representing the number of rounds and the vertical axis representing the loss value. For example... Figure 6 and Figure 7It can be seen that the loss curve decreases rapidly and converges with the increase of training times. The average loss on the last 50 generations of the test set is 0.54, and the fluctuation range is controlled within 0.03. This means that the action recognition model has very stable performance on the test data. This stability helps to improve the reliability and robustness of the model, making it more reliable in practical applications.
[0111] To further demonstrate the effectiveness and superiority of the action recognition model in this application, it is compared with some representative deep learning methods. The comparison objects include standard GCN, standard GAT (Graph Neural Network), MCNN-LSTM (a hybrid deep learning architecture combining convolutional neural networks and long short-term memory networks) which excels in handling temporal data, AM-GCN (Adaptive Multi-channel Graph Convolutional Network) which considers a dual-graph structure, and MRF-GCN (Combined Multi-view Graph Convolutional Network) which employs a multi-view approach.
[0112] During model training, the same training strategy was used to train each network model for 300 generations, and then the accuracy of maintenance action recognition was plotted. Figures 8 to 13 This chart compares the accuracy of different models across different training rounds. Figure 8 The horizontal axis represents the number of rounds, and the vertical axis represents the accuracy rate. Figure 8 The curve showing the accuracy of the AM-GCN model as a function of rounds is presented. Figure 9 The curve showing the accuracy of the DNDGN model as a function of rounds is presented. Figure 9 The horizontal axis represents the number of rounds, and the vertical axis represents the accuracy rate. Figure 10 The curve showing the accuracy of the GAT model as a function of rounds is presented. Figure 10 The horizontal axis represents the number of rounds, and the vertical axis represents the accuracy rate. Figure 11 The curve showing the accuracy of the GCN model as a function of rounds is presented. Figure 11 The horizontal axis represents the number of rounds, and the vertical axis represents the accuracy rate. Figure 12 The curve showing the accuracy of the MCNN-LSTM model as a function of epochs is presented. Figure 12 The horizontal axis represents the number of rounds, and the vertical axis represents the accuracy rate. Figure 13 The curve showing the accuracy of the MRF-GCN model as a function of rounds is presented. Figure 13 The horizontal axis represents the number of rounds, and the vertical axis represents the accuracy rate.
[0113] from Figures 8 to 13 It can be clearly seen that the DNDGN model proposed in this application converges faster and has significantly better accuracy than other models, highlighting the good performance and superior stability of the DNDGN model in action recognition based on human skeletal motion data in automobile maintenance.
[0114] To better evaluate model performance, the overall classification accuracy of the last 10 generations of the model was selected as the evaluation metric, including maximum accuracy, minimum accuracy, average accuracy, and volatility. The specific results are shown in Table 1. It is evident that traditional GCN relies on a fixed neighborhood averaging strategy, ignoring fine-grained differences between key joints, thus limiting its overall performance. While GAT introduces a self-attention mechanism to distinguish neighborhood importance, it only extracts features from node data at the same time point, resulting in a volatility of up to 7.23%. MCNN-LSTM effectively captures temporal features using convolutional and long short-term memory networks, but lacks explicit modeling of the skeletal spatial structure, making it difficult to fully utilize pose information. AM-GCN alleviates the oversmoothing problem through a dual-graph design, but fails to effectively extract differential information, leading to limited accuracy. MRF-GCN strengthens local collaboration within multi-scale receptive fields, but still fails to simultaneously model temporal dependencies, resulting in an accuracy of only 91.23%.
[0115] In contrast, the DNDGN proposed in this application, on the one hand, locks a stable skeletal topology through the GCN differential branch, suppressing jitter noise. On the other hand, it focuses on the real displacement between two frames through the CNN differential branch. Simultaneously, it explicitly amplifies high-amplitude joint pairs through Top-K selection, and the joint consistency constraint effectively counteracts the oversmoothing effect in deep propagation of graph convolution. Thanks to this dynamic and differential neighborhood modeling strategy, DNDGN not only achieves a top accuracy of 94.81%, but also fluctuates by only 0.68% over 10 rounds, fully verifying its robustness and superior generalization ability in complex automotive maintenance scenarios.
[0116] It is evident that the DNDGN proposed in this application significantly enhances the model's sensitivity and recognition ability to minor changes in key action nodes through the synergistic effect of the dynamic neighborhood difference perception mechanism and the difference feature enhancement mechanism.
[0117] Table 1
[0118] To verify the effectiveness of different components in the model, the influence of other modules on the performance of the DNDGN model was analyzed using the control variable method. Specifically, this included: (1) removing the dual difference neighborhood weight matrix in the dynamic neighborhood difference perception mechanism to examine its impact on the overall model; (2) deleting the difference feature enhancement mechanism; and (3) removing the consistency constraint. The specific results are shown in Table 2.
[0119] Table 2
[0120] Experimental results show that removing the dual-difference neighborhood weight matrix reduces the accuracy by 3.88%, indicating that the dual-difference neighborhood weight connection matrix is indispensable in both locking the stable topology of the skeleton and capturing the real displacement. Without this dynamic modeling, GCN reverts to a fixed neighborhood average, making it difficult to distinguish fine-grained action differences, resulting in a significant increase in the misclassification rate.
[0121] When the difference feature enhancement mechanism is removed, the model can still utilize the difference neighborhood, but it lacks Top-K key difference amplification. The deep graph convolution is oversmoothed, and the small changes of key action nodes are diluted, resulting in a 2.18% decrease in accuracy. This also confirms the value of Top-K key difference amplification in counteracting oversmoothing and improving action sensitivity.
[0122] After removing the consistency constraint, the GCN branch and the CNN branch are no longer aligned in the same semantic space, and jitter noise cannot be suppressed, resulting in a decrease in information utilization and a 1.39% drop in accuracy when fusing cross-branch features. This shows that the consistency regularization plays a key role in ensuring semantic consistency of multi-path features and suppressing noise.
[0123] Combination Figure 14 As shown, this application provides an electronic device including a processor 1 and a memory 2. Optionally, the device may further include a communication interface 3 and a bus 4. The processor 1, communication interface 3, and memory 2 can communicate with each other via the bus 4. The communication interface 3 can be used for information transmission. The processor 1 can call logical instructions in the memory 2 to execute the identification action method described in the above embodiment.
[0124] Furthermore, the logical instructions in the aforementioned memory 2 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0125] Memory 2, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this application. Processor 1 executes functional applications and data processing by running the program instructions / modules stored in memory 2, that is, it implements the action recognition method in the above embodiments.
[0126] The memory 2 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 2 may include high-speed random access memory and may also include non-volatile memory.
[0127] Among them, electronic devices can be computers or servers, etc.
[0128] This application provides a storage medium storing computer-executable instructions configured to perform the aforementioned identification action.
[0129] This application provides a computer program product, which includes a computer program stored on a storage medium. The computer program includes program instructions, and when the program instructions are executed by a computer, the computer performs the aforementioned identification action.
[0130] The aforementioned computer-readable storage medium may be a transient computer-readable storage medium or a non-transitory computer-readable storage medium.
[0131] The technical solutions of this application embodiment can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the method of this application embodiment. The aforementioned storage medium can be a non-transitory storage medium, including various media capable of storing program code such as USB flash drives, portable hard drives, read-only memory, random access memory, magnetic disks, or optical disks, or it can be a transient storage medium.
[0132] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and there may be other division methods in actual implementation. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed.
[0133] The above are merely embodiments of this application and are not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application. Furthermore, the above embodiments can be combined with each other to form new embodiments without conflict.
Claims
1. A method for recognizing actions, characterized in that, include: Graph convolutional networks are used to extract features from multiple input human body structure images to obtain multiple primary features. Different human anatomy diagrams are determined from different video frames of the video to be judged; the human anatomy diagram uses human skeletal joints as graph nodes and human skeletal connections as graph edges; Each first difference weight matrix is determined based on each difference feature; one difference feature is the difference between two first features corresponding to a set of temporally adjacent human body structure images; For each group of time-adjacent human anatomy images: input the group of time-adjacent human anatomy images into a convolutional neural network to obtain an off-diagonal difference weight matrix; Calculate the node-level residuals of the human anatomy diagrams that are adjacent in time for this set; The node-level residuals are added to the diagonal elements of the off-diagonal difference weight matrix to obtain the second update matrix; The second update matrix is symmetric and row normalized to obtain the second difference weight matrix corresponding to the human body structure images that are adjacent in time. Based on each of the first difference weight matrices and each of the human body structure diagrams, feature extraction is performed using the graph convolutional network to obtain each of the first difference weight features; Based on each of the second difference weight matrices and each of the human body structure diagrams, feature extraction is performed using the graph convolutional network to obtain each of the second difference weight features; Human motion is obtained by processing each of the first and second difference weight features using a preset classification layer.
2. The method according to claim 1, characterized in that, Graph convolutional networks are used to extract features from multiple input human anatomy images to obtain several primary features, including: For each of the aforementioned human anatomy diagrams: Obtain the adjacency matrix corresponding to the human body structure diagram; Add an identity matrix to the adjacency matrix to obtain the first update matrix; The first update matrix is normalized to obtain the first normalized matrix; The result of multiplying the first normalized matrix, the human body structure diagram, and the preset learnable weight matrix is substituted into a preset nonlinear activation function to obtain the first feature corresponding to the human body structure diagram.
3. The method according to claim 1, characterized in that, The first difference weight matrix is determined based on the difference characteristics, including: For each of the aforementioned human anatomy diagrams: Calculate the norm square of the difference between the differential feature corresponding to the i-th node and the differential feature corresponding to the j-th node in the human body structure diagram to obtain the first candidate element; Divide the first candidate element by the preset standard deviation parameter to obtain the second candidate element; Substitute the second candidate element into the preset negative exponential function to obtain the element value of the i-th row and j-th column in the first difference weight matrix corresponding to the human body structure diagram.
4. The method according to claim 1, characterized in that, Based on each of the first difference weight matrices and each of the human body structure diagrams, feature extraction is performed using the graph convolutional network to obtain each of the first difference weight features, including: For each of the first difference weight matrices: The first difference weight matrix is normalized to obtain the second normalized matrix; The multiplication result of the second normalized matrix, the human body structure diagram corresponding to the first difference weight matrix, and the preset learnable weight matrix is substituted into a preset nonlinear activation function to obtain the first difference weight feature corresponding to the first difference weight matrix.
5. The method according to claim 1, characterized in that, Based on each of the second difference weight matrices and each of the human body structure diagrams, feature extraction is performed using the graph convolutional network to obtain each of the second difference weight features, including: For each of the second difference weight matrices: The second difference weight matrix is normalized to obtain the third normalized matrix; The multiplication result of the third normalized matrix, the human body structure diagram corresponding to the second difference weight matrix, and the preset learnable weight matrix is substituted into a preset nonlinear activation function to obtain the second difference weight feature corresponding to the second difference weight matrix.
6. The method according to claim 1, characterized in that, The method further includes: The first key difference matrix is obtained by using a preset Top-K algorithm to retain the K edges with the largest weights in each first difference weight matrix; The K edges with the largest weights in each second difference weight matrix are retained using the preset Top-K algorithm to obtain each second key difference matrix; Correspondingly, based on each of the first difference weight features and each of the second difference weight features, a preset classification layer is used to process the data to obtain human motion, including: Each of the first key difference matrices is fused with the corresponding first difference weight feature to obtain each first fusion matrix; Each of the second key difference matrices is fused with the corresponding second difference weight feature to obtain each second fusion matrix; The first fusion matrix and the second fusion matrix are concatenated to obtain the first fusion feature; The second fused feature is input into the fully connected layer to obtain the first output feature; the second fused feature is obtained by concatenating the first fused features. The first output feature is input into a preset classification layer for processing to obtain human motion.
7. A training method for an action recognition model, characterized in that, The action recognition model performs the method as described in any one of claims 1 to 6; the training method includes: Obtain a sequence of human anatomy diagrams with action labels; The action recognition model is trained using a preset optimization algorithm with a target loss function based on the sequence of human anatomy diagrams with action labels until a preset stopping condition is met or the number of training iterations reaches a set number, thereby obtaining an updated action recognition model. The target loss function is determined based on a preset loss function and a constraint function; the constraint function is determined based on the second feature corresponding to the first key difference matrix and the third feature corresponding to the second key difference matrix.
8. An electronic device, characterized in that, It includes a processor and a memory, the memory storing computer-executable instructions that can be executed by the processor, the processor executing the computer-executable instructions to implement the method of recognizing an action as described in any one of claims 1 to 6, or to implement the training method as described in claim 7.
9. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, implements the method for recognizing actions as described in any one of claims 1 to 6, or the training method as described in claim 7.