Action Recognition Method and System Based on Multi-Scale Spatiotemporal Topic Graph Convolutional Network
By using a multi-scale spatiotemporal topic graph convolutional network to simulate the physical and non-physical connections of human skeletal nodes, the problem of ignoring joint relationships in existing methods is solved, achieving more efficient action recognition accuracy and recognition efficiency in real-world environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2023-04-07
- Publication Date
- 2026-06-30
AI Technical Summary
Existing skeletal motion recognition methods ignore the non-physical connections between joints, making it difficult to apply these methods to skeletons of any shape. Furthermore, existing methods do not pay enough attention to dense areas and cannot effectively capture the spatial information of the limbs.
A multi-scale spatiotemporal topic graph convolutional network is adopted. By constructing topic graph convolutional modules and multi-scale graph convolutional modules, the physical and non-physical connections of human skeletal nodes are simulated to extract feature information between joints. Combined with the Euclidean distance definition matrix and high-order network structure feature description operators, the feature extraction capability is enhanced.
It improves the accuracy and efficiency of human motion recognition, especially in real-world environments, effectively capturing complex relationships between joints and enhancing the model's feature extraction capabilities.
Smart Images

Figure CN116453218B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of posture recognition, and relates to the fields of human action recognition and computer vision. Specifically, it relates to a method and system for human action recognition based on a multi-scale spatiotemporal topic graph convolutional network. Background Technology
[0002] With the popularization of real-time human motion recognition technology, its applications in fields such as intelligent monitoring, smart homes, and motion-sensing games are becoming increasingly widespread. In recent years, more and more researchers have focused on recognizing human movements from the human skeleton. The human body itself can be considered a natural skeletal topology, and skeletal sequences can well represent human behavior while greatly reducing the computational redundancy of semantic analysis. Compared with the analysis of human activity from raw RGB data, the recognition performance of skeletal data is more stable and robust when the scene background changes with lighting, body parts, and complex backgrounds.
[0003] Early methods for skeletal motion recognition encoded the positions of all body joints in each frame into feature vectors or pseudo-images, which were then fed into RNNs or CNNs to generate predictions. However, the encoded form of skeleton data as feature vectors cannot adequately represent the correlations between joints. These recognition methods often ignore the rich motion information between joints, making it difficult to apply the recognition methods to arbitrary skeleton forms. With researchers' in-depth study of graph convolutional networks, these networks can effectively extract information from skeleton data. Yan et al. first used graph theory to model the correlations between human joints, constructing a spatiotemporal graph convolutional network (ST-GCN) that automatically learns the spatial and temporal information of skeletal data, and used graph convolution and temporal convolution to extract motion features.
[0004] However, manually defined topologies make it difficult to establish connections between joints that are not naturally connected. Existing recognition algorithms use static adjacency matrices, a partitioning strategy that pays insufficient attention to skeletal joints in dense regions, hindering the capture of spatial information from limbs. For example, in most movements, the relative positions of the left hand and left shoulder are more relevant than those of the left hand and right foot. Measuring the potential correlation between joints involves considering the spatial distance between each pair of nodes; the non-physical connections between human skeletal nodes are potential and may change with movement. Both physical and non-physical connections between joints should be considered. Based on this, this invention introduces the concept of biological motifs, utilizing higher-order network structure feature description operators to define a motif as a connected subgraph consisting of a small number of nodes, used to extract physical and non-physical connections between skeletal nodes.
[0005] Therefore, based on existing skeletal action recognition methods, this invention constructs a human action recognition system based on a multi-scale spatiotemporal topic graph convolutional network. For the physical and non-physical connections between different skeletal nodes in the human body, this invention simulates the connection and disconnection of skeletal nodes through a multi-scale topic graph convolutional network, obtaining the physical and non-physical connection features between skeletal nodes, thereby improving the model's ability to extract features between human skeletal nodes. Simultaneously, the action recognition model is designed for a real-world human action recognition system, ensuring the accuracy of human action recognition while effectively improving recognition efficiency in real-world environments.
[0006] CN113657349A discloses a human behavior recognition method based on a multi-scale spatiotemporal graph convolutional neural network, belonging to the field of neural network technology. The method includes extracting and preprocessing a dataset from the human skeleton sequence to be recognized; creating a deep neural network model containing a multi-scale graph convolutional module and a multi-temporal feature fusion module, enabling the model to better extract the spatial features of the human skeleton and the temporal features of the skeleton sequence; training and testing the deep neural network to obtain a human behavior recognition neural network model; and using the trained model to classify the video images to be recognized and outputting the classification results. The human behavior recognition method provided by this invention enables the neural network model to better extract the spatiotemporal features of the skeleton sequence, achieving automatic recognition of human behavior and improving the accuracy of human behavior recognition.
[0007] The aforementioned patent CN113657349A focuses on the feature information of physically connected nodes in the human skeleton, specifically the information of connected skeletal nodes. The multi-scale graph convolution module also targets the extraction of features related to physical connections. However, it neglects the non-physically connected features of human skeletal nodes, thus limiting the definition of skeletal node features. Assessing the potential correlation between joints should consider the spatial distance between each pair of nodes, not just the physical connections. This invention, however, starts from the spatial distance between each pair of nodes, defining human skeletal nodes as either physically connected or non-physically connected. The extracted human skeletal features include the features of physically connected nodes (i.e., the skeletal features extracted by patent CN113657349A) and the features between each pair of non-connected nodes (i.e., the non-physically connected features mentioned in this patent). The difference between this invention and CN113657349A is that CN113657349A has limitations, extracting skeletal features only from fixed physically adjacent nodes. This invention, on the other hand, extracts skeletal features not only from fixed physically adjacent nodes but also, more importantly, from non-physically adjacent nodes.
[0008] In this invention, a topic graph convolutional network is used to extract features of non-physical connections between human skeletal nodes, and a multi-scale improved graph convolutional network is used to extract features of physical connections between skeletal nodes. The features extracted by the multi-scale graph convolution are used to enhance the information of the features extracted by the topic graph convolution. Finally, the multi-scale topic graph convolutional network enhances the entire network for the physical and non-physical connections of human skeletal nodes. Summary of the Invention
[0009] This invention aims to solve the problems of the prior art. It proposes an action recognition method based on a multi-scale spatiotemporal topic graph convolutional network. The technical solution of this invention is as follows:
[0010] An action recognition method based on a multi-scale spatiotemporal topic graph convolutional network includes the following steps:
[0011] Step S1, the step of generating action recognition feature map: using the key point confidence and affinity prediction network to extract skeletal feature key points from the input image data to generate the feature map to be recognized;
[0012] Step S2, the steps to establish a topic graph convolutional network: For neighboring joints and the possible relationships between each pair of joints, construct an adjacency matrix of the human skeleton based on the topic, and add a partitioning strategy for the nodes of the limb regions. Establish a topic graph convolutional network for the connections between each pair of joints.
[0013] Step S3, the steps to establish a multi-scale topic graph convolutional network: using multi-scale feature extraction methods to establish a multi-scale topic graph convolutional network based on the skeletal features of adjacent nodes of the human skeleton;
[0014] Step S4: Establish a multi-scale spatiotemporal topic graph convolutional network model: Train the multi-scale spatiotemporal topic graph convolutional network model using training samples to verify the feasibility and robustness of the model;
[0015] Step S5: Detect common actions in daily environments and promptly issue an alarm when a person in the video image performs a dangerous action.
[0016] Furthermore, step S1, which generates the action recognition feature map, specifically includes the following steps:
[0017] S11: The keypoint confidence and affinity prediction network is used to acquire human skeletal information. The keypoint confidence and affinity prediction network represents human skeletal information through the Gaussian distance and affinity of human joints, thereby acquiring data of 18 joints of the human skeleton.
[0018] S12: The skeletal sequence is represented by the 2D or 3D coordinates of the human skeletal joints in each frame image. Each 2D or 3D coordinate is represented as a vector. A complete video motion sequence is represented as a sequence of vectors between different frames.
[0019] S13: Based on the action content in the video, annotate the video samples to facilitate subsequent model training and prediction.
[0020] Furthermore, step S2, which involves establishing a topic graph convolutional network, specifically includes:
[0021] S21. Based on the natural connection relationships of the human body, connect the joints to form a human skeleton connection diagram; represent the skeletal joints as vertices of the diagram, and represent the edges between the corresponding vertices of two adjacent spatiotemporal diagrams as time edges, thus completing the construction of the human body physical connection skeleton diagram.
[0022] S22. Based on the Euclidean distance relationship between human joints, human skeleton nodes are divided into physically connected nodes and non-physically connected nodes; the human skeleton graph is transformed into an adjacency matrix based on Euclidean distance, and the adjacency matrix describes the connection relationship between physically connected nodes and non-physically connected nodes in the human skeleton graph.
[0023] S23: Establish a topic graph convolution module and a graph convolution module to simulate the physical connection and disconnection of joints in the human skeleton; extract the physical connection feature information and non-physical connection feature information of the human skeleton;
[0024] S24: The graph convolution module extracts the adjacency relationships of joints in the human skeleton, and obtains the physical connection features of the human joints.
[0025] S25: The topic graph convolution module uses a high-order network structure feature description operator. By calculating the average Euclidean distance between each pair of nodes, it assigns larger weights to joints with shorter distances to define disconnected joints and extracts possible relationships in non-physically connected joints in the human skeleton; thus obtaining non-physically connected features in human joints.
[0026] Furthermore, step S23 establishes a topic graph convolutional module and a graph convolutional module to simulate the physical connections and disconnections of joints in the human skeleton; extracts physical connection feature information and non-physical connection feature information of the human skeleton, specifically including:
[0027] In the topic-based partitioning strategy for human skeleton nodes, a strategy based on the spatial distance of physically connected nodes is used for node partitioning. In defining the adjacency matrix of physically connected nodes, node extraction of the human limb region is introduced to enhance the features of limb extraction. Nodes in the limb region are defined as subsets of the limb region. In the new skeleton node partitioning strategy, adjacent nodes are divided into four subsets: 1) root node subset; 2) proximal subset; 3) distal subset; 4) limb region subset. The new skeleton node partitioning strategy is as follows:
[0028]
[0029] In the formula L i (v j ) represents the human skeleton point graph partitioning function, r represents the distance between the joint and the centroid of the skeleton, and v j Represented as a first-order neighbor node of the root node, when the first-order neighbor node v of the root node j Distance from the center of gravity r j Equal to the distance r from the root node to the centroid i At that time, v j Marked as 0, when r j Less than r i At that time, v j Marked as 1, when r j Greater than r i At that time, v j Marked as 2. V s Represents the set of nodes in the limb region, when node v j and v i Belongs to V s At that time, v j and v i Marked as 3.
[0030] Furthermore, step S3, which establishes a multi-scale topic graph convolutional network, specifically includes the following steps:
[0031] S31: The adjacency matrix based on the topic is divided into physically connected nodes and non-physically connected nodes. For physically connected nodes, different convolutions are set for the nodes to extract features, and the importance weights W1, W2, and W3 of different features are obtained. After obtaining the outputs of three different action features, they are added to the output of the topic graph convolution module to obtain the final output.
[0032] S32: For each frame's input f in ∈R N×D R N×D Represented as a set of N joints with D-dimensional coordinates, a multi-scale topic graph convolutional network is defined as follows:
[0033]
[0034] In the formula c v The number of convolution kernels is represented by c; convolution is represented by K; semantic roles are represented by K. M Matrix representation is The adjacency matrix of the defined topic graph convolution; This represents the semantic roles encoded between all physically connected nodes in the skeleton graph; Represented as The diagonalized matrix; It is represented as a weighting function.
[0035] Furthermore, step S4, which establishes a multi-scale spatiotemporal topic graph convolutional network model, specifically includes the following steps:
[0036] S41: Construct a multi-scale spatiotemporal topic graph convolutional network model, which is composed of multiple multi-scale topic graph convolutional networks and temporal convolutional networks superimposed on each other; during the model training process, gradient descent algorithm and cross-entropy loss function are used for optimization;
[0037] S42: Process the RGB images containing action features, convert them into corresponding image files, store the image files separately, and divide them into training and test sets for action recognition;
[0038] S43: Input the processed action recognition dataset into a multi-scale spatiotemporal topic graph convolutional network for training, and then use softmax classification to obtain multiple training results.
[0039] An action recognition system based on a multi-scale spatiotemporal topic graph convolutional network includes a data acquisition module, a human skeleton node extraction module, a human action recognition module, an action recognition display module, an early warning module, and a client app module; wherein,
[0040] The data acquisition module is used to acquire video information captured by the camera and process the data into the format required by the motion recognition system, realizing video format conversion and resolution adjustment;
[0041] The human skeleton node extraction module is used to extract human skeleton node information from the data collected by the data acquisition module, obtain human skeleton motion information in the data, and perform human body tracking on people appearing in the video.
[0042] The human motion recognition module is used to determine the type of human motion based on the human skeletal motion information processed by the human skeletal node extraction module, and to save the determined motion information.
[0043] The motion recognition display module provides an intuitive interface for displaying the motion information determined by the human motion recognition module.
[0044] The early warning module determines whether the human motion recognition module identifies a dangerous action. If a dangerous action is detected, it immediately sends an early warning to the client app, notifying the user to take necessary safety measures. If no dangerous action is detected, the motion recognition system continues to monitor the area captured by the camera.
[0045] The client app module allows users to view real-time video data captured by the camera and receive timely alerts in case of dangerous actions.
[0046] The advantages and beneficial effects of this invention are as follows:
[0047] This invention improves the adjacency matrix of the human skeleton based on graph convolutional network action recognition algorithms. It uses topic graph convolutional modules and multi-scale graph convolutional modules to simulate the disconnection and connection of the human skeleton, extracting both physical and non-physical connection information between skeletal joints. This improves the model's accuracy without increasing its complexity.
[0048] (1) Existing skeletal motion recognition methods use fixed human physical structures for modeling, resulting in poor model flexibility. This invention adopts a new human skeletal diagram design, using Euclidean distances between disconnected joints to define a matrix for modeling the spatial features of motion. Simultaneously, secondary extraction of node features from physically connected limb regions effectively enhances the model's feature extraction capability.
[0049] (2) In view of the problem that existing skeletal motion recognition methods ignore the features of non-physical connections between skeletal joints, this invention proposes a topic graph convolution module. The topic graph convolution module uses a connected subgraph composed of a small number of nodes, considers the spatial distance between each pair of nodes, and measures the possible correlation between skeletal joints, thereby effectively capturing efficient structural features in skeletal nodes.
[0050] (3) To address the problem of insufficient action extraction, this invention proposes a multi-scale graph convolution module. At the nodes of physical connection of human skeleton, multiple convolutions of different scales are used to extract various skeleton features, thereby enriching the features of skeleton nodes.
[0051] (4) While improving the accuracy of action recognition, this invention combines the needs of real life, establishes a human action recognition system based on multi-scale spatiotemporal topic graph convolutional network, and builds a client app for action recognition, which is beneficial for action recognition in real life. Attached Figure Description
[0052] Figure 1 This is a flowchart of a preferred embodiment of the human action recognition method based on a multi-scale spatiotemporal topic graph convolutional network provided by the present invention;
[0053] Figure 2 The structure diagram of the convolutional module for the main image;
[0054] Figure 3 This is a structural diagram of the multi-scale topic graph convolutional module;
[0055] Figure 4 This is a diagram showing the overall structure of a multi-scale spatiotemporal topic graph convolutional network model.
[0056] Figure 5 Flowchart of a human motion recognition system;
[0057] Figure 6 The image shows the results of human motion recognition. Detailed Implementation
[0058] The technical solutions of the present invention will be clearly and thoroughly described below with reference to the accompanying drawings. The described embodiments are merely some embodiments of the present invention.
[0059] The technical solution of the present invention to solve the above-mentioned technical problems is:
[0060] This embodiment provides a method and system for human action recognition based on multi-scale spatiotemporal topic graph convolution, such as... Figure 1 As shown, the specific steps include:
[0061] S1: Establish an action recognition feature map, which includes the following steps:
[0062] S11: A keypoint confidence and affinity prediction network is used to acquire skeletal information of the human body, obtaining data of 18 joint points of the human skeleton.
[0063] S12: Skeletal sequences are usually represented by 2D or 3D coordinates of human skeletal joints in each frame. Each 2D or 3D coordinate can be represented as a vector, so a complete video motion sequence can be represented as a sequence of vectors between different frames.
[0064] S13: Based on the action content in the video, annotate the video samples to facilitate subsequent model training and prediction.
[0065] S2: Establish a topic graph convolutional network, which includes the following steps:
[0066] S21: Construct an undirected graph G for each joint and the possible relationships between each pair of joints. t ={V t E t} represents a skeleton sequence with N joints and T frames, where V t It is a set of N key points, Et It encodes the relationship between each pair of nodes, node v ti The neighborhood set is defined as N(v ti )={v tj |d(v ti ,v tj )≤D}, where d(v ti ,v tj ) is from v ti to v tj The shortest path, where represents the neighborhood distance of adjacent nodes.
[0067] S22: A mapping function l∶V was designed. t →{1,2,…,K}, for each graph vertex v ti ∈V t Assign labels {1,2,…,K} to node v ti The neighbor set N(v) ti The edge set E is divided into a fixed number of K subsets. t It consists of two subsets, the first subset E S ={v ti v tj |(i,j)∈H} represents the connections within each frame's skeleton, where H represents the set of naturally connected human joints, and the second subset E F ={v ti v (t+1)i} represents the edge connecting consecutive adjacent frames, and the E of a specific joint i F All edges in the array represent the trajectory of its changes over time.
[0068] S23: In the topic-based partitioning strategy for human skeleton nodes, the physically connected nodes are partitioned according to the spatial distance strategy. This invention introduces node extraction of the human limb region into the adjacency matrix of physically connected nodes, enhancing the features of limb extraction. This invention defines nodes in the limb region as a subset of the limb region. In the new skeleton node partitioning strategy, adjacent nodes are divided into four subsets: 1) root node subset; 2) pericentric subset; 3) centcentric subset; 4) limb region subset. The new skeleton node partitioning strategy is as follows:
[0069]
[0070] In the formula L i (v j ) represents the human skeleton point graph partitioning function, r represents the distance between the joint and the centroid of the skeleton, and v j Represented as a first-order neighbor node of the root node, when the first-order neighbor node v of the root node j Distance from the center of gravity r jEqual to the distance r from the root node to the centroid i At that time, v j Marked as 0, when r j Less than r i At that time, v j Marked as 1, when r j Greater than r i At that time, v j Marked as 2. V s Represents the set of nodes in the limb region, when node v j and v i Belongs to V s At that time, v j and v i Marked as 3.
[0071] S24: In the human skeleton, each joint and its neighboring nodes are divided into root nodes, parent nodes, and child nodes. For physically connected nodes, the topic graph convolutional module encodes nodes with physical connections using different semantic roles between adjacent nodes, and defines the node weights using three semantic roles (KM1=3) to obtain the physical connection features in the human joints. For possible non-physical connections between human skeleton nodes, semantic roles (KM2=2) are used to define the features of non-physical relationships. The non-physical connection relationships between nodes are determined by the Weighted Adjacency Matrix (WAM). The WAM uses the Euclidean distance between each pair of nodes to calculate the matrix, and defines disconnected joints by assigning larger weights to joints with shorter distances.
[0072] S25: To compute the weighted adjacency matrix WAM, first calculate the Euclidean distance e(i,j) between each pair of nodes, and then calculate the averaged skeleton sequence of the input. Along the input skeleton sequence X t=1…F The time dimension is obtained in Let e(i,j) represent the Euclidean distance between non-physical joints i and j of x. When i and j are 0, e(i,j) is 0.
[0073] S26: The definition of WAM can be obtained from the Euclidean distance formula between non-physically connected joints i and j as follows:
[0074] α i,j =max ee(i,j)
[0075] In the formula, e(i,j) is a matrix e∈R V×V The i-th and j-th elements represent the distance between each pair of non-physically connected joints.
[0076] S27: Combining the above formula, for the t-th frame skeleton sequence input Xt The convolutional output of the topic graph for semantic roles KM2 is:
[0077]
[0078] In the formula A m,k ∈R V×V A m,k ∈R V×V The adjacency matrix representing the specific semantic roles of a topic (KM2). This represents a trainable filter matrix.
[0079] S28: For physical and non-physical connections of skeletal nodes, this invention adopts Φ m A function is used to construct a unified topic adjacency matrix, where the weights of each topic adjacency matrix define the importance of neighboring nodes. The formula for the unified topic graph convolution adjacency matrix is as follows:
[0080] A m,k =Φ m (A m,k )=(D m,k ) -1 A m,k
[0081] In the formula D m,k ∈R V×V It is a diagonal matrix, represented as
[0082] S29: As Figure 2 The diagram shown is the convolutional structure of the topic graph. After optimization using Chebyshev's formula, the final structure for the input f of each frame is... in ∈R N×D The topic graph has N nodes and a set of nodes with D coordinates. The convolution formula for the topic graph is defined as:
[0083]
[0084] In the formula K M Matrix representation The defined adjacency matrix of the topic graph convolution encodes the semantic roles between all physically connected nodes in the human skeletal graph. express The diagonal matrix M k This represents a learnable matrix.
[0085] S3: Establish a multi-scale topic graph convolutional network, which includes the following steps:
[0086] S31: The adjacency matrix based on the topic is divided into physically connected nodes and non-physically connected nodes. This invention focuses on physically connected nodes, where different convolutions are applied to the nodes to extract features, resulting in importance weights W1, W2, and W3 for different features. After obtaining the outputs of three different action features, they are added to the output of the topic graph convolution module to obtain the final output, such as... Figure 3 The structure diagram of the multi-scale topic graph convolutional module is shown, from bottom to top: topic graph convolutional module and multi-scale graph convolutional module.
[0087] S32: For each frame's input f in ∈R N×D A multi-scale topic graph convolutional network is defined as follows: It contains N nodes and D coordinates.
[0088]
[0089] In the formula C v K represents the number of convolution kernels. M Matrix representation The defined adjacency matrix of the topic graph convolution encodes the semantic roles between all physically connected nodes in the human skeletal graph. express The diagonal matrix M k This represents a learnable matrix. Indicates the weight.
[0090] S4: Training a multi-scale spatiotemporal topic graph convolutional network model, specifically including the following steps:
[0091] S41: The multi-scale spatiotemporal topic graph convolutional network model consists of 10 identical multi-scale spatiotemporal topic graph convolutional modules, pooling layers, fully connected layers, and a softmax function. For example... Figure 4 As shown, the 10 multi-scale spatiotemporal topic graph convolutional modules are represented by B1-B10. Each multi-scale spatiotemporal topic graph convolutional module contains, from left to right, a multi-scale topic graph convolutional module, a batch normalization (BN) layer, a ReLU activation function, a Dropout layer, a temporal convolution, a BN layer, and a ReLU activation function. The multi-scale spatiotemporal topic graph convolutional network model is finally connected to a pooling layer, a fully connected layer, and a softmax function.
[0092] S42: In the multi-scale spatiotemporal topic graph convolution module, temporal convolution uses a fixed-size kernel. Temporal convolution is defined as:
[0093]
[0094]
[0095] In the formula, Γ represents the temporal convolution kernel scale, and Γ controls the temporal range included in the human skeleton map; vt S(v) represents the joint in frame t; t w(v) represents the sampling region of the temporal convolution; q ) represents the weights of the temporal convolution.
[0096] S43: To accelerate model convergence, Batch Normalization (BN), ReLU, and dropout layers are placed between each graph convolutional module. Data after the BatchNorm layer is normalized.
[0097]
[0098]
[0099]
[0100] In the formula μ j This represents the mean. Z represents the variance. j The normalized data is represented by ε, which is used to prevent invalid computations due to zero variance. Simultaneously, to sparsify the network and reduce parameter dependencies, activation operations are performed on the data passing through the ReLU layer.
[0101]
[0102] When the output value x > 0, the output is executed; when x ≤ 0, the output value is changed to 0. Finally, dropout = 0.5 is applied to the output of the spatial convolution module to avoid overfitting of the network.
[0103] S44: Process the RGB images containing action features into grayscale images, convert them into corresponding image files, and store the image files separately as CSV files. Divide them into training and test sets for action recognition in an 8:2 ratio.
[0104] S45: Input the processed action recognition dataset into a multi-scale spatiotemporal topic graph convolutional network model for training, and then use softmax classification. The output of the softmax classifier is the probability distribution of each action category, where the category with the highest probability is regarded as the prediction result of the model.
[0105] S5: Establish a human action recognition system based on a multi-scale spatiotemporal topic graph convolutional network, such as... Figure 5 As shown, the specific steps include:
[0106] S51: The human motion recognition system consists of a data acquisition module, a human skeleton node extraction module, a human motion recognition module, a motion recognition display module, an early warning module, and a client app module.
[0107] S52: Data acquisition module, used to acquire video information captured by the camera and process the data into the format required by the motion recognition system, realizing video format conversion, resolution adjustment, etc.
[0108] S53: Human skeleton node extraction module, used to extract human skeleton node information from the data collected by the data acquisition module, obtain human skeleton motion information in the data, and perform human body tracking on people appearing in the video.
[0109] S54: Human motion recognition module, used to determine the type of human motion based on the human skeleton motion information processed by the human skeleton node extraction module, and save the determined motion information.
[0110] S55: Motion Recognition Display Module, used to provide an intuitive interface display of the motion information judged by the human motion recognition module, such as... Figure 6 The image shown is a display of the action recognition results in the action recognition display module.
[0111] S56: Early warning module, used to determine whether the human motion recognition module identifies a dangerous motion. When a dangerous motion is detected, an early warning message is immediately sent to the client's app module, notifying the user to take necessary safety measures. When no dangerous motion is detected, the motion recognition system continues to monitor the area captured by the camera.
[0112] S57: Client app module, which allows users to view video data captured by the camera in real time and receive timely alarm information in case of dangerous actions.
[0113] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions.
[0114] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0115] The above embodiments should be understood as illustrative only and not as limiting the scope of protection of the present invention. After reading the description of the present invention, those skilled in the art can make various alterations or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.
Claims
1. An action recognition method based on a multi-scale spatiotemporal topic graph convolutional network, characterized in that, Includes the following steps: Step S1, the step of generating action recognition feature map: using the key point confidence and affinity prediction network to extract skeletal feature key points from the input image data to generate the feature map to be recognized; Step S2, the steps to establish a topic graph convolutional network: For neighboring joints and the possible relationships between each pair of joints, construct an adjacency matrix of the human skeleton based on the topic, and add a partitioning strategy for the nodes of the limb regions. Establish a topic graph convolutional network for the connections between each pair of joints. Step S3, the steps to establish a multi-scale topic graph convolutional network: using multi-scale feature extraction methods to establish a multi-scale topic graph convolutional network based on the skeletal features of adjacent nodes of the human skeleton; Step S4: Establish a multi-scale spatiotemporal topic graph convolutional network model: Train the multi-scale spatiotemporal topic graph convolutional network model using training samples to verify the feasibility and robustness of the model; Step S5: Detect common actions in daily environments and promptly issue an alarm when a person in the video image performs a dangerous action; Step S3, which establishes a multi-scale topic graph convolutional network, specifically includes the following steps: S31: The adjacency matrix based on the topic is divided into physically connected nodes and non-physically connected nodes. For physically connected nodes, different convolutions are set for the nodes to extract features, and the importance weights W1, W2, and W3 of different features are obtained. After obtaining the outputs of three different action features, they are added to the output of the topic graph convolution module to obtain the final output. S32: For each frame of input , Represented as a set of N joints with D-dimensional coordinates, a multi-scale topic graph convolutional network is defined as follows: In the formula denoted by the number of convolution kernels; c represents convolution; k represents the semantic role; Matrix representation is The adjacency matrix of the defined topic graph convolution; This represents the semantic roles encoded between all physically connected nodes in the skeleton graph; Represented as The diagonalized matrix; Represented as a weighting function; Represents a learnable matrix; Step S4, which establishes a multi-scale spatiotemporal topic graph convolutional network model, specifically includes the following steps: S41: Construct a multi-scale spatiotemporal topic graph convolutional network model, which is composed of multiple multi-scale topic graph convolutional networks and temporal convolutional networks superimposed on each other; during the model training process, gradient descent algorithm and cross-entropy loss function are used for optimization; S42: Process the RGB images containing action features, convert them into corresponding image files, store the image files separately, and divide them into training and test sets for action recognition; S43: Input the processed action recognition dataset into a multi-scale spatiotemporal topic graph convolutional network for training, and then use softmax classification to obtain multiple training results.
2. The action recognition method based on a multi-scale spatiotemporal topic graph convolutional network according to claim 1, characterized in that, The step S1, which generates the action recognition feature map, specifically includes the following steps: S11: The keypoint confidence and affinity prediction network is used to acquire human skeletal information. The keypoint confidence and affinity prediction network represents human skeletal information through the Gaussian distance and affinity of human joints, thereby acquiring data of 18 joints of the human skeleton. S12: The skeletal sequence is represented by the 2D or 3D coordinates of the human skeletal joints in each frame image. Each 2D or 3D coordinate is represented as a vector. A complete video motion sequence is represented as a sequence of vectors between different frames. S13: Based on the action content in the video, annotate the video samples to facilitate subsequent model training and prediction.
3. The action recognition method based on a multi-scale spatiotemporal topic graph convolutional network according to claim 1, characterized in that, The specific steps of establishing the topic graph convolutional network in step S2 include: S21. Based on the natural connection relationships of the human body, connect the joints to form a human skeleton connection diagram; represent the skeletal joints as vertices of the diagram, and represent the edges between the corresponding vertices of two adjacent spatiotemporal diagrams as time edges, thus completing the construction of the human body physical connection skeleton diagram. S22. Based on the Euclidean distance relationship between human joints, human skeleton nodes are divided into physically connected nodes and non-physically connected nodes; the human skeleton graph is transformed into an adjacency matrix based on Euclidean distance, and the adjacency matrix describes the connection relationship between physically connected nodes and non-physically connected nodes in the human skeleton graph. S23: Establish a topic graph convolution module and a graph convolution module to simulate the physical connection and disconnection of joints in the human skeleton; extract the physical connection feature information and non-physical connection feature information of the human skeleton; S24: The graph convolution module extracts the adjacency relationships of joints in the human skeleton, and obtains the physical connection features of the human joints. S25: The topic graph convolution module uses a high-order network structure feature description operator. By calculating the average Euclidean distance between each pair of nodes, it assigns larger weights to joints with shorter distances to define disconnected joints and extracts possible relationships in non-physically connected joints in the human skeleton; thus obtaining non-physically connected features in human joints.
4. The action recognition method based on a multi-scale spatiotemporal topic graph convolutional network according to claim 3, characterized in that, Step S23 establishes a topic graph convolutional module and a graph convolutional module to simulate the physical connections and disconnections of joints in the human skeleton; it extracts the physical connection feature information and non-physical connection feature information of the human skeleton, specifically including: In the topic-based partitioning strategy for human skeleton nodes, a strategy based on the spatial distance of physically connected nodes is used for node partitioning. In defining the adjacency matrix of physically connected nodes, node extraction of the human limb region is introduced to enhance the features of limb extraction. Nodes in the limb region are defined as subsets of the limb region. In the new skeleton node partitioning strategy, adjacent nodes are divided into four subsets: 1) root node subset; 2) proximal subset; 3) distal subset; 4) limb region subset. The new skeleton node partitioning strategy is as follows: In the formula This is represented as a function for partitioning a human skeleton dot graph. This represents the distance between a joint and the center of gravity of the skeleton. Represented as a first-order neighbor node of the root node, when the first-order neighbor node of the root node... Distance from the center of gravity Equal to the distance from the root node to the centroid hour, Marked as 0, when Less than hour, Marked as 1, when Greater than hour, Marked as 2, Represents the set of nodes in the limb region, when the node and belong hour, and Marked as 3.
5. An action recognition system employing the method described in any one of claims 1-4, characterized in that, It includes a data acquisition module, a human skeleton node extraction module, a human motion recognition module, a motion recognition display module, an early warning module, and a client app module; among them, The data acquisition module is used to acquire video information captured by the camera and process the data into the format required by the motion recognition system, realizing video format conversion and resolution adjustment; The human skeleton node extraction module is used to extract human skeleton node information from the data collected by the data acquisition module, obtain human skeleton motion information in the data, and perform human body tracking on people appearing in the video. The human motion recognition module is used to determine the type of human motion based on the human skeletal motion information processed by the human skeletal node extraction module, and to save the determined motion information. The motion recognition display module provides an intuitive interface for displaying the motion information determined by the human motion recognition module. The early warning module is used to determine whether the human motion recognition module recognizes a dangerous motion. When a dangerous motion is detected, an early warning message is immediately sent to the client's app module to notify the user to take safety measures. When no dangerous motion is detected, the motion recognition system continues to monitor the area captured by the camera. The client app module allows users to view real-time video data captured by the camera and receive timely alerts in case of dangerous actions.