Method, system, device and medium for skeleton action classification based on multi-scale space-time interaction of action jitter and skeleton noise suppression

By employing a multi-scale spatiotemporal interactive skeleton action classification method, the problems of noise and jitter in skeleton action recognition are solved, achieving highly accurate and robust action classification, which is applicable to motion analysis, health monitoring, virtual reality, and security fields.

CN117671353BActive Publication Date: 2026-06-26XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2023-12-04
Publication Date
2026-06-26

Smart Images

  • Figure CN117671353B_ABST
    Figure CN117671353B_ABST
Patent Text Reader

Abstract

The method, system, device and medium for motion jitter and skeleton noise suppression multi-scale space-time interaction skeleton motion classification are as follows: first, obtaining a to-be-trained skeleton video sample; second, using a sparse level selection module of a skeleton motion recognition model to extract multi-scale features of the to-be-trained skeleton sample; on the space domain, a cascading interaction matrix is constructed through original joint features and sparse level joint features, and multi-head self-attention is applied to the interaction and diffusion of channel features; on the time domain, a time domain multi-scale time convolution module of the skeleton motion recognition model is used to extract different time features in combination with a multi-scale filter and a time self-attention channel interaction algorithm, so as to promote global time information interaction and update; the system, device and medium realize the action classification task based on the skeleton data through the Transformer network structure based on the skeleton motion classification method, and high-precision classification performance is achieved on the existing data set.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of skeleton-based human motion recognition technology, and in particular to a multi-scale spatiotemporal interactive skeleton motion classification method, system, device and medium for motion jitter and skeleton noise suppression. Background Technology

[0002] Actions and behaviors fall under the category of human biometrics, and recognizing and understanding the actions and behaviors of observed individuals is a fundamental attribute of human visual perception and cognition. Human action recognition is a fundamental problem in the field of computer vision. Graph convolutional networks (GCNNs) typically focus on learning features from the local neighbors of nodes in their design; their pooling and convolution operations mainly involve connections between nodes and their neighboring nodes. This inherently local nature leads to their limitations. Therefore, GCNNs often cannot provide sufficient global information abstraction and processing capabilities when handling skeleton data.

[0003] In recent years, Transformers have been widely used in computer vision and have demonstrated significant performance advantages. Due to their powerful global information extraction capabilities, some researchers have applied Transformers to human skeleton action recognition. However, existing methods are limited by the lack of scale-level spatiotemporal correlation information, hindering their ability to flexibly capture multi-layered semantic information between joints. Although many models apply Transformers to skeleton-based action recognition, they still cannot surpass the accuracy of many excellent graph convolution-based works. First, it is worth noting that noise in human skeleton data is unavoidable. The acquisition of human skeleton data is affected by various factors such as sensor accuracy and pose estimation algorithms. All of these factors contribute to noise in the data. Second, in the presence of noise, the model may struggle to accurately capture discriminative features in the human skeleton data. Noise may blur or lose these features, thus affecting model performance. Finally, balancing jitter and motion joints in long-range modeling is crucial. Jittery joints cause distortion and deformation of skeleton data, compromising the integrity and stability of the skeleton, leading to inaccurate topological joint connections, and thus disrupting the overall structure of the action. Under the influence of long-range modeling, this results in errors in information transmission, ultimately affecting model performance. Summary of the Invention

[0004] To overcome the problems of the prior art, the present invention discloses a multi-scale spatiotemporal interactive skeleton motion classification method, system, device, and medium for motion jitter and skeleton noise suppression. By proposing multiple node selection strategies—fixed joint selection, fixed edge joint selection, average selection, and random selection—the invention aims to minimize the impact of jittered joints, improve the robustness and accuracy of skeleton human motion recognition, and flexibly select and combine these extended strategies according to the specific skeleton human motion recognition task and dataset characteristics. A cascaded interaction matrix promotes information exchange and mutual attention between information joints and their features, enhancing spatial dependence and forcing the network to improve information joints with discriminative features, thus eliminating the cumulative adverse effects of jittered joints. Combining a temporal multi-scale TCN structure with multi-scale temporal convolution and time transformers, and employing different expansion rates, the invention flexibly promotes the learning of multi-scale global features in the temporal domain. The present invention has the advantages of strong recognition ability and high accuracy.

[0005] To achieve the above objectives, the present invention adopts the following technical solution:

[0006] A multi-scale spatiotemporal interactive skeleton action classification method with motion jitter and skeleton noise suppression is proposed. First, training skeleton video samples are acquired. Second, by applying the sparse selection module of the skeleton action recognition model, sparse joint features and original joint features of the skeleton samples to be trained are extracted from the training skeleton video samples. Then, through the fusion interaction module and temporal multi-scale graph convolution module of the skeleton action recognition model, long-distance random dependencies between skeleton joints are modeled to extract multi-scale spatiotemporal features. The interaction and updating of global spatiotemporal information enable skeleton-based human action classification.

[0007] A multi-scale spatiotemporal interactive skeleton motion classification method that includes motion jitter and skeleton noise suppression, specifically comprising the following steps:

[0008] Step 1, Obtain the skeleton video samples to be trained: Collect publicly available human skeleton motion datasets from the Internet, and perform preprocessing operations on the skeleton data, including raw skeleton acquisition, noise removal and viewpoint normalization, to obtain the preprocessed dataset.

[0009] Step 2: The dataset obtained from the preprocessing in Step 1 is used as input for multi-scale feature extraction, including sparse-level feature extraction and original-level joint feature extraction. By proposing a variety of node selection strategies, including fixed joint selection, fixed edge joint selection, average selection and random selection strategies, the influence of shaking joints is reduced, and the robustness and accuracy of skeletal human motion recognition are improved.

[0010] Step 3: Based on the multi-scale features obtained in Step 2, the long-distance random dependencies of skeleton joints are modeled using the fusion interaction module and the temporal multi-scale TCN module to achieve multi-scale spatiotemporal feature extraction. In addition, based on the learning objective of information bottleneck, the Infoloss loss function is used to learn the information-rich but compact latent representation in the skeleton data to guide the model to understand the maximum information representation.

[0011] Step 2 includes:

[0012] First, input feature x in Convert them into sparse-level features x respectively s and original level features x p The calculation formula is as follows:

[0013] x s x p =Conv 2D(1×1) x in ,

[0014]

[0015] in, We set the number of sparse-level feature channels to 1 / m of the output channels, C in C out V and T represent the number of input channels, output channels, frames, and joints, respectively; in x s Select V1 nodes that contain the most action information; x s The features of V1 nodes are aggregated into a feature map. At the same time, the corresponding adjacency matrix is ​​constructed.

[0016] The node selection strategy employs the following approach to reduce the impact of joint shaking and improve the robustness and accuracy of skeletal human motion recognition:

[0017] a) Average joint selection: Group all nodes by location and perform an average calculation on all nodes in the node group;

[0018] b) Selection of fixed joints: Select a set of non-edge nodes that can effectively represent the body structure;

[0019] c) Fixed edge joint selection: Select joints located at the edge of the human body;

[0020] d) Randomly select different nodes.

[0021] Step 3 includes:

[0022] 3.1 Dimensional compression and inter-channel feature interaction fusion are performed through the sparse-level fusion interaction module M2 in the fusion interaction module;

[0023] 3.1.1 The sparse-level fusion interaction module M2, through its global feature extraction part (GF) Extra Global features are extracted from the network; furthermore, the channel and temporal dimensions are compressed to facilitate subsequent attention operations; specifically, the original and sparse-level skeleton sequences are processed by the global feature extraction part (GF). Extra ), to obtain the output Q s K s :

[0024] Q p =GF Extra (x s )∈R V×1×1 ,

[0025]

[0026] Then, Q p and K s Multiplication is performed to achieve cascaded interaction fusion within the channel, and the obtained weights are adjusted using Softmax to obtain a mapping from the original-level cascaded interaction matrix to the sparse-level matrix. By mapping the input features x p and cascaded interaction matrix A ps Perform matrix multiplication to achieve interactive fusion of spatial features within the channel:

[0027] A ps =Softmax(atten(Q) p K s )),

[0028]

[0029] in, For matrix multiplication, MHSA is a multi-head self-attention mechanism. The weights are adjusted using Softmax; a global parameterization matrix is ​​introduced. To help obtain joint relationships from the correlation matrix; to use the multi-head self-attention mechanism MHSA to realize the interaction and diffusion of channel features;

[0030] 3.2 The original-level fusion interaction module M3 uses multi-head cross-attention to interact with and diffuse the features of the two sets of channels, realizing cross-channel information diffusion;

[0031] The original skeleton sequence is cascaded and fused with intra-channel features through a sparse fusion interaction module M2 and a self-attention layer to model long-distance joints in the skeleton sequence. Then, the output is input to the global feature extraction part (GF). Extra Global feature extraction and dimensionality compression are implemented in [the following]:

[0032] Q s =GF Extra (SA(x in )), K p =GF Extra (SF(x in )),

[0033]

[0034] In the above formula, SF represents the sparse fusion interaction module M2, and SA represents the self-attention layer; using the attention mechanism, the sparse-level skeleton sequence is mapped to the original-level skeleton sequence to generate the cascaded interaction matrix A. sp Subsequently, a graph-like convolution operation is performed on the output of the sparse-level fusion interaction module to promote the interaction and fusion of features at different scales. The output then undergoes processing through a linear transformation layer and gate circuits, and finally establishes a residual connection with the output of the self-attention layer.

[0035]

[0036] x out =λ·Linear(SF(x) in A sp )+SA(x in ),

[0037] in, d represents the feature dimension, λ represents the fusion coefficient, and attention map. Modeling the joint correlation between the front and rear channels;

[0038] 3.3 In the sparse-level fusion interaction module M2 and the original-level fusion interaction module M3, multi-head self-attention (MHSA) is used to obtain global features of spatial joints. Specifically, position embedding (PE) is used to encode spatial joint information, labeling each joint, and using sine and cosine functions of different frequencies as encoding functions.

[0039]

[0040]

[0041] Where p and i represent the joint position and the dimension of the position encoding vector, respectively; then, a linear transformation is performed on the input feature map x to generate Q, K, and V, and QK is calculated. TUse Softmax on QK T Normalization was performed, and a learnable adjacency matrix A was introduced. l The final output X is obtained through linear transformation. out As shown below:

[0042] Q, K, V = Linear(x),

[0043]

[0044]

[0045] Among them, A l , A∈R V×V , Linear(·) represents a linear transformation. This module is also used in the sparse stage, with the input being... The output is

[0046] 3.4 The temporal multi-scale TCN module employs multi-scale temporal convolution, fixing the kernel size to 3×1 and using different dilation rates (dilation = 1, 2) to expand the receptive field, reducing the computational cost caused by additional branches, as shown below:

[0047] x d x m x t x r =Conv (1×1) x,

[0048]

[0049] in, Use max pooling to obtain focal joint features:

[0050] x m =MaxPool(x) m ),

[0051] in, Conv (1×1) For feature transformation;

[0052] The temporal multi-scale TCN module utilizes the temporal multi-scale interaction mechanism T-Former, employing Transformer to aggregate the feature changes of the skeleton across the time dimension from a global perspective. It introduces a channel grouping strategy to reduce the number of model parameters, providing independent feature learning capabilities for different channel groups and enhancing the interactivity between features within the model; as shown below:

[0053] V t =Linear(xt ),

[0054] K t Q t =ChannelGrouping(x t ),

[0055]

[0056]

[0057] in, Linear(·) represents a linear transformation, and finally... and x d 'Semblage along the channel dimension;

[0058] 3.5 Training Loss

[0059] Based on the learning objective of information bottleneck, this method uses Infoloss to learn information-rich yet compact latent representations. Specifically, given a predicted label... And the real label z, then L info The loss can be written as:

[0060]

[0061]

[0062] Here, α and N represent the confidence coefficient and the number of action categories, respectively. Corresponding multiplication is used to calculate the predicted true score. and z i All are regularized; additionally, the cross-entropy loss is as follows:

[0063]

[0064] Where M represents the number of categories, y ic The sign function (0 or 1) is set to 1 if the actual sample class i equals c, and 0 otherwise. ic Let L represent the probability that the observed sample i belongs to class c; finally, let L... CE Loss and our proposed L info The loss is incorporated into the complete learning objective function:

[0065] L = L info +ωL CE .

[0066] Here, L info and L CE As defined above, ω is L CE The balancing hyperparameters.

[0067] A multi-scale spatiotemporal interactive skeleton motion classification system for motion jitter and skeleton noise suppression includes:

[0068] The feature extraction and node selection module, in step 2, is used for the extraction of multi-scale features, including sparse-level features and primitive-level features, and the selection of information joints; to minimize the impact of shaking joints and improve the robustness and accuracy of skeletal human motion recognition;

[0069] The fusion interaction module includes a sparse-level fusion interaction module M2 and a primitive-level fusion interaction module M3. In step 3, the sparse-level fusion interaction module M2 performs dimensionality compression and interaction fusion of related features within the channel; the primitive-level fusion interaction module M3 uses multi-head cross attention to interact with and diffuse the features of the two sets of channels, thereby achieving the balance of noise caused by motion jitter and the fusion of features at different scales.

[0070] In step 3, the temporal multi-scale TCN module uses a Transformer to aggregate the feature changes of the skeleton in the time dimension from a global perspective. It introduces a channel grouping strategy to reduce the number of model parameters, provides independent feature learning capabilities between different channel groups, and improves the interactivity between features within the model. This is used to extract multi-scale temporal features, realizing the extraction of differential temporal features and the interaction and updating of global temporal information. Finally, the proposed Infoloss loss function is used to learn the information-rich but compact latent representation in the skeleton data, which is used by the model to understand the maximum information representation.

[0071] A multi-scale spatiotemporal interactive skeleton motion classification device for motion jitter and skeleton noise suppression includes:

[0072] Memory, used to store computer programs;

[0073] A processor is used to implement the skeleton motion classification method described in steps 1 to 3 when executing the computer program.

[0074] A computer-readable storage medium storing a computer program, which, when executed by a processor, is capable of performing multi-scale spatiotemporal interactive skeleton motion classification based on the skeleton motion classification method described in steps 1 to 3, which includes motion jitter and skeleton noise suppression.

[0075] The beneficial effects of this invention are as follows:

[0076] 1. Solving the inaccuracy of scale-level spatiotemporal correlation: Traditional skeleton motion recognition methods lack scale-level spatiotemporal correlation information due to inaccurate joint prediction algorithms; this invention introduces CI-STFormer, a spatiotemporal cascaded interactive network based on Transformer, to solve this problem, allowing for better capture of multi-layer semantic information between joints and improving the flexibility of the model.

[0077] 2. Multi-scale feature capture: This invention includes a spatial sparse-level optimization and interaction module and a spatial original-level fusion and interaction module. These modules realize cross-channel multi-scale joint feature update, interaction and fusion, which helps the system capture multi-scale semantic information between joints, thereby improving the accuracy of action recognition.

[0078] 3. Temporal scale-level grouping and interaction: In the temporal domain, this invention introduces scale-level grouping and interaction to extract temporal features and realize global information interaction and updating. This helps to obtain the potential differential features of the spatiotemporal association of the skeleton, thereby improving the performance of action classification.

[0079] 4. Validation Experiments and Performance Advantages: Extensive ablation experiments demonstrate the effectiveness of the proposed innovative submodule, reinforcing the feasibility and performance of the invention. Comparative experiments with existing GCN-based and Transformer-based methods show performance advantages on four different datasets, indicating that the CI-STFormer model achieves better results in action classification.

[0080] This invention introduces an innovative skeleton action recognition method, which improves the accuracy and robustness of action classification by solving the problems of inaccurate joint prediction and multi-scale feature capture.

[0081] Compared with the prior art, the present invention has the following advantages:

[0082] 1. Effective Solution to Jitter and Noise Problems: This invention focuses on addressing motion jitter and node noise issues caused by sensor errors and inaccuracies in joint prediction algorithms. This is one of the key challenges of traditional skeleton motion recognition methods. This invention employs multiple node selection strategies, namely fixed joint selection, fixed edge joint selection, average selection, and random selection strategies, as well as a cascaded interaction matrix to promote information exchange between information joints and joint features, in order to address these problems and eliminate the cumulative adverse effects of jittery joints.

[0083] 2. Application of Transformer Network Structure: The Transformer network structure, a highly successful model in natural language processing and image processing, is employed to achieve skeleton-based human action classification. This enables the invention to better capture key features and contextual information from skeleton data.

[0084] 3. High-precision action classification: Based on existing datasets, a temporal multi-scale TCN structure combining multi-scale temporal convolution and time transformers is used, employing different expansion rates to flexibly promote the learning of multi-scale global features in the temporal domain. Furthermore, channel grouping is applied to reduce grouping parameters and improve information exchange capabilities between different channels. A global feature extraction module is used to extract low-level global information. This enhances the discriminative features of information joints and improves overall recognition capabilities. This invention achieves high-precision action classification results. This means it has excellent performance in action recognition tasks, can more reliably distinguish different actions, and can achieve high accuracy even in the presence of jitter and noise.

[0085] 4. Wide range of application potential: Given the high performance and noise robustness of this invention, it has wide range of application potential, including motion analysis, health monitoring, virtual reality, posture recognition and security fields; its application prospects are broad.

[0086] Unlike methods that involve data augmentation and predefined noise frames during data preprocessing, this invention focuses on specialized model design. First, it proposes multiple node selection strategies (fixed joint selection, fixed edge joint selection, average selection, and random selection) to minimize the impact of jittered joints and improve the robustness and accuracy of the skeleton-based human motion recognition model. Second, it creates a cascaded interaction matrix to effectively and comprehensively display the dynamic connections between joints. This cascaded interaction matrix promotes information exchange and mutual attention between information joints and their features, enhancing spatial dependencies and forcing the network to focus on information joints with discriminative features, thus eliminating the cumulative adverse effects of jittered joints.

[0087] This invention also performs noise removal during the data preprocessing stage, further ensuring the robustness of the model. This operation enables our model to better filter out potential motion jitter and node noise from sensors and keypoint prediction algorithms, thereby improving the model's reliability in practical applications. Overall, this invention, through innovative thinking in model design and learning mechanisms, successfully implements a human skeleton motion recognition method aimed at suppressing motion jitter and skeleton noise. This method not only has theoretical universality but also demonstrates significant performance advantages in practical applications.

[0088] In summary, this invention employs a self-attention mechanism, enabling the model to self-discover within the data and learn the connections between noisy and informative joints during human motion recognition, thus avoiding the influence of accumulated noise from non-informative joints. This design demonstrates universality across different datasets, requiring no personalized adjustments for each dataset. By focusing attention on focal joints, this invention not only more effectively recognizes human motions but also automatically adapts to the feature differences across various datasets. Compared to traditional data augmentation methods, our model prioritizes learning key information from the data. By adopting a Transformer network structure, it addresses issues such as motion jitter and node noise, improving generalization ability in real-world scenarios and achieving high-precision skeleton motion classification results, providing an efficient and reliable solution for applications in multiple fields. Attached Figure Description

[0089] Figure 1 Flowchart for the skeleton motion classification method.

[0090] Figure 2 A diagram showing the selection of nodes.

[0091] Figure 3 This is a diagram of the overall architecture of the model.

[0092] Figure 4 This is a diagram of the integrated interactive module architecture.

[0093] Figure 5 This is a diagram of the temporal multi-scale graph convolution (TS-TCN) module architecture. Detailed Implementation

[0094] This embodiment presents a multi-scale spatiotemporal interactive skeleton action classification method that combines motion jitter and skeleton noise suppression. The specific network structure framework used is as follows: Figure 3 As shown:

[0095] Step 1, Obtain training skeleton video samples: Collect publicly available human skeleton motion datasets from the internet. Here, we use four publicly available large-scale datasets: NTU-RGB+D, NTU-RGB+D120, NW-UCLA, and UAV-Human. Preprocess the skeleton data, including raw skeleton acquisition, noise removal, and viewpoint normalization.

[0096] 1.1 Skeleton Data Preprocessing

[0097] First, raw skeleton acquisition refers to data captured from different sensors or cameras, typically in various formats such as joint coordinates, depth images, or RGB images. During the preprocessing stage, this raw data needs to be extracted and standardized for subsequent analysis, including calibrating and synchronizing data from different sensors to ensure they are available in the same temporal and spatial coordinate system.

[0098] Secondly, noise removal is a crucial step because raw data typically contains noise from sensor errors, environmental interference, or motion blur. This noise can affect the accuracy of subsequent analysis and applications. Therefore, data preprocessing usually includes filtering and smoothing operations to remove this noise, thereby improving the reliability of the skeleton data.

[0099] Finally, viewpoint normalization is to ensure consistency of skeleton data across different viewpoints and camera settings. This typically involves mapping the skeleton data to a standard coordinate system so that data from different scenes can be more easily compared and analyzed.

[0100] In summary, skeleton data preprocessing is a crucial step that helps ensure data quality and consistency, thereby improving the performance and accuracy of computer vision and human motion analysis applications.

[0101] Step 2: Perform multi-scale feature extraction on the dataset obtained in Step 1:

[0102] First, input feature x in Convert them into sparse-level features x respectively s and original level features x p The calculation formula is as follows:

[0103] x s x p =Conv2 D(1×1) x in ,

[0104]

[0105] in, We set the number of sparse-level feature channels to 1 / m of the output channels, C in C out V and T represent the number of input channels, output channels, frames, and joints, respectively; in x s Select V1 nodes that contain the most action information; x s The features of V1 nodes are aggregated into a feature map. At the same time, the corresponding adjacency matrix is ​​constructed.

[0106] The node selection adopts the following strategy

[0107] To further optimize model performance, this invention employs four different node selection strategies: average selection, fixed joint selection, fixed edge joint selection, and random selection. a) Average joint selection. This strategy divides all nodes into six groups based on their location, such as… Figure 2 As shown, the average operation is performed on all nodes in the node group. b) Fixed joint selection. This strategy selects a set of non-edge nodes that can effectively represent the body structure. This strategy is relatively good at preserving key information. c) Fixed edge joint selection. In this strategy, we select joints located at the edges of the human body. However, edge nodes are generally more susceptible to motion jitter, and their performance may not be as good as other strategies. d) Random selection. This strategy is a selection method with uncertainty, and it is random due to the lack of explicit node selection rules. These different node selection strategies provide flexibility in making choices in model design, allowing performance to be optimized according to specification requirements.

[0108] Step 3: Based on the multi-scale features obtained in Step 2, model the long-distance random dependencies of skeletal joints, extract multi-scale discriminative spatiotemporal features, and achieve skeletal motion classification.

[0109] 3.1 Sparse-level fusion interaction module M2

[0110] like Figure 4 As shown, dimensionality compression and inter-channel related feature interaction fusion are performed through the sparse-level fusion interaction module M2 in the fusion interaction module;

[0111] 3.1.1 The sparse-level fusion interaction module M2, through its global feature extraction part (GF) Extra Global features are extracted from the network; furthermore, the channel and temporal dimensions are compressed to facilitate subsequent attention operations; specifically, the original and sparse-level skeleton sequences are processed by the global feature extraction part (GF). Extra ), to obtain the output Q s K s :

[0112]

[0113]

[0114] Then Q p and K s Multiplication is performed to achieve cascaded interaction fusion within the channel, and the obtained weights are adjusted using Softmax to obtain a mapping from the original-level cascaded interaction matrix to the sparse-level matrix. By mapping the input features x pand cascaded interaction matrix A ps Matrix multiplication is used to achieve interactive fusion of spatial features within the channel:

[0115] A ps =Softmax(atten(Q) p K s ))

[0116]

[0117] in For matrix multiplication, MHSA is a multi-head self-attention mechanism. The weights are adjusted using Softmax. We introduce a global parameterization matrix. This helps the correlation matrix explore more flexible joint relationships. Finally, the multi-head self-attention mechanism MHSA is used to realize the interaction and diffusion of channel features;

[0118] 3.2 Employing the original-level fusion interaction module M3 (such as...) Figure 3 (As shown in M3) Multi-head cross-attention is used to interact with and diffuse the features of the two sets of channels, thereby achieving cross-channel information diffusion;

[0119] Specifically, the original skeleton sequence is cascaded and fused with intra-channel features through a sparse fusion interaction module M2 and a self-attention layer to model long-distance joints in the skeleton sequence. Then, the output is input to the global feature extraction (GF) part. Extra Global feature extraction and dimensionality compression are implemented in [the following]:

[0120] Q s =GF Extra (SA(x in )), K p =GF Extra (SF(x in )),

[0121]

[0122] In the above formula, SF represents the sparse fusion interaction module M2, and SA represents the self-attention layer; using the attention mechanism, the sparse-level skeleton sequence is mapped to the original-level skeleton sequence to generate the cascaded interaction matrix A. sp Subsequently, the output generated by the sparse-level fusion interaction module M2 is used for class graph convolution operations to promote the interaction and fusion of features at different scales. The output is then processed by a linear transformation layer and gate circuits, and finally a residual connection is established with the output of the self-attention layer.

[0123]

[0124] xout =λ·Linear(SF(x) in A sp )+SA(x in ),

[0125] in, d represents the feature dimension, λ represents the fusion coefficient, and attention map. Model the joint correlation between the front and rear channels;

[0126] 3.3 In the sparse-level fusion interaction module M2 and the primitive-level fusion interaction module M3, multi-head self-attention (MHSA) is used to obtain the global features of spatial joints; specifically, position embedding (PE) is used to encode the spatial joint information, labeling each joint, and using sine and cosine functions of different frequencies as encoding functions:

[0127]

[0128]

[0129] Where p and i represent the joint position and the dimension of the position encoding vector, respectively; then, a linear transformation is performed on the input feature map x to generate Q, K, and V, and QK is calculated. T Use Softmax on QK T Normalization was performed, and a learnable adjacency matrix A was introduced. l The final output X is obtained through linear transformation. out As shown below:

[0130] Q, K, V = Linear(x),

[0131]

[0132]

[0133] in, Linear(·) represents a linear transformation. This module is also used in the sparse stage, with the input being... The output is

[0134] 3.4 Time-Domain Multi-Scale TCN Module (e.g.) Figure 5 (As shown in the left figure) Multi-scale temporal convolution is used, with the kernel size fixed at 3×1, and different dilation rates (dilation = 1, 2) are used to expand the receptive field, reducing the computational cost caused by extra branches, as shown below:

[0135] x d x m xt x r =Conv (1×1) (x),

[0136]

[0137] in, Use max pooling to obtain focal joint features:

[0138] x m =MaxPool(x) m ),

[0139] in, Conv (1×1) For feature transformation;

[0140] The temporal multi-scale TCN module utilizes the temporal multi-scale interaction mechanism T-Former (such as...) Figure 5 As shown in the right figure, the Transformer is used to aggregate the feature changes of the skeleton over time from a global perspective. A channel grouping strategy is introduced to reduce the number of model parameters, provide independent feature learning capabilities for different channel groups, and improve the interactivity between features within the model; as shown below:

[0141] V t =Linear(x t ),

[0142] K t Q t =ChannelGrouping(x t ),

[0143]

[0144]

[0145] in, Linear(·) represents a linear transformation, and finally... and x d 'Semblage along the channel dimension;

[0146] 3.5 Training Loss

[0147] Based on the learning objective of information bottleneck, this method uses Infoloss to learn information-rich yet compact latent representations. Specifically, given a predicted label... And the real label z, then L info The loss can be written as:

[0148]

[0149]

[0150] Here, α and N represent the confidence coefficient and the number of action categories, respectively. Corresponding multiplication is used to calculate the predicted true score. and z i All are regularized; additionally, the cross-entropy loss is as follows:

[0151]

[0152] Where M represents the number of categories, y ic The sign function (0 or 1) is set to 1 if the actual sample class i equals c, and 0 otherwise. ic Let L represent the probability that the observed sample i belongs to class c; finally, let L... CE Loss and our proposed L info The loss is incorporated into the complete learning objective function:

[0153] L = L info +ωL CE .

[0154] Here, L info and L CE As defined above, ω is L CE The balancing hyperparameters.

[0155] A multi-scale spatiotemporal interactive skeleton motion classification system for motion jitter and skeleton noise suppression includes:

[0156] The feature extraction and node selection module, in step 2, is used for the extraction of multi-scale features, including sparse-level features and primitive-level features, and the selection of information joints; to minimize the impact of shaking joints and improve the robustness and accuracy of skeletal human motion recognition.

[0157] The fusion interaction module includes a sparse-level fusion interaction module M2 and a primitive-level fusion interaction module M3. In step 3, the sparse-level fusion interaction module M2 performs dimensional compression and interaction fusion of related features within the channel; the primitive-level fusion interaction module M3 uses multi-head cross attention to interact with and diffuse the features of the two sets of channels, thereby achieving the balance of noise caused by motion jitter and the fusion of features at different scales.

[0158] In step 3, the temporal multi-scale TCN module uses a Transformer to aggregate the feature changes of the skeleton in the time dimension from a global perspective. It introduces a channel grouping strategy to reduce the number of model parameters, provides independent feature learning capabilities between different channel groups, and improves the interactivity between features within the model. This is used to extract multi-scale temporal features, realizing the extraction of differential temporal features and the interaction and updating of global temporal information. Finally, the proposed Infoloss loss function is used to learn the information-rich but compact latent representation in the skeleton data, which is used by the model to understand the maximum information representation.

[0159] A multi-scale spatiotemporal interactive skeleton motion classification device for motion jitter and skeleton noise suppression includes:

[0160] Memory, used to store computer programs;

[0161] A processor is used to implement the skeleton motion classification method described in steps 1 to 3 when executing the computer program.

[0162] A computer-readable storage medium storing a computer program, which, when executed by a processor, is capable of performing multi-scale spatiotemporal interactive skeleton motion classification based on the skeleton motion classification method described in steps 1 to 3, which includes motion jitter and skeleton noise suppression.

[0163] Experimental analysis was conducted to optimize the model hyperparameters. Four publicly available large-scale datasets were used: NTU-RGB+D, NTU-RGB+D120, NW-UCLA, and UAV-Human. Preprocessing operations were performed on the skeleton data, including raw skeleton acquisition, noise removal, and viewpoint normalization.

[0164] The processed dataset is fed into the skeleton action classification model. First, the skeleton data passes through a sparse feature extraction module to extract sparse and raw features. These extracted multi-scale features are then fed into a fusion and interaction module to achieve interaction and fusion of features at different scales within the overall network. Finally, based on the information bottleneck learning objective, we improve the training loss to learn information-rich yet compact latent representations in the skeleton data, guiding our model to understand the maximum information representation.

[0165] The operating system used in the experiment was Ubuntu 22.04.1, and the deep learning framework used was PyTorch. The specific configurations involved in the experiment are shown in Table 1.

[0166] Table 1 Experimental Configuration Table

[0167] project Configuration Processor (CPU) 12th Gen Intel(R)Core(TM)i9-12900KF cpu@3.20GHz Graphics card (GPU) GeForce RTX 3090 operating system Ubuntu 22.04.1 frame PyTorch

[0168] Extensive ablation experiments demonstrated the effectiveness of the innovative submodule proposed in this invention. This reinforces the feasibility and performance of the invention. Comparative experiments with existing GCN-based and Transformer-based methods showed performance advantages on four different datasets. This implies that the CI-STFormer model achieves better results in action classification.

Claims

1. A multi-scale spatiotemporal interactive skeleton motion classification method for motion jitter and skeleton noise suppression, characterized in that, First, obtain the skeleton video samples to be trained; second, by applying the sparse selection module of the skeleton action recognition model, extract the sparse joint features and original joint features of the skeleton samples to be trained from the skeleton video samples; then, by using the fusion interaction module and the temporal multi-scale graph convolution module of the skeleton action recognition model, model the long-distance random dependency relationship between skeleton joints in order to extract multi-scale spatiotemporal features. The interaction and updating of global spatiotemporal information enables the skeleton-based human motion classification task. The fusion interaction module and temporal multi-scale graph convolution module of the skeleton action recognition model model model long-distance random dependencies between skeleton joints to extract multi-scale spatiotemporal features, specifically including: 3.1 Dimensional compression and inter-channel feature interaction fusion are performed through the sparse-level fusion interaction module M2 in the fusion interaction module; 3.1.1 The sparse-level fusion interaction module M2, through its global feature extraction part Global features are extracted from the network; furthermore, the channel and temporal dimensions are compressed to facilitate subsequent attention operations; specifically, the original and sparse skeleton sequences are processed by the global feature extraction part. The output is obtained ,in, It is a sparse-level feature. For feature mapping Then, and Multiplication is performed to achieve cascaded interaction fusion within the channel, and the obtained weights are adjusted using Softmax to obtain a mapping from the original-level cascaded interaction matrix to the sparse-level matrix. ; by mapping input features and cascaded interaction matrix Perform matrix multiplication to achieve interactive fusion of spatial features within the channel: in, For matrix multiplication, MHSA is a multi-head self-attention mechanism. , The weights are adjusted using Softmax; a global parameterization matrix is ​​introduced. To help obtain joint relationships from the correlation matrix; to use the multi-head self-attention mechanism MHSA to realize the interaction and diffusion of channel features; 3.2 The original-level fusion interaction module M3 uses multi-head cross-attention to interact with and diffuse the features of the two sets of channels, realizing cross-channel information diffusion; The original skeleton sequence is cascaded and fused with intra-channel features through a sparse fusion interaction module M2 and a self-attention layer to model long-distance joints in the skeleton sequence; then, the output is input to the global feature extraction part. Global feature extraction and dimensionality compression are achieved in the following ways: , , In the above formula, SF represents the sparse fusion interaction module M2, and SA represents the self-attention layer; using the attention mechanism, the sparse-level skeleton sequence is mapped to the original-level skeleton sequence to generate a cascaded interaction matrix. Subsequently, a graph-like convolution operation is performed on the output of the sparse-level fusion interaction module to promote the interaction and fusion of features at different scales. The output then undergoes processing through a linear transformation layer and gate circuits, and finally establishes a residual connection with the output of the self-attention layer. in, d represents the feature dimension. Represents the fusion coefficient, attention map Modeling the joint correlation between the front and rear channels; 3.3 In the sparse-level fusion interaction module M2 and the primitive-level fusion interaction module M3, multi-head self-attention (MHSA) is used to obtain global features of spatial joints. Specifically, position embedding (PE) is used to encode spatial joint information, labeling each joint, and using sine and cosine functions of different frequencies as encoding functions. in, and Let these represent the joint position and the dimension of the position encoding vector, respectively; then, the input feature map... Perform a linear transformation to generate Q, K, and V, and calculate QK. T Use Softmax on QK T Normalization was performed, and a learnable adjacency matrix A was introduced. l The final output X is obtained through linear transformation. out As shown below: in, Linear( ) represents a linear transformation. This module is also used in the sparse stage, with the input being... The output is ; 3.4 The temporal multi-scale TCN module employs multi-scale temporal convolution, fixing the kernel size to 3×1 and using different expansion rates to increase the receptive field, thus reducing the computational cost caused by additional branches, as shown below: in, Max pooling is used to obtain the focal joint features: in, , For feature transformation; The temporal multi-scale TCN module utilizes the temporal multi-scale interaction mechanism T-Former, employing Transformer to aggregate the feature changes of the skeleton across the time dimension from a global perspective. It introduces a channel grouping strategy to reduce the number of model parameters, providing independent feature learning capabilities for different channel groups and enhancing the interactivity between features within the model; as shown below: in, Linear( ) represents a linear transformation, and finally... splicing along the channel dimension.

2. The multi-scale spatiotemporal interactive skeleton motion classification method for motion jitter and skeleton noise suppression according to claim 1, characterized in that, Specifically, the following steps are included: Step 1, Obtain the skeleton video samples to be trained: Collect publicly available human skeleton motion datasets from the Internet, and perform preprocessing operations on the skeleton data, including raw skeleton acquisition, noise removal and viewpoint normalization, to obtain the preprocessed dataset. Step 2: The dataset obtained from the preprocessing in Step 1 is used as input for multi-scale feature extraction, including sparse-level feature extraction and original-level joint feature extraction. By proposing a variety of node selection strategies, including fixed joint selection, fixed edge joint selection, average selection and random selection strategies, the influence of shaking joints is reduced, and the robustness and accuracy of skeletal human motion recognition are improved. Step 3: Based on the multi-scale features obtained in Step 2, the long-distance random dependencies of skeleton joints are modeled using the fusion interaction module and the temporal multi-scale TCN module to achieve multi-scale spatiotemporal feature extraction. In addition, based on the learning objective of information bottleneck, the Infoloss loss function is used to learn the information-rich but compact latent representation in the skeleton data to guide the model to understand the maximum information representation.

3. The multi-scale spatiotemporal interactive skeleton motion classification method for motion jitter and skeleton noise suppression according to claim 2, characterized in that, Step 2 includes: First, input features Convert them into sparse-level features respectively and primitive-level features The calculation formula is as follows: in, We set the number of sparse-level feature channels to the number of output channels. , These represent the number of input channels, output channels, frames, and joints, respectively; Select those with more motion information One node; middle The features of each node are aggregated into a feature map. At the same time, the corresponding adjacency matrix is ​​constructed. .

4. The multi-scale spatiotemporal interactive skeleton motion classification method for motion jitter and skeleton noise suppression according to claim 3, characterized in that, The node selection strategy employs the following approach to reduce the impact of joint shaking and improve the robustness and accuracy of skeletal human motion recognition: a) Average joint selection: Group all nodes by location and perform an average calculation on all nodes in the node group; b) Selection of fixed joints: Select a set of non-edge nodes that can effectively represent the body structure; c) Fixed edge joint selection: Select joints located at the edge of the human body; d) Randomly select different nodes.

5. The multi-scale spatiotemporal interactive skeleton motion classification method for motion jitter and skeleton noise suppression according to claim 3, characterized in that, The training loss in step 3 is: Based on the learning objective of information bottleneck, this method uses Infoloss to learn information-rich yet compact latent representations. Specifically, given a predicted label... and real labels ,but The loss can be written as: here, and These represent the confidence coefficient and the number of action categories, respectively. Corresponding multiplication is used to calculate the predicted true score. All are regularized; additionally, the cross-entropy loss is as follows: Where, 𝑀 represents the number of categories, This represents the sign function, taking the value 0 or 1. If the actual sample class c is equal to c, it takes the value 1; otherwise, it takes the value 0. This represents the probability that the observed sample X belongs to class X; finally, ... Loss and our proposed The loss is incorporated into the complete learning objective function: here, The definition is as above. yes The balancing hyperparameters.

6. A multi-scale spatiotemporal interactive skeleton motion classification system for motion jitter and skeleton noise suppression, characterized in that, To implement the skeleton motion classification method according to any one of claims 1-5, comprising: The feature extraction and node selection module is used for the extraction of multi-scale features, including sparse-level features and primitive-level features, and the selection of information joints; it minimizes the impact of shaking joints and improves the robustness and accuracy of skeletal human motion recognition. The fusion interaction module includes a sparse-level fusion interaction module M2 and a primitive-level fusion interaction module M3. The sparse-level fusion interaction module M2 performs dimensionality compression and interaction fusion of related features within the channel; the primitive-level fusion interaction module M3 uses multi-head cross attention to interact with and diffuse the features of the two sets of channels, thereby achieving the balance of noise caused by motion jitter and the fusion of features at different scales. The temporal multi-scale TCN module employs a Transformer to aggregate the feature changes of the skeleton in the time dimension from a global perspective. It introduces a channel grouping strategy to reduce the number of model parameters, provides independent feature learning capabilities for different channel groups, and enhances the interactivity between features within the model. This module is used to extract multi-scale temporal features, enabling the extraction of differential temporal features and the interaction and updating of global temporal information. Finally, the proposed Infoloss loss function is used to learn the information-rich yet compact latent representation in the skeleton data, which is used by the model to understand the maximum information representation.

7. A multi-scale spatiotemporal interactive skeleton motion classification device for motion jitter and skeleton noise suppression, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the skeleton motion classification method according to any one of claims 1-5 when executing the computer program.

8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it can perform multi-scale spatiotemporal interactive skeleton motion classification based on the skeleton motion classification method described in any one of claims 1-5, which includes motion jitter and skeleton noise suppression.