A human activity recognition method, device, equipment and readable storage medium

By performing multimodal fusion processing on skeletal joint and acceleration features, and utilizing a pre-defined spatiotemporal skeletal model and a Transformer model, the problem of low accuracy in single-modal recognition was solved, thereby improving the accuracy of human activity recognition.

CN116229579BActive Publication Date: 2026-06-23WUHAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WUHAN UNIV
Filing Date
2023-03-21
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies that rely on a single modality for human activity recognition suffer from low accuracy.

Method used

A multimodal fusion method is adopted, which extracts and fuses skeletal joint features and acceleration features, extracts features using a pre-set spatiotemporal skeletal model and a Transformer model, and performs recognition through a convolutional neural network.

Benefits of technology

It improves the accuracy of human activity recognition, overcomes the problem of poor single-modal recognition performance, and achieves improved accuracy in multimodal recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116229579B_ABST
    Figure CN116229579B_ABST
Patent Text Reader

Abstract

The application relates to a human activity recognition method and device, equipment and a readable storage medium, and relates to the technical field of intelligent sensing. The application comprises the following steps: performing feature extraction on original skeleton joint features of a target action to obtain skeleton joint features; performing feature extraction on original acceleration features of the target action to obtain acceleration features; performing fusion processing on the skeleton joint features and the acceleration features to obtain multi-modal fusion features; and performing recognition based on the multi-modal fusion features to obtain a recognition result corresponding to the target action. The application realizes multi-modal recognition of human activity by fusing skeleton joint data and acceleration data, so as to solve the problem of poor single-modal recognition effect and effectively improve the accuracy of human activity recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent sensing technology, and in particular to a method, apparatus, device, and readable storage medium for human activity recognition. Background Technology

[0002] Human activity recognition is a classic problem in the field of intelligent sensing technology, which involves using various sensing methods to acquire information related to activities. However, many related technologies rely on a single modality for human activity recognition, resulting in poor recognition performance and low accuracy. Therefore, improving the accuracy of human activity recognition is a pressing issue that needs to be addressed. Summary of the Invention

[0003] This application provides a method, apparatus, device, and readable storage medium for human activity recognition, in order to solve the problem of low recognition accuracy caused by human activity recognition through a single modality in related technologies.

[0004] Firstly, a method for recognizing human activity is provided, comprising the following steps:

[0005] Feature extraction is performed on the original skeletal joint features of the target action to obtain skeletal joint features;

[0006] The original acceleration features of the target action are extracted to obtain acceleration features;

[0007] The skeletal joint features and the acceleration features are fused to obtain multimodal fused features;

[0008] Based on the multimodal fusion features, the recognition result corresponding to the target action is obtained.

[0009] In some embodiments, the step of extracting features from the original skeletal joint features of the target action to obtain skeletal joint features includes:

[0010] Based on the channel compression excitation block in the preset spatiotemporal skeleton model, the original skeleton joint features are compressed, excited, and weighted in the channel dimension to obtain the first feature;

[0011] The first feature is compressed, excited, and weighted in the time dimension based on the time compression excitation block in the preset spatiotemporal skeleton model to obtain the second feature.

[0012] The skeletal joint features are obtained based on the second feature and the original skeletal joint features.

[0013] In some embodiments, the original skeletal joint features are compressed, excited, and weighted along the channel dimension based on the channel compression excitation block in the preset spatiotemporal skeletal model to obtain a first feature, including:

[0014] The channel compression excitation block compresses and excites the original skeletal joint features in both length and width dimensions to obtain the first activated features.

[0015] The channel compression excitation block compresses and excites the original skeletal joint features in the time, length, and width dimensions to obtain the second activated features.

[0016] The first activated feature and the second activated feature are combined to obtain the combined feature;

[0017] The combined feature and the original skeletal joint feature are weighted to obtain the first feature.

[0018] In some embodiments, the step of extracting acceleration features from the raw acceleration features of the target action to obtain acceleration features includes:

[0019] Based on the Transformer model, the original acceleration features of the target action are extracted to obtain acceleration features.

[0020] In some embodiments, the fusion processing of the skeletal joint features and the acceleration features to obtain multimodal fusion features includes:

[0021] The skeletal joint features are subjected to convolution and average pooling to obtain new key vectors and new query vectors, and new value vectors are obtained based on the skeletal joint features and their corresponding learning parameter matrices.

[0022] The key vector and value vector in the Transformer model are updated based on the new key vector and the new value vector, respectively, to obtain the first Transformer model;

[0023] The acceleration features are reshaped using the first Transformer model to obtain the fused acceleration features;

[0024] The query vector in the Transformer model is updated based on the new query vector to obtain the second Transformer model;

[0025] The skeletal joint features are reshaped using the second Transformer model to obtain the fused skeletal joint features.

[0026] In some embodiments, the step of identifying the target action based on the multimodal fusion features includes:

[0027] The fused acceleration features and the fused skeletal joint features are cascaded to obtain cascaded features.

[0028] The cascaded features are identified using a convolutional neural network to obtain the recognition result corresponding to the target action.

[0029] In some embodiments, before the step of extracting features from the original skeletal joint features of the target action to obtain the skeletal joint features, the method further includes:

[0030] Acquire raw acceleration data and raw skeletal joint data corresponding to multiple target actions performed by the human body;

[0031] The original acceleration data is segmented according to the video frames corresponding to each target action in the original skeletal joint data to obtain multiple original acceleration groups;

[0032] Alignment processing is performed on the video frames corresponding to each target action in multiple raw acceleration groups and raw skeletal joint data to obtain the raw skeletal joint features and raw acceleration features corresponding to each target action.

[0033] Secondly, a human activity recognition device is provided, comprising:

[0034] The feature extraction unit is used to extract features from the original skeletal joint features of the target action to obtain skeletal joint features; and to extract features from the original acceleration features of the target action to obtain acceleration features.

[0035] The feature fusion unit is used to fuse the skeletal joint features and the acceleration features to obtain multimodal fused features;

[0036] An action recognition unit is used to perform recognition based on the multimodal fusion features to obtain the recognition result corresponding to the target action.

[0037] Thirdly, a human activity recognition device is provided, comprising: a memory and a processor, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the aforementioned human activity recognition method.

[0038] Fourthly, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, which, when executed by a processor, implements the aforementioned human activity recognition method.

[0039] The beneficial effects of the technical solution provided in this application include: effectively improving the accuracy of human activity recognition.

[0040] This application provides a method, apparatus, device, and readable storage medium for human activity recognition. The method includes extracting features from the original skeletal joint features of a target action to obtain skeletal joint features; extracting features from the original acceleration features of the target action to obtain acceleration features; fusing the skeletal joint features and the acceleration features to obtain multimodal fusion features; and performing recognition based on the multimodal fusion features to obtain a recognition result corresponding to the target action. This application achieves multimodal recognition of human activity by fusing skeletal joint data and acceleration data, thereby solving the problem of poor performance in single-modal recognition and effectively improving the accuracy of human activity recognition. Attached Figure Description

[0041] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0042] Figure 1 A flowchart illustrating a human activity recognition method provided in an embodiment of this application;

[0043] Figure 2 This is a schematic flowchart illustrating the human activity recognition method provided in the embodiments of this application.

[0044] Figure 3 This is a schematic diagram of the overall model structure provided for an embodiment of this application;

[0045] Figure 4 A schematic diagram of human skeletal joint data provided in an embodiment of this application;

[0046] Figure 5 This is a schematic diagram of the structure of 3D-SEResnet provided in an embodiment of this application;

[0047] Figure 6 A schematic diagram of the lightweight Transformer provided in the embodiments of this application;

[0048] Figure 7 This is a schematic diagram of the structure for fusing skeletal joint data and acceleration data provided in an embodiment of this application;

[0049] Figure 8 This is a schematic diagram of the structure of a human activity recognition device provided in an embodiment of this application. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0051] This application provides a method, apparatus, device, and readable storage medium for human activity recognition, which can solve the problem of low recognition accuracy caused by human activity recognition through a single modality in related technologies.

[0052] See Figures 1 to 3 As shown in the figure, this application provides a method for human activity recognition, including the following steps:

[0053] Step S10: Extract features from the original skeletal joint features of the target action to obtain skeletal joint features;

[0054] It should be understood, as an example, that not only can body sensors be used to collect information such as acceleration and angular velocity during human activity for single-modal activity perception, but computer vision technology based on video images can also be used for single-modal activity perception. However, in actual sensor-based recognition processes, interference from environmental noise is common, while video image-based recognition technology is often affected by various factors such as changes in viewpoint, lighting, and background. Therefore, single-modal human activity recognition achieved through sensors or computer vision technology often has poor recognition performance, resulting in low accuracy. Thus, human activity recognition in real-world scenarios still faces many challenges.

[0055] In this embodiment, to improve recognition accuracy, a multimodal fusion method for human activity recognition is provided. Specifically, it fuses human skeletal joint data and acceleration data, and recognizes human activities based on the fused data. Therefore, before data fusion, feature extraction of the original skeletal joint features of the target human movement is required to obtain skeletal joint features.

[0056] Furthermore, before the step of extracting features from the original skeletal joint features of the target action to obtain the skeletal joint features, the method further includes:

[0057] Acquire raw acceleration data and raw skeletal joint data corresponding to multiple target actions performed by the human body;

[0058] The original acceleration data is segmented according to the video frames corresponding to each target action in the original skeletal joint data to obtain multiple original acceleration groups;

[0059] Alignment processing is performed on the video frames corresponding to each target action in multiple raw acceleration groups and raw skeletal joint data to obtain the raw skeletal joint features and raw acceleration features corresponding to each target action.

[0060] As an example, in this embodiment, before recognizing the target action, skeletal joint data and acceleration data of human activity are collected, and the collected data are preprocessed to establish a dataset.

[0061] Specifically, accelerometers and other sensing modules can be installed on the human body. When the human body performs specified target actions, the accelerometer can collect the acceleration of the human body along the x, y, and z axes at a frequency of 4Hz to obtain the raw acceleration data corresponding to multiple target actions. Simultaneously, while the human body performs specified target actions, camera modules such as infrared motion capture cameras collect skeletal data from the front of the human body to obtain the raw skeletal joint data corresponding to multiple target actions. For example, a motion capture camera can specifically capture data from 21 skeletal joints at a frequency of 100Hz (i.e.,...). Figure 4 (The key points corresponding to 1 to 21 in the text).

[0062] Since the raw acceleration data of the multiple target actions is continuous, it is necessary to segment the continuous raw acceleration data to obtain the acceleration data corresponding to each target action. Understandably, since the skeletal joint data is obtained from video captured by the camera module, the skeletal joint data corresponding to each target action can be determined based on the video frames.

[0063] Therefore, in this embodiment, the raw acceleration data is segmented according to video frames to obtain multiple raw acceleration groups. Each raw acceleration group is then aligned with a video frame to obtain the raw skeletal joint features and raw acceleration features corresponding to each target action. To ensure data integrity, mean interpolation can be used to interpolate missing values ​​in each raw acceleration group. Subsequently, all raw acceleration group data are standardized and normalized. Following the above steps, data from multiple users is collected, and u target actions are specified to establish a human activity recognition dataset. The dataset can be divided into a training set and a test set, allowing the model in this embodiment to be trained using the training set and its performance to be tested using the test set.

[0064] Furthermore, the process of extracting features from the original skeletal joint features of the target action to obtain skeletal joint features includes:

[0065] Based on the channel compression excitation block in the preset spatiotemporal skeleton model, the original skeleton joint features are compressed, excited, and weighted in the channel dimension to obtain the first feature;

[0066] The first feature is compressed, excited, and weighted in the time dimension based on the time compression excitation block in the preset spatiotemporal skeleton model to obtain the second feature.

[0067] The skeletal joint features are obtained based on the second feature and the original skeletal joint features.

[0068] As an example, in this embodiment, a spatiotemporal skeleton model will be established by constructing multiple 3D-SEResnet block connections, and the final skeleton joint features FI will be extracted from the skeleton joint data using the spatiotemporal skeleton model. For details, see [link to documentation]. Figure 5 As shown, each 3D-SEResnet block includes a residual block (i.e., ResNet), a channel compression excitation block, and a temporal compression excitation block. The residual block performs two convolutional operations on the original skeletal joint features to learn residual features, allowing for deeper network layers and preventing gradient vanishing. It should be noted that whether or not a residual block is included in the 3D-SEResnet block can be determined based on actual needs and is not limited here.

[0069] In this embodiment, taking a 3D-SEResnet block including a residual block as an example: the channel compression excitation module is used to compress, excite, and weight the original skeletal joint features after processing by the residual block in the channel dimension to obtain the first feature; the time compression excitation module is used to compress, excite, and weight the first feature in the time dimension to obtain the second feature; then the second feature and the original skeletal joint features are spliced ​​together to obtain the skeletal joint features FI of the target action.

[0070] Furthermore, the original skeletal joint features are compressed, excited, and weighted along the channel dimension using the channel compression excitation block in the preset spatiotemporal skeletal model to obtain the first feature, including:

[0071] The channel compression excitation block compresses and excites the original skeletal joint features in both length and width dimensions to obtain the first activated features.

[0072] The channel compression excitation block compresses and excites the original skeletal joint features in the time, length, and width dimensions to obtain the second activated features.

[0073] The first activated feature and the second activated feature are combined to obtain the combined feature;

[0074] The combined feature and the original skeletal joint feature are weighted to obtain the first feature.

[0075] As an example, in this embodiment, it is assumed that the original skeletal joint feature matrix of a target action is F, and F∈R f×c×h×w Where f represents time, c represents channel, h represents length, and w represents width; at the same time, a channel compression excitation block is constructed, which mainly includes three processes: compression, excitation, and weighting.

[0076] After inputting F into the channel compression excitation block, F is compressed in both length and width dimensions using global average pooling to obtain p. 1t p 1t ∈R f×c This represents the compressed characteristics. The compression process formula is as follows:

[0077]

[0078] p 1t The matrix dimension is reduced by using a first fully connected layer and activated by an activation function. Then, the dimension is increased by another fully connected layer to obtain g. 1t g 1t ∈R f×c This represents the activated feature. The activation process formula is as follows:

[0079] g 1t =sigmoid(W1ReLU(W2p) 1t ))

[0080] in, Both represent parameter matrices, where r represents hyperparameters, ReLU is the activation function, and sigmoid is the nonlinear function of the neuron.

[0081] Simultaneously, after F is input to the channel compression excitation block, F is compressed in three dimensions—time, length, and width—through global average pooling to obtain p. 1 That is, p 1 The compression process formula is as follows:

[0082]

[0083] Where, p 1 ∈R c , which represents the characteristics after compression along the channel.

[0084] Similarly, p 1 After the above-described excitation process, g is obtained. 1 g 1 ∈R c It represents the activated feature, i.e., g1 The formula for the excitation process is as follows:

[0085] g 1 =sigmoid(W1ReLU(W2p) 1 ))

[0086] After obtaining g 1t and g 1 After that, g 1t and g 1 The features are combined to form a combined feature G that incorporates both time and channel information. 1 G 1 ∈R f×c Among them, G 1 The formula is as follows:

[0087] G 1 =g 1t ×g 1

[0088] Finally, G 1 By weighting it with F, we can obtain the first feature F. c F c The formula is as follows:

[0089]

[0090] Similarly, constructing a time-compression excitation block also includes three processes: compression, excitation, and weighting. Likewise, F... c As the feature matrix input to the time-compressed excitation block, F c After compressing both length and width dimensions using global average pooling, p is obtained. 2t p 2t ∈R f×c The compression process formula is as follows:

[0091]

[0092] p 2t The matrix dimension is reduced by using a first fully connected layer and activated by an activation function. Then, the dimension is increased by another fully connected layer to obtain g. 2t g 2t ∈R f×c The excitation process formula is as follows:

[0093] g 2t =sigmoid(W1ReLU(W2p) 2t ))

[0094] At the same time, F c After input to the time-compressed excitation block, F cAfter compression in three dimensions—channel, length, and width—using global average pooling, p is obtained. 2 That is, p 2 The compression process formula is as follows:

[0095]

[0096] Where, p 2 ∈R f , which represents the characteristics after compression along time.

[0097] For p 2 After stimulation, g is obtained 2 g 2 ∈R f It represents the activated feature, i.e., g 2 The formula for the excitation process is as follows:

[0098] g 2 =sigmoid(w1ReLU(w2p) 2 ))

[0099] After obtaining g 2t and g 2 After that, g 2t and g 2 The features are combined to form a combined feature G that incorporates both time and channel information. 2 G 2 ∈R f×c Among them, G 2 The formula is as follows:

[0100] G 2 =g 2 ×g 2t

[0101] Finally, G 2 With F c By weighting, the second feature F can be obtained. t F t The formula is as follows:

[0102]

[0103] By compressing the time dimension using the steps described above, F can be obtained. t .

[0104] It should be understood that the compressed excitation network used in this embodiment helps the model to prioritize useful features and suppress features that are not very useful for the current task. Specifically, it explicitly establishes the interdependencies between feature channels by constructing channel-compressed excitation blocks, while temporal-compressed excitation blocks filter out redundant information, focusing on the temporal dimension. Therefore, the 3D-SEResnet in this embodiment achieves information filtering along both the channel and temporal dimensions and can be easily integrated with ResNet.

[0105] Step S20: Extract features from the original acceleration features of the target action to obtain acceleration features;

[0106] As an example, in this embodiment, the provided multimodal fusion human activity recognition method preferably fuses human skeletal joint data and acceleration data. Therefore, before data fusion, it is necessary to extract the original acceleration features of the target action performed by the human body to obtain acceleration features.

[0107] Furthermore, the step of extracting acceleration features from the raw acceleration features of the target action to obtain acceleration features includes:

[0108] Based on the Transformer model, the original acceleration features of the target action are extracted to obtain acceleration features.

[0109] As an example, in this embodiment, acceleration features will be extracted from the raw acceleration data using a lightweight Transformer model. For instance, suppose the raw acceleration feature matrix of a target motion is C, and C∈R. m×d Where m represents length and d represents dimension. For details, see [link to documentation]. Figure 6 As shown, the lightweight Transformer in this embodiment preferably employs a multi-head self-attention mechanism, the specific formula of which is as follows:

[0110] q = CW q k = CW k v = CW v

[0111] A = Sv

[0112] Where q, k, and v represent the query vector, key vector, and value vector, respectively, and W q W k W v ∈R d×eAll three represent learning parameters: e represents the dimensions of q and k, S represents the weights, m represents the network parameters (used to prevent excessively large data during matrix multiplication), and A represents the feature matrix. Then, the acceleration features CI of the target action are extracted from C using the lightweight Transformer model described above, as shown in the following formula:

[0113] A l =MSA(LN(A l-1 ))+A l-1 l = 1, 2, ... L

[0114] A′ l =MLP(LN(A) l ))+A l l = 1, 2, ... L

[0115] CI=LN(A′l)' l )

[0116] Where L represents the number of iterations or layers, MSA represents multi-head self-attention mechanism, MLP represents a fully connected layer, and LN represents a normalized layer.

[0117] Therefore, the lightweight Transformer in this embodiment makes the model more portable, and the attention mechanism can increase the weight of relevant acceleration sequences and decrease the weight of irrelevant acceleration sequences, while the normalization layer can speed up the convergence speed during training.

[0118] Step S30: Perform fusion processing on the skeletal joint features and the acceleration features to obtain multimodal fusion features;

[0119] As an example, in this embodiment, after obtaining the skeletal joint features and acceleration features corresponding to the target action, the modified Transformer model is used to fuse the skeletal joint features and acceleration features respectively to obtain multimodal fused features. It can be understood that the multimodal fused features include acceleration features CZ fused with skeletal joint features FI and skeletal joint features FZ fused with acceleration features CI.

[0120] Furthermore, the fusion processing of the skeletal joint features and the acceleration features to obtain multimodal fusion features includes:

[0121] The skeletal joint features are subjected to convolution and average pooling to obtain new key vectors and new query vectors, and new value vectors are obtained based on the skeletal joint features and their corresponding learning parameter matrices.

[0122] The key vector and value vector in the Transformer model are updated based on the new key vector and the new value vector, respectively, to obtain the first Transformer model;

[0123] The acceleration features are reshaped using the first Transformer model to obtain the fused acceleration features;

[0124] The query vector in the Transformer model is updated based on the new query vector to obtain the second Transformer model;

[0125] The skeletal joint features are reshaped using the second Transformer model to obtain the fused skeletal joint features.

[0126] As an example, in this embodiment, see Figure 7 As shown, the lightweight Transformer model used for CI extraction is first modified: a convolution operation is performed on the feature matrix of one frame in the skeletal joint features FI, and e convolution kernels are used to convolve each channel, followed by average pooling to obtain k. FI k FI ∈R c×e Simultaneously, v is obtained by multiplying the skeletal joint features FI with their corresponding learning parameter matrices. FI Then use k FI Replace k in the Transformer model of step S20, and use v FI Instead of v, a new Transformer model is formed. This new Transformer model allows for the calculation of the fused features and the reshaping of the fused acceleration feature CZ, as shown in the following formula:

[0127] q = CW q ,v FI =FI i W FI

[0128] A = Sv FI

[0129] A l =MSA(LN(A l-1 ))+A l-1 l = 1, 2, ... L

[0130] A′ l =MLP(LN(A l ))+A l l = 1, 2, ... L

[0131] CZi =LN(A' l )

[0132] Among them, W FI ∈R hw×e , representing the learning parameter matrix, FI i CZ represents the skeletal joint feature matrix of the i-th frame. i This represents the acceleration feature matrix of the i-th frame after incorporating skeletal joint features.

[0133] Furthermore, when reshaping FZ, the lightweight Transformer model used to extract CI also needs to be modified: perform convolution operations on the feature matrix of one frame in FI, use e convolution kernels to convolve each channel, and then perform average pooling to obtain q. FI q FI ∈R c×e Then use q FI Replace q in the Transformer model of step S20 to form a new Transformer model. The fused features can be calculated and the fused skeletal joint features FZ can be reconstructed using this new Transformer model, as shown in the following formula:

[0134] k = CW k v = CW v

[0135] A = Sv

[0136] A l =MSA(LN(A l-1 ))+A l-1 l = 1, 2, ... L

[0137] A' l =MLP(LN(A l ))+A l l = 1, 2, ... L

[0138] FZ i =LN(A′) l )

[0139] Among them, FZ i This represents the skeletal joint feature matrix after incorporating acceleration features in the i-th frame.

[0140] Therefore, it can be seen that the cross-attention mechanism used in this embodiment can effectively integrate information from different modalities and perform cross-modal feature transformation in a more granular way. The process of injecting skeletal joint features into the acceleration feature extraction process and the process of injecting acceleration features into the skeletal joint feature extraction process both rely on the attention mechanism, thus appropriately enhancing the features.

[0141] Step S40: Based on the multimodal fusion features, perform identification to obtain the identification result corresponding to the target action.

[0142] As an example, in this embodiment, after fusing skeletal joint features and acceleration features separately using the modified Transformer model, the target action is identified based on the multimodal fusion features such as the fused acceleration features and fused skeletal joint features. That is, multimodal recognition of human activities is achieved by fusing skeletal joint data and acceleration data to obtain the recognition result of the target action. This overcomes the problem of data heterogeneity of different modalities and solves the problem of poor single-modal recognition effect, thereby effectively improving the accuracy of human activity recognition.

[0143] Furthermore, the step of identifying the target action based on the multimodal fusion features includes:

[0144] The fused acceleration features and the fused skeletal joint features are cascaded to obtain cascaded features.

[0145] The cascaded features are identified using a convolutional neural network to obtain the recognition result corresponding to the target action.

[0146] As an example, in this embodiment, the fused acceleration feature CZ and the fused skeletal joint feature FZ are concatenated, and the final motion recognition result t is obtained through a convolutional neural network. The specific formula is as follows:

[0147] t = MLP(concat([CZ,FZ]))

[0148] Here, concat represents a cascaded operation, and MLP represents a three-layer convolutional neural network. Its first layer uses f*(c+m) convolutional layers with kernel size h×w and employs average pooling. The last two layers use fully connected layers to transform the kernel size into u*1*1 to obtain the probabilities of u actions, and the maximum value is taken as the recognition result of the target action.

[0149] In summary, this embodiment employs a multimodal fusion approach to improve recognition accuracy for human activity identification. Specifically, it first collects skeletal joint data and acceleration data from human activity and preprocesses the data; then, it establishes a spatiotemporal skeletal model (3D-SEResnet) to extract skeletal joint features from the skeletal joint data and uses a Transformer model to extract acceleration features from the acceleration data; finally, it fuses the skeletal joint features and acceleration data features using a cross-attention mechanism to obtain the fused features; then, it concatenates the fused skeletal joint features and the fused acceleration features, and uses a convolutional neural network to obtain the final motion recognition result. Therefore, this embodiment, based on neural networks, Transformer, and multimodal fusion technology, overcomes the problem of data heterogeneity across different modalities and solves the problem of poor single-modal recognition performance, thereby effectively improving the accuracy of human activity recognition.

[0150] This application also provides a human activity recognition device, including:

[0151] The feature extraction unit is used to extract features from the original skeletal joint features of the target action to obtain skeletal joint features; and to extract features from the original acceleration features of the target action to obtain acceleration features.

[0152] The feature fusion unit is used to fuse the skeletal joint features and the acceleration features to obtain multimodal fused features;

[0153] An action recognition unit is used to perform recognition based on the multimodal fusion features to obtain the recognition result corresponding to the target action.

[0154] Furthermore, the feature extraction unit is specifically used for:

[0155] Based on the channel compression excitation block in the preset spatiotemporal skeleton model, the original skeleton joint features are compressed, excited, and weighted in the channel dimension to obtain the first feature;

[0156] The first feature is compressed, excited, and weighted in the time dimension based on the time compression excitation block in the preset spatiotemporal skeleton model to obtain the second feature.

[0157] The skeletal joint features are obtained based on the second feature and the original skeletal joint features.

[0158] Furthermore, the feature extraction unit is specifically used for:

[0159] The channel compression excitation block compresses and excites the original skeletal joint features in both length and width dimensions to obtain the first activated features.

[0160] The channel compression excitation block compresses and excites the original skeletal joint features in the time, length, and width dimensions to obtain the second activated features.

[0161] The first activated feature and the second activated feature are combined to obtain the combined feature;

[0162] The combined feature and the original skeletal joint feature are weighted to obtain the first feature.

[0163] Furthermore, the feature extraction unit is specifically used for:

[0164] Based on the Transformer model, the original acceleration features of the target action are extracted to obtain acceleration features.

[0165] Furthermore, the feature fusion unit is specifically used for:

[0166] The skeletal joint features are subjected to convolution and average pooling to obtain new key vectors and new query vectors, and new value vectors are obtained based on the skeletal joint features and their corresponding learning parameter matrices.

[0167] The key vector and value vector in the Transformer model are updated based on the new key vector and the new value vector, respectively, to obtain the first Transformer model;

[0168] The acceleration features are reshaped using the first Transformer model to obtain the fused acceleration features;

[0169] The query vector in the Transformer model is updated based on the new query vector to obtain the second Transformer model;

[0170] The skeletal joint features are reshaped using the second Transformer model to obtain the fused skeletal joint features.

[0171] Furthermore, the action recognition unit is specifically used for:

[0172] The fused acceleration features and the fused skeletal joint features are cascaded to obtain cascaded features.

[0173] The cascaded features are identified using a convolutional neural network to obtain the recognition result corresponding to the target action.

[0174] Furthermore, the device also includes a data acquisition unit, which is used for:

[0175] Acquire raw acceleration data and raw skeletal joint data corresponding to multiple target actions performed by the human body;

[0176] The original acceleration data is segmented according to the video frames corresponding to each target action in the original skeletal joint data to obtain multiple original acceleration groups;

[0177] Alignment processing is performed on the video frames corresponding to each target action in multiple raw acceleration groups and raw skeletal joint data to obtain the raw skeletal joint features and raw acceleration features corresponding to each target action.

[0178] It should be noted that those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the device and each unit described above can be referred to the corresponding process in the aforementioned human activity recognition method embodiments, and will not be repeated here.

[0179] The apparatus provided in the above embodiments can be implemented as a computer program, which can be used in, for example... Figure 8 It runs on the human activity recognition device shown.

[0180] This application also provides a human activity recognition device, including: a memory, a processor, and a network interface connected via a system bus. The memory stores at least one instruction, which is loaded and executed by the processor to implement all or part of the steps of the aforementioned human activity recognition method.

[0181] The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art will understand that... Figure 8 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0182] A processor can be a CPU, or other general-purpose processors, DSPs (Digital Signal Processors), ASICs (Application Specific Integrated Circuits), FPGAs (Field Programmable Gate Arrays), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor, or any conventional processor. The processor is the control center of a computer device, connecting all parts of the computer device through various interfaces and lines.

[0183] Memory can be used to store computer programs and / or modules. The processor implements various functions of the computer device by running or executing the computer programs and / or modules stored in the memory, and by accessing data stored in the memory. Memory can mainly include a program storage area and a data storage area. The program storage area can store the operating system, at least one application program required for a function (such as video playback, image playback, etc.), etc.; the data storage area can store data created based on the use of the mobile phone (such as video data, image data, etc.). Furthermore, memory can include high-speed random access memory (RAM), and can also include non-volatile memory, such as hard disks, RAM, plug-in hard disks, SMC (SmartMediaCard), SD (Secure Digital) cards, flash cards, at least one disk storage device, flash memory device, or other volatile solid-state storage devices.

[0184] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements all or part of the steps of the aforementioned human activity recognition method.

[0185] The embodiments of this application can implement all or part of the aforementioned processes, or they can be accomplished by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various methods described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, ROM (Read-Only memory), RAM (Random Access memory), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added to or subtracted according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electrical carrier signals and telecommunication signals.

[0186] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, servers, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code.

[0187] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0188] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

[0189] The above description is merely a specific embodiment of this application, enabling those skilled in the art to understand or implement this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.

Claims

1. A method for recognizing human activity, characterized in that, Includes the following steps: Feature extraction is performed on the original skeletal joint features of the target action to obtain skeletal joint features; The original acceleration features of the target action are extracted to obtain acceleration features; The skeletal joint features and the acceleration features are fused to obtain multimodal fused features; Based on the multimodal fusion features, the recognition result corresponding to the target action is obtained; The step of fusing the skeletal joint features and the acceleration features to obtain multimodal fused features includes: The skeletal joint features are subjected to convolution and average pooling to obtain new key vectors and new query vectors, and new value vectors are obtained based on the skeletal joint features and their corresponding learning parameter matrices. The key vector and value vector in the Transformer model are updated based on the new key vector and the new value vector, respectively, to obtain the first Transformer model; The acceleration features are reshaped using the first Transformer model to obtain the fused acceleration features; The query vector in the Transformer model is updated based on the new query vector to obtain the second Transformer model; The skeletal joint features are reshaped using the second Transformer model to obtain the fused skeletal joint features.

2. The human activity recognition method as described in claim 1, characterized in that, The process of extracting features from the original skeletal joint features of the target action to obtain skeletal joint features includes: Based on the channel compression excitation block in the preset spatiotemporal skeleton model, the original skeleton joint features are compressed, excited, and weighted in the channel dimension to obtain the first feature; The first feature is compressed, excited, and weighted in the time dimension based on the time compression excitation block in the preset spatiotemporal skeleton model to obtain the second feature. The skeletal joint features are obtained based on the second feature and the original skeletal joint features.

3. The human activity recognition method as described in claim 2, characterized in that, The original skeletal joint features are compressed, excited, and weighted along the channel dimension by the channel compression excitation block based on the preset spatiotemporal skeletal model to obtain the first feature, including: The channel compression excitation block compresses and excites the original skeletal joint features in both length and width dimensions to obtain the first activated features. The channel compression excitation block compresses and excites the original skeletal joint features in the time, length, and width dimensions to obtain the second activated features. The first activated feature and the second activated feature are combined to obtain the combined feature; The combined feature and the original skeletal joint feature are weighted to obtain the first feature.

4. The human activity recognition method as described in claim 1, characterized in that, The process of extracting acceleration features from the raw acceleration features of the target action to obtain acceleration features includes: Based on the Transformer model, the original acceleration features of the target action are extracted to obtain acceleration features.

5. The human activity recognition method as described in claim 1, characterized in that, The recognition based on the multimodal fusion features to obtain the recognition result corresponding to the target action includes: The fused acceleration features and the fused skeletal joint features are cascaded to obtain cascaded features. The cascaded features are identified using a convolutional neural network to obtain the recognition result corresponding to the target action.

6. The human activity recognition method as described in claim 1, characterized in that, Before the step of extracting features from the original skeletal joint features of the target action to obtain the skeletal joint features, the method further includes: Acquire raw acceleration data and raw skeletal joint data corresponding to multiple target actions performed by the human body; The original acceleration data is segmented according to the video frames corresponding to each target action in the original skeletal joint data to obtain multiple original acceleration groups; Alignment processing is performed on the video frames corresponding to each target action in multiple raw acceleration groups and raw skeletal joint data to obtain the raw skeletal joint features and raw acceleration features corresponding to each target action.

7. A human activity recognition device, characterized in that, include: The feature extraction unit is used to extract the original skeletal joint features of the target action to obtain the skeletal joint features; The original acceleration features of the target action are extracted to obtain acceleration features; The feature fusion unit is used to fuse the skeletal joint features and the acceleration features to obtain multimodal fused features; An action recognition unit is used to identify the target action based on the multimodal fusion features. Specifically, the feature fusion unit is used for: The skeletal joint features are subjected to convolution and average pooling to obtain new key vectors and new query vectors, and new value vectors are obtained based on the skeletal joint features and their corresponding learning parameter matrices. The key vector and value vector in the Transformer model are updated based on the new key vector and the new value vector, respectively, to obtain the first Transformer model; The acceleration features are reshaped using the first Transformer model to obtain the fused acceleration features; The query vector in the Transformer model is updated based on the new query vector to obtain the second Transformer model; The skeletal joint features are reshaped using the second Transformer model to obtain the fused skeletal joint features.

8. A human activity recognition device, characterized in that, include: A memory and a processor, wherein the memory stores at least one instruction, which is loaded and executed by the processor to implement the human activity recognition method according to any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores a computer program that, when executed by a processor, implements the human activity recognition method according to any one of claims 1 to 6.