A cross-target gesture recognition method based on multi-modal fusion
By employing a multimodal fusion-based cross-target gesture recognition method, which utilizes the comparative fusion learning and generation network of WiFi and video data, the problem of feature loss and heterogeneity in new user recognition is solved, achieving higher recognition accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHONGBEI UNIV
- Filing Date
- 2024-06-04
- Publication Date
- 2026-06-23
AI Technical Summary
Existing gesture recognition methods based on WiFi and video suffer from decreased accuracy when identifying new users due to a lack of features, and they also struggle to address the heterogeneity issues between WiFi and video.
A multimodal fusion cross-target gesture recognition method is adopted. By acquiring multimodal data, including WiFi data and video data, gesture features are extracted through comparative fusion learning. A generative network is used for cross-modal generation, and a loss function is constructed to train the gesture recognition model. This method solves the heterogeneity between WiFi and video and generates WiFi data for new users.
It effectively solves the heterogeneity problem among multimodal data, improves the recognition performance across target tasks, and outperforms existing methods, especially showing good recognition accuracy when identifying new users.
Smart Images

Figure CN118568663B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of gesture recognition technology, and more specifically, to a cross-target gesture recognition method based on multimodal fusion. Background Technology
[0002] In recent years, with the rapid development of deep learning and artificial intelligence, gesture recognition technology has seen increasingly wider applications in fields such as smart homes, healthcare, and autonomous driving. Advances in this technology have led to continuous improvements in gesture recognition methods, bringing higher accuracy and efficiency to various fields. These methods primarily encompass a variety of gesture recognition approaches, including those based on Wi-Fi and video.
[0003] Today, with the widespread application of WiFi signals globally, its function has expanded from solely wireless communication to the field of wireless sensing. This technology operates on the principle that WiFi signals undergo physical phenomena such as reflection, scattering, and diffraction during propagation, forming multiple propagation paths. Simultaneously, the movement of objects affects signal propagation, thereby altering the phase and amplitude of the Channel State Information (CSI) in the WiFi signal's physical layer. Ultimately, by analyzing changes in the channel state information, wireless sensing technology can perceive spatial and human activity. To improve the sensing performance of models, researchers are actively exploring traditional machine learning algorithms and deep learning-based methods. Traditional machine learning algorithms analyze the time and frequency domain features of CSI data, using algorithms such as SVM and random forests to build classification models for gesture recognition. Deep learning-based methods design deep neural network models, including multiple convolutional layers, pooling layers, and fully connected layers, to extract and classify environment-independent behavioral features. Furthermore, some methods employ generative models to generate CSI data for different scenarios and targets to ensure the model's robustness across various domains. With the rise of deep learning, video-based gesture recognition has made significant progress. Most methods employ large-scale datasets and deep learning models to achieve accurate recognition by learning the semantic information of gestures in videos. Furthermore, the introduction of transfer learning and self-supervised learning enhances the generalization and adaptability of the models. Therefore, video-based gesture recognition methods have broad application prospects and potential in applications such as smart homes and game control.
[0004] While existing gesture recognition methods based on WiFi and video have achieved good accuracy, WiFi relies on physical phenomena such as reflection for perception, resulting in lower spatial resolution and thus affecting its recognition capabilities. Although video data contains rich spatial information, it is often affected by factors such as changes in lighting and limited viewing angle. Therefore, integrating the advantages of both modalities plays a positive role in improving gesture recognition accuracy. With the widespread application of multimodal learning in human-computer interaction, multimodal fusion gesture recognition methods based on WiFi and video have also received considerable attention. However, existing methods struggle to address the heterogeneity between WiFi and video, and the lack of features when new users join leads to a decrease in gesture recognition accuracy. Summary of the Invention
[0005] To overcome at least one deficiency in the prior art, this application provides a cross-target gesture recognition method based on multimodal fusion.
[0006] Firstly, a cross-target gesture recognition method based on multimodal fusion is provided, including:
[0007] Acquire multimodal data, which includes WiFi data and video data. The WiFi data is the WiFi data of the source domain user when performing gesture actions, and the video data includes gesture video data of the source domain user when performing gesture actions and daily life video data of the target domain user.
[0008] The gesture features are extracted by comparing, fusing and learning WiFi data and gesture video data when users in the source domain perform gesture actions.
[0009] Human skeleton point data is extracted from the gesture video data of the source domain user when performing gesture actions, and the spatial matrix of the source domain user is determined based on the skeleton point data; human skeleton point data is extracted from the daily life video data of the target domain user, and the spatial matrix of the target domain user is determined based on the skeleton point data.
[0010] Based on the generative network, cross-modal generation is performed on the WiFi data of the source domain users to obtain the WiFi data of the target domain users; during the cross-modal generation process, the loss function is constructed based on the spatial matrix of the source domain users and the spatial matrix of the target domain users.
[0011] The gesture recognition model is trained based on gesture features and WiFi data of target domain users to obtain the trained gesture recognition model.
[0012] The WiFi data of the target domain user to be identified is input into the trained gesture recognition model to obtain the gesture recognition result of the target domain user.
[0013] In one embodiment, WiFi data and gesture video data of source domain users performing gesture actions are compared, fused, and learned to extract gesture features, including:
[0014] WiFi features are obtained by performing 2D convolution on WiFi data.
[0015] 3D convolution is used to extract features from gesture video data to obtain video features;
[0016] The fused features are obtained by weighting and combining WiFi features and video features;
[0017] Based on the fusion features, a contrastive learning method is used to extract gesture features.
[0018] In one embodiment, extracting human skeleton point data from gesture video data when a source domain user performs a gesture action, and determining the spatial matrix of the source domain user based on the skeleton point data, includes:
[0019] Based on gesture video data, a 3D human pose estimation method was used to extract data from 17 skeletal points of the human body; each skeletal point data includes coordinates in three dimensions: X, Y, and Z.
[0020] The spatial matrix of the source domain user is determined based on 17 skeletal point data. The spatial matrix of the source domain user includes spatial matrices with three dimensions: X, Y, and Z. The spatial matrix of the X dimension is expressed by the following formula:
[0021]
[0022] Among them, X i ′ ,j Let x be the value of the element in the i-th row and j-th column of a spatial matrix of dimension X. i Let x be the coordinate of the i-th skeletal point in the X-dimensional data. j Let X be the coordinate of the j-th skeletal point in the X-dimensional data.
[0023] In one embodiment, the overall loss function in the cross-modal generation process is:
[0024]
[0025] in, For the total loss function, As the first instance of combat losses, For the second confrontation loss, For cycle consistency loss, The loss is spatial dissimilarity; λ1 and λ2 represent two hyperparameters, respectively.
[0026] First confrontation loss Secondary combat losses The following formula is used to express this:
[0027]
[0028] G1 and G2 are two generators, D y and D X There are two discriminators, X is the dataset of source domain users, Y is the dataset of target domain users, x is a sample in X, y is a sample in Y, and P... data () represents the distribution of y, P data Let () be the distribution of x, E represent the expectation, and D represent the expected value. Y () indicates that y is input into D. Y The output obtained, G1(), represents the output obtained by inputting x into G1, D X () indicates that x will be input into D. X The output obtained, G2(), represents the output obtained by inputting y into G2;
[0029] Cyclic consistency loss The following formula is used to express this:
[0030]
[0031] Where ||·||1 is the 1-norm;
[0032] Spatial difference loss The following formula is used to express this:
[0033]
[0034] Where n represents the number of dimensions, X t ′ Let X be the spatial matrix of the target domain users in X-dimensional space. s ′ Let Y be the spatial matrix of the source domain users in the X dimension. t ′ Let Y be the spatial matrix of the target domain users in the Y dimension. s ′ Let Z be the spatial matrix of the source domain users in the Y dimension. t ′ Let Z be the spatial matrix of the target domain users in the Z-dimensional space. s ′ The Z-dimensional spatial matrix represents the source domain users.
[0035] In one embodiment, the method further includes:
[0036] The WiFi data is denoised, and the start and end points of the gesture data in the WiFi data are determined in order to extract the gesture data in the WiFi data.
[0037] Background removal is performed on the video data, and the start and end points of the gesture data in the video data are determined in order to extract the gesture data from the video data.
[0038] Secondly, a cross-target gesture recognition device based on multimodal fusion is provided, comprising:
[0039] The data acquisition module is used to acquire multimodal data, which includes WiFi data and video data. The WiFi data is the WiFi data when the source domain user performs a gesture action, and the video data includes gesture video data when the source domain user performs a gesture action and daily life video data of the target domain user.
[0040] The contrast-fusion learning module is used to compare and fuse WiFi data and gesture video data when users in the source domain perform gesture actions to extract gesture features.
[0041] The spatial matrix determination module is used to extract human skeleton point data from the gesture video data of the source domain user when performing gesture actions, and determine the spatial matrix of the source domain user based on the skeleton point data; and to extract human skeleton point data from the daily life video data of the target domain user, and determine the spatial matrix of the target domain user based on the skeleton point data.
[0042] The cross-modal generation module is used to generate WiFi data for target domain users based on the source domain users' WiFi data using the generation network. During the cross-modal generation process, the loss function is constructed based on the spatial matrix of the source domain users and the spatial matrix of the target domain users.
[0043] The model training module is used to train the gesture recognition model based on gesture features and WiFi data of target domain users to obtain the trained gesture recognition model.
[0044] The recognition module is used to input the WiFi data of the target domain user to be identified into the trained gesture recognition model to obtain the gesture recognition result of the target domain user.
[0045] In one embodiment, the comparison fusion learning module is also used for:
[0046] WiFi features are obtained by performing 2D convolution on WiFi data.
[0047] 3D convolution is used to extract features from gesture video data to obtain video features;
[0048] The fused features are obtained by weighting and combining WiFi features and video features;
[0049] Based on the fusion features, a contrastive learning method is used to extract gesture features.
[0050] In one embodiment, the space matrix determination module is further configured to:
[0051] Based on gesture video data, a 3D human pose estimation method was used to extract data from 17 skeletal points of the human body; each skeletal point data includes coordinates in three dimensions: X, Y, and Z.
[0052] The spatial matrix of the source domain user is determined based on 17 skeletal point data. The spatial matrix of the source domain user includes spatial matrices with three dimensions: X, Y, and Z. The spatial matrix of the X dimension is expressed by the following formula:
[0053]
[0054] Among them, X i ′ ,j Let x be the value of the element in the i-th row and j-th column of a spatial matrix of dimension X. i Let x be the coordinate of the i-th skeletal point in the X-dimensional data. j Let X be the coordinate of the j-th skeletal point in the X-dimensional data.
[0055] Thirdly, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the aforementioned cross-target gesture recognition method based on multimodal fusion.
[0056] Fourthly, a computer program product is provided, including a computer program / instructions, which, when executed by a processor, implement the aforementioned cross-target gesture recognition method based on multimodal fusion.
[0057] Compared with existing technologies, this application has the following advantages: The cross-target gesture recognition method based on multimodal fusion in this application uses a contrastive fusion learning method to extract gesture features from WiFi data and video data, effectively solving the heterogeneity of the two modalities. Furthermore, it uses a cross-modal generation method for WiFi data to obtain WiFi data of new users, thus solving the problem of missing target features caused by changes in perceived targets. The method in this application shows good performance in cross-target tasks and is superior to existing multimodal fusion methods. Attached Figure Description
[0058] This application can be better understood by referring to the description given below in conjunction with the accompanying drawings, which, together with the detailed description below, are incorporated in and form part of this specification. In the drawings:
[0059] Figure 1A flowchart of a cross-target gesture recognition method based on multimodal fusion according to an embodiment of this application is shown;
[0060] Figure 2 The experimental scenario diagram is shown;
[0061] Figure 3 The performance comparison charts of the proposed method with four other multimodal fusion methods are shown. (a) shows the performance comparison chart of the proposed method with four other multimodal fusion methods using the collected dataset MixGR, and (b) shows the performance comparison chart of the proposed method with four other multimodal fusion methods using the publicly available dataset MM-Fi.
[0062] Figure 4 A performance comparison chart of the method in this application and the single-modal gesture recognition method is shown;
[0063] Figure 5 The diagram illustrates the gesture recognition performance of the method in this application in a cross-target scenario, wherein (a) is a gesture recognition performance diagram using the collected dataset MixGR in a cross-target scenario, and (b) is a gesture recognition performance diagram using the publicly available dataset MM-Fi in a cross-target scenario.
[0064] Figure 6 The diagram shows a comparison of the accuracy of cross-target gesture recognition using WiFi data generated by cross-modal generation and WiFi data generated by non-cross-modal generation. (a) shows a comparison of the accuracy of cross-target gesture recognition using WiFi data generated by cross-modal generation and WiFi data generated by non-cross-modal generation under different orientations, and (b) shows a comparison of the accuracy of cross-target gesture recognition using WiFi data generated by cross-modal generation and WiFi data generated by non-cross-modal generation under the same orientation.
[0065] Figure 7 The diagram shows a comparison of the accuracy of cross-target gesture recognition under contrastive fusion learning method and non-contrast fusion learning method. (a) shows the comparison of the accuracy of cross-target gesture recognition under different orientations with contrastive fusion learning method and non-contrast fusion learning method, and (b) shows the comparison of the accuracy of cross-target gesture recognition under the same orientation with contrastive fusion learning method and non-contrast fusion learning method.
[0066] Figure 8 The diagram shows a comparison of gesture recognition accuracy under different methods. (a) is a comparison of gesture recognition accuracy under different methods in different directions, and (b) is a comparison of gesture recognition accuracy under different methods in the same direction.
[0067] Figure 9 A structural block diagram of a cross-target gesture recognition device based on multimodal fusion according to an embodiment of this application is shown. Detailed Implementation
[0068] Exemplary embodiments of the present application will be described below with reference to the accompanying drawings. For clarity and brevity, not all features of the actual embodiments are described in the specification. However, it should be understood that many embodiment-specific decisions can be made in the development of any such actual embodiment to achieve the developer’s specific objectives, and these decisions may vary as the embodiments differ.
[0069] It should also be noted that, in order to avoid obscuring this application with unnecessary details, only the device structure closely related to the solution according to this application is shown in the accompanying drawings, while other details that are not closely related to this application are omitted.
[0070] It should be understood that this application is not limited to the described embodiments by virtue of the following description with reference to the accompanying drawings. In this document, embodiments may be combined with each other, features may be substituted or borrowed between different embodiments, and one or more features may be omitted in one embodiment, where feasible.
[0071] This application provides a cross-target gesture recognition method based on multimodal fusion. Figure 1 A flowchart of a cross-target gesture recognition method based on multimodal fusion according to an embodiment of this application is shown. See also: Figure 1 The methods include:
[0072] Step S1: Acquire multimodal data, which includes WiFi data and video data. The WiFi data is the WiFi data of the source domain user when performing the gesture action, and the video data includes gesture video data of the source domain user when performing the gesture action and daily life video data of the target domain user.
[0073] In terms of data acquisition, WiFi data is obtained by extracting Channel State Information (CSI) from its physical layer. This information provides valuable data such as the amplitude and phase of subcarriers. However, due to environmental noise interference, CSI becomes unstable, making it impossible to accurately reflect gesture characteristics. Therefore, a Butterworth filter is used to denoise the WiFi data. Since performing gestures causes significant fluctuations in the CSI signal, a variance thresholding method is used to determine the start and end points of gesture data in the WiFi data to extract the gesture data.
[0074] The video data was acquired as RGB video. To reduce the data volume while preserving the original gesture features, it was converted to grayscale video. Subsequently, a segmentation method was used to separate the user from the background to eliminate background interference with the gesture features. Then, optical flow technology was used to calculate the average optical flow intensity of each gesture frame, determining the start and end points of the gesture data in the video data to extract the gesture data.
[0075] Step S2 involves comparing and fusing WiFi data and gesture video data of users in the source domain when they perform gesture actions to extract gesture features.
[0076] After data preprocessing, high-quality WiFi and video data were obtained. However, WiFi data is two-dimensional, while video data is three-dimensional, and their data formats are different. Therefore, this step employs a contrastive fusion learning method to address the heterogeneity of the two types of data and enable the model to learn the features of different types of gestures.
[0077] Step S3: Extract human skeleton point data from the gesture video data of the source domain user when performing gesture actions, and determine the spatial matrix of the source domain user based on the skeleton point data; extract human skeleton point data from the daily life video data of the target domain user, and determine the spatial matrix of the target domain user based on the skeleton point data.
[0078] Step S4: Based on the generating network, cross-modal generation is performed on the WiFi data of the source domain users to obtain the WiFi data of the target domain users; during the cross-modal generation process, the loss function is constructed based on the spatial matrix of the source domain users and the spatial matrix of the target domain users.
[0079] Step S5: Train the gesture recognition model based on gesture features and WiFi data of users in the target domain to obtain the trained gesture recognition model. Here, the gesture recognition model can use traditional network models such as Convolutional Neural Networks (CNN) and classifiers.
[0080] Step S6: Input the WiFi data of the target domain user to be identified into the trained gesture recognition model to obtain the gesture recognition result of the target domain user.
[0081] The multimodal fusion cross-target gesture recognition method of this embodiment uses a contrastive fusion learning method to extract gesture features from WiFi data and video data, effectively solving the heterogeneity of the two modalities. Furthermore, it uses a cross-modal generation method to obtain WiFi data of new users to solve the problem of missing target features caused by changes in perceived targets. The method of this application has shown good performance in cross-target tasks and is superior to existing multimodal fusion methods.
[0082] In one embodiment, step S2 involves comparing and fusing WiFi data and gesture video data of the source domain user performing gesture actions to extract gesture features, including:
[0083] WiFi features are obtained by performing 2D convolution on WiFi data.
[0084] 3D convolution is used to extract features from gesture video data to obtain video features;
[0085] The fused features are obtained by weighting and combining WiFi features and video features;
[0086] Based on the fusion features, a contrastive learning method is used to extract gesture features.
[0087] In contrastive learning, any enhanced feature can be selected as an anchor. Throughout training, the similarity between features is learned by minimizing the distance between the anchor and positive features while maximizing the distance with negative features. Contrastive learning enhances the similarity between features of the same category and highlights the differences between features of different categories. This approach helps to better preserve the semantic information of different categories during feature fusion, thereby extracting gesture features.
[0088] In one embodiment, step S3 involves extracting human skeleton point data from the gesture video data of the source domain user performing the gesture action, and determining the spatial matrix of the source domain user based on the skeleton point data, including:
[0089] Based on gesture video data, a 3D human pose estimation method was used to extract data from 17 skeletal points of the human body; each skeletal point data includes coordinates in three dimensions: X, Y, and Z.
[0090] To strengthen the connections between skeletal points, a spatial matrix for the source domain user was determined based on data from 17 skeletal points. This spatial matrix comprises three dimensions: X, Y, and Z. Integrating these three dimensions reflects the spatial variations of the gesture. The X-dimensional spatial matrix is represented by the following formula:
[0091]
[0092] Among them, X i ′ ,j Let x be the value of the element in the i-th row and j-th column of a spatial matrix of dimension X. i Let x be the coordinate of the i-th skeletal point in the X-dimensional data. j Let X be the coordinate of the j-th skeletal point in the X-dimensional data.
[0093] It should be noted that the process of extracting human skeleton point data from the daily life video data of the target domain user in step S3 and determining the spatial matrix of the target domain user based on the skeleton point data is the same as the process of determining the spatial matrix of the source domain user in this embodiment, and will not be repeated here.
[0094] In one embodiment, in step S4, the generative network learns the distribution patterns of the data and generates data that closely resembles the real world, achieving a highly realistic effect. Most generative networks are applied in fields such as images and text. In this embodiment, step S4 uses generated WiFi data to compensate for feature loss. To meet the needs of cross-target tasks, a recurrent generative adversarial network (GAN) is used as the backbone, which mainly consists of three parts: a generator, a discriminator, and a loss function.
[0095] The generator consists of two parts: G1 and G2. G1 converts source domain WiFi data into target domain WiFi data, while G2 converts target domain WiFi data back into source domain WiFi data. This bidirectional conversion process ensures the integrity and consistency of data from different domains, thereby guaranteeing the quality of the generated WiFi data. G1 and G2 employ a residual network structure to avoid losing gesture features during the generation process.
[0096] The discriminator plays a crucial role in generative adversarial networks (GANs), responsible for evaluating the authenticity of generated gesture samples. Discriminators were designed within the generative network to differentiate the authenticity of generated samples in the source and target domains of WiFi data. The discriminator captures local and global features of the WiFi data using methods such as convolution and normalization, and then utilizes these features to evaluate the authenticity of the generated WiFi signals.
[0097] The loss function consists of three parts: adversarial loss, cycle consistency loss, and spatial variability loss. Adversarial loss measures the difference between the WiFi data generated by the generator and real WiFi data. By continuously optimizing the generator, its generated data gradually approximates real data, thus deceiving the discriminator. Cycle consistency loss ensures that the generated WiFi data can be restored to the original data after passing through generators G1 and G2. This means that the generator must not only generate realistic WiFi data but also retain the gesture features in the WiFi data. Spatial variability loss calculates the variability of the spatial matrix. However, directly subtracting the matrix cannot effectively highlight significant changes in spatial features; therefore, the mean squared error of the matrix is used to achieve this purpose.
[0098] The overall loss function in the cross-modal generation process:
[0099]
[0100] in, For the total loss function, As the first instance of combat losses, For the second confrontation loss, For cycle consistency loss, The loss is spatial dissimilarity; λ1 and λ2 represent two hyperparameters, respectively.
[0101] First confrontation loss Secondary combat losses The following formula is used to express this:
[0102]
[0103] G1 and G2 are two generators, D Y and D X There are two discriminators, X is the dataset of source domain users, Y is the dataset of target domain users, x is a sample in X, y is a sample in Y, and P... data () represents the distribution of y, P data Let () be the distribution of x, E represent the expectation, and D represent the expected value. Y () indicates that y is input into D. Y The output obtained, G1(x), represents the output obtained by inputting x into G1, D X (x) means inputting x into D. X The output obtained, G2(y), represents the output obtained by inputting y into G2; D Y (G1(x)) means inputting G1(x) into D. Y The output obtained, D X (G2(y)) means inputting G2(y) into D. X The output obtained.
[0104] Cyclic consistency loss The following formula is used to express this:
[0105]
[0106] Where ||·||1 is the 1-norm;
[0107] Spatial difference loss The following formula is used to express this:
[0108]
[0109] Where n represents the number of dimensions, X′ t Let X′ be the spatial matrix of the target domain users in the X dimension. s Let Y′ be the spatial matrix of the source domain users in the X dimension. t Let Y' be the spatial matrix of the target domain users in the Y dimension. s Let Z′ be the spatial matrix of the Y dimension of the source domain users.t Let Z′ be the spatial matrix of the target domain users in the Z dimension. s The Z-dimensional spatial matrix represents the source domain users.
[0110] To further verify the effectiveness of the cross-target gesture recognition method based on multimodal fusion proposed in this application, the following experimental analysis was conducted.
[0111] The proposed cross-target gesture recognition method was evaluated using the collected multimodal dataset MixGR and the publicly available multimodal dataset MM-Fi. As shown in Table 1, MixGR includes two modalities: WiFi and video, while MM-Fi includes five modalities: WiFi, video, depth video, LiDAR, and millimeter-wave radar.
[0112] Table 1 Dataset
[0113] Dataset Modal volunteer Gesture categories Sample size MixGR WiFi, video 10*5 6 1800 MM-Fi WiFi, video, depth video, LiDAR, millimeter-wave radar 10*4 9 8640
[0114] 1. Introduction to the dataset
[0115] The MixGR dataset was collected from 10 volunteers (including 3 women) at 5 different locations, with each volunteer performing 6 gestures. Figure 2 The experimental setup is illustrated. During data collection, a camera was used to record video of volunteers' gestures from different positions. Additionally, two microcomputers equipped with Intel 5300 network cards were used as the WiFi transmitter and receiver, respectively. Each transmitter and receiver was equipped with three antennas, with each pair recording 30 subcarrier data points, resulting in a total of 270 subcarriers collected for each WiFi session.
[0116] MM-Fi includes data from five modalities: video, depth video, WiFi, LiDAR, and millimeter-wave radar. The dataset was collected in four different environments, each with 10 volunteers performing 27 activities (including 9 gestures), each containing 297 frames. Two modalities were primarily used in the experiments: video and WiFi. Video data was collected using a high-resolution 640x480 camera, providing rich visual information. WiFi data was collected using a TP-Link N750 and Atheros CSI tools on a 40MHz bandwidth 5GHz band. The reflector and receiver consisted of three pairs of antennas, each recording 114 subcarriers, resulting in a total WiFi data volume of 3x114x10. Finally, data from 9 gestures out of the 27 activities were selected for the experiments.
[0117] 2. Experimental Setup
[0118] First, the cross-target gesture recognition method based on WiFi and video multimodal fusion proposed in this application is compared with four other multimodal fusion methods: Attnsense, Deepsense, Concat, and Cosmo. Second, to demonstrate the effectiveness of the method in heterogeneous multimodal fusion, it is compared with WiFi-based and video-based unimodal gesture recognition methods WiGr and Snapture. WiGr uses a dual-path prototype network for WiFi cross-domain gesture recognition, while Snapture develops a gesture recognition network that combines static and dynamic gestures. Simultaneously, to evaluate the performance of the method in cross-target gesture recognition scenarios, each user is tested sequentially as a target domain user.
[0119] In this method for generating WiFi data across modalities from video data, WiFi data for target domain users is generated using gesture data from nine individuals across modalities. Subsequently, the quality of the generated target domain user WiFi data is evaluated using SenseFi, a benchmark library for WiFi awareness research based on deep learning. During testing, the generated target domain user WiFi data is randomly mixed with real WiFi data. Finally, the mixed WiFi data is input into the SenseFi benchmark for testing.
[0120] The proposed method primarily involves a contrastive fusion learning approach and a cross-modal generation approach. To demonstrate the effectiveness of these two methods, ablation experiments were designed to evaluate their performance. For the contrastive fusion learning approach, the accuracy of cross-target gesture recognition with and without contrastive fusion learning was compared. For the cross-modal generation approach, experiments were designed to perform cross-target gesture recognition using only contrastive fusion learning and contrastive fusion learning combined with cross-modal generation. Furthermore, the accuracy of cross-target gesture recognition using contrastive fusion learning combined with a generative adversarial network (GAN) to generate target domain user WiFi data was compared.
[0121] 3. Experimental Environment
[0122] In the experimental environment, Python 3.8 was chosen as the programming language, and PyTorch v1.8 was used as the backend framework. The server infrastructure included an Intel® Xeon® CPU E5-2698 v4 processor, four Tesla V100 graphics processors, and 256GB of memory. Ubuntu 20 operating system was used. This powerful hardware and software configuration provided ample computing resources and performance, enabling the successful completion of training and inference tasks.
[0123] 4. Evaluation Results
[0124] In the experiment, the same software, hardware, and data partitioning methods were used to ensure fairness. Figure 3The performance comparison charts of the proposed method with four other multimodal fusion methods are shown. (a) shows the performance comparison chart using the collected dataset MixGR, and (b) shows the performance comparison chart using the publicly available dataset MM-Fi. Experimental results show that the gesture recognition accuracy of the proposed method (hereinafter referred to as MixGR) is superior to the other four multimodal fusion methods. Furthermore, the gesture recognition accuracy of the proposed method exceeds 90% in both the collected dataset MixGR and the publicly available MM-Fi dataset.
[0125] Figure 4 The diagram shows a performance comparison between the proposed method and single-modal gesture recognition methods. Compared with the WiFi-based gesture recognition method WiGr and the video-based gesture recognition method Snapture, the proposed method MixGR improves the gesture recognition accuracy by 21.74% and 18.21%, respectively. By fusing WiFi data and video data, the proposed method MixGR achieves better results in gesture recognition accuracy, demonstrating its multimodal fusion learning capability.
[0126] Figure 5 The diagram illustrates the gesture recognition performance of the method in this application in a cross-target scenario, wherein (a) is a gesture recognition performance diagram using the collected dataset MixGR in a cross-target scenario, and (b) is a gesture recognition performance diagram using the publicly available dataset MM-Fi in a cross-target scenario. Figure 5 The paper demonstrates the accuracy of cross-target gesture recognition when each user is considered as the target domain user, highlighting its effectiveness in cross-target gesture recognition tasks.
[0127] The WiFi data generated from the cross-modal video data is crucial to the recognition performance of the method in this application. Figure 6 The diagram shows a comparison of the accuracy of cross-target gesture recognition using WiFi data generated by cross-modal generation and WiFi data generated without cross-modal generation. (a) shows a comparison of the accuracy of cross-target gesture recognition using WiFi data generated by cross-modal generation and WiFi data generated without cross-modal generation at different orientations, and (b) shows a comparison of the accuracy of cross-target gesture recognition using WiFi data generated by cross-modal generation and WiFi data generated without cross-modal generation at the same orientation. Figure 6 (a) in the figure represents the experimental results obtained using the MixGR dataset. Figure 6 (b) shows the experimental results obtained using the MM-Fi dataset. The experimental results demonstrate that the MixGR method of this application exhibits significant advantages when using cross-modal generated WiFi data for cross-target gesture recognition tasks, which means that the cross-modal generated WiFi data is closer to real WiFi data.
[0128] Figure 7 The diagram shows a comparison of the accuracy of cross-target gesture recognition under contrastive fusion learning and non-contrast fusion learning methods. (a) shows the comparison of the accuracy of cross-target gesture recognition under different orientations with and without contrastive fusion learning methods, and (b) shows the comparison of the accuracy of cross-target gesture recognition under the same orientation with and without contrastive fusion learning methods. Figure 7 (a) in the figure represents the experimental results obtained using the MixGR dataset. Figure 7 (b) shows the experimental results obtained using the MM-Fi dataset. Compared with fusion learning methods, the accuracy of cross-target gesture recognition of the proposed method is significantly improved, highlighting the key role of contrastive fusion learning in promoting multimodal fusion and improving gesture recognition accuracy.
[0129] Figure 8 The diagram shows a comparison of gesture recognition accuracy under different methods. (a) is a comparison of gesture recognition accuracy under different methods in different positions, and (b) is a comparison of gesture recognition accuracy under different methods in the same position. Figure 8 (a) in the figure represents the experimental results obtained using the MixGR dataset. Figure 8 (b) in the figure shows the experimental results obtained using the MM-Fi dataset. The experimental results show that the method using the cross-modal generation method has a significant advantage over the other two methods when generating WiFi data, which proves the effectiveness of the cross-modal generation method.
[0130] Employing the same inventive concept as the cross-target gesture recognition method based on multimodal fusion, this embodiment also provides a corresponding cross-target gesture recognition device based on multimodal fusion. Figure 9 A structural block diagram of a cross-target gesture recognition device based on multimodal fusion according to an embodiment of this application is shown. See also: Figure 9 The device includes:
[0131] The data acquisition module 91 is used to acquire multimodal data, which includes WiFi data and video data. The WiFi data is the WiFi data when the source domain user performs a gesture action, and the video data includes gesture video data when the source domain user performs a gesture action and daily life video data of the target domain user.
[0132] The contrast-fusion learning module 92 is used to compare and fuse WiFi data and gesture video data when users in the source domain perform gesture actions to extract gesture features.
[0133] The spatial matrix determination module 93 is used to extract human skeleton point data from the gesture video data when the source domain user performs a gesture action, and determine the spatial matrix of the source domain user based on the skeleton point data; and to extract human skeleton point data from the daily life video data of the target domain user, and determine the spatial matrix of the target domain user based on the skeleton point data.
[0134] The cross-modal generation module 94 is used to generate WiFi data of target domain users based on the source domain users' WiFi data using the generation network. During the cross-modal generation process, the loss function is constructed based on the spatial matrix of the source domain users and the spatial matrix of the target domain users.
[0135] The model training module 95 is used to train the gesture recognition model based on gesture features and WiFi data of target domain users to obtain the trained gesture recognition model.
[0136] The recognition module 96 is used to input the WiFi data of the target domain user to be identified into the trained gesture recognition model to obtain the gesture recognition result of the target domain user.
[0137] The cross-target gesture recognition device based on multimodal fusion in this embodiment has the same inventive concept as the cross-target gesture recognition method based on multimodal fusion described above. Therefore, the specific implementation of this device can be found in the embodiment section of the cross-target gesture recognition method based on multimodal fusion described above, and its technical effects correspond to the technical effects of the above method, so it will not be repeated here.
[0138] This application provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it implements the above-described cross-target gesture recognition method based on multimodal fusion.
[0139] This application provides a computer program product, including a computer program / instruction, which, when executed by a processor, implements the aforementioned cross-target gesture recognition method based on multimodal fusion.
[0140] The above descriptions are merely various embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A cross-target gesture recognition method based on multimodal fusion, characterized in that, include: Acquire multimodal data, which includes WiFi data and video data. The WiFi data is WiFi data when a user in the source domain performs a gesture, and the video data includes gesture video data when a user in the source domain performs a gesture and daily life video data of a user in the target domain. The WiFi data and the gesture video data when the source domain user performs a gesture action are compared, fused, and learned to extract gesture features; Human skeleton point data is extracted from the gesture video data when the source domain user performs a gesture action, and the spatial matrix of the source domain user is determined based on the skeleton point data; Human skeleton point data is extracted from the daily life video data of the target domain users, and the spatial matrix of the target domain users is determined based on the skeleton point data. Based on the generating network, cross-modal generation is performed on the WiFi data of the source domain users to obtain the WiFi data of the target domain users; during the cross-modal generation process, the loss function is constructed based on the spatial matrix of the source domain users and the spatial matrix of the target domain users. The gesture recognition model is trained based on the gesture features and the WiFi data of the target domain user to obtain the trained gesture recognition model. The WiFi data of the target domain user to be identified is input into the trained gesture recognition model to obtain the gesture recognition result of the target domain user.
2. The method as described in claim 1, characterized in that, in, The WiFi data and the gesture video data of the source domain user performing gesture actions are compared, fused, and learned to extract gesture features, including: The WiFi data is subjected to 2D convolution to extract features, thus obtaining WiFi features; The gesture video data is subjected to 3D convolution for feature extraction to obtain video features; The WiFi features and the video features are weighted and combined to obtain the fused features; Based on the fusion features, a contrastive learning method is used to extract the gesture features.
3. The method as described in claim 1, characterized in that, in, Human skeleton point data is extracted from the gesture video data of the source domain user when performing gesture actions, and the spatial matrix of the source domain user is determined based on the skeleton point data, including: Based on the gesture video data, a 3D human pose estimation method is used to extract 17 skeletal point data of the human body; each skeletal point data includes coordinates in three dimensions: X, Y, and Z. The spatial matrix of the source domain user is determined based on the 17 skeletal point data. This spatial matrix comprises three dimensions: X, Y, and Z. The X-dimensional spatial matrix is represented by the following formula: Among them, X i ′ ,j Let x be the value of the element in the i-th row and j-th column of a spatial matrix of dimension X. i Let x be the coordinate of the i-th skeletal point in the X-dimensional data. j Let X be the coordinate of the j-th skeletal point in the X-dimensional data.
4. The method as described in claim 1, characterized in that, The overall loss function in the cross-modal generation process is: in, For the total loss function, As the first instance of combat losses, For the second confrontation loss, For cycle consistency loss, The loss is spatial dissimilarity; λ1 and λ2 represent two hyperparameters, respectively. First countermeasure loss Second Countermeasure Loss Expressed using the following formula: G1 and G2 are two generators, D Y and D X There are two discriminators, X is the dataset of source domain users, Y is the dataset of target domain users, x is a sample in X, y is a sample in Y, and P... data (y) is the distribution of y, P data Let (x) be the distribution of x, E represent the expectation, and D represent the expected value. Y (y) indicates that y is input into D. Y The output obtained, G1(x), represents the output obtained by inputting x into G1, D X (x) means inputting x into D. X The output obtained, G2(y), represents the output obtained by inputting y into G2; The cycle consistency loss Expressed using the following formula: Where ||·||1 is the 1-norm; The spatial difference loss Expressed using the following formula: Where n represents the number of dimensions, X t ′ Let X be the spatial matrix of the target domain users in X-dimensional space. s ′ Let Y be the spatial matrix of the source domain users in the X dimension. t ′ Let Y be the spatial matrix of the target domain users in the Y dimension. s ′ Let Z be the spatial matrix of the source domain users in the Y dimension. t ′ Let Z be the spatial matrix of the target domain users in the Z-dimensional space. s ′ The Z-dimensional spatial matrix represents the source domain users.
5. The method as described in claim 1, characterized in that, The method further includes: The WiFi data is denoised, and the start and end points of the gesture data in the WiFi data are determined in order to extract the gesture data in the WiFi data. Background removal is performed on the video data, and the start and end points of the gesture data in the video data are determined to extract the gesture data from the video data.
6. A cross-target gesture recognition device based on multimodal fusion, characterized in that, include: The data acquisition module is used to acquire multimodal data, which includes WiFi data and video data. The WiFi data is WiFi data when the source domain user performs a gesture action, and the video data includes gesture video data when the source domain user performs a gesture action and daily life video data of the target domain user. The comparison and fusion learning module is used to compare and fuse the WiFi data and the gesture video data when the source domain user performs a gesture action to extract gesture features. The spatial matrix determination module is used to extract human skeleton point data based on the gesture video data when the source domain user performs a gesture action, and to determine the spatial matrix of the source domain user based on the skeleton point data. Human skeleton point data is extracted from the daily life video data of the target domain users, and the spatial matrix of the target domain users is determined based on the skeleton point data. A cross-modal generation module is used to generate WiFi data of target domain users based on the WiFi data of the source domain users through a generation network; during the cross-modal generation process, the loss function is constructed based on the spatial matrix of the source domain users and the spatial matrix of the target domain users. The model training module is used to train the gesture recognition model based on the gesture features and the WiFi data of the target domain user to obtain the trained gesture recognition model. The recognition module is used to input the WiFi data of the target domain user to be identified into the trained gesture recognition model to obtain the gesture recognition result of the target domain user.
7. The apparatus as claimed in claim 6, characterized in that, The contrast fusion learning module is also used for: The WiFi data is subjected to 2D convolution to extract features, thus obtaining WiFi features; The gesture video data is subjected to 3D convolution for feature extraction to obtain video features; The WiFi features and the video features are weighted and combined to obtain the fused features; Based on the fusion features, a contrastive learning method is used to extract the gesture features.
8. The apparatus as claimed in claim 6, characterized in that, The spatial matrix determination module is also used for: Based on the gesture video data, a 3D human pose estimation method is used to extract 17 skeletal point data of the human body; each skeletal point data includes coordinates in three dimensions: X, Y, and Z. The spatial matrix of the source domain user is determined based on the 17 skeletal point data. This spatial matrix comprises three dimensions: X, Y, and Z. The X-dimensional spatial matrix is represented by the following formula: Among them, X i ′ ,j Let x be the value of the element in the i-th row and j-th column of a spatial matrix of dimension X. i Let x be the coordinate of the i-th skeletal point in the X-dimensional data. j Let X be the coordinate of the j-th skeletal point in the X-dimensional data.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the cross-target gesture recognition method based on multimodal fusion as described in any one of claims 1-5.
10. A computer program product, characterized in that, Includes a computer program / instruction, which, when executed by a processor, implements the cross-target gesture recognition method based on multimodal fusion as described in any one of claims 1-5.