A multi-source domain adaptation based cross-user gesture recognition method
By using a multi-source domain adaptive method, LSTM and SENET are used to obtain temporal features, and a common domain feature extractor is constructed. This solves the problem of decreased accuracy in cross-user gesture recognition and achieves higher recognition accuracy and real-time performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YANSHAN UNIV
- Filing Date
- 2022-09-30
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, gesture recognition models suffer from a decline in accuracy across users, especially due to differences in electromyographic signals between healthy individuals and people with disabilities, which leads to weak model generalization ability, long training time, and poor action recognition performance among multiple users.
By employing a multi-source domain adaptive approach, we utilize LSTM and SENET to acquire temporal features, construct a common feature extractor and a domain feature extractor, combine CORAL and SoftMax classifiers, perform multi-domain adaptive training, and use multi-source domain transfer learning and data augmentation techniques to optimize feature extraction and classification.
It improves the accuracy of cross-user gesture recognition, meets the requirements of real-time performance and speed, and enhances the model's recognition accuracy on new users.
Smart Images

Figure CN115512440B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of human-computer interaction technology in artificial intelligence, and in particular to a cross-user gesture recognition method based on multi-source domain adaptation. Background Technology
[0002] Gesture recognition technology has demonstrated significant research value and application prospects in numerous human-computer interaction fields. Scientific research has confirmed that hand rehabilitation robots, which integrate gesture recognition and robotics technologies, can assist physicians in rehabilitating patients' hands, helping to reshape the nervous system and promote the recovery of normal hand motor function. In the consumer electronics sector, the application scenarios for gesture recognition in human-computer interaction are even more diverse, such as remotely controlled home service robots and virtual reality interaction. In the industrial sector, using gesture recognition technology combined with robotic arms can ensure worker safety in extreme working environments. In the military field, language and facial expressions are often unsuitable for communication and command among soldiers; gestures can safely and clearly convey combat orders.
[0003] Due to individual differences in signals, action recognition can be divided into single-user action recognition algorithms and cross-user action recognition algorithms. The difference lies in whether the users in the training dataset and the users in the test dataset are the same.
[0004] Gesture recognition technology using classic classifiers generally involves four steps: signal detection, signal preprocessing and segmentation, feature extraction, and gesture classification. Feature extraction is the key step determining the effectiveness of gesture recognition. Early research focused on extracting features from the signal in the time domain, such as using root mean square (RMS) values and waveform length. As research progressed, researchers used more diverse methods for feature extraction, such as using frequency domain features like peak frequency and median frequency, and time-frequency domain features like the absolute mean and average energy of wavelet coefficients. However, manually designed features are prone to redundancy, leading to weak model generalization and long training times. To address redundancy, a common approach is feature selection, filtering extracted features to retain high-information and high-value features, thus constructing a feature subset with smaller dimensionality and higher information content. Common feature selection methods can be categorized into filtering, wrapper, and embedded approaches. While feature selection addresses feature redundancy to some extent, its effectiveness and applicability still require further improvement.
[0005] Due to differences in muscle development and contraction patterns, the electromyographic (EMG) signals generated when performing the same movements vary among individuals, especially between healthy individuals and people with disabilities. This difference limits the applicability of models across multiple users. Therefore, solving the problem of cross-user action recognition has attracted widespread attention from researchers. Current common methods include using transfer learning to update the weight factors of the recognition model for known users with a small amount of new user data, enabling the model to quickly adapt to new user data. Another approach is domain adaptation, which maps known and unknown user data to the same feature space through certain transformations, allowing for rapid matching of new users through model fine-tuning. However, since the distribution differences in action data among different users are inevitable, even the best-trained models will experience a decrease in accuracy when faced with new user data due to these data distribution differences. Summary of the Invention
[0006] The purpose of this invention is to address the shortcomings of the existing technology by designing a cross-user gesture recognition method based on multi-source domain adaptation, so as to solve the technical defects existing in the prior art.
[0007] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:
[0008] S1: Acquire electromyographic and inertial signals from the forearms of multiple subjects using sensors to construct a multi-source domain dataset.
[0009] S2: Preprocess the electromyographic and inertial signals obtained in step S1, including bandpass filtering and power frequency notch filtering; perform data processing, including data label correction and data augmentation.
[0010] S3: Construct a common feature extractor to obtain temporal features through Long Short-Term Memory (LSTM) and activation network (SENET).
[0011] S4: Construct domain feature extractors. Fully connected layers act as feature extractors for each domain, further extracting common features from both the source and target domains to obtain unique features for each source domain. This maps data from each source domain to a specific feature space for that domain. Domain-specific feature alignment is performed using CORAL as a measure of the distributional differences between domain features. The distance between features is calculated by measuring the covariance.
[0012] S5: Domain classifiers. Each domain classifier is followed by a SoftMax classifier. The unique features of each domain are used to obtain the classification result through the domain classifiers. Domain classifier alignment is performed using the absolute value of the cross-entropy difference between the outputs of the target domain and all domain classifiers as the alignment distance.
[0013] S6: Loss estimation for multi-domain adaptive methods, including classification loss, domain-specific feature difference loss, and domain classifier difference loss. The obtained data is fed into the model, and the model is trained until the model's loss function no longer improves. The model is then saved.
[0014] The improvements of this invention are as follows:
[0015] The LSTM in step S3 consists of a forget gate, an update gate, and an output gate. The LSTM forget gate uses the tanh function as the final output and also selects a function whose output value is in the interval [0,1] as the gate's activation function, which determines the previous cell state C. t-1 Whether forgotten, when f t A value of 0 indicates forgetting; a value of 1 indicates retention. The SE NET, an improved compressive excitation network, introduces short-time average energy (SMA) into the SE block. SMA can be used to describe electromyographic signals, providing better information on the strength and contraction of each muscle and reflecting the importance of each channel. The formula for SMA is as follows:
[0016]
[0017] Traditional transfer learning algorithms include: 1. Single-source-domain optimization: using a set of source domain data from the source domain dataset for transfer, traversing the entire dataset, and selecting the optimal experimental result as the result of the single-source-domain optimization method. 2. Source domain combination method: combining all source domain data that are not in the target domain into a dataset as the source domain for the adaptive algorithm.
[0018] The method of this invention acquires temporal features using LSTM and SENET, solves the problem of needing to collect a large amount of labeled source domain data by using multi-source domain transfer learning, and improves the recognition accuracy of the model through data augmentation, thus meeting the requirements for real-time and fast gesture recognition. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a flowchart of an adaptive gesture recognition method based on multi-source domains according to the present invention.
[0021] Figure 2 This is the improved SE Block operation process of the present invention;
[0022] Figure 3 This is the multi-source domain adaptive network structure of the present invention. Detailed Implementation
[0023] The present invention will be further described in detail below with reference to embodiments:
[0024] To provide a detailed account of the technical content, structural features, achieved objectives, and effects of this invention, an embodiment of the invention will be described below in conjunction with the accompanying drawings:
[0025] S1: Signal Acquisition and Data Set. Electromyography (EMG) signals are acquired using a differential amplifier circuit as the signal input, which amplifies the signal and suppresses common-mode interference during acquisition. The input uses three electrodes: a differential electrode pair and a reference electrode. The differential electrode pair is placed at the center of the muscle, and the reference electrode is placed away from the target muscle or at a location without muscle. Inertial signals are also acquired. Inertial information is calculated based on the relative displacement between a fixed coordinate system and the inertial sensor coordinate system. The angle between the two coordinate axes mainly consists of the heading angle, pitch angle, and roll angle. Based on the coordinate system mapping relationship and the output angle values, the inertial information of the acquired object can be calculated.
[0026] The experimental data collected in this invention targeted six muscles: flexor carpi ulnaris, flexor carpi radialis, flexor pollicis longus, extensor digitorum, extensor digiti minimi, and extensor indicis. The dataset included 53 hand gestures from 10 healthy subjects. Each subject repeated each gesture 6 times, holding each gesture for 5 seconds, followed by a 3-second rest period. The collected gestures included 12 fine finger movements, 17 wrist movements, 23 grasping gestures, and resting gestures, totaling 318 gestures per subject. The data consisted of 260ms of motion data (52 points) used as the sliding window length and 20ms of motion data (4 points) used as the sliding window step size.
[0027] S2: A Butterworth filter is used to perform bandpass filtering on the electromyography signal from 0.1 to 200 Hz, with a passband attenuation of 0.5 dB and a stopband attenuation of 40 dB. The bandpass filtered data is then passed through a 50 Hz notch filter to obtain the preprocessed signal.
[0028] Tag correction uses a data window size of 200ms and a sampling frequency of 2000Hz, with a total of 400 data points. The average short-time energy within this window is:
[0029]
[0030] Determine the active segment thresholds T1 and T2 through the results of multiple experiments, where T1 is the starting threshold of the active segment and T2 is the ending threshold of the active segment; when the window average short-time energy Pi > T1, it is considered as the starting data point of the active segment, and its first point is used as the starting point of the active segment; when the window average short-time energy Pi < T2, it is considered as the ending data point of the active segment.
[0031] Data augmentation, adding random noise, using Gaussian random noise to increase the diversity of data; data axis scaling, using the scaling method can simulate the data changes brought by the length of a unified gesture duration; channel flipping, flipping the channels can also increase the diversity of data.
[0032] S3: Construct a common feature extractor to obtain temporal features through long short-term memory (LSTM) and excitation network (SE NET).
[0033] LSTM consists of a forgetting gate, an update gate and an output gate. The forgetting gate of LSTM uses the tanh function as the final output, and also selects a function with an output value in the range of [0,1] as the activation function of the gate, which determines whether the previous cell state C t-1 is forgotten. When f t is 0, it is forgotten; when it is 1, it is retained; the mathematical expression is as follows:
[0034] f t = singmoid(w f *[h t-1 ,x t +b i )
[0035] In the formula, X T is the input value of the network at the current moment, H t-1 is the output value of LSTM at the previous moment, W F is the weight matrix of the forgetting gate, b i is the bias term of the input gate;
[0036] The latest cell state C t is determined by the previous cell state C t-1 and the new pending cell state together; f t and i t are the weight coefficient terms of C t-1 and, which reflects the update or forgetting of the cell. The mathematical expression is as follows:
[0037] i t = sigmoid(w i *[h t -1,x t [[ID=6i]]]+bi )
[0038]
[0039]
[0040] In the formula, W i W is the weight matrix of the input gate. c To calculate the weight matrix of the cell state, b c For the bias term used to calculate the cell state, i t For the input gate, C T-1 This refers to the previous cell state;
[0041] The output gate is responsible for determining the current cell state C. t How much will be output to output value H? t In Chinese, the mathematical expression is as follows:
[0042] o t =sigmoid(w o *[h t -1,x t ]+b o )
[0043] h t =o t *tanh(C t )
[0044] SENet treats each feature channel as a whole and uses a "feature recalibration" method to calibrate the weights of model channels using global channel features. This allows the model to learn the importance of different feature channels during training, strengthen important feature channels, suppress unimportant feature channels, and achieve adaptive fusion of multi-channel feature data.
[0045] The key module for adaptive channel fusion in SENet is the SE Block, which consists of two operations: compression and excitation. Step S3 employs an improved SE Block excitation method.
[0046] The short-time average energy is introduced into the SE Block, and the formula for the short-time average energy is as follows:
[0047]
[0048] Where n represents the sequence length and N represents the window length.
[0049] like Figure 2 As shown, in SE Net, E represents the input image fused from electromyographic and inertial signals. The output image represents the fusion of electromyography and inertial signals. The input data for each channel is aggregated by short-time average energy, reducing the dimensionality of the original image from T*C to 1*C to obtain the global temporal features of each channel.
[0050] S4: A domain feature extractor is set up for each source domain to further extract common features from both the source and target domains, thereby obtaining unique features for each source domain. This maps the data from each source domain to a specific feature space for each domain. Fully connected layers serve as the domain feature extractors for each domain. In this invention, two fully connected layers are used to increase and decrease the dimensionality of the data. The first fully connected layer performs dimensionality reduction, lowering the channel dimension of the feature map from C to C / r, where r is a variable hyperparameter. ReLU is used as the activation function after this fully connected layer. The second fully connected layer performs dimensionality increase, raising the channel dimension of the feature map back to C from C / r. Sigmoid is used as the activation function after this fully connected layer, limiting its output range to 0-1.
[0051] To achieve domain-specific feature alignment, CORAL is used as a measure of the distributional differences between domain features. This method is an alignment method that uses second-order statistics, specifically calculating the distance between features by calculating the covariance. The difference measure of this method is:
[0052]
[0053]
[0054]
[0055] Among them, Cov S The covariance matrix representing the features of the source domain, Cov T The covariance matrix representing the source domain features, d representing the number of neurons in the feature layer, ||Cov S -Cov T || 2 The F-norm, D, represents the covariance distance. S D represents the characteristics of the source domain data. T S represents the characteristics of the source domain data. n T represents the amount of data in the source domain. n This represents the amount of data in the target domain. The difference in their distributions can be measured by calculating the covariance distance between the features of the source domain and the features of the target domain.
[0056] S5: Set up a domain classifier for each source domain, and obtain the classification result by passing the domain classifier through the unique features of each domain. Add a SoftMax classifier after each domain classifier.
[0057] Domain classifiers are trained on data from different source domains, thus they may misclassify target samples, especially those from the target source domains distributed along class boundaries. In this invention, the domain classifiers output cross-entropy. To achieve domain feature alignment, this invention uses the absolute value of the difference in cross-entropy between the target domain's outputs on all domain classifiers as the alignment distance function:
[0058]
[0059] In the formula C j H is a softmax classifier, and H(F(x)) is a domain feature extractor.
[0060] S6: Loss estimation for multi-domain adaptive methods, including classification loss, domain-specific feature difference loss, and domain classifier difference loss.
[0061]
[0062] In the formula For source domain samples For target domain samples
[0063]
[0064] In the formula Classification loss Domain-specific feature loss Domain classifier loss, where L is the sum of all losses in the model.
[0065] To better illustrate the effectiveness of multi-source domain transfer learning, the following table shows the accuracy of the model in recognizing gestures from new users before and after using multi-source domain adaptation:
[0066]
[0067] For the same individual, using the 1st, 3rd, and 5th gestures as training data, and the 2nd and 4th gestures as evaluation and testing data, the model accuracy reached its highest value of 98.26% after 60 epochs. However, when the target subject changed, without applying transfer methods, the accuracy rapidly dropped to 59.2%, affecting the subject's experience. Using the adaptive method of this invention for model transfer, the model's accuracy in the target domain can be improved by 21.61%.
[0068] Two comparison methods are introduced: ① Single-source-domain optimal method. This method uses a set of source domain data from one source domain dataset for migration, iterates through the entire dataset, and selects the optimal experimental result as the single-source-domain optimal result. For example, if the target user is S6, then five single-source-domain migration methods (S1→S6, S2→S6, S3→S6, S4→S6, S5→S6) are compared, and the result with the highest accuracy is selected. ② Source domain combination method: This method combines all source domain data (excluding the target domain) into a single dataset as the source domain for the adaptive algorithm. The source domain combination method is the most direct way to apply the single-source-domain method to multiple source domains. For example, if the target user is S6, then the migration accuracy of S1, S2, S3, S4, S5→S6 is calculated as the final result. See the table below:
[0069]
[0070] For the same target domain, models using multi-source domain data outperform models built with single-source domain data. This result demonstrates that using multiple source datasets can improve classification accuracy in the target domain, and that introducing other source datasets can further improve single-source domain adaptive methods. As shown in the table, the adaptive method of this invention outperforms both single-source and multi-source domain combinations in terms of transfer learning results, proving that using multi-source domain adaptive methods can leverage multiple datasets to achieve higher cross-user classification accuracy.
[0071] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made by those skilled in the art to the technical solutions of the present invention without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims
1. A cross-user gesture recognition method based on multi-source domain adaptation, characterized in that... Includes the following steps: S1. Acquire electromyographic and inertial signals from multiple subjects' forearms using sensors to construct a multi-source domain dataset; S2. Preprocess the electromyographic and inertial signals obtained in step S1, including bandpass filtering and power frequency notch filtering; perform data processing, including data label correction and data augmentation. In step S2, a Butterworth filter is used to perform bandpass filtering on the electromyography signal from 0.1 to 200 Hz, with a passband attenuation of 0.5 dB and a stopband attenuation of 40 dB. The bandpass filtered data is then passed through a 50 Hz notch filter to obtain the preprocessed signal. Data augmentation includes adding random noise, such as Gaussian random noise, to increase data diversity; data axis scaling, which simulates data variations caused by varying the duration of a uniform gesture; and channel flipping, which also increases data diversity by flipping channels. S3. Construct a common feature extractor to obtain temporal features through Long Short-Term Memory (LSTM) and activation network SE Net; In step S3, the LSTM consists of a forget gate, an update gate, and an output gate; the LSTM forget gate uses the tanh function as the final output, selecting an output value from... The function within the interval is used as the activation function of the gate, when The previous cell state when it is 0. Forgotten; when The previous cell state was 1. Reserved; Temporal features are obtained through an LSTM network; The mathematical expression is as follows: In the formula, This represents the network's input value at the current moment. This is the output value of the LSTM at the previous time step. Here is the weight matrix for the forget gate. This is the bias term for the input gate; Latest cell status It is from the previous cell state It is determined along with the new, undetermined cell state; and yes The weighting coefficients of the sum represent the cell's renewal or forgetting process, and their mathematical expression is as follows: In the formula, Here is the weight matrix of the input gate. To calculate the weight matrix of the cell states, For calculating the bias term of the cell state, For input gate, This refers to the previous cell state; The output gate is responsible for determining the current cell state. How many will be output to the output value? In Chinese, the mathematical expression is as follows: ; SENet treats each feature channel as a whole and uses feature recalibration to use global features of the channel to calibrate the weights of the model channels. This allows the model to learn the importance of different feature channels during training, strengthen important feature channels, suppress unimportant feature channels, and achieve adaptive fusion of multi-channel feature data. The key module for adaptive channel fusion in SENet is the SE Block, which consists of two operations: compression and excitation. Step S3 employs an improved SE Block excitation method. The short-time average energy is introduced into the SE Block, and the formula for the short-time average energy is as follows: Where n represents the sequence length and N represents the window length, the input data from each channel is aggregated using short-time average energy, transforming the original image from... Dimensional reduction To obtain the global timing characteristics of each channel; S4. Construct domain feature extractors. Fully connected layers serve as domain feature extractors for each domain, further extracting common features from the source and target domains to obtain unique features for each source domain, thus mapping data from each source domain to a specific feature space for each domain. Domain-specific feature alignment is achieved by using CORAL as a measure of the distribution differences between domain features and calculating the distance between features by calculating the covariance. In step S4, CORAL is used as a measure of the distribution difference between domain features to achieve domain-specific feature alignment. This method calculates the distance between features by calculating the covariance, and the difference measure of this method is: in, The covariance matrix representing the characteristics of the source domain. The covariance matrix represents the features of the target domain, and d represents the number of neurons in the feature layer. The F-norm represents the covariance distance. Represents the characteristics of source domain data. Represents the characteristics of the target domain data. Represents the amount of data in the source domain. It represents the amount of data in the target domain, and its distribution difference is measured by calculating the covariance distance between the features of the source domain and the features of the target domain. S5. Construct domain classifiers, and add a SoftMax classifier to each domain classifier. The unique features of each domain are classified through the domain classifiers to obtain the classification results. Domain classifier alignment uses the absolute value of the cross-entropy difference between the outputs of the target domain and all domain classifiers as the alignment distance. In step S5, the absolute value of the cross-entropy difference between the target domain's outputs on all domain classifiers is used as the alignment distance function: In the formula It is a softmax classifier, and H(F(x)) is a domain feature extractor. For the domain classifier loss; S6. Loss estimation for multi-domain adaptive methods, including classification loss, domain-specific feature difference loss, and domain classifier difference loss; feed the obtained data into the model, train the model until the model loss function no longer improves, and save the model; In step S6 In the formula is a source domain sample is a target domain sample; In the formula Classification loss L represents the domain-specific feature loss, and L is the sum of all losses in the model.
2. The cross-user gesture recognition method based on multi-source domain adaptation according to claim 1, characterized in that: Step S1 uses a differential amplifier circuit as the signal input terminal. The input terminal uses three electrodes: a differential electrode pair and a reference electrode. The differential electrode pair is placed at the center of the muscle, and the reference electrode is placed away from the target muscle or at a location without muscle. Inertial signals are acquired. The inertial information is calculated by the relative displacement relationship between the fixed coordinate system and the inertial sensor coordinate system. The angle formed between the two coordinate axes is mainly composed of the heading angle, pitch angle, and roll angle. Based on the coordinate system mapping relationship and the above-mentioned angle values output, the inertial information of the acquired object can be calculated.