A human body screwing motion six-dimensional interaction force accurate estimation method based on multi-modal time sequence deep learning
By employing a multimodal temporal deep learning method that integrates electromyographic signals, visual skeleton, and contact force data, and designing a dimension-weighted loss function, the gradient-dominated problem of six-dimensional force/torque estimation in twisting tasks is solved, achieving high-precision and stable six-dimensional torque prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN INST OF TECH
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to achieve high-precision six-dimensional force/torque estimation in complex tasks such as twisting, especially in dynamic twisting scenarios where the ability to capture time-varying torque is insufficient. Furthermore, multi-task regression suffers from gradient dominance and learning imbalance issues.
A multimodal temporal deep learning approach is adopted. By constructing a fusion model of electromyographic signals, visual skeleton and contact force data, a dimension-weighted temporal prediction combined loss function is designed. Multi-head attention mechanism and bidirectional LSTM network are used for feature extraction and prediction to achieve accurate mapping of six-dimensional force/torque.
It achieves high-fidelity reconstruction of complex mechanical properties, improves the accuracy and stability of six-dimensional force estimation, enhances time series modeling capabilities, and has strong robustness and noise resistance.
Smart Images

Figure CN122241583A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of human-computer interaction, robotics and artificial intelligence, and specifically relates to a method for accurately estimating the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning. Background Technology
[0002] As robots undertake increasingly complex tasks in unstructured environments, achieving precise end-effector force control has become a core challenge in dexterous manipulation. Fine manipulation, exemplified by screwing, not only requires accurate position trajectory tracking but also relies on real-time, precise adjustment of the six-dimensional forces / torques at the end-effector. This high-precision "force-position synergy" in nonlinear contact environments is currently key to overcoming the bottlenecks in robot manipulation. However, establishing a precise mapping from motion states to complex contact dynamics remains a significant challenge. Simple position control is insufficient for tasks involving complex contact dynamics; high-precision interaction force estimation is essential to truly achieve human-like dexterous manipulation.
[0003] Research on manipulative force estimation has evolved from biomechanical models to data-driven methods. Traditional estimation methods are mostly based on human skeletal muscle models, using surface electromyography (sEMG) signals combined with anatomical parameters to infer joint torque. Existing models perform well in static wrist movements but cannot effectively predict joint torque in complex dynamic movements. Although existing techniques have verified the role of muscle coactivation in end-point stiffness regulation and established accurate mapping models, their dependence on precise sensors limits their application scope. In recent years, deep learning-based multimodal methods have gradually become mainstream. Existing techniques attempt to establish a nonlinear mapping between sEMG and human arm interaction forces, but their work is limited to estimating the force amplitude rather than the complete six-dimensional force vector, and experiments are only conducted at fixed locations, lacking the ability to capture time-varying torques in dynamic twisting scenarios. Existing multimodal fusion research, although incorporating visual or radar signals, often struggles to achieve high-fidelity six-dimensional force reconstruction in highly dynamic contact tasks.
[0004] Dexterous maneuvering force estimation modeling is essentially a multimodal time series analysis problem. Early fusion strategies often employed simple feature concatenation, which struggled to capture the temporal dependencies between modes. With the development of deep learning, architectures based on LSTM, GRU, and others have demonstrated powerful capabilities. Existing techniques propose a method that simultaneously predicts handshake and forearm movements using target muscle bioimpedance measurements and LSTM regression models, directly mapping changes in muscle bioimpedance to predicted angles, specifically for real-time target grasping tasks. Another existing technique proposes a system called FeelTheForce (FTF) for learning fine force control maneuvering skills from human tactile demonstrations. It uses a Transformer architecture to learn a closed-loop policy that can simultaneously predict the trajectory of the robot's end effector and the desired contact force. While these architectures perform well in specific tasks, they still face significant optimization challenges in six-dimensional force / torque (6DForce / Torque) regression for complex tasks such as twisting. Because forces (Fx, Fy, Fz, in N) and torques (Tx, Ty, Tz, in N·m) differ significantly in magnitude and dynamic range, the "gradient dominance" phenomenon occurs. The larger or more easily learned dimension dominates the descent direction of the loss function, severely compromising the prediction accuracy of other dimensions. Existing multi-task learning methods often overlook this optimization imbalance caused by differences in physical dimensions. Summary of the Invention
[0005] This invention provides a method for accurate estimation of six-dimensional interactive forces in human twisting motion based on multimodal temporal deep learning, which solves the problems of gradient dominance and learning imbalance in multi-task regression and achieves high-fidelity reconstruction of complex mechanical properties.
[0006] This invention is achieved through the following technical solution: A method for accurate estimation of six-dimensional interactive forces in human twisting motion based on multimodal temporal deep learning, the method comprising the following steps: Step 1: Multimodal data input and preprocessing; Step 2: Extract features from the multimodal data from Step 1; Step 3: Fuse the features extracted in Step 2 and perform temporal modeling; Step 4: Multi-head output performs independent-dimensional output mapping on the temporal features of Step 3; Step 5: Based on the prediction output of Step 4, design a dimension-weighted time series prediction combined loss function; Specifically, the combined loss function Defined as:
[0007] in, The dimension-weighted mean squared error loss, Dimension-weighted gradient consistency loss, For dimension-weighted correlation loss, The standard deviation loss is weighted by dimensions; each component of the loss is weighted by a dimension. ; Step 6: Based on the dimension-weighted temporal prediction combined loss function in Step 5, achieve accurate estimation of the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning.
[0008] Furthermore, step 1 specifically involves: Three parallel input branches were constructed to process multimodal temporal data. The EMG branch concatenates muscle activation and timestamps from 450 time steps, with a tensor structure of (450, 13). The visual keypoint branch concatenates keypoints and timestamps, with a tensor structure of (450, 35, 4), to capture the 3D coordinate trajectories of 35 body keypoints over 450 time steps. The contact force data branch concatenates 15-dimensional contact force data and timestamps, recording the force sensor readings at the same time step using a (450, 16) tensor.
[0009] Furthermore, step 2 specifically involves: Three independent feature extraction modules are designed to process different modalities of data. The EMG module upscales the 13-dimensional contact force input to 64-dimensionality through a fully connected layer, and then refines it to 32-dimensional features through a Conv1D convolutional layer. The vision module flattens the 35×4 visual keypoint data into a 140-dimensional vector, performs dimensionality reduction and feature encoding through a sequence of fully connected layers, and finally compresses it to 32-dimensional features. The force module upscales the 16-dimensional contact force input to 32-dimensionality through a fully connected layer, and then compresses it to 16-dimensional feature output through a Conv1D convolution. Each module is equipped with ReLU activation, layer normalization, and Dropout regularization to ensure training stability.
[0010] Furthermore, the feature fusion in step 3 specifically involves: The 32-dimensional outputs of the EMG module, the 32-dimensional output of the vision module, and the 16-dimensional output of the force module are concatenated to form an 80-dimensional joint feature vector. Nonlinear transformation and feature enhancement are performed through a double fully connected layer, and residual connections are introduced to promote gradient flow. A multi-head attention mechanism is used to model the global dependencies between features. Finally, the training process is stabilized through residual connections and layer normalization. The timing modeling in step 3 specifically involves: Deep temporal dependency extraction of feature sequences is performed by a two-layer bidirectional LSTM network. The first layer outputs 64-dimensional hidden states, and the second layer maintains the same structure to obtain 128-dimensional temporal features. A multi-head attention mechanism is introduced to capture the dynamic correlation between global time steps, and the gradient vanishing problem is alleviated by residual connections and layer normalization.
[0011] Furthermore, step 4 specifically involves: The multi-head output layer uses six independent fully connected network heads to process shared temporal features in parallel. Each head is responsible for predicting one output dimension through a three-layer structure of 128 to 64 to 32 to 1. LayerNorm, Dropout, and ReLU are used to ensure stable training. Finally, the outputs of each head are concatenated to form a prediction result of (450, 6).
[0012] Furthermore, the loss function in step 5 is specifically as follows: The design of the dimension-weighted time series prediction combined loss function includes four complementary constraint terms: mean squared error loss, gradient consistency loss, correlation loss, and standard deviation loss. The coefficient of determination R² assigns greater weights to dimensions with lower fit, forcing the network to focus more on features that are difficult to learn during training.
[0013] A system for accurately estimating the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning, characterized in that the system uses the aforementioned method for accurately estimating the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning, and the system includes... Multimodal data preprocessing module: multimodal data input and preprocessing; Multimodal data feature extraction module: Extracts features from multimodal data based on the multimodal data preprocessing module; Multimodal data temporal modeling module: fuses the features extracted by the multimodal data feature extraction module and performs temporal modeling; Output module: Multi-head output maps temporal features across various dimensions; Based on the predicted output, a dimension-weighted time-series prediction combined loss function is designed. Specifically, the combined loss function Defined as:
[0014] in, The dimension-weighted mean squared error loss, Dimension-weighted gradient consistency loss, For dimension-weighted correlation loss, The standard deviation loss is weighted by dimensions; each component of the loss is weighted by a dimension. ; Accurate estimation of six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning is achieved by using a dimension-weighted temporal prediction combined loss function.
[0015] Furthermore, the multimodal data preprocessing module specifically comprises: Three parallel input branches were constructed to process multimodal temporal data. The EMG branch concatenates muscle activation and timestamps from 450 time steps, with a tensor structure of (450, 13). The visual keypoint branch concatenates keypoints and timestamps, with a tensor structure of (450, 35, 4), to capture the 3D coordinate trajectories of 35 body keypoints over 450 time steps. The contact force data branch concatenates 15-dimensional contact force data and timestamps, recording the force sensor readings at the same time step using a (450, 16) tensor.
[0016] Furthermore, the multimodal data feature extraction module specifically comprises: Three independent feature extraction modules are designed to process different modalities of data. The EMG module upscales the 13-dimensional contact force input to 64-dimensionality through a fully connected layer, and then refines it to 32-dimensional features through a Conv1D convolutional layer. The vision module flattens the 35×4 visual keypoint data into a 140-dimensional vector, performs dimensionality reduction and feature encoding through a sequence of fully connected layers, and finally compresses it to 32-dimensional features. The force module upscales the 16-dimensional contact force input to 32-dimensionality through a fully connected layer, and then compresses it to 16-dimensional feature output through a Conv1D convolution. Each module is equipped with ReLU activation, layer normalization, and Dropout regularization to ensure training stability.
[0017] Furthermore, the feature fusion of the multimodal data temporal modeling module specifically involves: The 32-dimensional outputs of the EMG module, the 32-dimensional output of the vision module, and the 16-dimensional output of the force module are concatenated to form an 80-dimensional joint feature vector. Nonlinear transformation and feature enhancement are performed through a double fully connected layer, and residual connections are introduced to promote gradient flow. A multi-head attention mechanism is used to model the global dependencies between features. Finally, the training process is stabilized through residual connections and layer normalization. The timing modeling of the multimodal data timing modeling module specifically involves: Deep temporal dependency extraction of feature sequences is performed by a two-layer bidirectional LSTM network. The first layer outputs 64-dimensional hidden states, and the second layer maintains the same structure to obtain 128-dimensional temporal features. A multi-head attention mechanism is introduced to capture the dynamic correlation between global time steps, and the gradient vanishing problem is alleviated by residual connections and layer normalization.
[0018] Furthermore, the output module specifically comprises: The multi-head output layer uses six independent fully connected network heads to process shared temporal features in parallel. Each head is responsible for predicting one output dimension through a three-layer structure of 128 to 64 to 32 to 1. LayerNorm, Dropout, and ReLU are used to ensure stable training. Finally, the outputs of each head are concatenated to form a prediction result of (450, 6).
[0019] The loss function of the design is specifically as follows: The design of the dimension-weighted time series prediction combined loss function includes four complementary constraint terms: mean squared error loss, gradient consistency loss, correlation loss, and standard deviation loss. The coefficient of determination R² assigns greater weights to dimensions with lower fit, forcing the network to focus more on features that are difficult to learn during training.
[0020] The beneficial effects of this invention are: This invention constructs a multimodal temporal deep learning framework for six-dimensional interactive force estimation. This framework deeply integrates multi-source heterogeneous information such as electromyographic signals, visual skeletons, and contact forces, establishing a robust end-to-end temporal regression architecture that achieves accurate mapping from multimodal inputs to the final six-dimensional force / torque. Based on this, an innovative dimension-weighted temporal prediction combined loss function is designed. By adjusting the learning weights of each output dimension, it effectively solves the gradient dominance and learning imbalance problems in multi-task regression, achieving high-fidelity reconstruction of complex mechanical properties.
[0021] This invention designs a dimension-weighted temporal prediction combined loss function, which achieves high-precision and high-stability six-dimensional force estimation.
[0022] Under the multimodal fusion mechanism, the system of this invention has strong robustness to single-modal noise or missing data.
[0023] This invention significantly enhances the temporal modeling capability, enabling accurate capture of dynamic force interaction processes.
[0024] The six-dimensional output prediction of this invention is superior to traditional methods in terms of accuracy, continuity, and consistency.
[0025] This invention employs multimodal inputs (EMG signals, visual key points, and contact force) and constructs a deep neural network that includes feature extraction, fusion, temporal modeling, and multi-head output to predict six-dimensional force / torque. It is particularly emphasized that this method is not merely a simple concatenation, but rather incorporates a specific architectural design featuring "independent feature extraction, a fusion module (including an attention mechanism), bidirectional LSTM temporal modeling, and independent multi-head output."
[0026] This invention addresses the problem of predicted values "converging" or "overly smoothing" due to a single MSE loss function. The proposed solution involves constructing a composite loss function with four constraints. Specific components include: Mean Squared Error Loss (MSE): ensuring basic point-to-point accuracy; Gradient Consistency Loss: constraining the first-order difference to ensure the predicted trend aligns with the true value; Correlation Loss: constraining the Pearson correlation coefficient to ensure the consistency of the overall waveform pattern; and Standard Deviation Loss (StdLoss): constraining the dispersion of the distribution to prevent predicted values from collapsing to a constant (degenerate).
[0027] The fusion layer design of this invention uses a multi-head attention mechanism to perform global dependency modeling on the joint features of EMG, visual, and force signals, rather than simple splicing. The temporal layer design employs a bidirectional LSTM (BiLSTM) to extract deep temporal dependencies, followed by a multi-head attention mechanism to capture long-distance temporal correlations.
[0028] The output layer of this invention does not share parameters; instead, it designs six independent fully connected network heads. Specific features: Each head is dedicated to force / torque prediction in one dimension and has independent... The parameter path allows for differentiated gradient updates across different dimensions without interference. Attached Figure Description
[0029] Figure 1 This is a schematic diagram of the structure of the present invention.
[0030] Figure 2 This invention compares the predicted sequence with the actual sequence. Detailed Implementation In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods are omitted so as not to obscure the description of this application with unnecessary detail.
[0031] It should be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0032] It should also be understood that the terminology used in this application specification is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.
[0033] The technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0034] Many specific details are set forth in the following description in order to provide a full understanding of this application. However, this application may also be implemented in other ways different from those described herein. Those skilled in the art can make similar extensions without departing from the spirit of this application. Therefore, this application is not limited to the specific embodiments disclosed below.
[0035] The technical principle of this invention is based on a multimodal temporal deep learning framework. By fusing three heterogeneous sensory information—electromyography signals, visual key points, and contact force signals—it employs a hierarchical feature extraction strategy to process the spatiotemporal features of each modality of data, mapping the multimodal information into a high-precision, highly consistent, continuous estimate of six-dimensional force and torque.
[0036] This paper utilizes a dedicated feature extraction module to mine complementary information from different modalities, establishes multimodal associations through a feature fusion module, and captures temporal dependencies using bidirectional LSTM and attention mechanisms. An innovative dimension-weighted temporal prediction combined loss function is designed, adjusting the loss weights of each output dimension to focus the model on features that are difficult to learn. A multi-head output architecture is employed to allocate dedicated parameter spaces for the six-dimensional force components, effectively resolving feature conflict issues. The technical principle fully considers the physical characteristics and engineering constraints of six-dimensional force estimation, achieving high-precision and high-stability force estimation performance through multi-scale optimization and real-time inference mechanisms.
[0037] Implementation Method 1 This embodiment provides a method for accurately estimating the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning, such as... Figure 1 As shown, the method includes the following steps: Step 1: Multimodal data input and preprocessing; Step 2: Extract features from the multimodal data from Step 1; Step 3: Fuse the features extracted in Step 2 and perform temporal modeling; Step 4: Multi-head output mapping of temporal features across various dimensions; Step 5: Based on the prediction output of Step 4, design a dimension-weighted time series prediction combined loss function; Specifically, the combined loss function Defined as:
[0038] in, The dimension-weighted mean squared error loss, Dimension-weighted gradient consistency loss, For dimension-weighted correlation loss, The standard deviation loss is weighted by dimensions; each component of the loss is weighted by a dimension. ; Step 6: Based on the dimension-weighted temporal prediction combined loss function in Step 5, achieve accurate estimation of the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning.
[0039] Furthermore, step 1 specifically involves: Three parallel input branches were constructed to process multimodal temporal data. The EMG branch concatenates muscle activation and timestamps from 450 time steps, with a tensor structure of (450, 13). The visual keypoint branch concatenates keypoints and timestamps, with a tensor structure of (450, 35, 4), to capture the 3D coordinate trajectories of 35 body keypoints over 450 time steps. The contact force data branch concatenates 15-dimensional contact force data and timestamps, recording the force sensor readings at the same time step using a (450, 16) tensor.
[0040] Furthermore, step 2 specifically involves: Three independent feature extraction modules are designed to process different modalities of data. The EMG module upscales the 13-dimensional contact force input to 64-dimensionality through a fully connected layer, and then refines it to 32-dimensional features through a Conv1D convolutional layer. The vision module flattens the 35×4 visual keypoint data into a 140-dimensional vector, performs dimensionality reduction and feature encoding through a sequence of fully connected layers, and finally compresses it to 32-dimensional features. The force module upscales the 16-dimensional contact force input to 32-dimensionality through a fully connected layer, and then compresses it to 16-dimensional feature output through a Conv1D convolution. Each module is equipped with ReLU activation, layer normalization, and Dropout regularization to ensure training stability.
[0041] Furthermore, the feature fusion in step 3 specifically involves: The 32-dimensional outputs of the EMG module, the 32-dimensional output of the vision module, and the 16-dimensional output of the force module are concatenated to form an 80-dimensional joint feature vector. Nonlinear transformation and feature enhancement are performed through a double fully connected layer, and residual connections are introduced to promote gradient flow. A multi-head attention mechanism is used to model the global dependencies between features. Finally, the training process is stabilized through residual connections and layer normalization. The timing modeling in step 3 specifically involves: Deep temporal dependency extraction of feature sequences is performed by a two-layer bidirectional LSTM network. The first layer outputs 64-dimensional hidden states, and the second layer maintains the same structure to obtain 128-dimensional temporal features. A multi-head attention mechanism is introduced to capture the dynamic correlation between global time steps, and the gradient vanishing problem is alleviated by residual connections and layer normalization.
[0042] Furthermore, step 4 specifically involves: The multi-head output layer uses six independent fully connected network heads to process shared temporal features in parallel. Each head is responsible for predicting one output dimension through a three-layer structure of 128 to 64 to 32 to 1. LayerNorm, Dropout, and ReLU are used to ensure stable training. Finally, the outputs of each head are concatenated to form a prediction result of (450, 6). Specifically, step 5 is as follows: The design of the dimension-weighted time series prediction combined loss function includes four complementary constraint terms: mean squared error loss, gradient consistency loss, correlation loss, and standard deviation loss. The coefficient of determination R² assigns greater weights to dimensions with lower fit, forcing the network to focus more on features that are difficult to learn during training.
[0043] Table 1. Prediction performance of the multimodal force estimation model on different experimental subjects.
[0044] Table 2 compares the proposed method with mainstream baseline models in force estimation tasks.
[0045] 1. Data Acquisition and Preprocessing The Trigno surface electromyography (EMG) sensor was used to collect EMG signals from the main muscle groups of the upper limb, the Intel RealSense D435i depth camera was used to collect human motion visual data, the Anyskin tactile sensor was used to collect tactile data from the right thumb, and the SUNRISE six-dimensional force sensor was used to collect contact force data. Data time synchronization was achieved through a hardware trigger signal, and the data was uniformly resampled to 100Hz.
[0046] 2. Multimodal data alignment and slicing The collected raw data was time-aligned, and cross-correlation analysis was used to achieve precise synchronization of multimodal data, based on force sensor data. A sliding window method was employed to divide the continuous data into 450 time-step sample segments with a 50% overlap rate, ensuring temporal continuity.
[0047] 3. Feature Extraction Module Parameter Settings The EMG module's fully connected layers are configured with a 13→64→32 layer structure, a Conv1D kernel size of 3, and a stride of 1. The vision module's fully connected layers are configured with a 140→64→32 layer structure, with the same Conv1D parameters as the EMG module. The force module's fully connected layers map the 16-dimensional input to 32-dimensionality, and the Conv1D output is a 16-dimensional feature. All modules have a Dropout rate of 0.25 and use the ReLU activation function.
[0048] 4. Feature fusion and temporal modeling implementation The feature fusion module maps 80-dimensional concatenated features to 128 dimensions through a fully connected layer, using a 4-head attention mechanism with a 32-dimensional attention head. The temporal modeling module uses a bidirectional LSTM hidden layer with a 64-dimensional hidden layer, a two-layer structure, and uses LayerNorm for normalization.
[0049] 5. Multi-head output layer configuration Each output head employs a three-layer fully connected structure: 128→64→32→1, using an independent parameter space. Temporal features are shared among output heads, but parameters are not shared, and differentiated learning rates can be configured for different dimensions.
[0050] Loss function and training parameters The combined loss function weights are set as follows: λ1=1.0 (MSE), λ2=0.8 (gradient), λ3=0.5 (correlation), λ4=0.3 (standard deviation). The dimensions are set as follows: w1=2.2, w2=2.0, w3=0.7, w4=2.0, w5=2.2, w6=3.5.
[0051] 6. Model Training and Validation Using the Adam optimizer, with an initial learning rate of 1e-4, a batch size of 32, and 100 training epochs.
[0052] 7. Performance Evaluation Results On a test set containing data from eight experimental subjects, the method of this invention achieves excellent performance with an average R² of 0.9483 in the six-dimensional force estimation task. The R² values for each dimension are: Fx = 0.9477, Fy = 0.9782, Fz = 0.9847, Tx = 0.8928, Ty = 0.9363, and Tz = 0.9501. Tested on the data from experimental subject 1, compared with the baseline method (FCN 0.9067, Transformer 0.9238), the method of this invention has an average R² of 0.9538, demonstrating the effectiveness and superiority of this invention.
[0053] Through the verification of the above specific embodiments, the method of the present invention performs excellently in the six-dimensional interactive force estimation task, and provides an effective technical solution for applications such as robot compliant control and human-computer interaction safety monitoring.
[0054] Implementation Method 2 This embodiment provides a system for accurately estimating the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning. The system uses the method for accurately estimating the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning as described in Embodiment 1. The system includes... Multimodal data preprocessing module: multimodal data input and preprocessing; Multimodal data feature extraction module: Extracts features from multimodal data based on the multimodal data preprocessing module; Multimodal data temporal modeling module: fuses the features extracted by the multimodal data feature extraction module and performs temporal modeling; Output module: Multi-head output maps temporal features across various dimensions; Based on the predicted output, a dimension-weighted time-series prediction combined loss function is designed. Specifically, the combined loss function Defined as:
[0055] in, The dimension-weighted mean squared error loss, Dimension-weighted gradient consistency loss, For dimension-weighted correlation loss, The standard deviation loss is weighted by dimensions; each component of the loss is weighted by a dimension. ; Accurate estimation of six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning is achieved by using a dimension-weighted temporal prediction combined loss function.
[0056] Furthermore, the multimodal data preprocessing module specifically comprises: Three parallel input branches were constructed to process multimodal temporal data. The EMG branch concatenates muscle activation and timestamps from 450 time steps, with a tensor structure of (450, 13). The visual keypoint branch concatenates keypoints and timestamps, with a tensor structure of (450, 35, 4), to capture the 3D coordinate trajectories of 35 body keypoints over 450 time steps. The contact force data branch concatenates 15-dimensional contact force data and timestamps, recording the force sensor readings at the same time step using a (450, 16) tensor.
[0057] Furthermore, the multimodal data feature extraction module specifically comprises: Three independent feature extraction modules are designed to process different modalities of data. The EMG module upscales the 13-dimensional contact force input to 64-dimensionality through a fully connected layer, and then refines it to 32-dimensional features through a Conv1D convolutional layer. The vision module flattens the 35×4 visual keypoint data into a 140-dimensional vector, performs dimensionality reduction and feature encoding through a sequence of fully connected layers, and finally compresses it to 32-dimensional features. The force module upscales the 16-dimensional contact force input to 32-dimensionality through a fully connected layer, and then compresses it to 16-dimensional feature output through a Conv1D convolution. Each module is equipped with ReLU activation, layer normalization, and Dropout regularization to ensure training stability.
[0058] Furthermore, the feature fusion of the multimodal data temporal modeling module specifically involves: The 32-dimensional outputs of the EMG module, the 32-dimensional output of the vision module, and the 16-dimensional output of the force module are concatenated to form an 80-dimensional joint feature vector. Nonlinear transformation and feature enhancement are performed through a double fully connected layer, and residual connections are introduced to promote gradient flow. A multi-head attention mechanism is used to model the global dependencies between features. Finally, the training process is stabilized through residual connections and layer normalization. The timing modeling of the multimodal data timing modeling module specifically involves: Deep temporal dependency extraction of feature sequences is performed by a two-layer bidirectional LSTM network. The first layer outputs 64-dimensional hidden states, and the second layer maintains the same structure to obtain 128-dimensional temporal features. A multi-head attention mechanism is introduced to capture the dynamic correlation between global time steps, and the gradient vanishing problem is alleviated by residual connections and layer normalization.
[0059] Furthermore, the output module specifically comprises: The multi-head output layer uses six independent fully connected network heads to process shared temporal features in parallel. Each head is responsible for predicting one output dimension through a three-layer structure of 128 to 64 to 32 to 1. LayerNorm, Dropout, and ReLU are used to ensure stable training. Finally, the outputs of each head are concatenated to form a prediction result of (450, 6). The loss function of the design is specifically as follows: The design of the dimension-weighted time series prediction combined loss function includes four complementary constraint terms: mean squared error loss, gradient consistency loss, correlation loss, and standard deviation loss. The coefficient of determination R² assigns greater weights to dimensions with lower fit, forcing the network to focus more on features that are difficult to learn during training.
[0060] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A method for accurate estimation of six-dimensional interactive forces in human twisting motion based on multimodal temporal deep learning, characterized in that, The method includes the following steps: Step 1: Multimodal data input and preprocessing; Step 2: Extract features from the multimodal data from Step 1; Step 3: Fuse the features extracted in Step 2 and perform temporal modeling; Step 4: Perform independent-dimensional output mapping on the temporal features from Step 3 to construct a multi-head output module; Step 5: Based on the prediction output of Step 4, design a dimension-weighted time series prediction combined loss function; Specifically, the combined loss function Defined as: in, The dimension-weighted mean squared error loss, Dimension-weighted gradient consistency loss, For dimension-weighted correlation loss, The standard deviation loss is weighted by dimensions; each component of the loss is weighted by a dimension. ; Step 6: Based on the dimension-weighted temporal prediction combined loss function in Step 5, achieve accurate estimation of the six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning.
2. The method for accurately estimating the six-dimensional interactive force of human twisting motion according to claim 1, characterized in that, Specifically, step 1 is as follows: Three parallel input branches are constructed to process multimodal temporal data respectively; the EMG branch concatenates muscle activation and timestamps at 450 time steps, with a tensor structure of (450,13); the visual keypoint branch concatenates keypoints and timestamps, with a tensor structure of (450,35,4), to capture the three-dimensional coordinate trajectory of 35 body keypoints at 450 time steps; the contact force data branch concatenates 15-dimensional contact force data and timestamps, and records the force sensor readings at the same time step using a tensor of (450,16).
3. The method for accurately estimating the six-dimensional interactive force of human body twisting motion according to claim 1, characterized in that, Step 2 specifically involves, Three independent feature extraction modules are designed to process different modal data respectively; the EMG module enhances the 13-dimensional contact force input to 64-dimensional through a fully connected layer, and then refines it to 32-dimensional features through a Conv1D convolutional layer; The vision module flattens the 35×4 visual keypoint data into a 140-dimensional vector, performs dimensionality reduction and feature encoding through a fully connected layer sequence, and finally compresses it into 32-dimensional features; The force module upscales the 16-dimensional contact force input to 32-dimensional through a fully connected layer, and then compresses it to a 16-dimensional feature output through Conv1D convolution. Each module is equipped with ReLU activation, layer normalization, and Dropout regularization to ensure training stability.
4. The method for accurately estimating the six-dimensional interactive force of human body twisting motion according to claim 1, characterized in that, The feature fusion in step 3 specifically involves: The 32-dimensional outputs of the EMG module, the 32-dimensional output of the vision module, and the 16-dimensional output of the force module are concatenated to form an 80-dimensional joint feature vector. Nonlinear transformation and feature enhancement are performed through a double fully connected layer, and residual connections are introduced to promote gradient flow. A multi-head attention mechanism is used to model the global dependencies between features. Finally, the training process is stabilized through residual connections and layer normalization. The timing modeling in step 3 specifically involves: Deep temporal dependency extraction of feature sequences is performed by a two-layer bidirectional LSTM network. The first layer outputs 64-dimensional hidden states, and the second layer maintains the same structure to obtain 128-dimensional temporal features. A multi-head attention mechanism is introduced to capture the dynamic correlation between global time steps, and the gradient vanishing problem is alleviated by residual connections and layer normalization.
5. The method for accurately estimating the six-dimensional interactive force of human body twisting motion according to claim 1, characterized in that, Specifically, step 4 is as follows: The multi-head output layer uses six independent fully connected network heads to process shared temporal features in parallel. Each head is responsible for predicting one output dimension through a three-layer structure of 128 to 64 to 32 to 1. LayerNorm, Dropout, and ReLU are used to ensure stable training. Finally, the outputs of each head are concatenated to form a prediction result of (450, 6).
6. The method for accurately estimating the six-dimensional interactive force of human body twisting motion according to claim 1, characterized in that, Step 5 specifically involves, The design of the dimension-weighted time series prediction combined loss function includes four complementary constraint terms: mean squared error loss, gradient consistency loss, correlation loss, and standard deviation loss. The coefficient of determination R² assigns greater weights to dimensions with lower fit, forcing the network to focus more on features that are difficult to learn during training.
7. A precise estimation system for six-dimensional interactive forces in human twisting motion based on multimodal temporal deep learning, characterized in that, The system uses the six-dimensional interactive force estimation method for human twisting motion based on multimodal temporal deep learning as described in any one of claims 1-6, and the system includes, Multimodal data preprocessing module: multimodal data input and preprocessing; Multimodal data feature extraction module: Extracts features from multimodal data based on the multimodal data preprocessing module; Multimodal data temporal modeling module: fuses the features extracted by the multimodal data feature extraction module and performs temporal modeling; Output module: Multi-head output maps temporal features across various dimensions; Based on the predicted output, a dimension-weighted time-series prediction combined loss function is designed. Specifically, the combined loss function Defined as: in, The dimension-weighted mean squared error loss, Dimension-weighted gradient consistency loss, For dimension-weighted correlation loss, The standard deviation loss is weighted by dimensions; each component of the loss is weighted by a dimension. ; Accurate estimation of six-dimensional interactive force of human twisting motion based on multimodal temporal deep learning is achieved by using a dimension-weighted temporal prediction combined loss function.
8. The six-dimensional interactive force estimation system for human body twisting motion according to claim 7, characterized in that, The multimodal data preprocessing module specifically comprises: Three parallel input branches are constructed to process multimodal temporal data respectively; the EMG branch concatenates muscle activation and timestamps at 450 time steps, with a tensor structure of (450,13); the visual keypoint branch concatenates keypoints and timestamps, with a tensor structure of (450,35,4), to capture the three-dimensional coordinate trajectory of 35 body keypoints at 450 time steps; the contact force data branch concatenates 15-dimensional contact force data and timestamps, and records the force sensor readings at the same time step using a tensor of (450,16).
9. The six-dimensional interactive force estimation system for human body twisting motion as described in claim 7, characterized in that, The multimodal data feature extraction module is specifically as follows: Three independent feature extraction modules are designed to process different modal data respectively; the EMG module enhances the 13-dimensional contact force input to 64-dimensional through a fully connected layer, and then refines it to 32-dimensional features through a Conv1D convolutional layer; The vision module flattens the 35×4 visual keypoint data into a 140-dimensional vector, performs dimensionality reduction and feature encoding through a fully connected layer sequence, and finally compresses it into 32-dimensional features; The force module upscales the 16-dimensional contact force input to 32-dimensional through a fully connected layer, and then compresses it to a 16-dimensional feature output through Conv1D convolution. Each module is equipped with ReLU activation, layer normalization, and Dropout regularization to ensure training stability. The feature fusion of the multimodal data temporal modeling module specifically involves: The 32-dimensional outputs of the EMG module, the 32-dimensional output of the vision module, and the 16-dimensional output of the force module are concatenated to form an 80-dimensional joint feature vector. Nonlinear transformation and feature enhancement are performed through a double fully connected layer, and residual connections are introduced to promote gradient flow. A multi-head attention mechanism is used to model the global dependencies between features. Finally, the training process is stabilized through residual connections and layer normalization. The timing modeling of the multimodal data timing modeling module specifically involves: Deep temporal dependency extraction of feature sequences is performed by a two-layer bidirectional LSTM network. The first layer outputs 64-dimensional hidden states, and the second layer maintains the same structure to obtain 128-dimensional temporal features. A multi-head attention mechanism is introduced to capture the dynamic correlation between global time steps, and the gradient vanishing problem is alleviated by residual connections and layer normalization.
10. The six-dimensional interactive force estimation system for human body twisting motion according to claim 7, characterized in that, The output module is specifically, The multi-head output layer uses six independent fully connected network heads to process shared temporal features in parallel. Each head is responsible for predicting one output dimension through a three-layer structure of 128 to 64 to 32 to 1. LayerNorm, Dropout, and ReLU are used to ensure stable training. Finally, the outputs of each head are concatenated to form a prediction result of (450, 6). The dimension-weighted temporal prediction combined loss function is specifically as follows: The design of the dimension-weighted time series prediction combined loss function includes four complementary constraint terms: mean squared error loss, gradient consistency loss, correlation loss, and standard deviation loss. The coefficient of determination R² assigns greater weights to dimensions with lower fit, forcing the network to focus more on features that are difficult to learn during training.