Hand pose reconstruction method based on millimeter wave signals

By extracting hand features using millimeter-wave radar and deep learning models, the problems of illumination dependence and privacy leakage in existing technologies are solved, achieving high-precision non-wearable hand pose reconstruction, which is applicable to multiple environmental conditions.

CN117496596BActive Publication Date: 2026-06-23SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2023-11-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing hand recognition technologies rely on lighting conditions, are costly, provide a poor user experience, and are prone to privacy leaks, and cannot effectively capture continuous hand movements.

Method used

Using millimeter-wave radar to sense hand posture, spatial and temporal features of the hand are extracted through a two-stage attention hourglass network and a long short-term memory network. Combined with a general hand parametric model, a detailed 3D hand mesh is reconstructed, achieving robust hand posture reconstruction without the need for wearing protective devices.

Benefits of technology

It achieves high-precision hand posture reconstruction under different environmental conditions with an accuracy rate of 95.1%, avoids privacy leaks, and broadens the scope of applications.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117496596B_ABST
    Figure CN117496596B_ABST
Patent Text Reader

Abstract

A hand posture reconstruction method based on a millimeter wave signal, through a millimeter wave radar, the posture of a user's hand is perceived, motion data of the user's hand is collected, the collected signal is preprocessed, then a mmSpaceNet based on two-stage attention and a long short-term memory network (LSTM) are used respectively to extract spatial features and time features of the hand from the preprocessed signal, so that regression processing is carried out in a three-dimensional space, and a three-dimensional skeleton of the hand is generated in real time; through a general hand parameterization model (MANO), a 3D hand grid with a more detailed surface is reconstructed, and finally three-dimensional reconstruction of the posture of the user's hand is realized. The millimeter wave radar used in the application has a lower cost, does not need to be worn by the user, and does not depend on environmental conditions, and does not cause privacy leakage problems, and can realize a robust delicate representation of a three-dimensional hand grid, and widens the application range.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a technology in the field of millimeter wave applications, specifically a hand posture reconstruction method based on millimeter wave signals. Background Technology

[0002] Current hand recognition technologies extract features and generate 3D models using network models, but these technologies rely on lighting conditions and may lead to privacy leaks. Furthermore, they model hands from single images and do not adequately capture continuous motion, often failing in complex situations. Summary of the Invention

[0003] This invention addresses the shortcomings of existing technologies, such as the high cost and poor user experience of existing reconstruction technologies based on data gloves or wristbands, and the susceptibility of existing computer vision-based reconstruction technologies to ambient lighting conditions, which may also lead to user privacy leaks. It proposes a hand pose reconstruction method based on millimeter-wave signals, employing low-cost millimeter-wave radar, requiring no user wearing, being independent of environmental conditions, and avoiding privacy issues. This method can achieve robust and detailed representations of the hand's 3D mesh, thus broadening its application scope.

[0004] This invention is achieved through the following technical solution:

[0005] This invention relates to a hand posture reconstruction method based on millimeter-wave signals. The method uses millimeter-wave radar to sense the user's hand posture, collects hand motion data, preprocesses the acquired signals, and then uses a two-stage attention-based hourglass network (mmSpaceNet) and a long short-term memory network (LSTM) to extract the spatial and temporal features of the hand from the preprocessed signals for regression processing in three-dimensional space, generating a three-dimensional hand skeleton in real time. Finally, a 3D hand mesh with more detailed surfaces is reconstructed using a universal parametric hand model (MANO), achieving three-dimensional reconstruction of the user's hand posture.

[0006] This invention relates to a hand posture reconstruction system based on millimeter-wave signals to implement the above-mentioned method, comprising: a millimeter-wave signal preprocessing unit, a hand joint regression unit, and a hand mesh construction unit, wherein: the millimeter-wave signal preprocessing unit performs format reconstruction and filtering and noise reduction on the millimeter-wave signals acquired by radar to obtain a high-dimensional matrix containing hand information, called Radar Cube; the hand joint regression unit uses Radar Cube as input and uses a designed deep learning model to regress the three-dimensional coordinates of 21 hand joints from it; and the hand mesh construction unit reconstructs a 3D hand mesh with a more detailed surface based on the regressed hand joints using a universal hand parametric model (MANO).

[0007] Technical effect

[0008] This invention enhances the ability of deep learning models to extract hand features by employing a two-stage attention mechanism and focusing on the analysis of hand dynamics loss, combined with the characteristics of millimeter-wave signals. Under the supervision of hand dynamics loss, the three-dimensional coordinates of 21 hand joints are obtained through regression with an accuracy of 95.1%. Attached Figure Description

[0009] Figure 1 This is a flowchart of the present invention;

[0010] Figure 2 A schematic diagram of the deep learning model designed for this invention;

[0011] Figure 3 A schematic diagram of the two-stage attention mechanism designed for this invention;

[0012] Figure 4 This is a schematic diagram illustrating two geometric relationships of finger joints;

[0013] Figure 5 Schematic diagrams of hand skeletons and hand grids for several different hand gestures generated by this invention;

[0014] Figure 6 A schematic diagram illustrating the average per-joint error of 10 users in the experiment;

[0015] Figure 7 A diagram illustrating the accuracy of hand joint prediction for each user in the experiment. Detailed Implementation

[0016] like Figure 1 As shown, this embodiment relates to a hand pose reconstruction method based on millimeter-wave signals, including signal preprocessing, hand joint regression, and hand mesh construction stages, specifically:

[0017] Step 1) Use Frequency Modulated Continuous Wave (FMCW) technology to measure the distance, velocity, and angle of the target object using millimeter-wave signals, obtaining a high-dimensional matrix containing the range spectrum, Doppler spectrum, azimuth spectrum, and elevation spectrum, specifically including:

[0018] 1.1) A commercially available millimeter-wave radar transmits a linearly increasing frequency signal called chirp through its transmitting antenna; the signal is reflected from objects in the environment and captured by the radar's receiving antenna; the transmitted and received signals are mixed in the radar's mixer to generate an intermediate frequency (IF) signal. Where: f0 is the starting frequency of linear frequency modulation, B is the signal bandwidth, and T is the signal bandwidth. c For the duration of linear frequency modulation, A rLet τ be the amplitude coefficient for attenuation of the millimeter-wave signal, and τ(r,c) be the delay of the received signal relative to the transmitted signal, which is determined by the propagation speed c of the millimeter-wave signal and the distance r between the object and the radar. Where f is the frequency of the IF signal.

[0019] 1.2) Eliminating environmental interference from the received millimeter-wave signal: The original millimeter-wave data is filtered by an 8th-order bandpass Butterworth filter, and the hand-related signals are retained. Then, the range spectrum describing the target in the range dimension is obtained by performing a range FFT on the millimeter-wave signal.

[0020] 1.3) To calculate the velocity v of an object, the FMCW radar uses T... c Two chirp signals are emitted at intervals. After performing a range FFT, the two chirp signals peak at the same location in the range spectrum and have the same frequency response as the object in the range T. c The phase difference Δξ corresponding to the motion in the figure is used to calculate the velocity of the object. Where λ is the wavelength of the signal; after performing the Doppler FFT, the Doppler spectrum can be obtained.

[0021] 1.4) If two receiving antennas are set up, the distance difference Δd between the object and the two receiving antennas will produce a phase difference Δξ at the peak of the FFT. According to the geometric relationship, Δd is expressed as Δd=lsin(θ), where: l is the distance between the two receiving antennas and θ is the angle of arrival to be calculated.

[0022] To locate objects in space, millimeter-wave radar uses TDM-MIMO technology, which can calculate two angles of arrival (AOA): azimuth and elevation.

[0023] The high-dimensional matrix (Radar Cube) contains information on the range, angle, and speed of the sensed hand.

[0024] Step 2) as Figure 2 As shown, after constructing and training a deep learning model to extract spatial and temporal features of the hand, 21 hand joints are regressed using a high-dimensional matrix as input.

[0025] The deep learning model includes: a spatial feature extraction network (mmSpaceNet) and a temporal feature extraction network LSTM connected in sequence. mmSpaceNet is an hourglass network composed of attention residual blocks, which combines shallow and deep features to represent the hand at different granularities in space. The temporal feature extraction network LSTM flattens the global feature map obtained by the spatial feature extraction network into a feature vector. Then, through a hybrid loss function, under supervised learning conditions, regression is performed to obtain a three-dimensional hand skeleton containing 21 hand joints.

[0026] The attention residual block consists of two branches: one branch uses a 1×1 convolutional layer to adjust the number of channels without changing the feature map size, thus preserving the features at the current level. The other branch first uses a convolutional layer for downsampling to extract high-dimensional and fine-grained features, and then uses a deconvolutional layer for upsampling to obtain a high-resolution feature map.

[0027] All attention residual blocks employ a two-stage channel attention mechanism and a spatial attention mechanism, enhancing mmSpaceNet's ability to extract key features. The so-called two-stage channel attention mechanism combines traditional channel attention mechanisms with the characteristics of millimeter-wave signals, such as... Figure 3 As shown. Each Radar Cube can be considered as a combination of multiple cubes of the form X∈R V×D×A The input 3D cube is composed of concatenated segments, where V, D, and A represent the velocity, distance, and angle channels, respectively. For each 3D segment X, we apply a first-stage channel attention mechanism, which can be represented as: a = σ(Conv1(TGAP(X) + TGMP(X)), Y = aX, where σ is the sigmoid activation function, Conv1 is a block with two convolutional layers, TGAP is 3D global average pooling, and TGMP is 3D global max pooling, transforming each 3D segment into Y. Then, a second-stage attention mechanism is further applied, specifically: b = σ(FC([GAP(Y), GMP(Y)]), Z = bY, where FC is a fully connected layer used to encode all channel features into a weight vector, GAP is 3D global average pooling, and GMP is 3D global max pooling, transforming the original input Radar Cube into Z. Finally, a traditional 3D spatial attention mechanism is applied to Z.

[0028] The LSTM network for extracting temporal features generates a feature vector for each input, and then forms a vector sequence of all feature vectors as the input of the LSTM for extracting temporal features.

[0029] After spatial and temporal feature extraction, and under the supervision of a hybrid loss function, the 3D coordinates of 21 hand joints were obtained through regression. The hybrid loss function consists of two parts: the first part is the Euclidean distance constraint of the 3D coordinates, and the second part is the hand dynamics loss L. kine L kine The design is based on the observation that the hand is an object with segmented rigidity, and the relationship between finger joints can be abstracted into two geometric categories: collinearity and coplanarity. Each phalanx is a rigid body, and the phalanges are hinged together by joints, enabling the hand to perform various movements. Let A, B, C, and D represent three phalanges and one fingertip, where A is the root of the finger. Each joint has its own three-dimensional coordinates. When the finger is extended, the four joints are collinear. When the finger is bent, the four joints are not collinear, but remain coplanar. Figure 4 Two cases are shown. Constraints are applied for collinearity and coplanarity respectively, i.e., L... kine =λL cop +(1-λ)L col , where: L cop For coplanar loss, L col This represents the collinearity loss. λ is 0 when collinear and 1 when coplanar. Based on geometric relationships, L... cop Represented as: L cop =AB·e n +BC·e n +→CD·e n e n L is the plane normal vector. col It can be represented as: Here, e d Let p be the direction vector of the finger bone, and p is a value very close to 1. In this invention, it is set to 0.99.

[0030] Step 3) Generate the pose parameters θ and shape parameters β of the hand required for the Universal Hand Parametric Model (MANO) based on the 3D hand skeleton. Then, generate a 3D hand mesh with detailed hand representation through MANO. Specifically, using the 3D hand skeleton as input, three layer-normalized linear layers are used to output the shape parameter B, and then a layer-normalized linear layer is used to infer the pose parameter θ. Further, based on the 21 hand joints J... 3D Calculate the orientation matrix D of the finger bone using three-dimensional coordinates. p ∈R 20×3 Then, D p and J 3D The vectors are flattened and concatenated to predict the attitude parameters, resulting in the attitude parameter θ.

[0031] The aforementioned attitude parameter prediction refers to: outputting rotation quaternions Q∈R for all joints using a three-layer linear neural network. 21×4Then, the rotation quaternion Q is converted into the corresponding axis angle representation, which is the predicted attitude parameter θ.

[0032] Through specific practical experiments, a commercial off-the-shelf (COTS) millimeter-wave radar manufactured by Texas Instruments was used and connected to a data acquisition card (TIDCA1000EVM). The radar uses three transmit antennas and four receive antennas, employing TDM-MIMO technology to transmit chirp signals. The signal frequency range is 77GHz to 81GHz, and the duration of each chirp signal is 80µs. Each chirp is sampled 64 times. TI's mmWave Studio software was used to interact with the millimeter-wave radar and acquire data. The deep learning model was trained using an NVIDIA RTX 3090Ti graphics card.

[0033] Ten users were recruited for the experiment. For each user, static and dynamic gestures were collected. Static gestures required the user's hand to remain in a specific pose and stationary. Static gestures included the "T-pose" (standard template gesture), "fist," "OK," "Good," "Gun," "One," "Two," "Three," "Four," and "Six." Dynamic gestures involved users continuously and naturally transitioning between different gestures. 10,000 frames of static gestures and 40,000 frames of dynamic gestures were collected for each user. The experiment was conducted in various environments, including a classroom, a hallway, and a playground.

[0034] like Figure 5 The image shows examples of hand skeletons and hand meshes for different hand gestures. It can be seen that the 21 hand joints accurately depict the corresponding hand postures. Furthermore, the 3D hand mesh presents a realistic 3D animation consistent with the user's hand posture. Quantitative evaluation is then performed. The MPJPE and 3D-PCK of the 21 hand joints are calculated, where MPJPE refers to the average prediction error per joint, and 3D-PCK refers to the proportion of correctly predicted joints within a certain threshold.

[0035] like Figure 6 and Figure 7The results show the MPJPE and 3D-PCK values ​​for 10 users. Overall, the average MPJPE of this invention is 18.3 mm, and the 3D-PCK is 95.1%, with mean standard deviations of 2.96 mm and 1.17%, respectively. The results indicate that this invention can accurately regress 21 hand joints with a low mean error. Looking at individual users, the differences in MPJPE and 3D-PCK among each user are not significant. For example, the differences in MPJPE / 3D-PCK between user 2 (with the lowest MPJPE and the highest 3D-PCK) and user 6 (with the highest MPJPE and the lowest 3D-PCK) are only 2.9 mm and 3.3%, respectively. This demonstrates that the regression of hand joints by this invention is effective and robust for different individuals.

[0036] Compared to traditional wireless signals, which can only classify hand postures without further estimating hand joints or reconstructing more detailed 3D hand meshes, the hand joints obtained by this invention are comparable to traditional computer vision methods and wearable device methods, and have greater applicability.

[0037] The above-described specific implementations can be partially adjusted by those skilled in the art in different ways without departing from the principles and purpose of the present invention. The scope of protection of the present invention is defined by the claims and is not limited to the above-described specific implementations. All implementation schemes within the scope of the claims are bound by the present invention.

Claims

1. A hand posture reconstruction method based on millimeter-wave signals, characterized in that, The system uses millimeter-wave radar to sense the user's hand posture and collects hand motion data. The collected signals are preprocessed, and then spatial and temporal features of the hand are extracted from the preprocessed signals using an hourglass network based on two-stage attention and a long short-term memory network, respectively. These features are then used for regression processing in three-dimensional space to generate a three-dimensional skeleton of the hand in real time. Finally, a 3D hand mesh with more detailed surfaces is reconstructed using a general parametric hand model, thus achieving a three-dimensional reconstruction of the user's hand posture. The deep learning model includes: a spatial feature extraction network and a temporal feature extraction network LSTM connected in sequence, wherein: mmSpaceNet is an hourglass network composed of attention residual blocks; Two-stage channel attention and spatial attention mechanisms are employed in all attention residual blocks to enhance mmSpaceNet's ability to extract key features, specifically: a) Each Radar Cube can be considered as being composed of multiple shaped like... The three-dimensional fragments are spliced ​​together, wherein: Separate velocity, distance, and angle channels are used. For each 3D segment X, a first-stage channel attention mechanism is applied. , It is the sigmoid activation function. For a block with two convolutional layers, For three-dimensional global average pooling, This is a three-dimensional global max pooling method; b) Further apply the second-stage attention mechanism, specifically: , ,in: It is a fully connected layer used to encode all channel features into a weight vector. For three-dimensional global average pooling, This is a three-dimensional global max pooling method; c) After converting the original input Radar Cube to Z, apply the traditional 3D spatial attention mechanism to Z; The aforementioned hand posture reconstruction method specifically includes: Step 1) Use frequency-modulated continuous wave technology to measure the distance, velocity, and angle of the target object using millimeter-wave signals, obtaining a high-dimensional matrix containing the range spectrum, Doppler spectrum, azimuth spectrum, and elevation spectrum, specifically including: Step 2) After constructing and training a deep learning model to extract the spatial and temporal features of the hand, the model is used as input to regress 21 hand joints. Step 3) Generate the hand pose parameters required for a general parametric hand model based on the 3D hand skeleton. and shape parameters Then, a 3D hand mesh with detailed hand representation is generated using MANO. Specifically, using a 3D hand skeleton as input, three layer-normalized linear layers are used to output the shape parameter β, and then a layer-normalized linear layer is used to infer the pose parameter θ; further, based on 21 hand joints... Calculate the orientation matrix of the finger bone using three-dimensional coordinates. ;Then, and Flatten each vector and concatenate them to perform attitude parameter prediction, thus obtaining the attitude parameters. .

2. The hand posture reconstruction method based on millimeter-wave signals according to claim 1, characterized in that, The hourglass network combines shallow and deep features to represent the hand at different granularities in space; the temporal feature extraction network LSTM flattens the global feature map obtained by the spatial feature extraction network into a feature vector; and then, through a hybrid loss function, under supervised learning, regression is performed to obtain a three-dimensional hand skeleton containing 21 hand joints.

3. The hand posture reconstruction method based on millimeter-wave signals according to claim 2, characterized in that, The attention residual block consists of two branches: one branch uses a 1×1 convolutional layer to adjust the number of channels without changing the feature map size to preserve the features at the current level; the other branch first uses a convolutional layer to downsample to extract high-dimensional and fine-grained features, and then uses a deconvolutional layer to upsample to obtain a high-resolution feature map.

4. The hand posture reconstruction method based on millimeter-wave signals according to claim 1, characterized in that, Step 1 specifically includes: 1.1) A linear frequency modulated signal called chirp is transmitted at a linearly increasing frequency on the transmitting antenna of a millimeter-wave radar; when the signal is reflected from objects in the environment, it is captured by the radar's receiving antenna, and the transmitted and received signals are mixed in the radar's mixer to produce an intermediate frequency (IF) signal. ,in: The starting frequency of linear frequency modulation. Signal bandwidth, Linear frequency modulation duration, The amplitude coefficient of millimeter-wave signal attenuation. The delay of the received signal relative to the transmitted signal is determined by the propagation speed of millimeter-wave signals. and the distance between the object and the radar Decide; ,in: It is the frequency of the IF signal; 1.2) Eliminating environmental interference from the received millimeter-wave signal: The original millimeter-wave data is filtered by an 8th-order bandpass Butterworth filter in the millimeter-wave band, and the hand-related signals are retained. Then, the range spectrum describing the target in the range dimension is obtained by performing range FFT on the millimeter-wave signal. 1.3) To calculate the velocity of an object FMCW radar with Two chirp signals are emitted at a certain interval; after performing a range FFT, the two chirp signals reach a peak at the same location in the range spectrum and have the same characteristics as the object. Phase difference corresponding to motion The velocity of the object can then be calculated. ,in: The wavelength of the signal is denoted by λ; after performing the Doppler FFT, the Doppler spectrum is obtained. 1.4) If two receiving antennas are set up, the distance difference between the object and the two receiving antennas is... A phase difference will be generated at the peak of the FFT. According to geometric relationships, Represented as ,in: The distance between the two receiving antennas. Here is the angle of arrival to be calculated; The high-dimensional matrix contains information on the range, angle, and speed of the sensed hand.

5. The hand posture reconstruction method based on millimeter-wave signals according to claim 2, characterized in that, The hybrid loss function consists of two parts: the first part is the Euclidean distance constraint of the three-dimensional coordinates, and the second part is the hand dynamics loss. Specifically: ,in: The value is 0 when collinear and 1 when coplanar; coplanar loss. , It is a plane normal vector. , The direction vector of the finger bone. It is a value very close to 1.

6. The hand posture reconstruction method based on millimeter-wave signals according to claim 5, characterized in that, The aforementioned attitude parameter prediction refers to: outputting rotation quaternions for all joints using a three-layer linear neural network. Then, rotate the quaternion. Converting these into the corresponding axis-angle representation yields the predicted attitude parameters. .

7. A system for implementing the hand posture reconstruction method according to any one of claims 1-6, characterized in that, include: The system comprises a millimeter-wave signal preprocessing unit, a hand joint regression unit, and a hand mesh construction unit. The millimeter-wave signal preprocessing unit performs format reconstruction and filtering and noise reduction on the millimeter-wave signals acquired by the radar to obtain a high-dimensional matrix containing hand information, called a Radar Cube. The hand joint regression unit uses the Radar Cube as input and regresses the three-dimensional coordinates of 21 hand joints from it using a designed deep learning model. The hand mesh construction unit reconstructs a 3D hand mesh with a more detailed surface based on the regressed hand joints and a general hand parametric model.