A three-dimensional gesture tracking method based on an RGB camera

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By combining convolutional neural networks and particle swarm optimization algorithms, and using RGB cameras for 3D gesture tracking, the problems of low prediction accuracy and poor real-time performance in existing technologies are solved, and high-precision real-time 3D gesture tracking is achieved.

CN115810219BActive Publication Date: 2026-06-23NANJING UNIV OF POSTS & TELECOMM

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: NANJING UNIV OF POSTS & TELECOMM
Filing Date: 2022-12-21
Publication Date: 2026-06-23

Application Information

Patent Timeline

21 Dec 2022

Application

23 Jun 2026

Publication

CN115810219B

IPC: G06V40/20; G06V10/44; G06V10/774; G06V10/82; G06N3/006

CPC: Y02T10/40

AI Tagging

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies for 3D gesture tracking using RGB images have low prediction accuracy and suffer from hand occlusion and real-time issues. They fail to fully utilize 2D and 3D image information, have high network model complexity, and are difficult to achieve high-precision real-time tracking.

Method used

A 3D gesture tracking method based on an RGB camera is adopted, which combines convolutional neural networks and particle swarm optimization algorithm. Through 3D key point detection, iterative fitting of hand bone length and MANO model by particle swarm optimization algorithm, combined with inverse analytical kinematics to solve posture parameters, and using Open3D for mesh vertex reconstruction, high-precision 3D gesture tracking is achieved.

Benefits of technology

It improves the accuracy and real-time performance of gesture tracking, effectively handles hand self-occlusion and object occlusion, optimizes the algorithm structure, and achieves high-precision real-time 3D gesture tracking.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115810219B_ABST

Patent Text Reader

Abstract

The application provides a three-dimensional gesture tracking method based on an RGB camera, which comprises the following steps: taking the RGB camera as an input, performing standardization processing on a picture, and sending the picture into a three-dimensional joint detection module to extract image features, predict a preliminary three-dimensional joint and calculate a hand skeleton length; using a particle swarm optimization algorithm to iteratively fit the hand skeleton length and a MAMO hand model, finding an optimal hand shape and obtaining a new three-dimensional joint; sending the new three-dimensional joint into an inverse kinematics module to solve posture parameters required by the MAMO model; using a MANO model to combine the optimal hand shape and the posture parameters θ to obtain final joints and grid vertices; and performing reconstruction rendering on the obtained grid vertices and joints to output a final three-dimensional hand real-time tracking posture. The method provided by the application has high prediction accuracy and high three-dimensional gesture tracking effect precision, and the tracking process and result are easy to realize.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of gesture recognition technology, and in particular relates to a three-dimensional gesture tracking method based on an RGB camera. Background Technology

[0002] The hand is one of the most frequently used parts of the human body in real life, and also one of the most dynamic. The hand can display a variety of postures, thereby conveying a wealth of information. Capturing hand movements is crucial for various applications in virtual reality, augmented reality, and human-computer interaction. Therefore, many researchers have begun studying this area in recent years and have made some progress.

[0003] Due to the widespread use of depth cameras, many early researchers estimated hand poses by fitting generated models to depth images. Tompson et al. combined CNNs with random decision forests and inverse kinematics to estimate hand poses in real time from single depth images. Wan et al. used unlabeled depth images for self-supervised fine-tuning, while Mueller et al. constructed a photorealistic dataset for better robustness. Other researchers have isolated point clouds and 3D voxels from depth images for further study.

[0004] Due to the high cost and power consumption of depth sensors, as well as their stringent requirements for experimental environments, an increasing number of researchers are focusing on 3D hand pose estimation based on monocular RGB images. Zimmermann and Brox trained a CNN-based model that directly estimates 3D joint coordinates from RGB images. Iqbal et al. used a 2.5D heatmap formula, which encodes 2D joint positions along with depth information, significantly improving accuracy. Many researchers utilize depth image datasets to expand the diversity observed during training. Mueller et al. proposed a large-scale rendering dataset post-processed with CycleGAN to bridge the domain gap. However, they only focused on joint position estimation without performing joint rotation recovery. Ge et al. used GraphCNN to directly regress the hand mesh, but this requires a special dataset with a real hand mesh. This model-free approach generally performs poorly on challenging scenes. However, this approach shows good results when fully utilizing existing datasets from different modalities, including both image and non-image data. Zhou et al. proposed a 3D hand joint detection module and an inverse kinematics module using image data with 2D or 3D annotations and 3D animated images without corresponding image data. This module not only regresses 3D joint positions but also addresses joint rotation, showing promise for computer vision and graphics applications. However, Zhou's method does not consider the optimal matching problem between 3D joints and MANO's hand model, and the inverse kinematics module has a relatively high algorithmic complexity, making it difficult to implement.

[0005] Some of the methods mentioned above utilize depth images, while others use RGB color images. However, on the one hand, they do not fully utilize the image features of both 2D and 3D image information; on the other hand, due to insufficiently advanced network models or overly complex methods, the prediction accuracy of hand 3D coordinates is not very high, or the 3D gesture tracking effect is mediocre. Furthermore, there are also problems such as hand occlusion and poor real-time performance of gesture tracking. Summary of the Invention

[0006] The main objective of this invention is to design a three-dimensional gesture tracking method to improve prediction accuracy and achieve high-precision three-dimensional gesture tracking results, while ensuring that the tracking process and results are easy to implement.

[0007] To achieve the above objectives, this invention provides a three-dimensional gesture tracking method based on an RGB camera, comprising the following steps:

[0008] Step 1: Using an RGB camera as input, standardize the image, send the processed image to the 3D joint detection module, extract image features, use a convolutional neural network to predict preliminary 3D joints and calculate the length of the hand bones.

[0009] Step 2: Iteratively fit the hand bone length and MAMO hand model using the particle swarm optimization algorithm to find the optimal hand shape and obtain a new three-dimensional joint point;

[0010] Step 3: Input the new 3D joints into the inverse analytical kinematics module to solve for the pose parameters θ required for the MAMO model;

[0011] Step 4: Using the MANO model combined with the optimal hand shape and pose parameters θ, obtain the final joints and mesh vertices; and

[0012] Step 5: Reconstruct and render the obtained mesh vertices and joints to obtain the final 3D hand real-time tracking pose.

[0013] A further improvement of the present invention is that the three-dimensional joint detection module uses a ResNet50-based neural network model, including a feature extractor, a 2D detector and a 3D detector, and uses a ResNet50 with an attention mechanism as the feature extractor. The input is an image with a resolution of 128×128 and the output is a feature volume F with a size of 32×32×256.

[0014] A further improvement of the present invention is that the 2D detector is a two-layer CNN that acquires feature volume F and outputs a heatmap H of 21 joints, which is used for 2D pose estimation; the 3D detector first uses a two-layer CNN to estimate an incremental map D from the heatmap H and feature volume F, and the heatmap H, feature volume F and incremental map D are concatenated and fed into another two-layer CNN to obtain the final position map L, and the 3D hand joint position is estimated in the form of position map L.

[0015] A further improvement of the present invention is that the MANO model in step four is a 3D parametric model, which constitutes a complete hand chain based on 16 joints and 5 fingertip points obtained from the vertices.

[0016] A further improvement of the present invention is that, in step five, Open3D is used to reconstruct the vertices of the hand mesh.

[0017] A further improvement of the present invention lies in the fact that the MANO model... It's a gesture with the palms outstretched, using a template. Shape function B s (β) and pose function B p From (θ), we can obtain the hand deformation template T. Then, combined with the pose parameter θ, skinning weight ω, and joint position J(θ), we perform skinning operations on it. The mathematical expression is as follows:

[0018]

[0019] M(θ,β)=W(T(θ,β),θ,ω,J(θ)).

[0020] A further improvement of this invention is that the particle swarm optimization algorithm first initializes a group of random particles, and then finds the optimal solution through multiple iterations. In each iteration, the particles update themselves by tracking two extreme values. After finding these two optimal values, the particles update their velocity and position using the following formula:

[0021]

[0022]

[0023] Where i = 1, 2, ..., N, N is the particle swarm size, d is the particle dimension index, k is the number of iterations, ω is the inertia weight, c1 is the individual learning factor, c2 is the swarm learning factor, and r1 and r2 are random numbers in the interval [0, 1] to increase the randomness of the search. It is the velocity vector of particle i in the d-th dimension during the k-th iteration. It is the position vector of particle i in the d-th dimension during the k-th iteration. It represents the historical optimal position of particle i in the d-th dimension during the k-th iteration; that is, the optimal solution obtained by the i-th particle after the k-th iteration. It is the historical best position of the swarm in the d-th dimension during the k-th iteration, that is, the best solution in the entire particle swarm after the k-th iteration.

[0024] The beneficial effects of this invention are as follows: By combining particle swarm optimization (PSO) with the MANO model to reconstruct the hand shape, the accuracy of gesture tracking is improved. It exhibits good tracking performance for both self-occlusion and object occlusion of hand movements. The use of analytical inverse kinematics to solve for pose parameters optimizes the algorithm structure, controlling the overall complexity and achieving good real-time performance. Ultimately, high-precision real-time 3D gesture tracking is achieved. Attached Figure Description

[0025] Figure 1 This is a diagram illustrating the overall framework of the 3D gesture tracking method based on an RGB camera according to the present invention.

[0026] Figure 2 This is a network model diagram of the three-dimensional joint detection module in this invention.

[0027] Figure 3 This is an experimental result diagram of the present invention. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0029] It should be emphasized that, in describing this invention, various formulas and constraints are distinguished by consistent reference numerals, but it is not excluded that different reference numerals may be used to identify the same formulas and / or constraints. The purpose of this arrangement is to more clearly illustrate the features of this invention.

[0030] This invention proposes a gesture tracking method based on convolutional neural networks (CNNs), incorporating particle swarm optimization (PSO) and inverse analytical kinematics. The method utilizes both 2D and 3D image data, allowing for better use of image features to train the model. After preprocessing the raw data, it is fed into a CNN with an attention mechanism, resulting in high accuracy of the 3D hand joints. PSO is then used to iteratively fit the MANO model to find the optimal hand shape. Finally, analytical inverse kinematics is used to derive the hand pose parameters, obtaining the necessary hand parameters for the MANO model. This method, combined with inverse analytical kinematics, better utilizes hand movement characteristics. Simultaneously, by performing PSO iterative fitting on the bone length and MANO hand model parameter file, the optimal hand shape can be found. While controlling algorithm complexity, high-precision 3D gesture tracking is ultimately achieved.

[0031] The three-dimensional gesture tracking method based on an RGB camera of the present invention mainly includes the following steps:

[0032] Step 1: Acquire the video stream, process the images, and send the processed images into the 3D joint detection module to extract image features. Use a convolutional neural network to predict preliminary 3D joints and calculate the length of the hand bones.

[0033] Step 2: Iteratively fit the hand bone length and MAMO hand model using the particle swarm optimization algorithm to find the optimal hand shape and obtain a new three-dimensional joint point;

[0034] Step 3: Input the new 3D joints into the inverse analytical kinematics module to solve for the pose parameters θ required for the MAMO model;

[0035] Step 4: Using the MANO model combined with the optimal hand shape and pose parameters θ, obtain the final joints and mesh vertices; and

[0036] Step 5: Reconstruct and render the obtained mesh vertices and joints to obtain the final 3D hand real-time tracking pose.

[0037] The present invention will now be described in detail with reference to the accompanying drawings.

[0038] Step 1: Acquire the video stream, process the images, and send the processed images into the 3D joint detection module to extract image features. Use a convolutional neural network to predict preliminary 3D joints and calculate the length of the hand bones.

[0039] In step one, the image is cropped and normalized using an RGB three-channel camera as input. The processed image is then fed into a 3D joint detection module to extract image features. A convolutional neural network is used to predict preliminary 3D hand joints and calculate bone length. The hardware uses a standard RGB three-channel camera, with a video stream as input. The video stream is processed frame by frame, first uniformly cropped to a 128*128 pixel size, and then standardized using the mean and standard deviation of each channel.

[0040] like Figure 2 As shown, the 3D joint detection module consists of a feature extractor, a 2D detector, and a 3D detector. A ResNet50 with an attention mechanism is used as the feature extractor; the input is a 128×128 resolution image, and the output is a feature volume F of size 32×32×256. The 2D detector is a two-layer CNN that acquires the feature volume F and outputs a heatmap H for 21 joints. The heatmap H is used for 2D pose estimation. The 2D coordinates of the 21 hand joints are as follows:

[0041] P 2d =[[x1,y1],[x2,y2],...,[x1,y1]] T .

[0042] A heatmap is generated for the 2D image coordinates of each joint using a two-dimensional Gaussian distribution. The formula for generating the heatmap is as follows:

[0043]

[0044] Here, σ determines the size of the thermal radius. f(x, y) represents the probability value of joint point i with image coordinates [4x, 4y]. The position with the largest response in the i-th heatmap corresponds to the 2D image coordinates of the i-th joint.

[0045] The 3D detector first uses a two-layer CNN to estimate the incremental map D from the heatmap H and feature volume F. The heatmap H, feature volume F, and incremental map D are concatenated and fed into another two-layer CNN to obtain the final position map L. The 3D hand joint positions are then estimated in the form of position map L. The 3D hand joint positions represent the 3D coordinates P3 of the 21 joints of the hand relative to the spatial coordinate system of the wrist node. d =[[x1,y1,z1],[x2,y2,z2],...,[x1,y1,z1]] TThe presence of hand bones restricts the size of the hand, its range of motion, and the distances between joints. Any bone in the skeleton can be represented as a vector b between the i-th and j-th joints of the hand. ij Bone vector b ij Length | b ij | This corresponds to the length of the bone, bone vector b ij direction This indicates the direction of the bone. For the entire hand, there are 20 bone vectors, therefore the length of the entire hand's bones can be represented by matrix B. L B L ∈R (J-1)×1 .

[0046] Step 2: Iteratively fit the hand bone length and MAMO hand model using the particle swarm optimization algorithm to find the optimal hand shape and obtain a new three-dimensional joint.

[0047] Particle Swarm Optimization (PSO) is a population-based stochastic optimization technique. It first initializes a swarm of random particles (random solutions) and then finds the optimal solution through multiple iterations. In each iteration, the particles update themselves by tracking two extreme values. After finding these two optimal values, the particles update their velocity and position using the following formula:

[0048]

[0049]

[0050] Where i = 1, 2, ..., N, N is the particle swarm size, d is the particle dimension index, k is the number of iterations, ω is the inertia weight, c1 is the individual learning factor, c2 is the swarm learning factor, and r1 and r2 are random numbers in the interval [0, 1] to increase the randomness of the search. It is the velocity vector of particle i in the d-th dimension during the k-th iteration. It is the position vector of particle i in the d-th dimension during the k-th iteration. It represents the historical optimal position of particle i in the d-th dimension during the k-th iteration; that is, the optimal solution obtained by the i-th particle after the k-th iteration. It is the historical best position of the swarm in the d-th dimension during the k-th iteration, that is, the best solution in the entire particle swarm after the k-th iteration.

[0051] The termination condition for particle swarm optimization algorithms varies depending on the specific problem; it is generally chosen as the maximum number of iterations or meeting accuracy requirements. In this invention, after multiple comparative experiments, the maximum number of iterations was ultimately set to 150. In each iteration, the particles update themselves by tracking two extreme values.

[0052] The MANO model is a mainstream parametric model for hand pose estimation. It consists of 16 joints and 5 fingertips obtained from the vertices, forming a complete hand chain. By combining pose parameters, the hand shape can be recovered from the MANO model. This is the initial MANO mesh template, used to represent the initial mesh vertex positions of a standard MANO model surface in a static state. The initial MANO template is used... Shape function B s (β) and pose function B p From (θ), we can obtain the hand deformation template T, and then combine it with the pose parameter θ, skinning weight ω, and joint position J(θ) to perform skinning operations. The mathematical expression is as follows:

[0053]

[0054] M(θ,β)=W(T(θ,β),θ,ω,J(θ))

[0055] Step 3: Input the new 3D joints into the inverse analytical kinematics module to solve for the pose parameters θ required for the MAMO model.

[0056] 3D joint coordinates can explain hand posture to some extent, but are insufficient to represent a 3D hand model. Therefore, we need to derive joint rotation from joint coordinates. The analytical inverse kinematics described above solves for the posture parameter θ by decomposing rotation into torsion and oscillation. This posture parameter is ultimately used in the MANO model to reconstruct the hand shape and posture.

[0057] Step 4: Using the MANO model combined with the optimal hand shape and pose parameters θ, the final joints and mesh vertices are obtained.

[0058] Based on the overall process of model generation described above, given the pose parameter θ and the optimal hand shape, the corresponding mesh shape can be generated using the MANO model.

[0059] Step 5: Reconstruct and render the obtained mesh vertices and joints to obtain the final 3D hand real-time tracking pose.

[0060] In step five, Open3D is used to reconstruct the vertices of the hand mesh.

[0061] The invention design completed two experiments.

[0062] The first experiment aimed to test whether this method could improve the accuracy of 3D keypoint detection. Our model was trained on an Nvidia DGX GPU, Tesla P100-SXM2, using three datasets: CMU HandDB, Rendered Handpose Dataset, and GANerated Hands Dataset. The test set used four datasets: Rendered Handpose Dataset, EgoDexter Dataset, STB Dataset, and DexterObject Dataset.

[0063] The evaluation metrics are the proportion of correct keypoints (PCK) and the area under the curve (AUC). Calculating PCK requires manually setting a 3D keypoint error threshold *c*. When the 3D keypoint error is less than *c*, the keypoint is considered correctly detected. The PCK value is the estimated proportion of correctly detected keypoints out of all keypoints. At the same threshold, a higher PCK value indicates better method performance. Different PCK values can be obtained by setting different thresholds *c*. Plotting the threshold *c* on the horizontal axis and the PCK value on the vertical axis yields a curve showing PCK changing with the threshold. The area under the curve (AUC) is the result of calculating the area under the curve. A higher AUC value indicates more accurate pose estimation.

[0064] The experimental results are shown in Table 1. It can be seen that the method described in this invention can accurately estimate the coordinates of hand joints.

[0065] Table 1 Experimental Results

[0066]

[0067]

[0068] The second experiment aimed to test the real-time hand tracking performance of the method and the hand occlusion recovery problem. The experimental environment was an Intel(R) Xeon(R) CPU E5-2620. We performed actions such as opening the palm, holding a pen, and holding a cup. The experimental results are as follows. Figure 3 As shown, this method has good 3D gesture tracking capabilities. There is no obvious deformation or self-occlusion of the hand. Even when the hand is occluded by an object, this method can still recover the hand shape in real time.

[0069] In summary, the 3D gesture tracking performed using this method has good real-time performance and demonstrates good real-time performance in handling both self-occlusion and object occlusion of the hand.

[0070] The present invention also provides a three-dimensional gesture tracking device based on an RGB camera, the device including an image acquisition and preprocessing module, a three-dimensional joint detection module, a posture parameter calculation module, a joint acquisition module, and a rendering output module.

[0071] The image acquisition and preprocessing module and the 3D joint detection module are used to standardize the image with an RGB camera as input, send the processed image to the 3D joint detection module to extract image features, use a convolutional neural network to predict the preliminary 3D joints of the hand and calculate the bone length.

[0072] The pose parameter calculation module is used to iteratively fit the hand bone length and the MANO model using the particle swarm optimization algorithm to find the optimal hand shape and obtain a new set of three-dimensional joints; and then send the new three-dimensional joints into the inverse analytical kinematics module to solve for the pose parameters θ required by the MANO model.

[0073] The joint acquisition module is used to obtain the final joints and mesh vertices by combining the MANO model with the optimal hand shape and pose parameters θ; and

[0074] The rendering output module is used to reconstruct and render the obtained mesh vertices and joints to obtain the final 3D hand real-time tracking pose.

[0075] The method of this invention derives the hand bone length from the 3D coordinates of the hand predicted by a convolutional neural network model. It then uses a particle swarm optimization algorithm to iteratively fit the hand bone length and the MANO hand model parameter file 150 times to match the optimal hand shape. Next, inverse analytical kinematics is used to infer pose parameters from joint positions. Finally, the pose parameters are combined with the MANO model to obtain the final hand tracking pose. This algorithm exhibits good real-time performance in real-world applications and demonstrates good robustness against both self-occlusion and object occlusion of the hand.

[0076] The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A three-dimensional gesture tracking method based on an RGB camera, characterized in that, Includes the following steps: Step 1: Using an RGB camera as input, standardize the image, send the processed image to the 3D joint detection module, extract image features, use a convolutional neural network to predict preliminary 3D joints and calculate the length of the hand bones. Step 2: Iteratively fit the hand bone length and the MANO hand model using the particle swarm optimization algorithm to find the optimal hand shape and obtain a new three-dimensional joint point; Step 3: Input the new 3D joints into the inverse analytical kinematics module, and solve for the pose parameters required by the MANO model by decomposing rotation into torsion and oscillation. ; Step 4: The MANO model is a 3D parametric model that constructs a complete hand chain based on 16 joints and 5 fingertip points obtained from the vertices. The MANO model is then used to combine optimal hand shape and pose parameters. This yields the final joints and mesh vertices; as well as Step 5: Reconstruct and render the obtained mesh vertices and joints to obtain the final 3D hand real-time tracking pose; The 3D joint detection module uses a ResNet50-based neural network model. The model includes a feature extractor, a 2D detector, and a 3D detector. A ResNet50 with an attention mechanism is used as the feature extractor. The input is a 128×128 resolution image, and the output is a feature volume F of 32×32×256. The 2D detector is a two-layer CNN that acquires the feature volume F and outputs a heatmap H for 21 joints, which is used for 2D pose estimation. The 3D detector first uses a two-layer CNN to estimate an incremental map D from the heatmap H and the feature map F. The heatmap H, feature map F, and incremental map D are concatenated and fed into another two-layer CNN to obtain the final position map L, and the 3D hand joint positions are estimated in the form of position map L. Step four is part of the MANO model. It's a gesture with the palms outstretched, using a template. Shape function and pose function This allows us to obtain a hand deformation template T, which is then combined with posture parameters. Skin weight Joint position The skinning operation is performed on it, and the mathematical expression is as follows: 。 2. The three-dimensional gesture tracking method based on an RGB camera according to claim 1, characterized in that: Step 5 uses Open3D to reconstruct the vertices of the hand mesh.

3. The three-dimensional gesture tracking method based on an RGB camera according to claim 1, characterized in that: The particle swarm optimization algorithm first initializes a swarm of random particles, then finds the optimal solution through multiple iterations, with the iteration termination condition being 150 iterations. In each iteration, the particles update themselves by tracking two extreme values. After finding these two optimal values, the particles update their velocity and position using the following formula: Where i = 1, 2, ..., N, N is the particle swarm size, d is the particle dimension index, and k is the number of iterations. It is inertial weight. It is an individual learning factor. It is a group learning factor. , It is a random number within the interval [0,1], which increases the randomness of the search. It is the velocity vector of particle i in the d-th dimension during the k-th iteration. It is the position vector of particle i in the d-th dimension during the k-th iteration. It represents the historical optimal position of particle i in the d-th dimension during the k-th iteration; that is, the optimal solution obtained by the i-th particle after the k-th iteration. It is the historical best position of the swarm in the d-th dimension during the k-th iteration, that is, the best solution in the entire particle swarm after the k-th iteration.

4. A three-dimensional gesture tracking device based on an RGB camera, characterized in that: The device includes an image acquisition and preprocessing module, a 3D joint detection module, a pose parameter calculation module, a joint acquisition module, and a rendering output module; the device is used to perform the method described in any one of claims 1-3.

5. The three-dimensional gesture tracking device based on an RGB camera according to claim 4, characterized in that: The 3D joint detection module uses a ResNet50-based neural network model. The system includes a feature extractor, a 2D detector, and a 3D detector. A ResNet50 with an attention mechanism is used as the feature extractor. The input is a 128×128 resolution image, and the output is a feature volume F of size 32×32×256. The 2D detector is a two-layer CNN that acquires the feature volume F and outputs a heatmap H for 21 joints, which is used for 2D pose estimation. The 3D detector first uses a two-layer CNN to estimate an incremental map D from the heatmap H and the feature map F. The heatmap H, feature map F, and incremental map D are concatenated and fed into another two-layer CNN to obtain the final position map L, and the 3D hand joint positions are estimated in the form of the position map L.