Hand motion data processing method and system based on visual motion capture
By employing a global optimization method based on visual motion capture technology, the problem of precision and accuracy in restoring hand motion postures in complex environments has been solved. This method enables portable acquisition and robustness of high-precision hand motion data, making it suitable for professional fields such as film and animation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TUJIAN TECH (BEIJING) CO LTD
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
Smart Images

Figure CN122200752A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of visual motion capture data processing technology, and more specifically, to a method and system for processing hand motion data based on visual motion capture. Background Technology
[0002] In fields such as robotic teleoperation, embodied AI training, and rehabilitation medicine, the accurate acquisition of hand motion data is a core technological support. Visual motion capture technology, with its non-contact and non-invasive advantages, captures key hand feature points to analyze natural movements, making it a mainstream alternative to traditional contact sensors.
[0003] While current visual motion capture technology has made progress in terms of accuracy, visual motion capture devices generally rely on fixed multi-camera arrays or dedicated depth sensors to build the acquisition system. Due to the complexity of operating multi-camera arrays, motion capture data is easily affected by interference factors such as changes in light intensity and background texture complexity in the acquisition environment, which leads to a double decrease in the accuracy and precision of motion posture reconstruction. Summary of the Invention
[0004] In view of this, the present invention provides a method and system for processing hand motion data based on visual motion capture to solve the problem of poor accuracy and precision in motion posture reconstruction.
[0005] In a first aspect, the present invention provides a method for processing hand motion data based on visual motion capture. The method includes: acquiring hand motion video data of a target object, wherein the hand motion video data is collected by a smart camera; decoding the hand motion video data frame by frame to obtain a complete sequence of video frames; extracting hand regions based on the complete sequence of video frames to obtain a complete sequence of hand region frames and a complete sequence of hand dynamic masks; enhancing and completing the hand features of each frame based on the temporal context information of two adjacent frames in the complete sequence of hand region frames to obtain a global spatiotemporal hand feature sequence; performing pose calculation and global temporal constraint joint optimization and correction based on the global spatiotemporal hand feature sequence to obtain full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera; calculating the camera trajectory and constructing a world coordinate system based on the complete sequence of hand dynamic masks to obtain the camera trajectory and the corresponding world coordinate system; and performing global joint optimization of hand and camera motion based on the full-degree-of-freedom hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system to obtain full-degree-of-freedom hand motion data in the world coordinate system.
[0006] In one optional implementation, hand region extraction is performed based on the entire sequence of video frames to obtain a complete sequence of hand region frame set and a complete sequence of hand dynamic mask, including: performing frame-by-frame hand region detection based on the entire sequence of video frames to obtain a complete sequence of hand region frame set; and performing left and right hand recognition and identification association based on the complete sequence of hand region frame set to obtain a complete sequence of hand dynamic mask.
[0007] In one optional implementation, the hand features of each frame are augmented and enhanced based on the temporal context information of two adjacent frames in the full sequence of hand region frames to obtain a global spatiotemporal hand feature sequence. This includes: extracting features from the hand region of each frame in the full sequence of hand region frames to obtain high-dimensional features of the hand region of each frame; establishing feature associations between adjacent frames based on the context information of the full sequence of hand region frames to obtain temporal context feature associations of the full sequence of hand region frames; and augmenting and enhancing the abnormal frame features in the temporal context feature associations based on the high-dimensional features of the hand region of each frame to obtain a global spatiotemporal hand feature sequence.
[0008] In one optional implementation, pose calculation and global temporal constraint joint optimization correction are performed based on the global spatiotemporal hand feature sequence to obtain the full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera. This includes: performing frame-by-frame pose calculation in the camera coordinate system of the smart camera based on the global spatiotemporal hand feature sequence to obtain the initial full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera. The initial full-degree-of-freedom hand pose parameters include at least one of hand posture, shape, global direction, and translation parameters; associating the poses of two adjacent frames based on the initial full-degree-of-freedom hand pose parameters to obtain temporally associated pose parameters; and performing hand movement physiological structure constraint processing based on the temporally associated pose parameters to obtain the full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera.
[0009] In one optional implementation, camera trajectory calculation and world coordinate system construction are performed based on the full sequence of hand dynamic masks to obtain the camera trajectory and the corresponding world coordinate system. This includes: extracting background feature points from the global spatiotemporal hand feature sequence based on the full sequence of hand dynamic masks to obtain background feature points; performing synchronous localization and map construction based on the background feature points to obtain the full sequence of camera poses; connecting the full sequence of camera poses in chronological order to obtain the initial camera trajectory; performing global real-world scale calibration based on the initial camera trajectory to obtain the camera trajectory; and constructing a unified world coordinate system based on the camera trajectory to obtain the world coordinate system.
[0010] In one optional implementation, global real-world scale calibration is performed based on the initial camera trajectory to obtain the camera trajectory, including: performing absolute depth estimation based on the entire sequence of video frames to obtain an absolute depth map of the background scene; obtaining the depth values of background feature points in the absolute depth map to obtain the absolute depth of the background feature points; calculating a global scale factor based on the proportional relationship between the absolute depth and the relative displacement of the same background feature points in the initial camera trajectory; and performing global scaling adjustment on the initial camera trajectory based on the global scale factor to obtain the camera trajectory.
[0011] In one optional implementation, global joint optimization of hand and camera motion is performed based on the fully free hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system to obtain fully free hand motion data in the world coordinate system. This includes: transforming the fully free hand pose parameters in the camera coordinate system to the world coordinate system based on the camera trajectory and the corresponding world coordinate system to obtain an initial hand pose sequence in the world coordinate system; projecting hand feature points in the initial hand pose sequence in the world coordinate system based on the camera trajectory to obtain a predicted projection position; matching the predicted projection position with hand feature points in the global spatiotemporal hand feature sequence and calculating the matching error to obtain a global projection error; and adjusting the initial hand pose sequence and camera trajectory in the world coordinate system based on the global projection error until the global projection error is less than a global projection error threshold to obtain fully free hand motion data in the world coordinate system.
[0012] The present invention provides a method for processing hand motion data based on visual motion capture, which further includes: removing abnormal data from the full-degree-of-freedom hand motion data in the world coordinate system that deviates from the physiological range of human hand motion; performing motion smoothing filtering on the full-degree-of-freedom hand motion data with removed abnormal data to obtain a smoothed motion data sequence; correcting the smoothed motion data sequence based on hand anatomical kinematic constraints to obtain a corrected motion data sequence; performing motion calibration on the corrected motion data sequence to obtain a calibration motion data sequence; and optimizing the trajectory based on the hand world space trajectory in the calibration motion data sequence to obtain target full-degree-of-freedom hand motion data in the world coordinate system, wherein the target full-degree-of-freedom hand motion data includes at least one of the hand global pose, joint angles, and world space trajectory for each frame.
[0013] In one optional implementation, after decoding the hand motion video data to obtain a complete sequence of video frames, the method further includes: scaling each frame in the fully decoded video frame set according to a preset resolution to obtain a full sequence size-normalized video frame; performing geometric correction on the full sequence size-normalized video frames based on camera calibration distortion parameters to obtain a full sequence corrected video frame set; and performing pixel value normalization processing on the full sequence corrected video frame set based on a preset pixel normalization range to obtain a full sequence frame dataset.
[0014] Secondly, the present invention provides a hand motion data processing system based on visual motion capture, comprising: a smart camera and a processor, wherein the smart camera is used to collect hand motion video data of a target object, and the processor is used to execute the hand motion data processing method based on visual motion capture of the first aspect or any corresponding embodiment described above.
[0015] The present invention provides a visual motion capture-based hand motion data processing method. This method decodes the acquired hand motion video data of the target object frame by frame, providing a complete full-sequence image foundation for subsequent processing. It then extracts the hand region from the decoded full-sequence video frame set, accurately locating the hand target and generating a dynamic mask to effectively shield against complex background interference and improve robustness in complex scenes. Next, the temporal context information of adjacent frames in the full-sequence hand region frame set is used to complete and enhance the hand features of each frame, obtaining a global spatiotemporal hand feature sequence. This global temporal information is used to repair degraded features such as occlusion and blurring, ensuring the integrity of single-frame features in challenging scenes. Finally, the global spatiotemporal hand feature sequence is subjected to pose decoding. The algorithm employs a combined optimization and correction approach with global temporal constraints to obtain smooth, physiologically accurate hand pose parameters in the camera coordinate system, eliminating single-frame jitter and anomalies. Furthermore, it calculates camera trajectory and constructs a world coordinate system based on the full sequence of hand dynamic masks, mapping relative motion from local perspectives to a unified 3D world space and establishing a global spatial benchmark. Finally, it performs global joint optimization of hand and camera motion based on the fully free hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system, obtaining fully free hand motion data in the world coordinate system. This eliminates the cumulative errors caused by separate calculations and outputs high-precision motion data with spatiotemporal consistency in the world coordinate system. This invention, through a full-process camera input and global optimization architecture, fundamentally solves the problem of traditional high-precision motion capture's strong dependence on dedicated hardware and fixed locations, achieving extreme portability and out-of-the-box usability of the acquisition end. Simultaneously, by leveraging a global joint optimization closed-loop for hand and camera motion, it significantly improves robustness in complex real-world scenarios while ensuring no cumulative error in long sequences. It can be directly applied to professional fields such as film and animation, and digital human-driven systems without manual correction, significantly reducing the barrier to entry and implementation costs of high-precision motion capture technology. Through global closed-loop optimization of hand and camera motion, it improves the spatiotemporal consistency of hand motion capture data and the accuracy of motion posture reconstruction. Attached Figure Description
[0016] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0017] Figure 1 This is a flowchart illustrating a hand motion data processing method based on visual motion capture according to an embodiment of the present invention. Figure 2 This is a schematic diagram of the structure of a smart camera according to an embodiment of the present invention; Figure 3 This is a structural schematic diagram of another smart camera according to an embodiment of the present invention. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] This embodiment provides a hand motion data processing system based on visual motion capture, including a smart camera and a processor. The smart camera is used to collect video data of the hand motion of a target object. In this embodiment, the smart camera can be a monocular camera or a binocular stereo camera, etc.
[0020] like Figure 2-3The intelligent camera comprises a head-mounted carrier module 1, a video acquisition module 2, and an integrated module 3. The integrated module 3 includes a communication module, a storage module, and a power supply module. The head-mounted carrier module 1 features an adjustable headband design to fit different head shapes. The headband has built-in cushioning and anti-slip structures to ensure stability and comfort. A camera mounting bracket is located at the front of the carrier, supporting tilt adjustment of ±15 degrees and horizontal adjustment of ±30 degrees for precise aiming at the hand's activity area. The video acquisition module 2 includes a camera unit using a 720P, 1080P, or 4K high-definition CMOS sensor with an adjustable frame rate of 30 / 60. It is equipped with lenses with a horizontal FOV of 120 degrees and a vertical FOV of 120 degrees to cover the core area of hand activity and avoid blind spots. The communication module integrates 4G and WiFi units for high-speed transmission of high-definition images and hand movement data. The storage module has a built-in 128GB high-speed flash memory as local storage, supporting real-time caching of raw images and processed data to prevent data loss. The power supply module uses a 5V / 2A rechargeable lithium battery to ensure that the device has a battery life of ≥4 hours.
[0021] The steps for using a smart camera are as follows: First, the target wears the head-mounted carrier module 1 and adjusts its tightness. Simultaneously, the angle of the camera mounting bracket is adjusted so that the video capture module 2 is aligned with the hand's movement area. Then, the power button is pressed and held to start the device; the system enters standby mode after power-on. Subsequently, the video capture module 2 captures hand movement video data at a preset frame rate (30fps or 60fps). This data is transmitted in real-time to the processor for data analysis via the communication module. It also supports real-time storage of the video stream to the local TF card in the storage module, achieving dual-channel backup.
[0022] The processor is used to perform tasks such as Figure 1 The method for processing hand motion data based on visual motion capture in this paper processes the hand motion video data transmitted by the communication module.
[0023] According to an embodiment of the present invention, a method for processing hand motion data based on visual motion capture is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0024] This embodiment provides a hand motion data processing method based on visual motion capture, which can be used in the aforementioned mobile terminals, such as mobile phones and tablets. Figure 1 This is a flowchart of a hand motion data processing method based on visual motion capture according to an embodiment of the present invention, as shown below. Figure 1 As shown, the process includes the following steps: Step S101: Obtain video data of the hand movement of the target object. The video data of the hand movement is collected by a smart camera.
[0025] This embodiment acquires hand movement video data of the target object for a preset duration (e.g., 5s to 60s) collected by a smart camera to ensure that the complete hand movement cycle is included. The smart camera can be a monocular smart camera or a binocular smart camera, etc.
[0026] Step S102: Decode the hand motion video data frame by frame to obtain a complete sequence of video frames.
[0027] In this embodiment, the hand movement video data is a complete RGB video file captured by a smart camera. This embodiment uses a standard video decoding process to parse the video file frame by frame in chronological order, restoring it into a series of continuous static images. This yields a complete sequence of video frames containing all time points, providing the most basic image data input for subsequent hand posture and motion analysis.
[0028] Step S103: Extract the hand region based on the full sequence video frame set to obtain the full sequence hand region frame set and the full sequence hand dynamic mask.
[0029] This embodiment is based on a complete video frame set. By detecting the hand region in each frame, a continuous image of the hand region is extracted (forming a complete hand region frame set). Simultaneously, a binary mask of the hand region in each frame is output (i.e., a dynamic full-sequence hand mask). The mask accurately isolates the hand region; for example, the hand region in the image is marked as white (value 1), and the background region is marked as black (value 0). This mask effectively eliminates interference from complex backgrounds, clothing, or other objects, ensuring that subsequent algorithms such as pose estimation and trajectory tracking focus only on hand pixels, thereby significantly improving the accuracy and robustness of the solution.
[0030] Step S104: Based on the temporal context information of two adjacent frames in the full sequence hand region frame set, the hand features of each frame are completed and enhanced to obtain a global spatiotemporal hand feature sequence.
[0031] This embodiment utilizes temporal context information from the entire sequence to enhance the hand features of a single frame. For situations where the hand is occluded, moves out of the frame, or is blurred, resulting in incomplete features or degraded quality, this embodiment can use information from preceding and following clear frames to infer and complete the features of these problematic frames. In this embodiment, adjacent frames refer to the first and second frames, the second and third frames, the third and fourth frames, and so on.
[0032] Step S105: Based on the global spatiotemporal hand feature sequence, perform pose calculation and global temporal constraint joint optimization and correction to obtain the full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera.
[0033] Based on the global spatiotemporal hand features extracted earlier, this embodiment initially calculates the specific position and motion parameters of the hand from the camera's perspective. Subsequently, a global temporal constraint mechanism is introduced to optimize and correct the initial calculation results, effectively eliminating single-frame calculation errors or image jitter caused by occlusion or blurring, ensuring that the final output hand pose parameters are not only accurate but also have a smooth and natural motion transition.
[0034] Step S106: Calculate the camera trajectory and construct the world coordinate system based on the full sequence of hand dynamic masks to obtain the camera trajectory and the corresponding world coordinate system.
[0035] This embodiment uses a full-sequence dynamic hand mask as a visual reference to infer and calculate the motion trajectory of the smart camera itself, thereby constructing a fixed world coordinate system. By mapping the hand pose parameters in the camera coordinate system calculated in the previous steps to this unified global coordinate system, the perspective deviation caused by hand-held shooting shakiness or camera movement is effectively eliminated. The relative motion, which was originally only relative to the camera, is transformed into absolute motion in the real world, ensuring the positioning accuracy and spatial consistency of the hand motion trajectory in physical space.
[0036] Step S107: Based on the full-degree-of-freedom hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system, perform global joint optimization of hand and camera motion to obtain full-degree-of-freedom hand motion data in the world coordinate system.
[0037] In this embodiment, the previously independently calculated hand pose parameters and camera motion trajectory are combined into a unified global optimization framework for adjustment. This minimizes the global reprojection error of the entire sequence, eliminates the cumulative error caused by separate calculations, and obtains the complete hand motion sequence in the world coordinate system, thereby improving the accuracy and precision of hand motion pose reconstruction.
[0038] The hand motion data processing method based on visual motion capture provided in this embodiment decodes the acquired hand motion video data of the target object frame by frame, providing a complete full-sequence image foundation for subsequent processing. Hand regions are extracted from the decoded full-sequence video frame set to accurately locate the hand target and generate a dynamic mask, effectively shielding complex background interference and improving robustness in complex scenes. Next, the temporal context information of adjacent frames in the full-sequence hand region frame set is used to complete and enhance the hand features of each frame, obtaining a global spatiotemporal hand feature sequence. Global temporal information is used to repair degraded features such as occlusion and blurring, ensuring the integrity of single-frame features in difficult scenes. Finally, the pose of the global spatiotemporal hand feature sequence is calculated. Joint optimization and correction of solution and global temporal constraints yield smooth hand pose parameters that conform to human physiological laws in the camera coordinate system, eliminating single-frame jitter and anomalies. Camera trajectory calculation and world coordinate system construction are performed based on the full-sequence hand dynamic mask, obtaining the camera trajectory and corresponding world coordinate system. This maps the relative motion from the local perspective to a unified 3D world space, establishing a global spatial benchmark. Finally, global joint optimization of hand and camera motion is performed based on the fully free hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system, obtaining fully free hand motion data in the world coordinate system. This eliminates the cumulative errors caused by separate solution and outputs high-precision motion data with spatiotemporal consistency in the world coordinate system. This invention, through a full-process camera input and global optimization architecture, fundamentally solves the problem of traditional high-precision motion capture's strong dependence on dedicated hardware and fixed locations, achieving extreme portability and out-of-the-box usability of the acquisition end. Simultaneously, by leveraging a global joint optimization closed-loop for hand and camera motion, it significantly improves robustness in complex real-world scenarios while ensuring no cumulative error in long sequences. It can be directly applied to professional fields such as film and animation, and digital human-driven systems without manual correction, significantly reducing the barrier to entry and implementation costs of high-precision motion capture technology. Through global closed-loop optimization of hand and camera motion, it improves the spatiotemporal consistency of hand motion capture data and the accuracy of motion posture reconstruction.
[0039] This invention, with its professional-grade posture reconstruction accuracy and zero hardware dependency, can be applied to film and animation and game development, replacing traditional expensive motion capture equipment with just a regular camera to quickly generate high-precision hand animations; it can be used in medical rehabilitation and scientific research analysis, providing long-sequence, multi-dimensional quantitative motion data to assist doctors in functional assessments or support human kinematics research; it can be used for offline teaching of industrial robots, generating control commands by mimicking human operating movements, reducing the programming threshold for automated production lines; and it can also be used for sign language recognition and digital human driving, building a refined sign language corpus or providing natural body interaction data for virtual characters, comprehensively promoting the low-cost implementation of high-precision motion capture technology in consumer and industrial scenarios.
[0040] In some optional implementations, the process of extracting the hand region based on the entire video sequence frame set to obtain the entire hand region frame set and the entire hand dynamic mask in step S103 mainly includes: Step S1031: Perform frame-by-frame hand region detection based on the full sequence video frame set to obtain the full sequence hand region frame set.
[0041] This embodiment uses a pre-trained object detection model to perform bounding box detection and cropping of hand instances in each frame image based on the complete frame sequence. This accurately separates the local image region containing the hand from the original complex background, resulting in a complete sequence set of hand region frames containing only the hand target. This step effectively eliminates background noise and interference from non-hand dynamic objects by narrowing the image focus area, significantly reducing computational redundancy in subsequent feature extraction. It provides high signal-to-noise ratio local input data for refined analysis of hand movements, improving processing efficiency and robustness.
[0042] In terms of model training, the model is first pre-trained on a large-scale dataset containing general object categories to learn general visual feature extraction capabilities. Then, it is transferred learning and fine-tuned on a specially labeled hand dataset to optimize it specifically for hand features, thereby improving the detection accuracy of the model in complex backgrounds.
[0043] Step S1032: Based on the full sequence of hand region frames, perform left and right hand recognition and identification association to obtain the full sequence of hand dynamic mask.
[0044] This embodiment manages the identity of hand instances in each frame based on a complete set of hand region frames. Hand targets in consecutive frames are matched, and a unique identity identifier (ID) is assigned to the same hand target across frames and classified as left or right hand based on appearance feature similarity and motion distance calculations. Simultaneously, image segmentation techniques are used to generate a binary mask corresponding to the hand region in each frame (i.e., a dynamic mask for the entire sequence of hands), accurately identifying the pixel-level contours of the hands. This step solves the problems of occlusion, interaction, and ID jumps that occur during hand movement, ensuring the continuity of hand identity throughout the entire time series. Furthermore, the generated dynamic mask provides crucial semantic segmentation basis for subsequent steps to distinguish foreground (hands) from background and construct interference-resistant camera trajectories.
[0045] In terms of model training, a large pedestrian re-identification dataset is used for pre-training to learn how to extract discriminative appearance features to distinguish different individuals; at the same time, the Kalman filter algorithm is used to predict the position of the current frame based on the state of the previous frame, thereby achieving stable and smooth trajectory association.
[0046] This embodiment first performs precise frame-by-frame localization and cropping of the hand region across the entire video sequence, generating a high signal-to-noise ratio (SNR) set of hand region frames. Then, it performs cross-frame identity matching and left / right hand classification on this set, simultaneously generating pixel-level binary masks. This process effectively solves the problems of hand occlusion and ID jumps in dynamic scenes, ensuring the continuity of hand identity throughout the entire time series. Finally, by introducing a dynamic mask, the dynamic interference source of the hand is physically isolated from the static background, providing a data foundation for subsequent camera self-calibration based on background features and motion capture based on hand features, thereby ensuring the stability and accuracy of the entire visual motion capture system in complex dynamic scenes.
[0047] In some optional implementations, the process of completing and enhancing the hand features of each frame based on the temporal context information of two adjacent frames in the full sequence hand region frame set to obtain a global spatiotemporal hand feature sequence in step S104 mainly includes: Step S1041: Extract features from the hand region of each frame in the full sequence of hand region frames to obtain high-dimensional features of the hand region of each frame.
[0048] This embodiment uses a pre-trained deep backbone feature extraction network to perform convolutional operations and hierarchical feature abstraction on the hand image of each frame, based on a complete sequence of hand regions. This transforms the input two-dimensional image into a high-dimensional feature vector for each frame's hand region, containing information on texture, shape contour, structural relationships, and pose. This step, through the powerful representational capabilities of deep convolutional networks, effectively filters out interference from illumination variations and background noise, extracting stable and discriminative essential hand features, thus laying a solid visual foundation for subsequent temporal correlation and motion analysis.
[0049] In terms of model training, the model is first pre-trained on a large-scale general image dataset to learn general visual feature extraction capabilities; then it is transferred and fine-tuned on a specialized dataset containing rich hand poses to optimize it specifically for the anatomical features of the hand.
[0050] Step S1042: Establish feature association between two adjacent frames based on the context information of the full sequence of hand region frames to obtain the temporal context feature association of the full sequence of hand region frames.
[0051] This embodiment utilizes a global temporal attention network based on the high-dimensional features of the hand region in each frame to calculate the feature similarity and dependency between adjacent frames (or distant frames) throughout the entire sequence, establishing temporal context feature associations. By capturing long-distance temporal dependencies through a self-attention mechanism, the model can understand the continuity and movement trends of hand actions (such as waving trajectories and grasping preparatory movements), thereby capturing the global patterns of dynamic behavior.
[0052] In terms of model training, taking the Transformer architecture as an example, a self-supervised learning approach is used for pre-training. By randomly masking some image patches and allowing the model to reconstruct them, the model is forced to learn global contextual information, thereby obtaining highly robust feature representations. Subsequently, supervised fine-tuning training is performed on a specific hand action dataset to optimize the model's understanding and prediction capabilities of the temporal action evolution patterns.
[0053] Step S1043: Based on the high-dimensional features of the hand region in each frame, the abnormal frame features in the temporal context feature association are completed and enhanced to obtain the global spatiotemporal hand feature sequence.
[0054] This embodiment intelligently completes and enhances anomalous frame features in feature association by associating high-dimensional features and temporal context features of the hand region in each frame, using a global temporal attention network or the temporal modeling unit of a Transformer. Specifically, by analyzing normal motion manifolds and spatiotemporal continuity, it infers and repairs feature loss or distortion caused by occlusion, truncation, or motion blur, outputting a global spatiotemporal hand feature sequence that contains both fine appearance information and coherent motion information. This step effectively solves the feature degradation problem caused by instantaneous interference in dynamic capture, significantly improving the robustness and stability of the feature sequence in complex real-world scenarios.
[0055] In terms of model training, taking the LSTM network as an example, training is performed on a large-scale time-series dataset containing continuous hand movement trajectories. The training process first allows the network to learn the dynamic laws of normal hand movements and master the ability to predict the features of subsequent frames based on the features of previous frames (i.e., sequence prediction pre-training). Subsequently, adversarial training or reconstruction optimization is performed on a dataset containing simulated occlusion or noise (abnormal frames) to enhance the model's tolerance and repair capabilities for incomplete inputs, thereby enabling reasonable feature reconstruction of abnormal frames during the inference stage.
[0056] This embodiment first utilizes a pre-trained deep backbone feature extraction network to extract deep visual features from the entire sequence of hand regions, transforming the 2D image into a high-dimensional vector containing texture and structure. Then, a global temporal attention network is used to establish cross-frame global temporal dependencies, capturing the dynamic trends of the action. Finally, the global temporal attention network intelligently repairs and enhances damaged or missing features. This process not only preserves the fine appearance details of the hand but also deeply integrates dynamic motion information, generating a highly robust global spatiotemporal feature sequence. Finally, this feature representation, which combines appearance and motion information, provides extremely reliable data support for high-precision 3D hand pose calculation in subsequent steps, ensuring stable and accurate motion parameters can still be output even when the hand is occluded or blurred.
[0057] In some optional implementations, step S105 above involves pose calculation based on the global spatiotemporal hand feature sequence and joint optimization and correction of global temporal constraints to obtain the fully free hand pose parameters in the camera coordinate system of the smart camera, mainly including: Step S1051: Based on the global spatiotemporal hand feature sequence, perform frame-by-frame pose calculation in the camera coordinate system of the smart camera to obtain the initial full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera. The initial full-degree-of-freedom hand pose parameters include at least one of hand posture, shape, global orientation, and translation parameters.
[0058] This embodiment uses a pre-trained regression network to map and decode the features of each frame based on the global spatiotemporal hand feature sequence. The network treats the hand skeleton as graph-structured data, aggregating information from joints and their neighboring regions through convolutional operations to directly regress the initial fully free-DOF hand pose parameters in the camera coordinate system, including hand pose (rotation angles of each joint), shape (palm width, finger length), global orientation, and translation parameters. This step, through an end-to-end mapping method, skips the traditional complex geometric matching process, enabling rapid inference of 3D spatial pose from 2D visual features, providing high-precision initialization parameters for subsequent optimization.
[0059] In terms of model training, pre-training is performed on a large-scale synthetic hand dataset. Supervised learning minimizes the reprojection error and vertex error between the predicted pose and the real label, enabling the model to master the nonlinear mapping ability from the feature space to the physical space.
[0060] Step S1052: Based on the initial full-degree-of-freedom hand pose parameters, perform pose association between two adjacent frames to obtain temporally associated pose parameters.
[0061] This embodiment uses a pre-trained global temporal pose constraint network (GTCNN) to establish the motion continuity and rate of change between adjacent frames based on the initial full-degree-of-freedom hand pose parameters. This network focuses on the interaction of pose features between the current frame and the previous (and subsequent) frames through sliding window or local attention mask constraints, calculating motion consistency weights between adjacent frames to output smoothly transitioning temporally correlated pose parameters. This step effectively solves the inter-frame jump problem that may occur in single-frame calculation, concatenating discrete frame-level predictions into a continuous motion flow, enhancing the temporal coherence of the pose sequence.
[0062] In terms of model training, taking Transformer as an example, by introducing inter-frame jitter noise into the training data and training the model to predict jitter-free true poses through the context information of adjacent frames, robust anti-interference correlation ability is learned.
[0063] Step S1053: Hand movement physiological structure constraint processing is performed based on temporal correlation pose parameters to obtain the full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera.
[0064] This embodiment utilizes a pre-trained global temporal pose constraint network based on temporally associated pose parameters, combined with prior knowledge of human anatomy, to jointly optimize the entire sequence. This network captures long-term dependencies throughout the action sequence through a self-attention mechanism, while introducing pre-set physiological prior knowledge such as the constant length of human hand bones and limitations on joint range of motion as constraints. This corrects abnormal postures that violate physiological structures, outputting final, physically consistent, fully free-degree-of-freedom hand pose parameters. This step overcomes the limitations of local temporal models, ensuring the physiological rationality and temporal consistency of the action from a global perspective.
[0065] In terms of model training, the Global Temporal Pose Constraint Network adopts a combination of self-supervised pre-training and supervised fine-tuning: first, it performs masked frame reconstruction pre-training on large-scale unlabeled action data to learn the manifold distribution of human motion; then, it performs fine-tuning on precisely labeled datasets, and further enhances its ability to identify and correct unreasonable poses by introducing an anatomical-based loss function.
[0066] This embodiment first utilizes a regression network to rapidly regress a fully free initial pose, including posture, shape, and translation, from enhanced features, achieving efficient mapping from visual feature space to three-dimensional physical space. Subsequently, a global temporal pose constraint network is used to model and correlate motion dependencies between adjacent frames, effectively filtering out random noise and jitter from single-frame estimation, ensuring smoothness and temporal continuity of motion transitions. Finally, a global temporal pose constraint network combined with prior knowledge of human anatomy performs joint optimization across the entire sequence, correcting abnormal poses that violate physiological structures, ensuring the physical rationality and global consistency of the output data. This process not only significantly improves the accuracy and robustness of hand motion capture in complex dynamic scenes but also ensures that the generated hand pose data conforms to visual observation facts and strictly follows the laws of human movement biology.
[0067] In some optional implementations, the process of calculating the camera trajectory and constructing the world coordinate system based on the full sequence of hand dynamic masks in step S106, to obtain the camera trajectory and the corresponding world coordinate system, mainly includes: Step S1061: Extract background feature points from the global spatiotemporal hand feature sequence based on the full sequence hand dynamic mask to obtain background feature points.
[0068] This embodiment utilizes a pre-trained adaptive spatial localization module to perform keypoint detection and descriptor calculation on the non-masked regions (i.e., background regions) in the global spatiotemporal hand feature sequence based on the full-sequence hand dynamic mask, extracting stable static background feature points. This step effectively shields the hand as a high-frequency dynamic interference source through dynamic masking, ensuring that the extracted feature points all originate from the static background, thereby significantly improving the stability and accuracy of subsequent camera pose calculation.
[0069] In terms of model training, a self-supervised learning approach is used for pre-training. Pseudo-labels are generated by performing geometric transformations (such as perspective transformations) on a large number of unlabeled images, and the network is trained to learn the feature point detection capability that is robust to changes in viewpoint.
[0070] Step S1062: Perform synchronous localization and map construction based on background feature points to obtain the full sequence of camera poses.
[0071] This embodiment uses extracted background feature points and a pre-trained spatial localization model to estimate the relative motion of the camera by matching background feature points between adjacent frames, and simultaneously constructs a sparse environment map, thereby outputting the full sequence camera pose for each frame. This step enables the function of mapping and localization simultaneously in an unknown environment, solving the problem of camera localization in three-dimensional space.
[0072] In terms of model training, parameter tuning and validation are performed on standard public datasets to ensure the accuracy and robustness of pose estimation under different motion modes.
[0073] Step S1063: Connect the camera poses of the entire sequence in chronological order to obtain the initial camera trajectory.
[0074] This embodiment concatenates discrete pose parameters according to the timestamp order of video frames based on the full sequence of camera poses to generate a continuous initial camera trajectory. This step integrates scattered single-frame position information into a complete motion path, intuitively reflecting the camera's motion throughout the entire shooting process. This embodiment can also combine Kalman filtering or bundle adjustment for local optimization to eliminate accumulated errors and ensure that the generated trajectory has geometric consistency in space.
[0075] Step S1064: Perform global real-world scale calibration based on the initial camera trajectory to obtain the camera trajectory.
[0076] In this embodiment, based on the initial camera trajectory, the absolute depth of the entire video sequence is estimated using a pre-trained monocular metric depth estimation model.
[0077] In terms of model training, the depth estimation model is pre-trained on a large-scale mixed dataset containing rich scenes (including real images and synthetic rendering data), and is optimized through a scale-invariant loss function to learn to predict depth values with absolute significance.
[0078] Step S1065: Construct a unified world coordinate system for the entire sequence based on the camera trajectory to obtain the world coordinate system.
[0079] This embodiment selects a stationary reference point in the scene (usually the camera center of the initial frame or a stable background feature point) as the origin based on the camera trajectory with real physical scale. The coordinate axis orientation is determined by combining the gravity direction (inferred from the inertial measurement unit (IMU) or the vertical direction of the image), thus constructing a unified world coordinate system for the entire sequence. This step provides a fixed reference benchmark for all subsequent 3D reconstruction and motion analysis, ensuring that data from different times and perspectives can be aligned and compared within the same spatial framework.
[0080] This embodiment first utilizes a pre-trained adaptive spatial localization module combined with dynamic masking to extract clean background feature points, thus shielding hand motion interference. Then, a simultaneous localization and mapping (SMR) algorithm is used to perform high-precision SMR and map construction based on these feature points, obtaining the full sequence of camera poses. Next, the poses are concatenated and global scale calibration is performed using a monocular depth estimation model, eliminating visual scale ambiguity. Finally, a unified world coordinate system is constructed based on the calibrated trajectory. This series of steps not only achieves high-precision camera motion recovery in a monocular camera configuration but also ensures that the output trajectory has a realistic physical scale and a stable spatial reference benchmark, providing indispensable spatiotemporal anchors for the precise alignment of subsequent hand motion data with camera motion.
[0081] In some optional implementations, the process of performing global real-world scale calibration based on the initial camera trajectory in step S1064 to obtain the camera trajectory mainly includes: Step S10641: Perform absolute depth estimation based on the entire sequence of video frames to obtain the absolute depth map of the background scene.
[0082] This embodiment uses a pre-trained monocular metric depth estimation model to perform forward inference on each frame of the full video sequence, outputting an absolute depth map of the background scene corresponding to the pixel coordinates. This step provides a real-world physical reference for subsequent scale alignment, compensating for the lack of absolute scale information in pure geometric vision methods and ensuring consistency between the depth estimation results and the real physical world.
[0083] Step S10642: Obtain the depth values of background feature points in the absolute depth map to obtain the absolute depth of the background feature points.
[0084] In this embodiment, based on the absolute depth map and background feature points (from step S1061), the numerical values of the corresponding pixel positions of these feature points in the depth map are obtained through a query operation, and then converted into the absolute depth of the background feature points in physical units. This step establishes a direct mapping relationship between image feature points and physical depth values, providing an accurate data source for subsequent calculations of scale ratios.
[0085] Step S10643: Calculate the global scale factor based on the ratio of the absolute depth to the relative displacement of the same background feature points in the initial camera trajectory.
[0086] This embodiment calculates the ratio between the absolute depth of background feature points and the relative displacement (in pixels) of the same feature points in the initial camera trajectory, thereby solving for the global scale factor. This factor represents the proportion of the real-world length in the virtual trajectory. Furthermore, it can eliminate mismatches using the least squares method or random sampling consensus algorithm, ensuring the robustness of the scale estimation.
[0087] Step S10644: Globally scale and adjust the initial camera trajectory based on the global scale factor to obtain the camera trajectory.
[0088] This embodiment uses a global scale factor to uniformly scale all translation vectors in the initial camera trajectory, stretching or compressing the originally relatively proportional trajectory to the true physical size to obtain the camera trajectory. This scaling adjustment ensures that the distance the camera moves perfectly matches the actual distance moved in physical space.
[0089] This embodiment successfully solves the inherent scale ambiguity problem in monocular vision systems by utilizing the absolute depth prior provided by the monocular metric depth estimation model. By accurately calculating the global scale factor and uniformly scaling the trajectory, it ensures that the output camera trajectory has true physical metric meaning, providing a reliable scale benchmark for subsequent accurate alignment of hand motion data to the real-world coordinate system.
[0090] In some optional implementations, the process of performing global joint optimization of hand and camera motion based on the full-degree-of-freedom hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system in step S107 to obtain full-degree-of-freedom hand motion data in the world coordinate system mainly includes: Step S1071: Based on the camera trajectory and the corresponding world coordinate system, the full-degree-of-freedom hand pose parameters in the camera coordinate system are transformed to the world coordinate system to obtain the initial hand pose sequence in the world coordinate system.
[0091] This embodiment performs rigid body transformation on the fully free-degree-of-freedom hand pose parameters in the camera coordinate system based on the camera trajectory and the world coordinate system. Specifically, using the camera pose (rotation matrix and translation vector) corresponding to each frame, the coordinates of the hand joints are mapped from the camera coordinate system to a unified world coordinate system, resulting in an initial hand pose sequence in the world coordinate system. This step breaks the local viewpoint limitation of the camera coordinate system, placing the hand motion in a global, constant physical reference system, laying the foundation for subsequent multimodal data alignment.
[0092] Step S1072: Based on the camera trajectory, project the hand feature points in the initial hand pose sequence in the world coordinate system to obtain the predicted projection position.
[0093] This embodiment uses a perspective projection model to perform geometric mapping based on the initial hand pose sequence in the world coordinate system and the corresponding camera trajectory. Specifically, the 3D hand feature points in the world coordinate system are transformed to the camera coordinate system using the camera extrinsic parameters (rotation matrix and translation vector) of the current frame, resulting in 3D points in the camera coordinate system. Subsequently, using the camera intrinsic parameters (focal length and principal point), the 3D points in the camera coordinate system are projected onto the 2D image plane through perspective division to obtain normalized coordinates, which are then finally converted into pixel coordinates, i.e., the predicted projection position. This step establishes a strict geometric correspondence between the 3D physical space and the 2D observation space, and is a key step in verifying whether the solution accurately matches the original image.
[0094] Step S1073: Match the predicted projection position with the hand feature points in the global spatiotemporal hand feature sequence, and calculate the matching error to obtain the global projection error.
[0095] This embodiment matches the predicted projection position with the actual 2D hand feature points detected in the global spatiotemporal hand feature sequence one by one. By calculating and accumulating the projection errors (e.g., Euclidean distance) between the two, the global projection error is obtained. This step quantifies the deviation between the current 3D reconstruction result and the actual visual observation. The smaller the error value, the higher the degree of consistency between the hand pose and the camera trajectory, which is a direct indicator of the optimization effect.
[0096] Step S1074: Based on the global projection error, adjust the initial hand pose sequence and camera trajectory in the world coordinate system until the global projection error is less than the global projection error threshold, and obtain the full-degree-of-freedom hand motion data in the world coordinate system.
[0097] This embodiment uses minimizing the global projection error as the objective function and employs bundle adjustment to perform joint nonlinear optimization of the initial hand pose sequence and camera trajectory in the world coordinate system. Specifically, by iteratively adjusting these variables (such as camera extrinsic parameters), the projection error converges to a minimum until it is less than a preset global projection error threshold, ultimately outputting highly consistent full-degree-of-freedom hand motion data in the world coordinate system. This step eliminates the accumulated error caused by separate solutions through the global optimal solution, ensuring perfect spatiotemporal alignment of hand and camera movements.
[0098] This embodiment first unifies hand movements to the world coordinate system through coordinate transformation. Then, it uses projection geometry to map the 3D data back to a 2D image for cross-validation. Based on the projection error, it jointly fine-tunes the 3D hand pose and camera trajectory. This process completely eliminates scale drift and spatiotemporal misalignment that occur during independent calculations, significantly improving the absolute accuracy and spatial stability of hand movement data. The final output motion data is geometrically highly consistent with the original video observations, providing a high-fidelity spatiotemporal reference for subsequent motion analysis and interactive applications.
[0099] The present invention provides a method for processing hand motion data based on visual motion capture, which further includes: Step S108: Remove abnormal data from the full-degree-of-freedom hand motion data in the world coordinate system that deviate from the physiological range of human hand motion.
[0100] This embodiment uses statistical analysis methods to identify and remove abnormal data that deviates from the physiological limits of human hand movement based on fully free hand motion data in the world coordinate system. Specifically, it sets reasonable threshold ranges for hand joint angles, angular velocities, and accelerations, scans each frame of data in the entire sequence, and marks data points that exceed these ranges (such as reverse finger bending or instantaneous jumps) as abnormal and removes them or replaces them using interpolation between preceding and following frames. This step purifies the data quality from the source, effectively eliminating extreme erroneous poses caused by severe occlusion or sudden changes in lighting, thus clearing obstacles for subsequent smoothing processing.
[0101] Step S109: Perform motion smoothing filtering on the full-degree-of-freedom hand motion data after removing outlier data to obtain a smoothed motion data sequence.
[0102] This embodiment uses digital signal processing algorithms to perform low-pass filtering on the entire sequence of fully free-degree-of-freedom hand motion data after outlier removal. This step filters out high-frequency noise and micro-tremors through convolution operations, preserving the low-frequency main motion trend of the hand movements and outputting a smooth motion data sequence. This process effectively eliminates the inherent jitter problem of visual algorithms, making the generated motion trajectory smoother and more natural, conforming to the physical movement habits observed by the human eye.
[0103] Step S110: Correct the smooth motion data sequence based on hand anatomical kinematic constraints to obtain the corrected motion data sequence.
[0104] This embodiment uses a differentiable kinematic layer to apply anatomical constraints to smooth motion data sequences for secondary correction. Specifically, it forcibly constrains the length of the hand bones to remain constant throughout the sequence and restricts the rotation angles of each joint to strictly within the range of human anatomical movement. Penetrating or exceeding these limits is corrected to obtain a corrected motion data sequence. This step further enhances the physiological plausibility of the data and solves the problems of artifacts or physical distortions such as bone elongation that may exist in purely visual data.
[0105] Step S111: Perform motion calibration on the corrected motion data sequence to obtain the calibrated motion data sequence.
[0106] This embodiment verifies the dynamic rationality of the action by using a physics engine simulation based on the corrected motion data sequence. Specifically, the motion data drives a virtual hand model to check for any dynamic contradictions that cannot be realized in reality (such as clipping, suspension, or violation of momentum conservation). Frames with logical conflicts are fine-tuned or marked, and a verified motion data sequence is output to improve the accuracy of the data.
[0107] Step S112: Based on the world space trajectory of the hand in the verification motion data sequence, trajectory optimization is performed to obtain the target full-degree-of-freedom hand motion data in the world coordinate system. The target full-degree-of-freedom hand motion data includes at least one of the global pose of the hand, joint angles, and world space trajectory in each frame.
[0108] This embodiment uses a trajectory optimization algorithm to fine-tune the hand's world space trajectory across the entire sequence of verified motion data. The algorithm aims to minimize energy consumption by smoothing the trajectory curvature and eliminating trajectory spikes caused by data jumps, ultimately obtaining the target hand motion data with full degrees of freedom in the world coordinate system. This step further optimizes the comfort and aesthetics of the movement while ensuring data accuracy.
[0109] This embodiment first removes physiological outliers from a statistical perspective; then, signal processing techniques are used to smooth high-frequency jitter; next, kinematic constraints are used to correct physical distortions, and dynamic simulations are used to verify the feasibility of the movement; finally, trajectory optimization algorithms are used to enhance the smoothness and aesthetics of the movement. This series of refined post-processing steps completely eliminates residual noise and errors from the algorithm's solution, transforming the raw, coarse visual estimation data into high-precision, highly stable target full-degree-of-freedom hand motion data that fully conforms to the laws of human anatomy and physics.
[0110] In some optional embodiments, after decoding the hand motion video data to obtain a complete sequence of video frames in step S102, the process further includes preprocessing the complete sequence of video frames, specifically including: Step 1: Scale each frame in the set of fully decoded video frames according to a preset resolution to obtain fully sequence size-normalized video frames.
[0111] This embodiment uses a set of fully decoded video frames and performs bilinear or bicubic interpolation scaling on each frame at a preset uniform resolution to convert the original video into standard-sized, fully sequence-normalized video frames. This step eliminates the size differences in the input videos, ensuring the consistency of the input size for subsequent neural networks, while also helping to balance the algorithm's processing speed and accuracy and reducing unnecessary consumption of computing resources.
[0112] Step 2: Perform geometric correction on the full sequence of size-normalized video frames based on the camera calibration distortion parameters to obtain a set of corrected video frames.
[0113] This embodiment uses perspective transformation or distortion correction mapping algorithms to geometrically correct the entire sequence of size-normalized video frames based on distortion parameters (including radial and tangential distortion coefficients) obtained beforehand through camera calibration. This step corrects barrel or pincushion distortion caused by the lens through inverse mapping, restoring the image to linear perspective and obtaining a geometrically accurate set of corrected video frames, ensuring the geometric correctness of subsequent 3D reconstruction and spatial calculations.
[0114] Step 3: Perform pixel value standardization processing on the full sequence of corrected video frames based on the preset pixel standardization range to obtain the full sequence frame dataset.
[0115] This embodiment normalizes the entire sequence of corrected video frames according to a preset pixel normalization range (e.g., mapping pixel values from [0,255] to [-1,1] or [0,1]). Specifically, mean subtraction and variance normalization can be used, or even the mean and standard deviation of the ImageNet pre-trained model can be directly used to obtain the entire sequence frame dataset. This step eliminates the differences in brightness and contrast under different lighting conditions and improves robustness to changes in lighting.
[0116] This embodiment first unifies data specifications through size normalization, eliminating resolution differences; then, it eliminates lens distortion through geometric correction, restoring the true physical perspective; finally, it smooths out lighting and environmental interference through pixel value standardization. This series of preprocessing operations not only significantly improves the accuracy and stability of subsequent feature extraction and pose calculation, but also provides a solid data foundation for the generalization ability of the entire visual motion capture system under different hardware devices and complex environments.
[0117] A portion of this invention can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to the invention through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.
[0118] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.
Claims
1. A method for processing hand motion data based on visual motion capture, characterized in that, The method includes: Acquire video data of the hand movements of the target object, wherein the video data of the hand movements is collected by a smart camera; The hand movement video data is decoded frame by frame to obtain a complete sequence of video frames; Hand region extraction is performed based on the full sequence video frame set to obtain a full sequence hand region frame set and a full sequence hand dynamic mask; Based on the temporal context information of two adjacent frames in the full sequence of hand region frames, the hand features of each frame are completed and enhanced to obtain a global spatiotemporal hand feature sequence. Based on the global spatiotemporal hand feature sequence, pose calculation and global temporal constraint joint optimization correction are performed to obtain the full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera. Based on the full sequence of hand dynamic masks, camera trajectory calculation and world coordinate system construction are performed to obtain the camera trajectory and the corresponding world coordinate system; Based on the full-degree-of-freedom hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system, global joint optimization of hand and camera motion is performed to obtain full-degree-of-freedom hand motion data in the world coordinate system.
2. The method according to claim 1, characterized in that, The step of extracting the hand region based on the entire video frame sequence to obtain a complete set of hand region frames and a complete hand dynamic mask includes: Frame-by-frame hand region detection is performed based on the full sequence video frame set to obtain the full sequence hand region frame set; Based on the set of full-sequence hand region frames, left and right hand recognition and identification association are performed to obtain the full-sequence hand dynamic mask.
3. The method according to claim 1, characterized in that, The step of augmenting and enhancing the hand features of each frame based on the temporal context information of two adjacent frames in the full sequence of hand region frames yields a global spatiotemporal hand feature sequence, including: Feature extraction is performed on the hand region of each frame in the full sequence of hand region frames to obtain high-dimensional features of the hand region of each frame; Based on the contextual information of the entire sequence of hand region frames, feature associations between two adjacent frames are established to obtain temporal contextual feature associations of the entire sequence of hand region frames. Based on the high-dimensional features of the hand region in each frame, the abnormal frame features in the temporal context feature association are completed and enhanced to obtain the global spatiotemporal hand feature sequence.
4. The method according to claim 1, characterized in that, The pose calculation based on the global spatiotemporal hand feature sequence and the joint optimization and correction of global temporal constraints yield the fully free hand pose parameters in the camera coordinate system of the smart camera, including: Based on the global spatiotemporal hand feature sequence, frame-by-frame pose calculation is performed in the camera coordinate system of the smart camera to obtain the initial full-degree-of-freedom hand pose parameters in the camera coordinate system of the smart camera. The initial full-degree-of-freedom hand pose parameters include at least one of hand posture, shape, global direction and translation parameters. Based on the initial full-degree-of-freedom hand pose parameters, the poses of two adjacent frames are correlated to obtain temporally correlated pose parameters. Based on the temporal correlation pose parameters, physiological structure constraints on hand movement are performed to obtain the fully free hand pose parameters in the camera coordinate system of the smart camera.
5. The method according to claim 1, characterized in that, The step of calculating the camera trajectory and constructing the world coordinate system based on the full sequence of hand dynamic masks to obtain the camera trajectory and the corresponding world coordinate system includes: Based on the full sequence of dynamic hand masks, background feature points are extracted from the global spatiotemporal hand feature sequence to obtain background feature points; Based on the background feature points, synchronous localization and map construction are performed to obtain the full sequence of camera poses; The initial camera trajectory is obtained by connecting the camera poses of the entire sequence in chronological order. The camera trajectory is obtained by performing global real-world scale calibration based on the initial camera trajectory; Based on the camera trajectory, a unified world coordinate system is constructed for the entire sequence, and the world coordinate system is obtained.
6. The method according to claim 5, characterized in that, The step of performing global real-world scale calibration based on the initial camera trajectory to obtain the camera trajectory includes: Based on the complete set of video frames, an absolute depth estimation is performed to obtain an absolute depth map of the background scene; Obtain the depth values of the background feature points in the absolute depth map to obtain the absolute depth of the background feature points; The global scale factor is calculated based on the ratio of the absolute depth to the relative displacement of the same background feature points in the initial camera trajectory. The initial camera trajectory is globally scaled and adjusted based on the global scale factor to obtain the camera trajectory.
7. The method according to claim 1, characterized in that, The process involves globally co-optimizing the hand and camera motion based on the fully free hand pose parameters in the camera coordinate system, the camera trajectory, and the corresponding world coordinate system, to obtain fully free hand motion data in the world coordinate system, including: Based on the camera trajectory and the corresponding world coordinate system, the full-degree-of-freedom hand pose parameters in the camera coordinate system are transformed to the world coordinate system to obtain the initial hand pose sequence in the world coordinate system. Based on the camera trajectory, the hand feature points in the initial hand pose sequence in the world coordinate system are projected to obtain the predicted projection position; The predicted projection position is matched with the hand feature points in the global spatiotemporal hand feature sequence, and the matching error is calculated to obtain the global projection error; Based on the global projection error, the initial hand pose sequence and the camera trajectory in the world coordinate system are adjusted until the global projection error is less than the global projection error threshold, thereby obtaining the full-degree-of-freedom hand motion data in the world coordinate system.
8. The method according to claim 1, characterized in that, Also includes: Abnormal data that deviate from the physiological range of human hand movement in the full-degree-of-freedom hand movement data in the world coordinate system are removed; Motion smoothing filtering is applied to the full-degree-of-freedom hand motion data after removing outliers to obtain a smoothed motion data sequence; The smooth motion data sequence is corrected based on hand anatomical kinematic constraints to obtain a corrected motion data sequence; The corrected motion data sequence is subjected to motion calibration to obtain the calibrated motion data sequence; Trajectory optimization is performed on the hand world space trajectory in the verified motion data sequence to obtain target full-degree-of-freedom hand motion data in the world coordinate system. The target full-degree-of-freedom hand motion data includes at least one of the hand global pose, joint angles, and world space trajectory for each frame.
9. The method according to claim 1, characterized in that, After decoding the hand movement video data to obtain a complete sequence of video frames, the process further includes: For each frame in the set of fully decoded video frames, scale it according to a preset resolution to obtain a fully sequence size-normalized video frame; Geometric correction is performed on the full sequence of size-normalized video frames based on camera calibration distortion parameters to obtain a set of corrected video frames. The pixel values of the full sequence of corrected video frames are standardized based on a preset pixel standardization range to obtain a full sequence frame dataset.
10. A hand motion data processing system based on visual motion capture, characterized in that, include: A smart camera and a processor; wherein the smart camera is used to acquire video data of hand movements of a target object, and the processor is used to execute the hand movement data processing method based on visual motion capture as described in any one of claims 1-9.