A human-machine collaborative system

By combining augmented reality positioning modules and inertial measurement units (IMUs), the problems of high cost and insufficient accuracy in human posture and global positioning in dynamic environments are solved, achieving high-precision spatial positioning and dynamic interaction, which is suitable for human-machine collaborative systems.

CN119987557BActive Publication Date: 2026-06-30TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2025-01-24
Publication Date
2026-06-30

Smart Images

  • Figure CN119987557B_ABST
    Figure CN119987557B_ABST
Patent Text Reader

Abstract

This application discloses a human-machine collaborative system that achieves motion and posture estimation and global spatial positioning of the human body in a large 3D scene using a small number of wearable sensors. The system further utilizes multimodal data fusion technology to ensure the precise position and posture of the head-mounted display in three-dimensional space, thereby achieving stable tracking by the positioning system. Furthermore, augmented reality technology is used to assist in dynamic interactive tracking between the human and the scene, establishing a two-way real-time system supporting high-precision data transmission and virtual-real interaction, truly achieving seamless integration and interaction between the physical and virtual worlds.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to, but is not limited to, the fields of artificial intelligence and situational awareness, and in particular to a human-machine collaborative system based on egocentric global tracking, localization, and perception. Background Technology

[0002] Augmented reality (AR) refers to the overlaying of information or images provided by a computer system onto real-world information and presenting it to the user, thereby enhancing the user's perception of the real world. The key point is that the information or images are overlaid on the real world, creating a "real within virtual" effect, which is equivalent to "enhancing" the user's understanding and perception of the real world.

[0003] Capturing and modeling human-scene interactions in dynamic environments is a complex and important research topic, encompassing fields such as robotics, augmented reality, virtual reality (VR), and digital twins. Related technologies primarily focus on optical motion capture, vision-based methods, inertial measurement unit (IMU)-based techniques, and traditional positioning methods. These technologies each have their advantages and disadvantages in terms of accuracy, application scope, and cost; however, they generally suffer from problems such as complex deployment, high cost, or poor environmental adaptability, limiting their widespread application.

[0004] Overall, the relevant technologies suffer from high costs, insufficient accuracy, and complex deployment when capturing human posture, global positioning, and interactive behavior in dynamic environments. Furthermore, these technologies are often developed independently, making it difficult to form a unified system solution. Summary of the Invention

[0005] This application provides a human-machine collaborative system that can form a unified system, improve positioning accuracy, and reduce costs.

[0006] This invention provides a human-machine collaborative system, comprising: an augmented reality positioning module, a human posture capture module, and a data fusion module; wherein,

[0007] Augmented Reality Positioning Module: Used to obtain real-time positioning information of augmented reality devices in physical space through augmented reality technology;

[0008] The human posture capture module is used to capture human posture data based on an inertial sensor (IMU) and constrain it through pre-set physical constraints to ensure the accuracy and naturalness of posture reconstruction.

[0009] The data fusion module combines the acquired real-time positioning information with the obtained posture data to achieve absolute position tracking and real-time correction of the human body, so as to ensure the consistency of the absolute position and movement posture of the human body in virtual and real scenarios.

[0010] In one exemplary instance, it also includes a human-scene dynamic interaction tracking module, used to realize the interaction between the virtual human body and the real scene using augmented reality technology, and dynamically record scene changes; including: an interaction behavior modeling submodule, a data recording submodule, and a virtual-real matching submodule; wherein,

[0011] The interactive behavior modeling submodule is used to add physical constraints and logical rules to simulate physical behavior in real interactions in order to capture the interaction behavior between the human body and the scene.

[0012] The data recording submodule is used to record the state changes of objects during scene interactions in real time through custom scripts.

[0013] The Virtual-Real Matching submodule is used to synchronize and update interactive effects with the physical model in the virtual scene.

[0014] In one exemplary instance, the augmented reality positioning module is used to:

[0015] By combining Simultaneous Localization and Mapping (SLAM) with multi-marker registration technology provided by the augmented reality device, the position and orientation in three-dimensional space are identified to obtain the real-time positioning information of the augmented reality device in physical space.

[0016] In one exemplary instance, the augmented reality positioning module includes: a positioning data acquisition submodule, a 3D registration and registration submodule, and a coordinate system alignment and correction submodule, wherein,

[0017] The positioning data acquisition submodule is used to collect environmental data in real time through the augmented reality device and generate the real-time positioning information through the positioning function in the SLAM.

[0018] The 3D registration and registration submodule is used to perform spatial registration of the multiple markers based on the perspective n-point problem PnP algorithm and establish the alignment between the device coordinate system and the world coordinate system.

[0019] The coordinate system alignment and correction submodule is used to combine the multi-marker correction and spatial coordinate adjustment technology to dynamically update the mapping relationship between the coordinate system of the augmented reality device and the world coordinate system, so as to ensure that the augmented reality device maintains the consistency of spatial position and orientation during dynamic interaction.

[0020] In one exemplary instance, the multiple identifiers include: a QR code or a marker point.

[0021] In one exemplary instance, the human posture capture module is used for:

[0022] The motion data collected by the IMU is transmitted to the real-time 3D engine and development platform through a high-level programming language to dynamically present the human body posture, so that the posture of the virtual human body in the augmented reality device is synchronized with the real human body in real time.

[0023] In one exemplary instance, the human posture capture module includes: an inertial sensor acquisition submodule, a posture estimation submodule, and a physical constraint optimization submodule, wherein,

[0024] The inertial sensor acquisition submodule is used to acquire acceleration and angular velocity data in real time through an IMU worn on key parts of the human body, thereby capturing the motion information of the human body.

[0025] The pose estimation submodule is used to calculate the pose data of the human body based on the parametric 3D human body model SMPL and deep learning algorithms, combined with real-time collected acceleration and angular velocity data; the pose data of the human body includes joint positions and motion states.

[0026] The physical constraint optimization submodule is used to introduce one or any combination of the following physical constraints: joint angle restrictions, motion continuity, and dynamic consistency, to constrain the captured posture data and ensure the accuracy and naturalness of the captured posture.

[0027] In one exemplary instance, the physical constraint optimization submodule can also be used to: set ground contact and slip detection constraints to optimize the stability of the human body's posture in the environment.

[0028] In one exemplary instance, the data fusion module includes a mapping submodule and a dynamic correction submodule; wherein,

[0029] The mapping submodule is used to establish an initial mapping relationship between the augmented reality device and the human skeleton through static calibration;

[0030] The dynamic correction submodule is used to dynamically update the position information of the skeleton root node using the data collected by the IMU and the nonlinear state estimation method.

[0031] In one exemplary instance, the nonlinear state estimation method is the Extended Kalman Filter (EKF) algorithm.

[0032] The human-machine collaborative system based on egocentric global tracking, positioning, and perception provided in this application embodiment achieves the estimation of the human body's position and posture in a large 3D scene through a small number of wearable sensors, realizing high-precision spatial positioning and ensuring the accurate position and posture of the head-mounted display in three-dimensional space, thereby achieving stable tracking of the positioning system.

[0033] Furthermore, the human-computer interaction system provided in this application embodiment utilizes augmented reality technology to assist in the dynamic interaction and tracking of people and scenes, establishing a two-way real-time system that supports high-precision data transmission and virtual-real interaction, truly realizing the seamless integration and interaction between the physical world and the virtual world.

[0034] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention may be realized and obtained by means of the structures particularly pointed out in the description, claims and drawings. Attached Figure Description

[0035] The accompanying drawings are used to provide a further understanding of the technical solutions of this application and constitute a part of the specification. They are used together with the embodiments of this application to explain the technical solutions of this application and do not constitute a limitation on the technical solutions of this application.

[0036] Figure 1 This is a schematic diagram of the architecture of the human-machine collaborative system in the embodiments of this application;

[0037] Figure 2 This is a schematic diagram illustrating the implementation process of the human-machine collaborative system in the embodiments of this application;

[0038] Figure 3 This is a schematic diagram illustrating the implementation process of the multi-identifier-based 3D registration method in the augmented reality positioning module of this application embodiment;

[0039] Figure 4 This is a schematic diagram of the SMPL human body model and the sensor wearing positions in this system in the embodiments of this application;

[0040] Figure 5 This is a schematic diagram illustrating the data fusion implementation process in an embodiment of this application. Detailed Implementation

[0041] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in detail below with reference to the accompanying drawings. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be arbitrarily combined with each other.

[0042] In a typical configuration of this application, the computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0043] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0044] Computer-readable media include both permanent and non-permanent, removable and non-removable media, which can store information by any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.

[0045] The steps illustrated in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in a different order than that presented here.

[0046] Optical motion capture technology utilizes a multi-camera system and external markers to achieve high-precision capture of human movements through 3D reconstruction algorithms, and is widely used in film production and professional animation. However, this method is sensitive to ambient lighting conditions and performs poorly in bright light, shadow, or low light environments. Furthermore, the deployment and calibration of multiple cameras are complex and costly, making it unsuitable for non-professional users or dynamic wide-area scenarios.

[0047] Vision-based methods do not rely on external markers and achieve human pose estimation using a single or multiple camera combined with deep learning algorithms. While this approach offers improvements in accuracy and flexibility, it is susceptible to occlusion issues in dynamic scenes and has high computational complexity, making it difficult to meet real-time requirements. Furthermore, vision-based methods are heavily dependent on feature points, and ambient lighting and dynamic changes significantly affect their performance.

[0048] Motion capture technology based on inertial measurement units (IMUs) has gained increasing attention due to its portability, low cost, and ease of deployment. IMUs can capture human motion data without relying on external devices, making them suitable for dynamic and diverse scenarios. However, due to inherent drift errors, IMUs have limited accuracy when used alone, making it difficult to provide global position information of the human body. Therefore, inertial sensors typically need to be combined with other technologies, such as visual positioning or static calibration, to improve overall performance.

[0049] In the field of positioning technology, traditional methods typically rely on external positioning systems such as GPS, LiDAR, or multiple cameras. While these methods can provide high-precision location information, they perform poorly indoors or in occluded environments, and are costly and have high deployment requirements. In recent years, Visual Inertial Odometry (VIO) and Simultaneous Localization and Mapping (SLAM) technologies have gradually become mainstream. They achieve autonomous device localization and scene modeling by fusing data from cameras and IMUs. Although VIO and SLAM have advantages in flexibility and convenience, they still face problems such as accumulated errors and insufficient adaptability to dynamic environments.

[0050] To address at least one of the aforementioned problems, this application provides a human posture capture and human scene interaction modeling system based on egocentric sensors (such as cameras and IMUs). By combining visual and inertial data for multimodal fusion, it not only overcomes the limitations of related technologies but also features low cost, flexible deployment, and strong environmental adaptability, thus providing more practical solutions for fields such as robotics, AR / VR, and digital twins.

[0051] Figure 1 The following is a schematic diagram of the human-machine collaborative system architecture in the embodiments of this application, such as... Figure 1 As shown, it may include: an augmented reality positioning module, a human posture capture module, and a data fusion module; among which,

[0052] Augmented Reality Positioning Module: This module is used to obtain real-time positioning information of augmented reality devices in physical space through augmented reality technology.

[0053] In one embodiment, the augmented reality positioning module can be used to: identify high-precision position and attitude in three-dimensional space by using Simultaneous Localization and Mapping (SLAM) combined with multi-marker registration technology provided by the augmented reality device to obtain real-time positioning information of the augmented reality device in physical space, and provide a reference benchmark for human posture capture and scene interaction.

[0054] The human posture capture module is used to capture human posture data based on inertial sensors and constrain it through pre-set physical constraints to ensure the accuracy and naturalness of posture reconstruction.

[0055] In one embodiment, the human posture capture module can be used to: transmit motion data collected by an IMU to a real-time 3D engine and development platform such as Unity3D or other visualization environments via a high-level programming language such as Python, dynamically presenting human posture; in augmented reality devices (such as HoloLens2), the virtual human's posture is synchronized with the real human in real time, meaning that the virtual human's movements are consistent with the real human's movements in time (with almost no delay), and the posture reflects the real human's posture as accurately as possible. For example, if a user raises their right hand, the virtual human model's right hand will immediately make the same raising motion, enhancing the interactive experience. Here, Python is a high-level programming language widely used in data processing, algorithm development, and system integration; Unity is a popular real-time 3D engine and development platform widely used in game development, AR, VR, and 3D model visualization.

[0056] The data fusion module combines the acquired real-time positioning information with the obtained posture data to achieve absolute position tracking and real-time correction of the human body, so as to ensure the consistency of the absolute position and movement posture of the human body in virtual and real scenarios.

[0057] In one embodiment, the augmented display device may include, but is not limited to, a head-mounted augmented reality device (or simply head-mounted display).

[0058] In one exemplary instance, the augmented reality positioning module may include a positioning data acquisition submodule, a 3D registration and registration submodule, and a coordinate system alignment and correction submodule, wherein...

[0059] The positioning data acquisition submodule is used to collect environmental data in real time through the visual sensors of augmented reality devices such as head-mounted displays, and generate real-time positioning information of augmented reality devices in physical space through the positioning function in SLAM.

[0060] The 3D registration and registration submodule is used to spatially register multiple identifiers based on the perspective n-point problem (PnP) algorithm, and to establish the alignment between the device coordinate system and the world coordinate system. In one embodiment, the multiple identifiers may include, but are not limited to, QR codes or marker points. The multiple identifiers are used to enhance the accuracy of registration, thereby ensuring the accurate overlay effect of virtual content and real scene.

[0061] The coordinate system alignment and correction submodule is used to combine multi-marker correction and spatial coordinate adjustment technology to dynamically update the mapping relationship between the coordinate system of augmented reality devices such as head-mounted displays and the world coordinate system. This ensures that augmented reality devices such as head-mounted displays maintain the consistency of spatial position and orientation during dynamic interaction, thereby achieving the stability and accuracy of device positioning.

[0062] In this embodiment, the augmented reality positioning module collects environmental data in real time through the visual sensor of an augmented reality device, such as a head-mounted augmented reality device, and uses its SLAM technology to achieve spatial positioning and obtain the real-time positioning data of the head-mounted display. Combined with the multi-identifier registration method based on the PnP algorithm, the device coordinate system is aligned and registered with the world coordinate system, thereby achieving high-precision spatial positioning and ensuring the accurate position and posture of the head-mounted display in three-dimensional space.

[0063] In one exemplary instance, the human posture capture module may include an inertial sensor acquisition submodule, a posture estimation submodule, and a physical constraint optimization submodule, wherein,

[0064] The inertial sensor acquisition submodule is used to acquire acceleration and angular velocity data in real time through IMUs worn on key parts of the human body, such as six IMUs worn on the main joints of the human body, to capture human motion information.

[0065] The pose estimation submodule is used to calculate the human body's pose data, including joint positions and motion states, based on the parametric 3D human body model (SMPL, Skinned Multi-Person Linear) and deep learning algorithms, combined with real-time collected acceleration and angular velocity data.

[0066] The physical constraint optimization submodule is used to introduce one or any combination of the following physical constraints: joint angle restrictions, motion continuity, dynamic consistency, etc., to constrain the captured posture data and ensure the accuracy and naturalness of the captured posture. Further, in one embodiment, the physical constraint optimization submodule can also be used to set ground contact and sliding detection constraints to optimize the stability of the human body's posture in the environment.

[0067] The range of motion of human joints is limited; for example, the knee joint cannot bend significantly in the opposite direction, and the rotation angle of the arm is also limited. In this embodiment, by setting limits on joint angles (such as upper and lower limits), it is possible to avoid generating posture models with incorrect postures that do not conform to human anatomy, prevent unnatural postures, and ensure that the virtual human body looks realistic and believable.

[0068] Human movement is continuous and smooth, rather than sudden jumps or discontinuous changes. In this embodiment, by optimizing the algorithm, the joint movement trajectory of the model is constrained to conform to the temporal continuity characteristics of human movement, making the virtual human's movements smoother and avoiding jitter or sudden displacement.

[0069] Human movements need to conform to the laws of mechanics, such as gravity, inertia, and friction. In the embodiments of this application, by introducing these dynamic constraints, it can be ensured that the movements of the virtual human body conform to physical logic (e.g., the natural force exerted on the feet when running), enhancing the realism of the virtual human body, especially in fast-moving or complex action scenarios, and avoiding appearances that are distorted or contrary to common sense.

[0070] Precisely matching the posture of a virtual human (digital model) with the posture of a user in the real world—for example, when a user raises their arm, the virtual human's arm is displayed synchronously at the same position and angle—can enhance the immersive and interactive experience of augmented reality or virtual reality.

[0071] When the human body interacts with the ground, such as standing or walking, this embodiment of the application will also detect whether the feet are in correct contact with the ground, and avoid the model from "floating" or "sliding" (such as the feet leaving the ground or abnormal movement), so as to enhance the realism of the interaction between the virtual human body and the scene and make the movements look natural.

[0072] In this embodiment, the human posture capture module collects joint motion data in real time using a small number of IMUs (Integrated Mutual Detectors), such as six, worn on key parts of the human body. Based on the SMPL human body model and deep learning algorithms, combined with acceleration and angular velocity data provided by the IMUs, it calculates the three-dimensional posture of the human body in real time, including motion state estimation and joint position reconstruction. Simultaneously, physical constraints such as joint angle limitations, motion continuity, and dynamic consistency are introduced into the generated posture model to ensure the accuracy and naturalness of posture reconstruction, thereby achieving precise synchronization between the virtual human body and the actual scene. Furthermore, visualization is performed on a PC, while constraints such as ground contact and sliding detection are set.

[0073] In one exemplary instance, the data fusion module may include a mapping submodule and a dynamic correction submodule, wherein,

[0074] The mapping submodule is used to establish an initial mapping relationship between the augmented reality device and the human skeleton through static calibration, such as the positional transformation between the head and the pelvis. In one embodiment, the relationship between the head and the root node of the skeleton (such as the pelvis) can be statically calibrated, and the pose changes between the head and the pelvis can be dynamically updated through human kinematics algorithms.

[0075] The dynamic correction submodule is used to dynamically update the position information of the skeletal root nodes (such as the pelvis) using data acquired by the IMU and nonlinear state estimation methods such as the Extended Kalman Filter (EKF) algorithm. In one embodiment, IMU data, i.e., motion data from the IMU sensor, is used as the observation, and real-time positioning data, i.e., absolute positioning data provided by the augmented reality headset, is used as the prediction (i.e., the reference) to dynamically optimize the position information of the skeletal root nodes, ensuring the continuity and accuracy of posture and position. The fusion of IMU data and real-time positioning information in this embodiment ensures real-time synchronization of human posture and spatial position.

[0076] The human-computer interaction system provided in this application is a human posture capture and human scene interaction modeling system. Through a small number of wearable sensors (IMU sensors and augmented reality head-mounted devices), it realizes the estimation of the position and action posture of the human body in a large 3D scene, achieves high-precision spatial positioning, ensures the accurate position and posture of the head-mounted display in three-dimensional space, and thus realizes stable tracking of the positioning system.

[0077] In one exemplary instance, the human-machine collaborative system provided in this application embodiment may further include a human-scene dynamic interaction tracking module, used for:

[0078] Augmented reality technology is used to enable interaction between virtual human bodies and real-world scenes, and to dynamically record scene changes, thereby providing dynamic update functionality for the interaction between virtual human bodies and the environment, and achieving a high degree of synchronization between virtual and real scenes.

[0079] In one embodiment, the human scene dynamic interaction tracking module may include:

[0080] The interactive behavior modeling submodule is used to add physical constraints and logical rules to simulate physical behaviors in real-world interactions, capturing human-scene interactions such as pushing objects and changing their states. In one embodiment, gravity, friction, and elasticity can be added as physical constraints using a physics engine to simulate realistic interaction effects. In another embodiment, custom logical rules can be used to record information such as the force, path, and velocity of objects.

[0081] The data recording submodule is used to record the state changes of objects (such as path, speed, force, etc.) in real time during scene interaction through custom scripts.

[0082] The virtual-real matching submodule is used to synchronize the interactive effects with the physical model in the virtual scene, providing data support for subsequent virtual scene analysis and modeling.

[0083] In this embodiment, augmented reality technology is used to provide dynamic updates for the interaction between the captured virtual human and the environment, recording in real time changes in the scene as the user interacts (such as the movement or state change of objects). The human-scene dynamic interaction tracking module simulates physical behaviors during the interaction process by adding physical constraints and logical rules, such as gravity and elasticity, and records information such as force, path, and speed through custom scripts. This ensures accurate matching of the interaction effects between the virtual scene and the real scene, and provides support for subsequent analysis and virtual environment modeling.

[0084] Furthermore, the human-computer interaction system provided in this application embodiment utilizes augmented reality technology to assist in the dynamic interaction and tracking of people and scenes, establishing a two-way real-time system that supports high-precision data transmission and virtual-real interaction, truly realizing the seamless integration and interaction between the physical world and the virtual world.

[0085] The human-computer interaction system provided in this application deeply integrates augmented reality, inertial sensors, data fusion, and dynamic interaction technologies. It achieves high-precision alignment between virtual and real scenes by using augmented reality devices and multi-marker registration; it accurately reconstructs human posture by combining IMU sensors and deep learning algorithms; multi-sensor fusion ensures a high degree of consistency between the virtual human body and the real environment; and it enhances the interaction between virtual and real scenes through physical behavior simulation and logical rules. This system achieves high-precision synchronization and dynamic interaction between the virtual human body and the actual scene, and is applicable to multiple fields such as virtual reality, motion capture, and human-computer interaction.

[0086] In one exemplary instance, combined with Figure 2 The HoloLens 2 augmented reality device can be used to achieve real-time positioning of the headset and precise registration with the virtual space, providing a foundation for overlaying virtual content with real-world scenes. In one embodiment, the augmented reality positioning module can be developed and implemented using the Microsoft HoloLens 2 augmented reality device, a PC (64-bit Windows 10), Unity3D, Visual Studio, the Mixed Reality Toolkit (MRTK), and the Vuforia engine. The goal of the augmented reality positioning module is to acquire the pose data of the HoloLens 2 device in real time and receive, process, and send data through the PC to achieve alignment and registration between the virtual space and the real-world scene. The construction process of the augmented reality positioning module generally includes:

[0087] An augmented reality application supporting HoloLens 2 was created using a development environment built with Unity3D and Visual Studio. The Mixed Reality Toolkit (MRTK) was used to simplify HoloLens 2 development, while the Vuforia engine was utilized for real-world landmark recognition and registration. The application running on HoloLens 2 acquires the device's pose data (including position and orientation) in real time via the MRTK interface and transmits it to a PC via a network connection. In one embodiment, to ensure the stability and efficiency of data transmission, real-time communication between HoloLens 2 and the PC is established using the TCP / IP protocol.

[0088] In actual operation, system coordinate initialization is first required, setting the current user's position as the origin. This means setting the user's position at the time of HoloLens startup as the coordinate origin to ensure consistency between the virtual space and the real scene, facilitating subsequent 3D registration and alignment. After initialization, the application is launched and enters normal operation. Upon successful connection to the server, it enters 3D registration mode, providing interactive operations through the user interface (UI). In the UI, the user can select and confirm the positions of markers one by one, and virtual-real alignment is completed through a multi-marker registration algorithm. In one embodiment, UI interaction can use classic virtual buttons and a control panel. The user can start the registration process by clicking a virtual button. Each marker is selected and confirmed via a virtual button to ensure accurate positioning. During 3D registration, the system uses a multi-marker 3D registration algorithm, such as... Figure 3 As shown, the Vuforia engine identifies multiple landmarks in the real world and registers their coordinates with those in the virtual space to achieve high-precision alignment between the virtual space and the real scene. The selection and confirmation of landmarks are performed one by one. Users can manually confirm the alignment position of each landmark. Once all landmarks are confirmed to be correct, the registration process ends. After all landmarks are confirmed, the system automatically calculates and completes the alignment between the virtual space and the real scene.

[0089] Figure 3 This demonstrates how to align virtual and real-world scenes using augmented reality devices (such as HoloLens 2) and the recognition and registration of markers. Specifically, this can include:

[0090] First, the core object is identified, that is, the first sub-object is selected as the core object, and its position and rotation information are recorded as a reference and basis for subsequent calculations. In one embodiment, the camera of an augmented reality device (such as HoloLens 2) can be used to scan markers (such as QR codes), and the first identified marker is selected as the core object. Its position and rotation information before transformation are recorded as a reference and basis for subsequent transformations. The information (position and pose) of the core object will be used to guide the registration of other sub-objects.

[0091] Next, sub-objects are selected, that is, all other markers besides the core object are traversed and selected, and processed one by one. In one embodiment, for each sub-object, its position and rotation information relative to the core object are determined. The markers can be identified by the Vuforia engine or similar tools and aligned with the core object.

[0092] Then, the rotation matrix and rotation angle are calculated, that is, the new position and pose of each sub-object are calculated, and a corresponding 4×4 matrix (Matrix4×4) is constructed. In one embodiment, the absolute position of the sub-object in virtual space can be derived based on the pose information (position and rotation) of the core object; the rotation angle and transformation matrix are calculated to describe the spatial transformation relationship from the core object to the sub-object.

[0093] Finally, adjust the main object, that is, adjust the overall coordinate system to match the position and orientation of the core object, ensuring alignment between the virtual scene and the actual scene. In one embodiment, rotation and translation matrices can be applied to adjust the positions of all child objects relative to the core object. This ensures that the final virtual space layout remains consistent with the actual scene.

[0094] Figure 3 The process shown efficiently achieves 3D registration between virtual space and real scene by confirming each marker, calculating the posture, and adjusting the position, providing a reliable spatial positioning foundation for augmented reality applications.

[0095] In one exemplary instance, combined with Figure 2 This module utilizes the SMPL skeleton model and a sparse sensor-based motion capture algorithm to achieve real-time human pose reconstruction in Unity. By combining physical constraints and inverse kinematics (IK) techniques, it ensures that the generated pose conforms to natural motion and physical rules, thus enabling the construction of a human pose capture module. The goal of the human pose capture module is to achieve real-time 3D human pose reconstruction based on an IMU sensor and SMPL model motion capture algorithm, and to ensure the naturalness and physical consistency of the pose through physical constraints. The construction process of the human pose capture module roughly includes:

[0096] First, the human skeleton is extracted using the SMPL model and kinematic analysis is performed. The SMPL model is a parametric human model based on skeletal structure. Skin rigging is performed using the SMPL model to generate a high-quality skeleton and pose, such as... Figure 4 As shown. The SMPL skeleton model is migrated to Unity, and model development tools and plugins are used to ensure that it can be rendered and animated within the Unity environment. Commercial sensors, such as the NoitomPN3 sensor, are used to collect motion data from key human nodes (such as the joints where the IMU is located). Sensor data is transmitted via a receiver, and intermediate data broadcasting is used to transmit the data to the PC. In one embodiment, to read the sensor data, a communication protocol interface can be written in C++, which is responsible for converting the raw data acquired from the sensors into usable joint position data, thus decoding the raw sensor data into joint position data. For subsequent pose estimation and reconstruction, algorithm code is written in Python and integrated through Unity's communication interface. See also... Figure 2 The data processing and visualization debugging process is shown in the diagram.

[0097] Then, based on the SMPL model, a fast human pose estimation algorithm is applied to infer the positions of various joints throughout the body (e.g., 6 human nodes located at the wearable inertial sensor) by using key human nodes. Figure 4 As shown in the figure, a continuously differentiable and trainable model is used to inversely solve the motion, enabling reasonable estimation of human posture based on joint motion data. The resulting whole-body joint posture data does not include movement data. In other words, the joint posture data derived in the human posture estimation algorithm based on the SMPL model and IMU only represents the relative motion and posture of human joints (such as bending angle, rotation angle, direction, etc.), and does not include the global positional changes of the human body as a whole (i.e., the overall displacement or movement path in space). In the embodiments of this application, the posture estimation algorithm does not need to process complex overall motion data (movement trajectory), but efficiently focuses on the local motion between joints.

[0098] When acquiring joint data and driving the SMPL model to move, physical constraints are applied to ensure that the generated posture conforms to real-world physics rules, preventing joints from penetrating the ground and maintaining reasonable postures and angles. Inverse kinematics (IK) algorithms are used to ensure that the feet remain in contact with the ground during movement, preventing them from penetrating the ground. Furthermore, constraints such as joint angle limitations, motion continuity, and dynamic consistency can be introduced to ensure natural postures and smooth movements. After adjusting all joint data, in Unity, the SMPL model updates its posture in real time based on the input joint data, ensuring that the virtual character's movement conforms to actual physical constraints.

[0099] In one exemplary instance, after constructing the augmented reality localization module and the human pose capture module, their respective initialization and calibration are performed, combined with... Figure 5 It can include:

[0100] In the augmented reality positioning module, the HoloLens's initial position at startup can be used as the origin of the camera coordinate system. Directly obtain preliminary positioning data of the head-mounted display in the world coordinate system. In the 3D registration process with multiple markers, the marker positions are used as alignment references to align the camera coordinate system with the external world coordinate system. The human pose capture module is activated simultaneously with HoloLens startup, ensuring the user remains stationary in a standard pose (such as a T-pose) at the origin of the augmented reality system's coordinate system for a period of time. The IMU defaults to its startup position as the initial position. The pelvic position in the SMPL model is typically used as the reference origin (root node) of the human skeleton, and the captured skeleton data is based on the relative position of the human coordinate system origin. During system initialization, the user remains stationary, and the relative pose transformation matrix between the camera coordinate system and the human coordinate system is calibrated. Assuming the initial head-to-pelvic translation is... , ,in and If the initial rotation matrix is ​​the rotation matrix at the head and pelvis, then the initial pose matrix is: Therefore, after calibration, the global pose matrix of the human root node in the world coordinate system is obtained as follows: .

[0101] In one exemplary instance, when the data fusion module performs the transfer of the global head pose to the root node (pelvic position), based on the real-time motion capture data of the human body and the acquired global positioning data, a low-pass filter is used to remove local motion interference of the head, and the global head pose is transferred to the root node using the constraint relationship of the skeleton model to ensure the stability and rationality of the root node position.

[0102] Since the human pose estimation algorithm does not consider the reconstruction of rotational data of the human end joints, meaning that the augmented reality device is worn on the head, its local rotation cannot be synchronized with the head movement estimated by the pose estimation. Therefore, in the embodiments of this application, as... Figure 5As shown, the acquired global positioning data is low-pass filtered to eliminate the influence of high-frequency local rotational postures, such as rapid nodding and shaking, on the root node position calculation, retaining only the lower-frequency overall head motion information. In one embodiment, smoothing the head pose data using a low-pass filter may include: filtering the rotational portion of the pose data using quaternion interpolation to avoid interpolation distortion caused by directly manipulating Euler angles; and using a standard one-dimensional low-pass filter for the translational portion of the pose data.

[0103] For head position and rotation quaternions The head positioning data is filtered to remove local high-frequency movements (such as nodding and shaking). In three-dimensional space, rotation quaternions are used. It is a mathematical tool for representing rotation, often used to describe the orientation of an object. The filtering implementation is as follows:

[0104] The translation part uses the position filtering formula: ,in, It is a smoothing factor. The filtering time constant is This represents the sampling time interval.

[0105] The rotated portion is smoothed using quaternion spherical linear interpolation (Slerp). The rotation filtering formula is as follows: ,in, It is a quaternion spherical linear interpolation function. These are the interpolation coefficients. They are adjusted... The cutoff frequency of the filter can control the degree of suppression of high-frequency motion.

[0106] The filtered global pose of the camera is .

[0107] like Figure 5 As shown, during the operation of the human pose capture module, the real-time relative position of the head to the root node is derived based on the estimated joint data. That is, the human pose capture module provides the rotation matrix of the head relative to the root node in real time. and the real-time changing relative translation from the head to the root node Therefore, the real-time relative pose relationship is obtained as follows: ;

[0108] use The relative pose relationship and the filtered global camera pose Calculate the global pose of the root node. for: .

[0109] In one exemplary instance, combined with Figure 5 As shown, the data fusion module uses visual positioning data as the primary source and IMU as a loosely coupled auxiliary correction for short-term motion changes. The visual positioning data is the global pose obtained through data fusion and transmitted from the head to the root node. The IMU correction module generates state estimation independently of visual positioning data. It selects the raw outputs of the accelerometer and gyroscope from the sensors worn at the root node to calculate short-term pose changes. It obtains rotational attitude updates through angular velocity integration and displacement updates through two integral accelerations (removing the influence of gravity). The output is the pose change within a relatively short time window. This means that the IMU is used for short-term motion changes. Correction only occurs at the final fusion level; therefore, deep integration of the original data (such as feature points or acceleration / angular velocity) is not required.

[0110] Using visual positioning data as the prediction input, the state prediction equation is constructed as follows: ,in, The vision provides global pose data for the current global state (position, attitude). Used as input to the prediction model, This represents system noise. At each update, the current state is estimated visually. .

[0111] The IMU-corrected observations, that is, using the acceleration and angular velocity data provided by the IMU as input to the observation model, correct the bias of the visual prediction as follows: ,in, These are IMU measurements. For the observation model of the system, To measure noise. Through observation error. Calculate Kalman gain .

[0112] Figure 5 The Kalman filter data fusion shown includes fusing visual predictions and IMU measurements using the Extended Kalman Filter (EKF) framework. The global pose provided by vision during the prediction phase is used as... The formula for the updated and fused state during the correction phase is: .

[0113] In one exemplary instance, the process of tracking dynamic human interactions in a scene may include:

[0114] A digital 3D model is created based on the real-world scene and interactive objects within that scene. The interactive functionality is then built, developed, and deployed on HoloLens 2. During runtime, the HoloLens camera captures environmental information in real time and identifies interactive objects, calibrating and tracking their positions, postures, and states. When a user interacts with objects in the scene, the object's state is dynamically updated, and interaction data is transmitted in real time to the Unity engine on the PC, such as object movement, rotation, and state changes (e.g., opening and closing a door, pushing and pulling a chair). When the user manipulates an object (e.g., moving, rotating, pushing and pulling), the object's physical properties (e.g., position, rotation angle, forces) are updated in the virtual scene based on the interaction data. Unity's physics engine synchronizes the virtual object's state changes, ensuring that physical behaviors (e.g., gravity, friction) are reasonably simulated. In one embodiment, for objects with complex interactive operations (e.g., sliding doors), predefined physical constraints can ensure the naturalness and stability of the interaction process. During the dynamic interaction tracking of human scenes, action data for each interaction is recorded, including information such as object position, movement trajectory, and force conditions. At the same time, detailed interaction records are generated through scripts and algorithms to support later data analysis and playback. This data can also be used for user behavior analysis or to optimize subsequent virtual environment design.

[0115] Although the embodiments disclosed in this application are as described above, the content described is merely for the purpose of understanding this application and is not intended to limit this application. Any person skilled in the art to which this application pertains may make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed in this application; however, the scope of patent protection of this application shall still be determined by the scope defined in the appended claims.

Claims

1. A human-machine collaborative system, characterized in that, include: Augmented reality positioning module, human posture capture module, and data fusion module; among them, The augmented reality positioning module includes a positioning data acquisition submodule, a 3D registration and registration submodule, and a coordinate system alignment and correction submodule. The positioning data acquisition submodule collects environmental data from the visual sensors of the head-mounted augmented reality device (HMD) and generates the real-time pose of the HMD in physical space using Simultaneous Localization and Mapping (SLAM). The 3D registration and registration submodule registers multiple markers spatially based on the recognition results using a perspective n-point (PnP) algorithm to establish alignment between the device coordinate system and the world coordinate system. The coordinate system alignment and correction submodule dynamically updates the mapping relationship between the device coordinate system and the world coordinate system by combining multi-marker correction and spatial coordinate adjustment techniques. The human posture capture module includes an inertial sensor acquisition submodule, a posture estimation submodule, and a physical constraint optimization submodule. The inertial sensor acquisition submodule is used to acquire acceleration and angular velocity data through inertial sensors (IMUs) worn on key parts of the human body. The posture estimation submodule is used to calculate the human posture data based on the parametric 3D human body model SMPL and deep learning algorithms. The physical constraint optimization submodule is used to introduce ground contact and sliding detection constraints to constrain the posture data. The data fusion module includes a mapping submodule and a dynamic correction submodule. The mapping submodule establishes an initial mapping relationship between the head-mounted augmented reality device and the root node of the human skeleton through static calibration. The dynamic correction submodule, under the extended Kalman filter (EKF) framework, uses the global pose output by the augmented reality localization module as the prediction input and the short-time window pose change calculated from IMU data as the observation input to fuse and update the global pose of the skeleton root node, and outputs the real-time corrected global pose of the skeleton root node to achieve absolute human position tracking.

2. The human-machine collaborative system according to claim 1 further includes a human-scene dynamic interaction tracking module, used to realize the interaction between the virtual human body and the real scene using augmented reality technology, and dynamically record scene changes; including: The module includes an interaction behavior modeling submodule, a data recording submodule, and a virtual-real matching submodule; among them, The interactive behavior modeling submodule is used to add physical constraints and logical rules to simulate physical behavior in real interactions in order to capture the interaction behavior between the human body and the scene. The data recording submodule is used to record the state changes of objects during scene interactions in real time through custom scripts. The Virtual-Real Matching submodule is used to synchronize and update interactive effects with the physical model in the virtual scene.

3. The human-machine collaborative system according to claim 1, wherein, The multiple identifiers include: QR codes or markers.

4. The human-machine collaborative system according to claim 1 or 2, wherein, The human posture capture module is used for: The motion data collected by the IMU is transmitted to the real-time 3D engine and development platform through a high-level programming language to dynamically present the human body posture, so that the posture of the virtual human body in the augmented reality device is synchronized with the real human body in real time.

5. The human-machine collaborative system according to claim 1, wherein, The constraints include one or any combination of the following physical constraints: joint angle limitation, motion continuity, and dynamic consistency.

6. The human-machine collaborative system according to claim 5, wherein, The human posture capture module can also be used to set ground contact and sliding detection constraints to optimize the stability of the human posture in the environment.