Human-robot collaborative embodiment AI agent interaction and intention understanding method
By constructing a gravity-compensated torso reference frame and dynamic velocity threshold decoupling verification, combined with probability cone search space and multimodal feedback, the problem of misjudgment in limb intention recognition under non-steady-state scenarios is solved, achieving efficient intention understanding and target locking, and improving the accuracy and safety of human-machine collaborative operations.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN YUNCHUANG YOUYI TECH CO LTD
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-12
Smart Images

Figure CN122197940A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the technical field of AI intelligence, and in particular to a method for human-machine collaborative embodied AI agent interaction and intent understanding. Background Technology
[0002] With the rapid development of embodied artificial intelligence and special-purpose robotics, the application boundaries of human-machine collaborative operations have expanded significantly from structured factory workshops to unstructured and complex environments such as earthquake disaster relief, geological exploration, and battlefield reconnaissance. In these high-risk or complex scenarios, human commanders often need to work closely with embodied intelligent agents (such as quadrupedal robot dogs and bipedal robots), issuing spatial semantic commands such as target searching, path navigation, or object grasping to the agents through intuitive, non-contact interaction methods such as gestures and body pointing. This intention expression based on body movements has a natural intuitiveness and rapid response capability, directly determining the execution efficiency and success rate of human-machine collaboration in emergency tasks.
[0003] Currently, in the field of human-computer interaction and motion control, there are several technical solutions for human motion capture and skill reproduction. For example, Chinese Patent Publication No. CN114102600B discloses a multi-space fusion human-computer skill transfer and parameter compensation method and system. This technical solution mainly involves a skill transfer framework based on dynamic motion primitives. By constructing an upper arm stiffness model and identifying model parameters using perturbation methods, the stiffness multi-space matrix generated during the demonstration is decomposed into eigenvalue matrices and eigenvector matrices. Quaternion transformation and spatial decoupling techniques are then used to import the position and velocity parameters in Euclidean space and the attitude stiffness parameters in Riemannian space into the dynamic system for encoding. This solution establishes a model of the relationship between actual values and discrepancies, and uses a linear weighted regression update module to achieve skill parameter transfer and compensation, aiming to enable robots to reproduce the mechanical characteristics and motion trajectories of human demonstrators in specific tasks with high fidelity.
[0004] However, the aforementioned existing technologies have significant limitations in non-steady, highly dynamic human-machine collaborative scenarios. Existing technologies typically presuppose that the demonstrator is in a relatively stable standing or sitting posture, or that their torso base is stable. The system directly interprets the movement trajectory of the limbs relative to the base or in absolute space as a valid expression of intent. However, in real-world rubble search and rescue or field exploration scenarios, commanders are often in a non-steady-state process, running, jumping, peering sideways, or struggling to maintain balance on rubble. In these situations, the human torso experiences violent displacement, irregular tilting, and high-frequency vibrations. In this highly dynamic environment, the limb data collected by the sensors is actually a complex coupling and superposition of "large-amplitude nonlinear motion of the entire torso" and "microscopic directional movements of localized limbs." On the one hand, when a commander moves rapidly or sways his torso to maintain balance, the traditional reference frame fixed to the torso will tilt and rotate accordingly, causing a huge geometric projection deviation in the limb pointing vector relative to this reference frame when transformed to the absolute coordinate system of the environmental map. On the other hand, when the human body moves rapidly, the limbs will inevitably produce accompanying passive swaying. Existing technology has difficulty distinguishing this passive swaying caused to maintain balance from the true pointing intention, and is very likely to misjudge high-frequency follow-up noise as interactive commands, causing the intelligent agent to be unable to accurately resolve the true pointing target under the condition that the commander's own movement state is extremely unstable. Summary of the Invention
[0005] To address the challenge of accurate limb intention recognition in unsteady scenarios by decoupling trunk motion noise based on gravity compensation, this application provides a human-machine collaborative embodied AI agent interaction and intention understanding method.
[0006] The human-computer collaborative embodied AI agent interaction and intent understanding method provided in this application adopts the following technical solution: The human-computer collaborative embodied AI agent interaction and intent understanding method includes: Collect the inertia and position data of the command personnel and obtain an environmental semantic map containing object position information. Use the gravity vector in the inertia data to construct a gravity-compensated torso reference system with the torso as the origin and the vertical axis forcibly aligned with the gravity vector. The relative motion velocity of the limb ends is calculated within the gravity-compensated torso reference frame. When the relative motion velocity is lower than the dynamic velocity threshold generated based on the global movement rate of the commander, a potential intention data frame containing continuous limb pointing data is extracted. Calculate the dispersion of limb pointing data within the potential intent data frame to obtain dynamic jitter residuals, determine the opening angle based on the dynamic jitter residuals, and construct a probability cone search space with the average limb pointing as the axis; The probability cone search space is projected onto the environmental semantic map, and objects located within the range of the probability cone search space are identified as candidate objects within the space. Interaction targets are selected from the candidate objects within the space based on their salience weights.
[0007] Optionally, constructing the gravity-compensated torso reference frame includes: The gravitational acceleration component in the inertial sensing data is separated in real time to establish the gravity vector, and the Z-axis of the gravity-compensated torso reference frame is configured to be parallel to the reverse extension line of the gravity vector. The positive torso vector is extracted from the inertial sensing data and projected onto a geometric plane perpendicular to the Z-axis to construct the X-axis of the gravity-compensated torso reference system. An attitude correction matrix is generated based on the Z-axis and the X-axis. The attitude correction matrix is then used to perform an inverse rotation transformation on the inertial sensing data, thereby locking the XY plane of the gravity-compensated torso reference frame to be parallel to the physical ground plane.
[0008] Optionally, the logic for generating the dynamic speed threshold is configured as follows: Establish a positive correlation between the dynamic speed threshold and the global movement rate of the command personnel; In response to the increase in the global movement rate, the dynamic speed threshold is increased and made greater than the speed value of the passive follow-up component of the limbs in order to maintain body balance. In response to the decrease in global movement rate, the dynamic speed threshold is lowered so that it only covers the physiological micro-motion speed values of the human body in a static state.
[0009] Optionally, the logic for intercepting latent intent data frames containing continuous limb pointing data is configured to perform a dual motion decoupling check on the limb extremities: The relative angular velocity of the limb end is calculated in real time within the gravity-compensated torso reference frame, and an angular rest threshold is configured to characterize the active hovering stability of the limb. Construct intent-locking logic, which is configured to combine the dynamic velocity threshold and the angular stationary threshold into parallel constraints. The commander is determined to be in a physiological relative freeze state only when the relative motion speed is lower than the dynamic speed threshold and the relative angular velocity is lower than the angular rest threshold. In response to the confirmation of the physiological relative freeze state, the current limb movement is identified as an active intention expression stripped of the passive swaying component generated by maintaining trunk balance, and the interception of the potential intention data frame is triggered.
[0010] Optionally, the logic for constructing the probability cone search space is configured to perform residual-based uncertainty spatialization compensation: The continuous limb pointing data contained in the latent intent data frame is parsed to obtain multi-frame pointing vectors, and the dispersion of the pointing vectors is statistically analyzed to generate dynamic jitter residuals. An adaptive scalar adjustment logic is constructed to establish a positive correlation mapping relationship between the scalar of the probability cone search space and the dynamic jitter residual. By dynamically expanding the scalar as the dynamic jitter residual increases, a spatial redundancy is formed to cover the pointing deviation under the physiological relative freezing state. The average pointing vector of the pointing vector is calculated and used as the central axis. The probability cone search space is generated by combining the angle determined by the adaptive angle adjustment logic.
[0011] Optionally, the logic for selecting an interaction target from candidate objects within the space is configured to perform a saliency competition selection based on multidimensional spatial weights: The spatial envelope retrieval of the environmental semantic map is performed using the probability cone search space to extract entities located inside the probability cone search space as candidate objects within the space; A two-factor decay evaluation logic is constructed to calculate the angular offset of the geometric centroid of each candidate object in the space relative to the central axis of the probability cone search space, and the Euclidean distance relative to the commander. Generate saliency weights and configure the saliency weights to be negatively correlated with the angular offset and the Euclidean distance, so as to give higher priority to objects that are closer to the pointing axis in space and are physically closer. Perform confidence-gated verification to lock the object with the highest salience weight as the target to be confirmed. Only when the salience weight exceeds the preset confidence threshold is the target to be confirmed as the interaction target. Otherwise, suppress the generation of interaction commands to maintain the system standby state.
[0012] Optionally, it also includes multimodal time-domain backtracking and race check steps: Continuous serialization storage of potential intent data frames that have been determined to be in the physiologically relatively frozen state is used to form a sliding temporal buffer queue; The system analyzes the voice command stream of the commander in real time. When a semantic keyword containing spatial referential attributes is identified, the system uses the semantic keyword as a time anchor point and performs a reverse index on the sliding time domain cache queue based on a preset backtracking time threshold. If multiple non-contiguous potential intent data frame sequences are retrieved within the backtracking time threshold, an intent persistence competition check is performed, the time span weight of each sequence is calculated, and the sequence with the highest time span weight is locked as the main intent frame. The idea graph frame is used as the sole input source for constructing the probability cone search space.
[0013] Optionally, a geometric ambiguity verification step based on semantic constraints may also be included: The semantic keywords are parsed into logical directional attributes containing abstract spatial referential information, and the geometric pointing features of the main image frame are solved within the gravity-compensated torso reference frame. Define a semantic legal domain in three-dimensional space based on the logical direction attribute, and verify whether the geometric pointing feature falls within the coverage of the semantic legal domain; When the geometric pointing feature falls into a mutually exclusive quadrant that contradicts the semantic legal domain, it is determined that there is a spatiotemporal cognitive conflict in the current interaction and the interaction command is intercepted.
[0014] Optionally, an adaptive correction step based on negative feedback may also be included: After locking the interactive target, open the feedback listening window to monitor the lateral relative motion velocity of the limb end within the gravity-compensated torso reference frame in real time. When the lateral relative motion speed exceeds a preset rejection speed threshold, it is determined that a negative feedback signal has been received and the currently stored potential intent data frame is cleared. The preset gain compensation coefficient is retrieved to correct the parameters of the calculation logic used to determine the angle when constructing the probability cone search space. The expansion ratio of the angle relative to the dynamic jitter residual is increased proportionally, thereby generating the probability cone search space with a larger envelope range in the next recognition cycle.
[0015] Optionally, the logic for determining if a negative feedback signal has been received is configured to perform multimodal asynchronous verification: Within the feedback monitoring window, the voice command stream of the commander is parsed in parallel to identify negative semantic keywords that represent the intention to refuse and generate a semantic-level negative signal. The lateral relative motion speed exceeding the rejection speed threshold is defined as a motion-level rejection signal, and the semantic-level rejection signal and the motion-level rejection signal are set as parallel triggering conditions that serve as backups for each other; In response to the satisfaction of any of the parallel triggering conditions, it is confirmed that the negative feedback signal has been received, and the operation of clearing the potential intent data frame and the parameter correction of the computation logic are immediately triggered.
[0016] In summary, this application includes the following beneficial technical effects: 1. This method constructs a gravity-compensated torso reference frame, forcibly aligning the vertical axis of the reference frame with the direction of gravity. This decouples the motion noise introduced by the severe tilting and rotation of the torso during unsteady scenarios such as running and jumping. It enables the representation of limb pointing data to be established on a stable spatial reference parallel to the physical ground plane, fundamentally solving the problem of huge geometric projection deviation caused by the torso swaying in traditional methods. This reduces the pointing projection deviation by about 70%, providing a reliable spatial basis for the accurate recognition of core intentions.
[0017] 2. This method innovatively correlates dynamic speed threshold with global movement rate and introduces a dual decoupled verification logic of relative motion speed and relative angular velocity of limb extremities. It can effectively distinguish between passive swaying caused by the human body to maintain balance and genuine active pointing intention. Thus, it can accurately determine the physiological relative freeze state and capture effective intention data frames in a high-dynamic environment, reducing the intention misjudgment rate in non-steady-state scenarios by about 40%, and significantly improving the accuracy and reliability of the system's intention recognition in complex motion states.
[0018] 3. This method integrates semantic keywords from voice commands with limb pointing actions for temporal backtracking and competition verification, and performs ambiguity verification on geometric pointing based on semantic logic direction. This achieves cross-modal intention collaborative understanding and conflict interception. At the same time, it introduces a multimodal asynchronous feedback mechanism based on limb negation actions or negation speech to drive the system to adaptively correct the probability cone search parameters, forming a closed-loop optimization of human-computer interaction. This enables the system to respond quickly and improve itself when faced with user adjustments or inaccurate initial recognition, increasing the target locking success rate of subsequent recognition cycles by about 25%. This greatly enhances the system's adaptability in real-world complex scenarios and the security of the agent's execution of commands. Attached Figure Description
[0019] Figure 1 It is a logical flowchart of the interaction and intent understanding methods; Figure 2 This is a schematic diagram of the Z-axis of a gravity-compensated torso reference system; Figure 3 This is a schematic diagram of the X-axis of a gravity-compensated torso reference frame; Figure 4 This is a schematic diagram of the Y-axis of a gravity-compensated torso reference frame. Detailed Implementation
[0020] The following combination Figures 1-4 This application will be described in further detail.
[0021] This application discloses a method for human-machine collaborative embodied AI agent interaction and intent understanding. For example... Figure 1As shown, the embodied AI agent interaction and intent understanding method for human-machine collaboration aims to solve the problem of accurate limb intent recognition in non-steady-state scenarios by decoupling torso motion noise based on gravity compensation. The following steps are described in detail: S1 constructs a gravity-compensated torso reference frame. S11 collects basic data The system collects inertial data, position data, and an environmental semantic map containing object location information from the command personnel. Inertial data is acquired through inertial measurement units (IMUs) worn on the command personnel's torso and extremities. These IMUs are fixed at the sternum and wrist, respectively. The collected parameters include acceleration and angular velocity, with a sampling frequency set to 100Hz. This frequency is determined based on the motion characteristics of human limbs; the highest frequency of rapid limb movement is approximately 20Hz. According to the Nyquist sampling theorem, a sampling frequency of 100Hz ensures distortion-free capture of limb movement details while avoiding data redundancy caused by excessively high sampling rates.
[0022] Regarding location data, the system employs an RGB-D depth camera in conjunction with a visual skeleton tracking algorithm. Specifically, the depth camera acquires RGB image streams and depth map streams of the command personnel in real time. A pre-trained human pose estimation model based on a convolutional neural network (CNN) (such as a keypoint detection algorithm based on a ResNet backbone network) is used to extract features from the RGB images, outputting a heatmap containing the location probabilities of human keypoints (including the left shoulder, right shoulder, left hip, and right hip). The system extracts the pixel coordinates of these four torso keypoints in the image plane based on the peak response coordinates of the heatmap, and combines this with the corresponding depth values indexed in the depth map. Using the camera intrinsic parameter matrix, the pixel coordinates are back-projected into three-dimensional space to obtain the three-dimensional spatial coordinates of the four keypoints. Subsequently, the system calculates the arithmetic mean of the three-dimensional coordinates of the four keypoints (left shoulder, right shoulder, left hip, and right hip) and defines this as the three-dimensional coordinate of the torso center.
[0023] A confidence level filtering mechanism is implemented synchronously. The confidence level is calculated as follows: the maximum probability value (ranging from 0 to 1) corresponding to the four torso key points is extracted from the heatmap output by the attitude estimation model, and these values are marked as the local confidence level of each key point. The average of the local confidence levels of the four key points is calculated as the overall confidence level of the torso center position data in the current data frame. The system sets a confidence level threshold of 0.8, and data frames with an overall confidence level below 0.8 are identified as abnormal data and removed. This confidence level threshold was determined through extensive experimental statistics; when the value is below this threshold, the abnormality rate of the data exceeds 30%, and filtering can significantly improve the reliability of the position data.
[0024] The environmental semantic map is jointly generated by the visual sensor and LiDAR carried by the embodied AI agent. The visual sensor collects the appearance features of objects in the environment, and the LiDAR obtains the three-dimensional distance information of the objects. The two data are fused in real time to form an environmental semantic map containing the three-dimensional coordinates, category and contour information of the objects, providing basic data support for the subsequent spatial association between the reference frame and environmental objects.
[0025] S12 establishes the gravity vector Based on the raw acceleration data acquired by the inertial measurement unit (IMU) using the S11 sensor, the system separates the gravitational acceleration component to establish the gravity vector. The raw acceleration data contains both gravitational acceleration and dynamic acceleration generated by limb movement; the superposition of these two components makes it difficult to directly extract a pure gravity vector using traditional methods. The system employs a complementary filtering technique to process the raw data. This technique combines the static stability of the accelerometer with the dynamic response of the gyroscope, optimizing the fusion of data from both sensors by setting a filtering coefficient. The filtering coefficient is set to 0.01, a value determined through repeated experiments. For common dynamic scenarios such as torso swings and limb movements, a filtering coefficient of 0.01 effectively suppresses interference from dynamic acceleration while preserving the stable characteristics of gravitational acceleration. In the specific processing, the raw acceleration data output from the IMU serves as the input to the filtering algorithm. After processing, the gravitational acceleration component is separated, ultimately yielding a stable gravity vector. Experimental verification shows that this processing method controls the extraction error of gravitational acceleration to within 0.05 m / s². 2 Compared to traditional direct extraction methods, the error is reduced by about 60%, providing a reliable guarantee for the accurate calibration of the vertical axis of the reference system and effectively solving the technical pain point of inaccurate gravity vector extraction in unsteady scenarios.
[0026] S13 defines the coordinate axes of the reference system. like Figures 2 to 4As shown, with the torso center obtained in S11 as the origin, the system configures the Z-axis of the gravity-compensated torso reference frame. The Z-axis direction remains parallel to the reverse extension line of the gravity vector established in S12, meaning the Z-axis always points vertically upward. This configuration ensures that the Z-axis is unaffected by changes in torso posture; even when the commander is in a non-standard posture such as bending, tilting, or even looking sideways, the vertical pointing characteristic of the Z-axis remains stable. The system extracts the positive torso vector from the inertial data collected in S11. This vector is obtained by detecting changes in torso posture through the inertial measurement unit and accurately reflects the orientation of the commander's torso. The system projects the positive torso vector onto a geometric plane perpendicular to the Z-axis, which is parallel to the physical ground plane. The projected vector is the X-axis of the gravity-compensated torso reference frame. Based on the defined Z-axis and X-axis, the system generates the Y-axis using the right-hand rule: the right thumb points in the positive Z-axis direction, the index finger points in the positive X-axis direction, and the direction in which the middle finger naturally bends is the positive Y-axis direction, ensuring that the three axes of the reference frame are orthogonal. This coordinate axis definition method overcomes the limitation of traditional torso reference frames tilting with limb posture, ensuring that the XY plane of the reference frame remains parallel to the physical ground plane. This significantly reduces the interference of torso posture tilt on subsequent limb motion analysis. Compared to traditional reference frames, the pointing projection deviation is reduced by approximately 70%, laying a spatial benchmark for accurately extracting limb intentions.
[0027] S14 generates the attitude correction matrix and corrects the data. Based on the Z-axis and X-axis defined in S13, the system generates an attitude correction matrix. First, the unit vectors of the Z-axis and X-axis are obtained, denoted as uz and ux respectively. Since the three axes are orthogonal, the Y-axis unit vector uy can be obtained through the cross product of uz and ux. The attitude correction matrix is an orthogonal matrix with column vectors ux, uy, and uz. Its matrix form is constructed based on the fundamental principles of spatial coordinate system rotation transformation, ensuring accurate representation of the attitude relationship between the original inertial data coordinate system and the gravity-compensated torso reference system. The system uses the original inertial sensing data acquired in S11 as input and performs an inverse rotation transformation using the attitude correction matrix. That is, through matrix multiplication, the original data is transformed from the local coordinate system of the inertial measurement unit to the gravity-compensated torso reference system. This rotation transformation effectively counteracts attitude interference caused by torso tilt and rotation, forcibly locking the XY plane of the gravity-compensated torso reference system to be parallel to the physical ground plane. This data correction method closely follows the coordinate axis definition described earlier, further solidifying the stability of the reference frame and successfully decoupling the coupling noise between trunk movement and limb motion, enabling subsequent limb motion analysis to be conducted based on a stable reference coordinate system. The aforementioned matrix construction method and rotation transformation steps accurately achieve data correction.
[0028] S2 generates dynamic speed threshold S21 Establish mapping relationship The system calculates the global movement rate of the commander based on the torso center position data acquired by S1. The torso center position data originates from the three-dimensional coordinates extracted by visual skeleton tracking technology in S11. The system selects two consecutive frames of coordinate data, calculates the magnitude of the difference between the two frames, and obtains the change in torso center position. Considering that the sampling frequency of the inertial measurement unit in S11 is 100Hz, the system sets the time interval between two frames of data to 0.01s to ensure data synchronization and dimensional consistency. The global movement rate is obtained by dividing the change in position by the time interval. This calculation method conforms to the basic definition of velocity in kinematics and can accurately reflect the overall movement state of the commander.
[0029] The system establishes a positive correlation between a dynamic speed threshold and the global movement rate. This mapping is designed to align with the laws of human movement: the greater the global movement rate of the commander, the greater the passive follow-up speed of the limbs to maintain balance. If the threshold remains fixed, such passive swaying is easily misinterpreted as active intentional actions. The positive correlation mapping allows the threshold to adjust in real time with the global movement rate, adapting to complex motion states in non-steady-state scenarios. Compared to traditional fixed thresholds, this significantly improves the scenario adaptability of intention recognition.
[0030] S22 Dynamically adjusts the threshold When the global movement speed exceeds 1.5 m / s, the system increases the dynamic speed threshold. This threshold is determined by statistically analyzing the movement data of command personnel of different ages and body types in non-steady-state scenarios such as rubble search and rescue and field exploration. In these scenarios, the passive motion component of limbs maintaining balance generally does not exceed 0.3 m / s. To ensure the threshold effectively filters passive swaying without missing active intentional movements, the system sets the dynamic speed threshold at 0.35 m / s. When the global movement speed is below 0.3 m / s, the system decreases the dynamic speed threshold. This threshold corresponds to a static or slow-moving state, where limbs exhibit no obvious balance-maintaining movements, only physiological micro-movements. Experimental measurements show that the velocity values of these micro-movements in different populations typically do not exceed 0.05 m / s; therefore, the system sets the dynamic speed threshold at 0.05 m / s, covering only the physiological micro-movement velocity range. When the global movement speed is in the middle range of 0.3 m / s to 1.5 m / s, the system performs linear interpolation calculations to establish a linear growth relationship between the dynamic speed threshold and the global movement speed. The specific calculation logic is: Dynamic speed threshold = 0.05 + (Global movement speed - 0.3) × 0.25. This dynamic adjustment method across the entire range accurately matches the limb movement characteristics of the human body in different movement states from stationary and slow walking to running. It effectively solves the technical pain points of traditional fixed thresholds being prone to misjudgment in high-dynamic scenarios and lacking sensitivity in low-dynamic scenarios, making intent recognition in non-steady-state scenarios more reliable and adaptable.
[0031] S3 captures potential intent data frames S31 calculates the relative angular velocity and sets the angular rest threshold. The system uses a gravity-compensated torso reference frame constructed by S1 as a benchmark to analyze the data from the end-effector inertial measurement unit (IMU) after correction by the S14 attitude correction matrix. The IMU is fixed at the wrist, and its raw angular velocity data has undergone inverse rotation transformation to eliminate interference from torso tilt and rotation, ensuring that the data only represents the limb's own rotational state. Following the principles of rigid body kinematics, the system calculates the relative angular velocity of the end-effector with respect to the torso origin in real time through vector subtraction, effectively isolating the influence of torso motion on the end-effector angular velocity, making the calculation results more closely match the limb's actual movement intention.
[0032] Based on the motion characteristics of active limb hovering, the system sets the angular velocity threshold to 15 deg / s. This value was determined through statistical experiments on limb movements of people of different ages and body types in unsteady-state scenarios such as rubble search and rescue and field exploration. When a person actively controls limb hovering to express directional intention, the limb angular velocity is generally below 15 deg / s; while the angular velocity of passive swaying caused by maintaining body balance often exceeds 20 deg / s. This threshold can accurately distinguish between active hovering and passive swaying, providing a reliable quantitative basis for subsequent intention recognition. Compared with traditional fixed thresholds that do not take into account scene characteristics, it is more adaptable to complex motion states in unsteady-state environments.
[0033] S32 constructs intent locking logic The system constructs an intent-locking logic that combines the dynamic velocity threshold generated by S2 with the angular stillness threshold set by S31 as parallel constraints. This logic is designed based on an in-depth analysis of the characteristics of human intention expression: a relative velocity simply below the threshold may indicate that the limbs are passively still with the torso rather than being actively controlled; a relative angular velocity simply below the threshold may indicate that the limbs are slowly and aimlessly swinging rather than indicating an intention. Only when both linear motion stability and angular motion stability meet the requirements can it be preliminarily determined as an active intention expression.
[0034] The system clearly defines the rules for determining parallel constraints: the constraints are only satisfied when the relative motion velocity of the limb's extremity is below the dynamic velocity threshold and the relative angular velocity is below the angular rest threshold. The input to this logic is the real-time calculated relative motion velocity and relative angular velocity, and the output is a binary judgment signal indicating "constraint satisfied" or "constraint not satisfied," providing a clear and executable logical basis for subsequent determination of physiological relative freeze states. This dual-constraint mechanism overcomes the limitations of single-threshold verification, significantly reducing the probability of passive actions being misjudged as active intentions, making intention recognition more reliable.
[0035] S33 Determines the physiological relative freezing state. The system monitors in real time the relative angular velocity calculated by S31 and the relative motion velocity of the limb extremities relative to the torso. The relative motion velocity is calculated by the difference between the limb extremity position data collected by S11 and the torso center position data. The calculation uses a 0.01s time interval consistent with the sampling frequency of the inertial measurement unit to ensure data synchronization and unit uniformity. When the relative motion velocity is lower than the dynamic velocity threshold generated by S2 and the relative angular velocity is lower than the angular stationary threshold set by S31, the system determines that the commander is in a physiological relative freeze state.
[0036] The core characteristic of this state is that the extremities remain stable relative to the torso, successfully isolating the passive swaying component caused by maintaining torso balance. Experimental results show that this judgment logic achieves an accuracy of approximately 92% in recognizing active intention actions. In non-steady-state scenarios, compared to traditional methods based solely on velocity thresholds, the false positive rate is reduced by about 40%. This precise state determination effectively solves the recognition challenge caused by the coupling of torso movement and limb intention actions in non-steady-state scenarios, providing a reliable logical trigger condition for subsequent interception of intention data frames.
[0037] S34 triggers latent intent data frame interception Once S33 confirms that the commander is in a physiologically relatively frozen state, the system immediately identifies the current limb movement as an active intention expression and triggers the interception of potential intention data frames. The interception time window is set to 2 seconds. This duration is determined by statistically analyzing the duration of human active pointing intentions in different scenarios: the duration of a single pointing intention by most commanders is between 1.5 and 2.5 seconds. The 2-second time window can capture all the data of an active pointing intention completely while avoiding redundant data accumulation caused by excessively long interception times, thus ensuring data processing efficiency.
[0038] The captured data frames contain key information such as continuous limb pointing vectors, relative motion velocity, and relative angular velocity. The data format is consistent with the inertial measurement unit data after S14 correction, facilitating direct processing in subsequent stages. The capture process employs a real-time caching mechanism. Once a capture command is triggered, the system serializes and stores relevant data frames within a 2-second time frame into a dedicated cache area, ensuring data integrity and temporal order. This state-triggered capture method effectively filters out invalid data generated by non-intent actions, providing high-quality basic input data for the subsequent construction of the probability cone search space, further improving the accuracy of the entire intent understanding process.
[0039] S4 constructs a probability cone search space. S41 analyzes the pointing vector and calculates the dynamic jitter residual. The system uses the latent intent data frame captured by S34 or the main intent frame locked by S64 as its core input. When a concept graph frame exists, it is prioritized as the input source to ensure consistency with the multimodal temporal backtracking verification logic. These input data frames have all undergone physiological relative freeze state determination or temporal competition verification, removing passive swaying components and instantaneous false triggering interference caused by trunk balance, retaining only motion information related to the core active intent, laying the foundation for accurate extraction of the pointing vector. The system calculates the pointing vector by the difference between the three-dimensional coordinates of the limb extremities and the coordinates of the trunk center. All vectors are defined within the gravity-compensated trunk reference frame constructed by S1, ensuring the consistency of spatial orientation representation.
[0040] The system generates dynamic jitter residuals by calculating the dispersion of pointing vectors, with the dispersion quantified using standard deviation. The calculation first averages all pointing vectors to obtain the average pointing vector. Then, it calculates the angle between each individual pointing vector and the average pointing vector, and finally, the standard deviation of these angles is used as the dynamic jitter residual. Combining a 2-second truncation window and a 100Hz sampling frequency, the number of data frames is fixed at 200. This number has been experimentally verified to ensure the reliability of the statistical results while avoiding computational delays caused by excessive data. The dynamic jitter residuals directly reflect the degree of physiological jitter in limb pointing; a larger value indicates weaker pointing stability. This quantification solves the problem that traditional methods cannot adapt to different jitter states, allowing for more targeted adjustments to the search space.
[0041] S42 establishes a mapping relationship between the opening angle and the dynamic jitter residual. The system constructs an adaptive levitation angle adjustment logic, establishing a positive correlation between the levitation angle of the probability cone search space and the dynamic jitter residual calculated by S41. This mapping relationship is designed to meet the actual needs of non-steady-state scenarios: when limb pointing jitter is severe, the pointing deviation range expands, requiring a larger search space to cover possible targets; when limb pointing is stable, the search range can be narrowed to improve target selection efficiency and avoid invalid region retrieval.
[0042] The system uses the following formula to calculate the pointing angle: Pointing Angle = Gain Coefficient × Dynamic Jitter Residual, where the gain coefficient is set to 3. This value was determined through extensive comparative experiments: for the common dynamic jitter residual range of 0.5deg to 10deg, a gain coefficient of 3 ensures that the pointing angle covers more than 95% of the pointing deviation without causing a decrease in target selection efficiency due to an excessively large range. For example, when the dynamic jitter residual is 3deg, the pointing angle is 9deg, effectively accommodating pointing fluctuations at this level of jitter; when the dynamic jitter residual is 1deg, the pointing angle is 3deg, ensuring pointing accuracy while reducing redundant searches. This design, which transforms jitter uncertainty into search space parameters, significantly improves the system's robustness in unsteady scenarios compared to traditional fixed pointing angle schemes.
[0043] S43 Generation Probability Cone Search Space The system uses the average pointing vector calculated by S41 as its core, setting it as the central axis of the probability cone search space. The average pointing vector filters out instantaneous fluctuations in single-frame data, better representing the true pointing intentions of the commander and ensuring the accuracy of the core direction of the search space. The central axis continues the spatial definition of the gravity-compensated torso reference frame, consistent with the coordinate system mentioned above, avoiding deviations caused by spatial transformations.
[0044] The system combines the angle determined by S42 to generate a three-dimensional conical probability cone search space. This space, based on the central axis, symmetrically expands to both sides, corresponding to the angle range of the subtended angle, forming a complete conical search region. The boundaries of the conical space are precisely determined through geometric calculations, and the coordinates of each boundary point are calculated based on the direction of the central axis and the subtended angle, ensuring the accuracy of the spatial range. Compared to the traditional single-ray projection method, this dynamic probability cone search space can flexibly adapt to different jitter states, accommodating pointing deviations while maintaining recognition accuracy. This provides a scientifically sound spatial range basis for subsequent target selection, making target search in non-steady-state scenarios more adaptable and efficient.
[0045] S5 Select Interaction Target S51 Extracts Candidate Objects within Space The system uses the probability cone search space generated by S43 as the core retrieval basis and performs spatial envelope retrieval on the environmental semantic map collected by S1. The environmental semantic map already contains the 3D coordinates, category, and contour information of all objects in the scene. The retrieval process focuses on the geometric centroid of the objects: the system first defines the boundary coordinates of the probability cone search space, which is defined by the direction of the central axis and the subtended angle. The coordinates of each boundary point are accurately derived through geometric calculations. Subsequently, the system compares the 3D coordinates of the geometric centroid of each object with the boundary coordinates of the probability cone one by one. If the centroid falls within the axial extension range and the subtended angle coverage of the cone space, the object is determined to be a candidate object in the space. This centroid-based precise retrieval method not only ensures the relevance of the screening results to the command personnel's pointing intentions but also simplifies the calculation process. Those skilled in the art can directly implement it through coordinate comparison logic, effectively filtering out irrelevant objects outside the probability cone range.
[0046] S52 Calculates Two-Factor Evaluation Parameters The system constructs a two-factor attenuation evaluation logic. For each candidate object extracted in space by S51, two key evaluation parameters are calculated: angular offset and Euclidean distance. The angular offset is the angle between the geometric centroid of the candidate object and the central axis of the probability cone search space. In the calculation, the "vector pointing from the centroid to the center of the body" and the "vector of the central axis of the cone" are first normalized, and then the cosine value of the angle is obtained by vector dot product operation. The angular offset is then derived from this cosine value, with the unit being deg. This calculation method ensures the accuracy of the angle representation and is consistent with the dimensions of the subtended angle and jitter residual in S4.
[0047] Euclidean distance is the straight-line distance between the geometric centroid of the candidate object and the center of the commander's torso. It is calculated based on the three-dimensional coordinates of both objects within a gravity-compensated torso reference frame, derived using a spatial distance formula, and is measured in meters (m). The torso center coordinates are derived from position data collected by S11 and corrected by the S14 attitude correction matrix, ensuring consistency between the distance calculation benchmark and the coordinate system described earlier, avoiding deviations caused by spatial transformations. These two parameters characterize the correlation between the candidate object and the pointing intent from the dimensions of "direction matching degree" and "physical accessibility," respectively. Compared to traditional single-dimensional evaluation, this approach is better suited to the complex interaction needs in unsteady scenarios.
[0048] S53 generates significance weights The system generates a saliency weight, which is negatively correlated with the angular offset and Euclidean distance calculated by S52: the smaller the angular offset and the closer the Euclidean distance, the higher the weight, and the stronger the match between the candidate and the commander's true intention. The system sets the saliency weight calculation formula as: saliency weight = ,in and The weighting coefficients are 0.6 and 0.4, respectively, with their weights summing to 1 to ensure the rationality of the weighting logic. The values of these coefficients and the formula construction were determined through numerous experiments in unsteady-state scenarios. The introduction of "1" as a distance baseline constant in the formula serves two purposes: firstly, it normalizes the distance term to a dimensionless value within the [0,1] interval, eliminating the numerical singularity of "1 / Euclidean distance" at close range (<1m) and the inconsistency of dimensions at long range; secondly, in unsteady-state environments, physical distance directly affects the feasibility and convenience of interaction, and closer targets are more likely to become the actual interaction objects for commanders, thus assigning a constant value of 1 / Euclidean distance. A higher percentage; angular offset reflects pointing accuracy and is an important supplement to intent matching, giving it... A reasonable proportion is used to ensure directional accuracy. In the formula, The weighting increases as the angular offset decreases, aligning with the requirement of negative correlation. This improved two-factor weighting method overcomes the limitations of traditional single-factor evaluation, enabling more accurate and robust selection of targets that match the true intent.
[0049] S54 performs confidence-gated validation and determines the interaction target. The system sets a confidence threshold of 0.7, which is determined through target matching experiments in different non-steady-state scenarios (such as rubble search and rescue and field exploration). When the significance weight exceeds 0.7, the matching degree between the candidate object and the intended target reaches a high level, and the probability of false matching is less than 10%. When the weight is less than 0.7, the target is more ambiguous, and there may be multiple candidate objects that are difficult to distinguish, or the candidate objects may have a weak correlation with the intention, and the risk of false interaction increases significantly.
[0050] The system first identifies the object with the highest saliency weight calculated by S53 as the target to be confirmed. Then, it performs a confidence-gated check: if the saliency weight of the target exceeds 0.7, the system confirms it as an interactive target; otherwise, it suppresses the generation of interactive commands and remains in standby mode. This check effectively filters out erroneous interactions caused by ambiguous targets, ensuring the reliability of interactive targets. Compared to traditional target selection methods without confidence judgment, this significantly improves the accuracy of intent recognition in non-steady-state scenarios, preventing the system from blindly executing commands when the target is unclear, and better meeting the practical application needs of high-risk and complex scenarios.
[0051] S6 Multimodal Time-Domain Backtracking and Race Check S61 constructs a sliding time-domain cache queue The system continuously serializes and stores potential intent data frames identified as physiologically frozen states by S34, forming a sliding temporal buffer queue. The queue's storage capacity is set to 10 seconds, determined by statistically analyzing the temporal correlation between human voice commands and body movements: in most scenarios, the time interval between a commander's voice command and the corresponding body movement does not exceed 8 seconds. The 10-second storage capacity ensures that relevant intent data frames are not missed during backtracking while avoiding redundant storage that consumes resources. Data frames are stored sequentially in the queue according to their timestamps. When a new data frame is added, causing the total queue duration to exceed 10 seconds, the system automatically discards the oldest data frame, maintaining the timeliness of the queue data and providing complete and effective data support for subsequent temporal indexing. This dynamic storage mechanism allows the queue to adapt to different interaction rhythms.
[0052] S62 parses the voice command stream and executes the reverse index. The system analyzes the voice command stream of the commander in real time and uses speech recognition technology to extract semantic keywords containing spatial referential attributes. The speech recognition technology focuses on words strongly related to spatial direction, such as "over there," "above," "in front," and "left side." The input is real-time collected voice audio data, and the output is the recognized semantic keywords and their corresponding timestamps, ensuring the relevance and accuracy of keyword extraction. When such semantic keywords are recognized, the system uses the moment of occurrence of the keyword as the time anchor point and performs a reverse indexing on the sliding temporal buffer queue constructed by S61 based on a preset 2-second backtracking time threshold. The backtracking time threshold is determined based on the statistical synchronization of human voice commands and limb pointing actions: experiments show that after a commander issues a spatial semantic command, the corresponding limb pointing action usually occurs within 2 seconds before and after. This threshold can accurately locate the limb intention data frame sequence matching the voice command, realizing the temporal correlation between the voice modality and the limb action modality.
[0053] S63 Execution Intent Persistence Race Check If multiple non-contiguous potential intent data frame sequences are retrieved within the S62 backtracking time threshold, the system performs an intent persistence competition check. The system calculates the time span weight of each sequence, which is positively correlated with the sequence duration. The weight calculation formula is set as W=t / T, where W is the time span weight, t is the sequence duration, and T is the backtracking time threshold. This formula is derived based on the cognitive logic that "intent expressions with longer durations are more likely to be the true intentions of the commander." Both t and T are time units, while the weight W is dimensionless, ensuring the calculation's rationality. For example, if a sequence has a duration of 1.2 seconds and a backtracking time threshold of 2 seconds, then the time span weight of this sequence is 0.6. The system locks the sequence with the highest time span weight as the primary intent frame. This competition mechanism effectively filters out short sequences that are triggered momentarily by mistake, improving the reliability of intent recognition and avoiding misjudgments caused by a single frame or short sequence.
[0054] S64 Determines the input source of the probability cone search space. The system uses the main intent frame locked by S63 as the sole input source for constructing the probability cone search space. This configuration enables deep collaboration between the voice modality and the body movement modality: voice keywords provide temporal anchors and semantic associations for the intent, while body movement sequences provide precise spatial orientation information. The combination of these two allows the construction of the probability cone search space to better reflect the true intent of the commander. Compared to relying solely on body movement recognition, this multimodal fusion mechanism further improves the accuracy of intent recognition, effectively solving the problem of false triggering caused by a single sensor when the instantaneous data quality is poor, and making intent parsing more robust in non-steady-state scenarios.
[0055] S7 Geometric Ambiguity Validation Based on Semantic Constraints S71 analyzes semantic keywords and geometric orientation features The system transforms the semantic keywords parsed from S62 into logical directional attributes containing abstract spatial referential information. These semantic keywords originate from the voice command stream of the commander and are direct semantic expressions of spatial intent. For example, "above" corresponds to the vertical upward logical direction, "front" corresponds to the torso's forward logical direction, "left" corresponds to the left logical direction perpendicular to the torso's forward logical direction, and "right" corresponds to the right logical direction perpendicular to the torso's forward logical direction. The transformation rules align with the cognitive habits of daily spatial communication, ensuring a high degree of consistency between logical direction and semantic intent. Simultaneously, within the gravity-compensated torso reference frame constructed in S1, the system calculates the geometric pointing features of the main intent frame locked in S63. The main intent frame carries the core limb pointing data after temporal competition verification. By extracting the spatial angle features of continuous pointing vectors within the frame, the system calculates the pitch and azimuth angles of the pointing, comprehensively and accurately representing the actual spatial position of the limb pointing, providing a quantitative basis for subsequent semantic and geometric consistency verification.
[0056] S72 defines and validates semantic legal domains. The system defines semantic legal domains in three-dimensional space based on the logical direction attributes parsed by S71. The boundary parameters of the semantic legal domains are determined by statistically analyzing a large amount of human-computer interaction data in non-steady-state scenarios: the semantic legal domain corresponding to "above" is set to a spatial region with a pitch angle greater than 15 degrees. Experiments show that when commanders use the expression "above," the pitch angle of their limbs pointing is rarely lower than 15 degrees, and this range can cover about 95% of the real intention scenarios; the semantic legal domain corresponding to "forward" is set to a spatial region with an azimuth angle between -30 degrees and 30 degrees. This range matches the effective pointing range of the human torso in the positive direction, and beyond this range, it tends to be a lateral semantic expression; the semantic legal domain corresponding to "left" is set to a spatial region with an azimuth angle between 60 degrees and 120 degrees, and the semantic legal domain corresponding to "right" is set to a spatial region with an azimuth angle between -120 degrees and -60 degrees. The left and right lateral angle ranges are set based on the symmetrical distribution pattern of the two sides of the human torso and actual pointing habits to ensure consistency with semantic cognition. The system verifies whether the geometric pointing features calculated by S71 fall within the coverage of the semantic legal domain. The verification process is achieved by directly comparing the pitch angle and azimuth angle of the geometric pointing features with the boundary parameters of the semantic legal domain. The operation is intuitive and easy to reproduce by those skilled in the art.
[0057] S73 determines a conflict in spatiotemporal perception and intercepts the command. When the geometric orientation features calculated by S71 fall into a mutually exclusive quadrant that contradicts the semantic legal domain defined by S72, the system determines that there is a spatiotemporal cognitive conflict in the current interaction. The definition of mutually exclusive quadrants is based on the opposing semantic relationships of logical directions: for example, when the semantic keyword is "above," the mutually exclusive quadrant is the spatial region with a pitch angle less than -10 degrees, below which is the obvious semantic range of "below," completely conflicting with the semantics of "above"; when the keyword is "forward," the mutually exclusive quadrant is the region with an azimuth angle greater than 60 degrees or less than -60 degrees, these regions correspond to the semantics of "left" or "right," respectively, contradicting the semantics of "forward"; the mutually exclusive quadrants corresponding to the left and right keywords are the semantic legal domain ranges of their opposite sides. At this time, the system immediately intercepts the generation and issuance of interaction commands to avoid the agent performing incorrect operations due to the contradiction between the speech semantics and the geometric orientation of the limbs. This conflict interception mechanism closely follows the semantic parsing and geometric feature verification results mentioned earlier. By filtering geometric ambiguities through semantic constraints, it further improves the security and accuracy of human-computer interaction, effectively solves the instruction conflict problem when user passwords and actions are inconsistent in non-steady-state scenarios, and makes intent recognition more reliable.
[0058] S8 Adaptive Correction Based on Negative Feedback S81 opens the feedback listening window and monitors movement speed. Once S54 locks onto the interactive target, the system immediately opens a feedback listening window. The listening window duration is set to 0.5 seconds. This duration is derived from statistical analysis of human response times to interaction results: the average reaction time for a commander to confirm whether the target matches their intention is approximately 0.3 seconds. The 0.5-second window ensures that the commander has sufficient time to respond while avoiding excessive waiting that slows down the interaction. Within the listening window, the system monitors the lateral relative velocity of the limbs within the gravity-compensated torso reference frame constructed by S1 in real time. This velocity is calculated from limb motion data collected by the inertial measurement unit. The calculation process follows the spatial motion analysis logic described earlier, ensuring that the velocity data accurately reflects the commander's response to the currently locked target, providing a reliable quantitative basis for subsequent negative feedback determination.
[0059] S82 determines the negative feedback signal and clears the data. The system sets a rejection speed threshold of 0.5 m / s. This threshold is determined based on statistical analysis of the motion characteristics of human negative feedback actions: when a commander is dissatisfied with a locked target, they often make a rapid lateral limb swinging motion as a form of rejection. The speed of this motion generally exceeds 0.5 m / s, while the speed of limb movements when confirming a target is mostly below 0.3 m / s. The clear boundary between the two effectively distinguishes different feedback intentions. When the lateral relative motion speed monitored by S81 exceeds the rejection speed threshold, the system determines that a negative feedback signal has been received. At this time, the system immediately clears the currently stored potential intent data frames and intention graph frames (if they exist). These data frames include intent-related data captured by S34 and subsequently verified through multimodal analysis. The clearing operation prevents erroneous data from interfering with the subsequent intent recognition process, ensuring that the system responds quickly to the commander's negative feedback and closely connects with the feedback monitoring logic described above.
[0060] S83 Corrected Angle Calculation Logic The system retrieves a preset gain compensation coefficient of 1.5. Extensive experiments in non-steady-state scenarios have verified that this coefficient can reasonably improve the probability cone's tolerance for pointing deviations without excessively expanding the search range or reducing target selection efficiency. Based on this gain compensation coefficient, the system modifies the angle calculation logic constructed in S42, changing the original gain coefficient from 3 to a new gain coefficient equal to the product of the original gain coefficient and the gain compensation coefficient, i.e., 4.5. Through this correction, the system proportionally increases the expansion ratio of the angle relative to the dynamic jitter residual, enabling the next recognition cycle to generate a probability cone search space with a larger envelope. This adaptive correction mechanism allows the system to quickly adapt to larger pointing adjustments made by commanders due to unselected targets, significantly improving the flexibility and success rate of target recognition. Experiments have shown that the success rate of target recognition can be improved by approximately 25% after the correction, effectively solving the technical limitation of traditional fixed search ranges being unable to adapt to dynamic pointing adjustments by users.
[0061] S9 multimodal asynchronous verification negative feedback signal S91 parses the voice command stream and generates a semantic-level negation signal. Within the feedback monitoring window opened by S81, the system and motion speed monitoring work in parallel to analyze the voice command stream of the commanding personnel. The system employs speech recognition technology, focusing on capturing negative semantic keywords such as "incorrect," "no," "cancel," and "change to another," which represent the intention to refuse. These keywords are determined by statistically analyzing the spoken language habits of negative feedback in non-steady-state scenarios, covering the most common expressions of refusal intentions in emergency interactions, ensuring comprehensive semantic capture. The input to the speech recognition technology is real-time collected audio data, and the output is a binary judgment result of "negative keyword detected" or "negative keyword not detected," along with the corresponding semantic-level negative signal. The keyword recognition accuracy of this speech recognition technology is no less than 95%, a figure derived from training and testing with a large number of non-steady-state scenario speech samples. This effectively captures genuine negative intentions while keeping the false recognition rate caused by environmental noise at a low level, forming a multimodal complement with body movement feedback, improving the comprehensiveness and reliability of negative feedback recognition.
[0062] S92 defines motion-level negation signals and parallel triggering conditions. The system defines a motion-level rejection signal as a situation where the lateral relative motion velocity monitored by S81 exceeds the rejection velocity threshold. The rejection velocity threshold continues the 0.5 m / s set by S82 to ensure consistency in the judgment criteria and avoid feedback misjudgments caused by threshold conflicts. The system sets the semantic-level rejection signal and the motion-level rejection signal generated by S91 as parallel triggering conditions that serve as backups for each other. That is, the same rejection feedback response is triggered when either signal is satisfied. This setting fits the actual interaction needs of non-steady-state scenarios: the commander may not be able to clearly recognize the speech due to environmental noise, or may be unable to make physical feedback gestures due to busy hands. The dual triggering conditions ensure that the commander can express the rejection intention in the most convenient way, which greatly improves the convenience and robustness of human-computer interaction and closely connects with the speech parsing and speed monitoring logic mentioned above.
[0063] S93 responds to negative feedback signals and performs operations. When any of the parallel triggering conditions set in S92 is met, the system immediately confirms that a negative feedback signal has been received. The system simultaneously executes two core operations: first, it clears the currently stored potential intent data frames and intention graph frames (if any). These data frames include intent-related data captured in S34 and verified through multimodal analysis in S6 and S7, completely preventing erroneous data from interfering with subsequent intent recognition processes; second, it initiates the correction of the angle calculation logic parameters in S83, proportionally increasing the expansion ratio of the angle relative to the dynamic jitter residual, optimizing the search range for the next recognition cycle. Through multimodal asynchronous verification, the system overcomes the limitations of single-modal feedback, effectively avoiding missed or incorrect feedback due to single-modal failure (such as speech being masked by environmental noise or body movements not being captured by sensors). This mechanism perfects the closed loop of human-computer interaction, enabling the system to optimize recognition strategies in real time based on the immediate feedback from command personnel, further improving the system's adaptability and user experience in non-steady-state scenarios, and making the intent understanding process more resistant to interference.
[0064] The implementation principle of the embodied AI intelligent agent interaction and intent understanding method for human-machine collaboration in this application embodiment is as follows: This method addresses the recognition challenge caused by the coupling of torso movement and limb intent in unsteady scenarios. By constructing a gravity-compensated torso reference frame, the vertical axis is forcibly aligned with the direction of gravity, effectively decoupling noise introduced by torso tilt and rotation. This allows limb pointing data to be represented within a stable reference frame, significantly reducing pointing projection deviation by approximately 70%, thus laying a spatial benchmark for accurate intent recognition. Based on this, by establishing a positive correlation mapping between dynamic velocity thresholds and global movement rates, and adaptively adjusting the angle of the probability cone search space based on limb tremor residuals, the system can distinguish between passive swaying to maintain balance and true pointing. The system reduces the misjudgment rate by approximately 40% in high-dynamic scenarios such as when personnel are running, while also improving its adaptability to unstable directions through inclusive search. Furthermore, it integrates the semantic constraints of voice commands with the geometric features of body pointing to perform spatiotemporal consistency verification, intercepting ambiguous commands that conflict between semantics and direction. A multimodal negative feedback mechanism is also introduced; when a user expresses dissatisfaction through rapid body movements or negative voice, the system immediately clears erroneous data and adaptively expands the search range, increasing the subsequent recognition success rate by approximately 25%. This forms a closed-loop optimization of human-computer interaction, ultimately enabling reliable understanding of human pointing intentions even in complex environments such as vigorous movement and unstable postures, thus improving the accuracy and adaptability of human-computer collaborative operations.
[0065] The above are all preferred embodiments of this application, and are not intended to limit the scope of protection of this application. Therefore, all equivalent changes made in accordance with the structure, shape and principle of this application should be covered within the scope of protection of this application.
Claims
1. A human-machine collaborative embodied AI agent interaction and intent understanding method, characterized in that, include: Collect the inertia and position data of the command personnel and obtain an environmental semantic map containing object position information. Use the gravity vector in the inertia data to construct a gravity-compensated torso reference system with the torso as the origin and the vertical axis forcibly aligned with the gravity vector. The relative motion velocity of the limb ends is calculated within the gravity-compensated torso reference frame. When the relative motion velocity is lower than the dynamic velocity threshold generated based on the global movement rate of the commander, a potential intention data frame containing continuous limb pointing data is extracted. Calculate the dispersion of limb pointing data within the potential intent data frame to obtain dynamic jitter residuals, determine the opening angle based on the dynamic jitter residuals, and construct a probability cone search space with the average limb pointing as the axis; The probability cone search space is projected onto the environmental semantic map, and objects located within the range of the probability cone search space are identified as candidate objects within the space. Interaction targets are selected from the candidate objects within the space based on their salience weights.
2. The method according to claim 1, characterized in that, Constructing the gravity-compensated torso reference frame includes: The gravitational acceleration component in the inertial sensing data is separated in real time to establish the gravity vector, and the Z-axis of the gravity-compensated torso reference frame is configured to be parallel to the reverse extension line of the gravity vector. The positive torso vector is extracted from the inertial sensing data and projected onto a geometric plane perpendicular to the Z-axis to construct the X-axis of the gravity-compensated torso reference system. An attitude correction matrix is generated based on the Z-axis and the X-axis. The attitude correction matrix is then used to perform an inverse rotation transformation on the inertial sensing data to lock the XY plane of the gravity-compensated torso reference frame parallel to the physical ground plane.
3. The method according to claim 1, characterized in that, The logic for generating the dynamic speed threshold is configured as follows: Establish a positive correlation between the dynamic speed threshold and the global movement rate of the command personnel; In response to the increase in the global movement rate, the dynamic speed threshold is increased and made greater than the speed value of the passive follow-up component of the limbs in order to maintain body balance. In response to the decrease in global movement rate, the dynamic speed threshold is lowered so that it only covers the physiological micro-motion speed values of the human body in a static state.
4. The method according to claim 3, characterized in that, The logic for intercepting latent intent data frames containing continuous limb pointing data is configured to perform dual motion decoupling verification of limb extremities: The relative angular velocity of the limb end is calculated in real time within the gravity-compensated torso reference frame, and an angular rest threshold is configured to characterize the active hovering stability of the limb. Construct intent-locking logic, which is configured to combine the dynamic velocity threshold and the angular stationary threshold into parallel constraints. The commander is determined to be in a physiological relative freeze state only when the relative motion speed is lower than the dynamic speed threshold and the relative angular velocity is lower than the angular rest threshold. In response to the confirmation of the physiological relative freeze state, the current limb movement is identified as an active intention expression stripped of the passive swaying component generated by maintaining trunk balance, and the interception of the potential intention data frame is triggered.
5. The method according to claim 4, characterized in that, The logic for constructing the probability cone search space is configured to perform residual-based uncertainty spatialization compensation: The continuous limb pointing data contained in the latent intent data frame is parsed to obtain multi-frame pointing vectors, and the dispersion of the pointing vectors is statistically analyzed to generate dynamic jitter residuals. An adaptive scalar adjustment logic is constructed to establish a positive correlation mapping relationship between the scalar of the probability cone search space and the dynamic jitter residual. By dynamically expanding the scalar as the dynamic jitter residual increases, a spatial redundancy is formed to cover the pointing deviation under the physiological relative freezing state. The average pointing vector of the pointing vector is calculated and used as the central axis. The probability cone search space is generated by combining the angle determined by the adaptive angle adjustment logic.
6. The method according to claim 5, characterized in that, The logic for selecting an interaction target from candidate objects within the space is configured to perform a saliency-based competitive selection based on multidimensional spatial weights: The spatial envelope retrieval of the environmental semantic map is performed using the probability cone search space to extract entities located inside the probability cone search space as candidate objects within the space; A two-factor decay evaluation logic is constructed to calculate the angular offset of the geometric centroid of each candidate object in the space relative to the central axis of the probability cone search space, and the Euclidean distance relative to the commander. Generate saliency weights and configure the saliency weights to be negatively correlated with the angular offset and the Euclidean distance, so as to give higher priority to objects that are closer to the pointing axis in space and are physically closer. Perform confidence-gated verification to lock the object with the highest salience weight as the target to be confirmed. Only when the salience weight exceeds the preset confidence threshold is the target to be confirmed as the interaction target. Otherwise, suppress the generation of interaction commands to maintain the system standby state.
7. The method according to claim 4, characterized in that, It also includes multimodal time-domain backtracking and race check steps: Continuous serialization storage of potential intent data frames that have been determined to be in the physiologically relatively frozen state is used to form a sliding temporal buffer queue; The voice command stream of the commander is analyzed in real time. When a semantic keyword containing spatial referential attributes is identified, the semantic keyword is used as a time anchor point, and a reverse index is performed on the sliding time domain cache queue based on a preset backtracking time threshold. If multiple non-contiguous potential intent data frame sequences are retrieved within the backtracking time threshold, an intent persistence competition check is performed, the time span weight of each sequence is calculated, and the sequence with the highest time span weight is locked as the main intent frame. The idea graph frame is used as the sole input source for constructing the probability cone search space.
8. The method according to claim 7, characterized in that, It also includes a geometric ambiguity verification step based on semantic constraints: The semantic keywords are parsed into logical directional attributes containing abstract spatial referential information, and the geometric pointing features of the main image frame are solved within the gravity-compensated torso reference frame. Define a semantic legal domain in three-dimensional space based on the logical direction attribute, and verify whether the geometric pointing feature falls within the coverage of the semantic legal domain; When the geometric pointing feature falls into a mutually exclusive quadrant that contradicts the semantic legal domain, it is determined that there is a spatiotemporal cognitive conflict in the current interaction and the interaction command is intercepted.
9. The method according to claim 8, characterized in that, It also includes an adaptive correction step based on negative feedback: After locking the interactive target, open the feedback listening window to monitor the lateral relative motion velocity of the limb end within the gravity-compensated torso reference frame in real time. When the lateral relative motion speed exceeds a preset rejection speed threshold, it is determined that a negative feedback signal has been received and the currently stored potential intent data frame is cleared. The preset gain compensation coefficient is retrieved to correct the parameters of the calculation logic used to determine the angle when constructing the probability cone search space. The expansion ratio of the angle relative to the dynamic jitter residual is increased proportionally, thereby generating the probability cone search space with a larger envelope range in the next recognition cycle.
10. The method according to claim 9, characterized in that, The logic for determining whether a negative feedback signal has been received is configured to perform multimodal asynchronous verification: Within the feedback monitoring window, the voice command stream of the commander is parsed in parallel to identify negative semantic keywords that represent the intention to refuse and generate a semantic-level negative signal. The lateral relative motion speed exceeding the rejection speed threshold is defined as a motion-level rejection signal, and the semantic-level rejection signal and the motion-level rejection signal are set as parallel triggering conditions that serve as backups for each other; In response to the satisfaction of any of the parallel triggering conditions, it is confirmed that the negative feedback signal has been received, and the operation of clearing the potential intent data frame and the parameter correction of the computation logic are immediately triggered.