An Autonomous Control Method and System for Embossed Intelligent Robots Based on AI Agents

Through AI Agent technology, robots autonomously acquire multimodal perception data for structured semantic representation and task planning, bridging the semantic gap between perception and task planning, and achieving adaptive and robust task execution in complex environments.

CN122299686APending Publication Date: 2026-06-30BEIJING BOYAN SHENGKE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING BOYAN SHENGKE TECH CO LTD
Filing Date
2026-06-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies in the control of embodied intelligent robots suffer from a semantic gap between perceptual information and task planning, and lack adaptive capabilities, resulting in low robustness and success rate of robots in completing tasks in complex or dynamic environments.

Method used

By adopting an AI Agent-based approach, feature extraction and cross-modal fusion are performed on multimodal perception data to generate a structured semantic representation of the environment. Progressive reasoning is then used to generate hierarchical task planning, and incremental corrections are made during execution to form an autonomous closed-loop control.

Benefits of technology

It improves the robot's comprehensiveness and accuracy in understanding the environment, enhances its adaptability and robustness in dynamic environments, reduces human intervention, and ensures stable task completion in complex scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122299686A_ABST
    Figure CN122299686A_ABST
Patent Text Reader

Abstract

This invention relates to the field of robot control technology, and more particularly to an autonomous control method and system for embodied intelligent robots based on an AI agent. The method involves acquiring multimodal perception data and generating a structured semantic representation of the environment. An autonomous decision-making agent then infers and generates a hierarchical task plan, which is decomposed into atomic action sequences. These sequences are mapped to kinematic control parameters to generate control commands for execution. Based on feedback deviations, the plan and action sequences are incrementally corrected, forming an autonomous closed-loop control system from perception to execution. This method enables the robot to make autonomous decisions and adaptively adjust its behavior.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot control technology, and in particular to an autonomous control method and system for embodied intelligent robots based on AI agents. Background Technology

[0002] In the field of embodied intelligent robot control, existing technologies typically employ a layered architecture that separates perception, planning, and execution. The conventional approach involves collecting environmental data through sensors and converting this data into a sequence of instructions that the robot can understand, using pre-defined rules or templates to drive the robot to complete specific tasks. For example, in industrial scenarios, robots rely on fixed workflows and precise environmental modeling to perform repetitive operations through offline programming or teach-and-playback methods. In the field of service robots, finite state control methods based on state machines are often used, combined with predefined scene and motion libraries to achieve interactive tasks. The core of these methods lies in obtaining control strategies through offline optimization based on static models and deterministic logic.

[0003] Existing conventional methods have significant drawbacks: First, there is a serious semantic gap between perceived information and task planning. Traditional methods simply treat environmental data as numerical values ​​or feature vectors, failing to effectively extract high-level semantic information such as entity relationships, spatial structure, and dynamic attributes. This results in the robot's understanding of the environment remaining at a shallow feature level. When environmental complexity increases or unexpected disturbances occur, this coarse-grained perceptual representation is insufficient to support the robot in making reasonable decision adjustments.

[0004] Secondly, the rigid nature of task planning makes it lack adaptive capabilities. Existing technologies typically generate a complete sequence of actions or a task decomposition tree all at once before execution, and lack an effective dynamic correction mechanism for deviations during execution. When the execution result does not meet expectations or the environment changes abruptly, the robot can often only handle the situation through abnormal termination or manual intervention, and cannot perform incremental optimization of the task hierarchy or action sequence based on real-time feedback. This static planning mode limits the applicability of robots in unstructured, dynamically changing scenarios, reducing the robustness and success rate of task completion. Summary of the Invention

[0005] This invention provides an autonomous control method and system for embodied intelligent robots based on AI agents, which can solve the problems in the prior art.

[0006] A first aspect of the present invention provides an autonomous control method for an embodied intelligent robot based on an AI agent, comprising:

[0007] Acquire multimodal perception data of the robot's environment, perform feature extraction and cross-modal fusion on the multimodal perception data to generate fused perception features, and construct a structured semantic representation of the environment state based on the fused perception features, which includes the semantic attributes and spatial topological relationships of each entity in the environment;

[0008] The autonomous decision-making agent performs progressive reasoning based on the structured semantic representation of the environmental state and the preset task objectives to generate a hierarchical task plan. The hierarchical task plan contains a sequence of sub-tasks with dependencies. Each sub-task in the hierarchical task plan is decomposed into a sequence of atomic actions, and each atomic action has a definite execution time constraint and preconditions.

[0009] Each atomic action is mapped to the kinematic control parameters of the robot body, generating control instructions that include joint trajectory parameters and end effector operation parameters. The control instructions are executed to drive the robot body, and the body state feedback data and environmental change feedback data are continuously collected during the execution process.

[0010] The autonomous decision-making agent incrementally corrects the hierarchical task planning or atomic action sequence based on the deviation between the ontology state feedback data, environmental change feedback data and the expected execution state.

[0011] The correction results are fed back to the control command generation stage, forming an autonomous closed-loop control process from perception to execution to feedback and then to adaptive adjustment.

[0012] Based on fused sensing features, a structured semantic representation of the environment state is constructed, including:

[0013] Channel separation is performed on the fused perceptual features to obtain visual semantic feature components and spatial geometric feature components. Based on the visual semantic feature components, semantic category recognition and instance-level segmentation are performed on each physical entity in the environment. A unique entity identifier and semantic category label are assigned to each recognized physical entity.

[0014] Based on spatial geometric feature components, three-dimensional pose estimation and geometric shape reconstruction are performed on each physical entity to generate three-dimensional bounding box information and surface geometric description of each physical entity.

[0015] Based on the 3D bounding box information and surface geometry description of each physical entity, the spatial distance, relative azimuth angle and contact state between entity pairs are calculated, and a spatial topology graph with physical entities as nodes and spatial adjacency relationships as edges is constructed.

[0016] For entity pairs with spatial adjacency in the spatial topology graph, further determine the functional interaction type between the entity pairs, mark the functional interaction type on the corresponding edge, and combine the unique entity identifier, semantic category label, 3D bounding box information, surface geometry description and its adjacency edge information in the spatial topology graph of each physical entity into the semantic attribute record of the physical entity.

[0017] The semantic attribute records of all physical entities and the spatial topological relationship graph with functional interactive annotations are output as a structured semantic representation of the environment state.

[0018] The autonomous decision-making agent performs progressive reasoning based on the structured semantic representation of the environmental state and preset task objectives to generate a hierarchical task plan. This hierarchical task plan contains a sequence of sub-tasks with dependencies, including:

[0019] The autonomous decision-making agent analyzes the preset task objectives, decomposes the preset task objectives into several phased objectives with sequential constraints, extracts the physical entities and their semantic attribute records associated with the current phased objective from the structured semantic representation of the environmental state, and constructs a local environmental state view corresponding to the current phased objective.

[0020] Intent-level reasoning is performed based on the local environment state view, mapping the current stage goal to a set of candidate operation intents. The candidate operation intents are the state change directions applied to the target physical entity. The feasibility of each candidate operation intent in the set of candidate operation intents is evaluated, and candidate operation intents that conflict with the spatial constraints of the physical entity in the local environment state view are eliminated, while feasible operation intents are retained.

[0021] Each feasible operation intention is expanded into a corresponding subtask, and each subtask is labeled with the preconditions of the physical entities it depends on.

[0022] For all subtasks belonging to the same stage objective, perform cross-analysis of the pre-existing physical entity state conditions to determine the serial dependency, parallel independence, and mutual exclusion relationships among the subtasks.

[0023] The set of subtasks corresponding to all phased goals, as well as the serial dependencies, parallel independence relationships, and mutual exclusion relationships between subtasks, are integrated into hierarchical task planning.

[0024] For all subtasks belonging to the same stage objective, perform cross-analysis of the pre-existing physical entity state conditions to determine the serial dependencies, parallel independence relationships, and mutual exclusion relationships among the subtasks, including:

[0025] For each subtask, its preceding physical entity state conditions are represented as a set of condition tuples consisting of physical entity identifiers and desired state values;

[0026] For any two subtasks belonging to the same stage objective, their condition tuple sets are compared tuple by tuple. If the expected state value of the first subtask conflicts with the expected state value of the same physical entity in the precondition of the second subtask, the two subtasks are marked as mutually exclusive.

[0027] If the target physical entity of a certain tuple in the condition tuple set of the first subtask is the same as the target physical entity to which the state change is applied in the second subtask, and the expected state value of the first subtask is the same as the state value of the physical entity after the change is applied in the second subtask, then the first subtask is marked as a serial dependency on the second subtask.

[0028] If the condition tuple sets of the two subtasks do not contain any physical entities that are referenced in common, and the target physical entities of the two subtasks are disjoint, then the two subtasks are marked as parallel and independent.

[0029] For subtask pairs marked as mutually exclusive, conflict resolution is achieved by inserting intermediate transitional subtasks or adjusting the execution order constraints of subtasks. The serial dependency and parallel independence relationships determined after conflict resolution are output as the dependency relationships between subtasks.

[0030] Each atomic motion is mapped to the kinematic control parameters of the robot body, generating control instructions that include joint trajectory parameters and end effector operation parameters, including:

[0031] For each atomic action in the atomic action sequence, parse the operation type identifier of the atomic action and the entity identifier of the target operation object;

[0032] Based on the operation type identifier, the preset kinematic parameter mapping rule library is queried to determine the end effector attitude template and motion constraints corresponding to the operation type. The motion constraints include the end effector's velocity upper limit constraint and acceleration upper limit constraint.

[0033] Based on the entity identifier of the target operation object, the three-dimensional bounding box information and surface geometric description of the physical entity are retrieved from the structured semantic representation of the environmental state, and the target pose sequence of the end effector during the operation execution process is calculated.

[0034] The target pose sequence of the end effector is input into the inverse kinematics model of the robot body. Under the premise of satisfying the motion constraints, the joint angle time series of each joint of the robot body is obtained. Smoothing interpolation and velocity planning are performed on the joint angle time series of each joint to generate the continuous motion trajectory curve of each joint.

[0035] The continuous motion trajectory curves of each joint are encoded as joint trajectory parameters, the time series of the operation state of the end effector is encoded as end effector operation parameters, and the joint trajectory parameters and end effector operation parameters are aligned in time sequence and then encapsulated into executable control instructions.

[0036] The autonomous decision-making agent incrementally corrects hierarchical task planning or atomic action sequences based on the deviation between ontology state feedback data, environmental change feedback data, and the expected execution state, including:

[0037] Extract the expected execution state description of the currently executing subtask from the hierarchical task planning. The expected execution state description includes the expected ontology state vector and the expected environment state vector.

[0038] The collected ontological state feedback data and the expected ontological state vector are interpolated one dimension at a time to generate an ontological state deviation vector. The collected environmental change feedback data and the expected environmental state vector are interpolated one dimension at a time to generate an environmental state deviation vector. The ontological state deviation vector and the environmental state deviation vector are then weighted and fused to obtain a comprehensive deviation scalar.

[0039] When the overall deviation scalar is lower than the first judgment threshold, the judgment deviation is at the compensable level, and the joint trajectory parameters of the current atomic action are locally compensated and adjusted to maintain the hierarchical task planning unchanged.

[0040] When the comprehensive deviation scalar is between the first and second judgment thresholds, the judgment deviation is at the correction level, and the current subtask and its subsequent unexecuted atomic action sequence are re-decomposed and re-mapped.

[0041] When the overall deviation scalar is higher than the second judgment threshold, the judgment deviation is at the level of needing replanning, triggering the autonomous decision agent to re-reason and re-plan all sub-task sequences that have not yet been executed based on the structured semantic representation of the updated environmental state.

[0042] A second aspect of this invention provides an autonomous control system for an embodied intelligent robot based on an AI Agent, comprising:

[0043] The perception fusion unit is used to acquire multimodal perception data of the robot's environment, extract features from the multimodal perception data and fuse them across modes to generate fused perception features, and construct a structured semantic representation of the environment state based on the fused perception features, which includes the semantic attributes and spatial topological relationships of each entity in the environment.

[0044] The task planning unit is used by the autonomous decision-making agent to perform progressive reasoning based on the structured semantic representation of the environmental state and the preset task objectives to generate a hierarchical task plan. The hierarchical task plan contains a sequence of sub-tasks with dependencies. Each sub-task in the hierarchical task plan is decomposed into a sequence of atomic actions, and each atomic action has a definite execution timing constraint and precondition.

[0045] The control execution unit is used to map each atomic action to the kinematic control parameters of the robot body, generate control instructions containing joint trajectory parameters and end effector operation parameters, execute control instructions to drive the robot body, and continuously collect body state feedback data and environmental change feedback data during the execution process.

[0046] The feedback correction unit is used by the autonomous decision-making agent to incrementally correct the hierarchical task planning or atomic action sequence based on the deviation between the ontology state feedback data, environmental change feedback data and the expected execution state.

[0047] The closed-loop control unit is used to feed back the correction results to the control command generation stage, forming an autonomous closed-loop control process from perception to execution to feedback and then to adaptive adjustment.

[0048] A third aspect of the present invention provides an electronic device, comprising:

[0049] processor;

[0050] Memory used to store processor-executable instructions;

[0051] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.

[0052] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.

[0053] The fusion of multimodal perception data and structured semantic representation enable robots to accurately recognize the semantic attributes and spatial topological relationships of environmental entities. Environmental representation based on entity attributes and spatial relationships eliminates the ambiguity of single modalities, significantly improving the comprehensiveness and accuracy of perception, providing a solid basis for subsequent decision-making, and reducing the probability of erroneous judgments due to perceptual biases. Simultaneously, structured semantic representation supports rapid updates to the dynamic environment, enabling robots to adapt to scene changes and maintain cognitive consistency.

[0054] The autonomous decision-making agent generates hierarchical task planning through progressive reasoning, decomposing complex goals into a sequence of dependent subtasks. These subtasks are further refined into atomic actions with clearly defined temporal constraints and preconditions. This multi-level decomposition ensures the logical rigor and safety of task execution, avoiding action conflicts or sequence errors. Atomic actions are directly mapped to control commands that map joint trajectories to end effector parameters, enabling the robot to execute operations efficiently and accurately, improving the consistency and reliability of action execution, and reducing the conversion losses from high-level planning to low-level control.

[0055] During execution, the robot continuously collects feedback on its own state and environmental changes. By calculating the deviation from the expected execution state, the autonomous decision-making agent incrementally corrects the hierarchical planning or atomic action sequence. The correction results are immediately fed back to the control command generation stage, forming a closed-loop control from perception to execution to feedback and adaptive adjustment. This mechanism endows the robot with the ability to dynamically compensate for execution errors and environmental disturbances, automatically adjusting its strategy without human intervention. This significantly enhances the system's robustness and adaptability, ensuring stable task completion in complex dynamic environments. Attached Figure Description

[0056] Figure 1 This is a flowchart illustrating the autonomous control method for embodied intelligent robots based on AI Agents. Detailed Implementation

[0057] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0058] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0059] Figure 1 This is a flowchart illustrating the autonomous control method for an embodied intelligent robot based on an AI Agent, as described in an embodiment of the present invention. The autonomous control method for an embodied intelligent robot based on an AI Agent includes:

[0060] Acquire multimodal perception data of the robot's environment, perform feature extraction and cross-modal fusion on the multimodal perception data to generate fused perception features, and construct a structured semantic representation of the environment state based on the fused perception features, which includes the semantic attributes and spatial topological relationships of each entity in the environment;

[0061] The autonomous decision-making agent performs progressive reasoning based on the structured semantic representation of the environmental state and the preset task objectives to generate a hierarchical task plan. The hierarchical task plan contains a sequence of sub-tasks with dependencies. Each sub-task in the hierarchical task plan is decomposed into a sequence of atomic actions, and each atomic action has a definite execution time constraint and preconditions.

[0062] Each atomic action is mapped to the kinematic control parameters of the robot body, generating control instructions that include joint trajectory parameters and end effector operation parameters. The control instructions are executed to drive the robot body, and the body state feedback data and environmental change feedback data are continuously collected during the execution process.

[0063] The autonomous decision-making agent incrementally corrects the hierarchical task planning or atomic action sequence based on the deviation between the ontology state feedback data, environmental change feedback data and the expected execution state.

[0064] The correction results are fed back to the control command generation stage, forming an autonomous closed-loop control process from perception to execution to feedback and then to adaptive adjustment.

[0065] Based on fused sensing features, a structured semantic representation of the environment state is constructed, including:

[0066] Channel separation is performed on the fused perceptual features to obtain visual semantic feature components and spatial geometric feature components. Based on the visual semantic feature components, semantic category recognition and instance-level segmentation are performed on each physical entity in the environment. A unique entity identifier and semantic category label are assigned to each recognized physical entity.

[0067] Based on spatial geometric feature components, three-dimensional pose estimation and geometric shape reconstruction are performed on each physical entity to generate three-dimensional bounding box information and surface geometric description of each physical entity.

[0068] Based on the 3D bounding box information and surface geometry description of each physical entity, the spatial distance, relative azimuth angle and contact state between entity pairs are calculated, and a spatial topology graph with physical entities as nodes and spatial adjacency relationships as edges is constructed.

[0069] For entity pairs with spatial adjacency in the spatial topology graph, further determine the functional interaction type between the entity pairs, mark the functional interaction type on the corresponding edge, and combine the unique entity identifier, semantic category label, 3D bounding box information, surface geometry description and its adjacency edge information in the spatial topology graph of each physical entity into the semantic attribute record of the physical entity.

[0070] The semantic attribute records of all physical entities and the spatial topological relationship graph with functional interactive annotations are output as a structured semantic representation of the environment state.

[0071] After obtaining the fused perceptual features from multimodal perception data, it is necessary to extract structured semantic information so that the autonomous decision-making agent can understand the attributes of various entities in the environment and their interrelationships. Fusion perceptual features are typically high-dimensional tensor representations obtained by cross-modal fusion of visual image features, depth point cloud features, and other sensor features, where different channels carry information from different modalities. Channel separation is performed on the fused perceptual features, splitting them into two subsets according to the semantic attributes of the feature channels: visual semantic feature components and spatial geometric feature components. The visual semantic feature components mainly originate from the RGB image encoding path, preserving visual appearance information such as texture, color, and shape; the spatial geometric feature components mainly originate from the depth map or point cloud encoding path, preserving geometric structure information such as 3D coordinates and normal vectors. Channel separation can be achieved by setting channel masks in the feature fusion network, constraining the semantic attribution of each channel during the training phase, ensuring that the two types of components have clear functional distinctions after separation.

[0072] Based on visual semantic feature components, semantic category recognition and instance-level segmentation are performed on each physical entity appearing in the current frame's environmental image. Semantic category recognition uses a multi-class classification head to map each candidate region to a predefined set of semantic categories, outputting the label with the highest category confidence as the entity's semantic category label. Instance-level segmentation further distinguishes different individuals within the same category based on semantic recognition, generating a pixel-level instance mask for each independent physical entity. After instance segmentation, a unique entity identifier is assigned to each identified physical entity. This identifier remains unchanged throughout the current task execution cycle and is used for cross-frame tracking and attribute association of the same entity in subsequent processing. The unique entity identifier generation strategy can adopt a hash-based approach, combining and encoding the entity's category label, the first-appearance frame timestamp, and its centroid coordinates in the image to ensure the identifier's uniqueness and reproducibility.

[0073] Spatial geometric feature components are used for 3D pose estimation and geometric reconstruction of each physical entity. The goal of 3D pose estimation is to determine the position vector and rotation matrix of each physical entity in the robot coordinate system or world coordinate system, i.e., the six-degree-of-freedom pose. Pose estimation can be achieved by projecting an instance segmentation mask onto the corresponding depth point cloud region, fitting the object coordinate system using the spatial distribution of the point cloud, and then obtaining the pose parameters through PnP solving or point cloud registration methods. Geometric reconstruction, based on pose estimation, performs bounding box fitting on the point cloud subset corresponding to each entity, generating axis-aligned bounding boxes or directed bounding boxes, and recording the center coordinates, three-axis dimensions, and orientation angle of the bounding boxes. Surface geometric description further extracts the normal vector distribution, curvature features, and key geometric feature points of the entity surface for subsequent contact state judgment and operation point planning. The 3D bounding box information and surface geometric description together constitute the 3D geometric profile of each physical entity, providing basic data for spatial relationship calculations.

[0074] Based on the 3D bounding box information and surface geometry descriptions of all physical entities, calculate the spatial relationship attributes between any pair of entities, including spatial distance, relative azimuth, and contact state. Spatial distance is defined as the shortest Euclidean distance between the surfaces of the bounding boxes of two entities. If this distance is below a preset adjacency threshold... When two entities are adjacent, a spatial adjacency is determined. The relative azimuth angle describes the direction of one entity relative to another, and can be represented by the projection angles of the vector connecting the centroids of the two entities in the horizontal and vertical directions. This is used to generate directional semantic labels such as "above," "left," and "front." The contact state is determined by detecting the overlap of the bounding boxes of the two entities or the minimum distance between their surface point clouds. When the minimum distance is below the contact determination threshold... When the state is reached, it is marked as a contact state. Based on the above calculation results, a spatial topology graph is constructed with each physical entity as a node and the connections between pairs of entities with spatial adjacency as edges. Each edge in the graph carries attributes such as spatial distance value, relative azimuth label, and contact state flag, which fully records the geometric spatial relationship between two adjacent entities.

[0075] For entity pairs with spatial adjacency in the spatial topology graph, their functional interaction types are further analyzed. Functional interaction types reflect the potential association between two entities at the functional or operational level. For example, a "support" relationship indicates that one entity bears the weight of another; a "contains" relationship indicates that one entity is inside the internal space of another; and a "graspable" relationship indicates that a robot's end effector can manipulate an entity. The determination of functional interaction types comprehensively considers the semantic category label combination, contact state, relative orientation angle, and geometric size ratio of the two entities. For example, when an entity labeled "cup" is above an entity labeled "tabletop" and in contact, a "support" type functional interaction can be determined. The determination of functional interaction types can be achieved through a predefined rule base or through reasoning based on a semantic knowledge graph. The determination result is appended as annotation information to the corresponding edges of the spatial topology graph, so that each edge in the graph carries both geometric spatial relationship and semantic functional relationship information.

[0076] Each physical entity's unique entity identifier, semantic category label, 3D bounding box information, surface geometry description, and attribute information of all its adjacent edges in the spatial topology graph are combined to form a complete semantic attribute record for that physical entity. The semantic attribute record is stored in a structured data format, with each record corresponding to a physical entity. It includes static attribute fields (such as semantic category label and geometric dimensions) and dynamic attribute fields (such as current 3D pose and contact state), facilitating real-time updates of dynamic attributes during task execution without affecting the integrity of static attributes. The set of semantic attribute records for all physical entities is integrated with the spatial topology graph labeled with functional interactions into a unified data structure, which is the structured semantic representation of the current environmental state. This structured semantic representation includes the individual attributes of each entity in the environment and fully describes the spatial topology and functional interactions between entities through a graph structure, providing semantically rich and structurally clear environmental description input for subsequent reasoning by the autonomous decision-making agent.

[0077] The autonomous decision-making agent performs progressive reasoning based on the structured semantic representation of the environmental state and preset task objectives to generate a hierarchical task plan. This hierarchical task plan contains a sequence of sub-tasks with dependencies, including:

[0078] The autonomous decision-making agent analyzes the preset task objectives, decomposes the preset task objectives into several phased objectives with sequential constraints, extracts the physical entities and their semantic attribute records associated with the current phased objective from the structured semantic representation of the environmental state, and constructs a local environmental state view corresponding to the current phased objective.

[0079] Intent-level reasoning is performed based on the local environment state view, mapping the current stage goal to a set of candidate operation intents. The candidate operation intents are the state change directions applied to the target physical entity. The feasibility of each candidate operation intent in the set of candidate operation intents is evaluated, and candidate operation intents that conflict with the spatial constraints of the physical entity in the local environment state view are eliminated, while feasible operation intents are retained.

[0080] Each feasible operation intention is expanded into a corresponding subtask, and each subtask is labeled with the preconditions of the physical entities it depends on.

[0081] For all subtasks belonging to the same stage objective, perform cross-analysis of the pre-existing physical entity state conditions to determine the serial dependency, parallel independence, and mutual exclusion relationships among the subtasks.

[0082] The set of subtasks corresponding to all phased goals, as well as the serial dependencies, parallel independence relationships, and mutual exclusion relationships between subtasks, are integrated into hierarchical task planning.

[0083] After receiving a structured semantic representation of the environmental state, the autonomous decision-making agent first performs semantic parsing on the preset task objective. The preset task objective is usually given in the form of natural language or structured instructions, such as "assemble part A on the workbench to the designated position on bracket B." During the parsing process, the autonomous decision-making agent identifies the core operational semantics, target physical entities, and expected final state descriptions involved in the task objective, and decomposes it into several phased objectives with sequential constraints based on the task's inherent logic. Taking the assembly task mentioned above as an example, it can be decomposed into phased objectives such as "locating part A," "grabbing part A," "moving to the vicinity of bracket B," and "performing the assembly operation." There is a strict dependency between each phased objective; that is, the completion of the previous phased objective is a necessary condition for the initiation of the next phased objective.

[0084] For each stage objective, physical entities and their semantic attribute records directly related to that stage objective are extracted from the structured semantic representation of the environmental state. The structured semantic representation includes semantic fields such as category labels, pose information, functional attributes, and operability status of each entity within the environment, as well as a description of the spatial topological relationships between entities. For the stage objective of "locating part A," only the pose information, visibility status, and spatial distribution of surrounding obstacles of part A need to be extracted; for the "performing assembly operations" stage, the current pose of part A, the interface position and orientation of bracket B, and information on any occluding entities that may exist on the assembly path need to be extracted simultaneously. Through this on-demand filtering mechanism, a corresponding local environmental state view is constructed for each stage objective, avoiding global information redundancy from interfering with the reasoning process and reducing the complexity of reasoning computation.

[0085] Based on the local environment state view, the autonomous decision-making agent enters the intent-level reasoning stage. The core of intent-level reasoning lies in mapping the difference between the expected final state of the stage objective and the current local environment state into a set of candidate operation intents. Candidate operation intents describe the direction of state change applied to the target physical entity, such as "changing the position of part A," "changing the orientation of part A," or "establishing contact constraints between part A and support B." After generating the set of candidate operation intents, a feasibility assessment is performed on each candidate operation intent. The feasibility assessment is based on the spatial constraint information of each physical entity in the local environment state view, including the entity's occupancy range, motion degree of freedom restrictions, and topological constraints such as support, containment, and obstruction between entities. If a candidate operation intent cannot be realized under the current spatial constraints, for example, if the target entity is completely occluded by other entities and cannot be bypassed, then that candidate operation intent is removed from the set. After feasibility screening, the retained candidate operation intents constitute the set of feasible operation intents, serving as the basis for subsequent subtask development.

[0086] When feasible operational intentions are broken down into specific subtasks, each feasible operational intention corresponds to one or more subtasks. The granularity of a subtask lies between a stage goal and an atomic action, describing the specific operational unit that the robot needs to complete, such as "extend the robotic arm above part A" or "close the end effector to grip part A." Simultaneously with subtask generation, the preconditions of the physical entities that each subtask depends on are labeled. Preconditions are the environmental state requirements that must be met before the subtask can be initiated. For example, the precondition for "closing the end effector to grip part A" is "the end effector has reached the gripping position of part A and the end effector is in the open state." Precise labeling of preconditions provides a structured basis for subsequent dependency analysis.

[0087] Performing a cross-analysis of the prerequisite physical entity state conditions for all subtasks belonging to the same stage objective is a crucial step in determining the relationships between subtasks. The logic of the cross-analysis is as follows: If the subtasks... The preconditions include requirements for the state of a certain physical entity, and this state happens to be a subtask. If the result is generated by the execution, then it is determined. and There is a serial dependency relationship between them, that is... Must Previously completed; if two subtasks and If the preconditions are that neither operation involves the other's output state, and the physical entities operated on by both do not interfere with each other in space, then the determination is made. and There is a parallel and independent relationship between them, allowing for concurrent scheduling when execution resources are sufficient; if two subtasks and If the execution of all actions requires exclusive access to the same physical resource (such as an end effector) or the application of contradictory state changes to the same entity, then the decision is made. and There are mutual exclusion relationships between them, meaning that only one subtask can be in execution at any given time. The results of the above three types of relationships are organized in the form of a directed graph. Directed edges represent serial dependencies, and node pairs without connecting edges are marked as parallel independent if they meet the parallel condition. Mutual exclusion relationships are recorded with special annotations in the edge attributes of the graph structure.

[0088] After analyzing the subtask relationships within a single stage objective, the set of subtasks corresponding to all stage objectives and their relationship graphs are integrated across stages. During the integration process, the connection dependencies between adjacent stage objectives are explicitly established; that is, the output state of the finally completed subtask in the previous stage objective serves as one of the prerequisites for the first subtask in the next stage objective. Through this nested dependency network, a hierarchical task plan covering all stage objectives is ultimately formed. Structurally, the hierarchical task plan presents a nested organization of three levels: a top-level stage objective sequence, a middle-level subtask dependency graph, and a bottom-level prerequisite annotation, providing a logically rigorous and directly executable task structure framework for subsequent atomic action decomposition.

[0089] In practical applications, when a robot faces a dynamically changing operating environment, the construction of the local environmental state view is dynamically updated as the perceived data is updated. The intent-level reasoning and feasibility assessment processes also locally update the set of subtasks in the hierarchical task planning accordingly, ensuring that the task planning always remains consistent with the current environmental state. This progressive reasoning mechanism based on local views effectively reduces the computational overhead of global replanning while ensuring the hierarchical task planning's rapid response to environmental changes.

[0090] For all subtasks belonging to the same stage objective, perform cross-analysis of the pre-existing physical entity state conditions to determine the serial dependencies, parallel independence relationships, and mutual exclusion relationships among the subtasks, including:

[0091] For each subtask, its preceding physical entity state conditions are represented as a set of condition tuples consisting of physical entity identifiers and desired state values;

[0092] For any two subtasks belonging to the same stage objective, their condition tuple sets are compared tuple by tuple. If the expected state value of the first subtask conflicts with the expected state value of the same physical entity in the precondition of the second subtask, the two subtasks are marked as mutually exclusive.

[0093] If the target physical entity of a certain tuple in the condition tuple set of the first subtask is the same as the target physical entity to which the state change is applied in the second subtask, and the expected state value of the first subtask is the same as the state value of the physical entity after the change is applied in the second subtask, then the first subtask is marked as a serial dependency on the second subtask.

[0094] If the condition tuple sets of the two subtasks do not contain any physical entities that are referenced in common, and the target physical entities of the two subtasks are disjoint, then the two subtasks are marked as parallel and independent.

[0095] For subtask pairs marked as mutually exclusive, conflict resolution is achieved by inserting intermediate transitional subtasks or adjusting the execution order constraints of subtasks. The serial dependency and parallel independence relationships determined after conflict resolution are output as the dependency relationships between subtasks.

[0096] After completing hierarchical task planning, a systematic dependency analysis is needed for all subtasks belonging to the same stage goal to ensure the correctness and efficiency of subsequent execution scheduling. The core of this analysis lies in revealing the potential serial dependencies, parallel independence, and mutual exclusion relationships between subtasks through cross-comparison of the state conditions of the preceding physical entities, thereby providing the task scheduler with an accurate execution constraint graph.

[0097] For each subtask, its preceding physical entity state conditions are formally represented as a set of condition tuples. Each condition tuple consists of two components: a physical entity identifier and a desired state value, i.e. ,in Represents the physical entity identifier. This represents the expected state value that the entity must meet before the subtask begins. Taking the subtask of "placing object A on tray B" as an example, its set of precondition tuples might contain two tuples: "object A is in a grasping state" and "tray B is in an empty state." This formal representation transforms the task preconditions described in natural language into computable, comparable structured data, laying the foundation for subsequent cross-analysis.

[0098] For any two subtasks belonging to the same stage objective, traverse the condition tuple sets of both subtasks tuple by tuple and perform conflict detection. The conflict detection criterion is: if a tuple in the condition tuple set of the first subtask... The same physical entity identifier referenced in the condition tuple set of the second subtask tuple satisfy and and If two subtasks are physically incompatible, they are considered mutually exclusive. For example, if subtask X requires "valve V to be in the open state," while subtask Y requires "valve V to be in the closed state," their desired state values ​​for the same physical entity, valve V, conflict and cannot be satisfied simultaneously; therefore, they are marked as mutually exclusive. Mutual exclusion means that the two subtasks cannot be executed concurrently at the same time point and must be handled through subsequent conflict resolution mechanisms.

[0099] The logic for determining serial dependencies is based on causal chain analysis of state generation and state consumption. If a precondition tuple of the first subtask references a physical entity... This happens to be the target physical entity to which the second subtask will apply a state change after execution, and the first subtask expects this entity to be in a certain state. The second subtask, upon completion, happens to change the entity's state to... If the first subtask is logically dependent on the completion of the second subtask, then the first subtask is marked as having a serial dependency relationship with the second subtask. For example, the subtask "grab screw C" requires screw C to be in a "positioned" state, and the subtask "visually position screw C" changes the state of screw C to "positioned" after execution. Therefore, "grab screw C" is serially dependent on "visually position screw C" and can only start after the latter is completed. This causal chain analysis can automatically extract implicit execution order constraints from the set of condition tuples without requiring manual specification of dependencies.

[0100] The determination of parallel independence is based on the disjointness test of physical entity references. For two subtasks, if their condition tuple sets do not contain any commonly referenced physical entity identifiers, and there is no intersection between the target physical entity sets of the two subtasks, then the two subtasks are considered completely decoupled at the physical resource level and marked as parallel independent. Parallel independent subtasks can be scheduled and executed simultaneously if their respective preconditions are met, thereby significantly improving the robot's task execution efficiency. For example, if the subtasks "adjusting the left arm joint angle" and "detecting the right worktable status" do not involve any common physical entities, they can be executed in parallel, making full use of the robot's multi-actuator concurrency capabilities.

[0101] After marking the three types of relationships, conflict resolution is performed on subtask pairs marked as mutually exclusive. Two strategies are used for conflict resolution: inserting intermediate transition subtasks and adjusting execution order constraints. The strategy of inserting intermediate transition subtasks is suitable for situations where two subtasks have state conflicts regarding the same physical entity but neither can be omitted. This is achieved by inserting an intermediate transition subtask between the two to change the state of the physical entity. Switch to Transitional subtasks transform two originally mutually exclusive subtasks into a sequential chain of tasks with serial dependencies. For example, if subtask X requires "container D to be in a sealed state" and subtask Y requires "container D to be in an open state," transitional subtasks "open container D" or "seal container D" can be inserted, with the appropriate insertion position and transition direction determined based on the overall task logic. The strategy of adjusting execution order constraints is suitable for situations where state conflicts can be avoided by rearranging the execution sequence of subtasks. By introducing mandatory sequential constraints for mutually exclusive subtask pairs, one subtask is allowed to start only after the other has completed its state recovery or transition, thus resolving conflicts without adding additional subtasks.

[0102] After conflict resolution, the final determined serial dependencies and parallel independence relationships are output as a directed acyclic graph (DAG), serving as the dependency graph for subtask scheduling. Each node in the graph corresponds to a subtask, and directed edges represent serial dependencies (the direction of the edge is from the dependent subtask to the dependent subtask). Unconnected node pairs are assumed to be parallel independent. The scheduler uses this graph for topological sorting to determine the legal execution order of each subtask and performs concurrent scheduling optimization based on the parallel independence relationships during runtime. The entire cross-analysis and conflict resolution process is statically executed during the task planning phase, and the generated dependency graph is used as a scheduling constraint during the execution phase, thereby ensuring the robot's execution correctness and resource utilization efficiency in complex multi-step tasks.

[0103] Each atomic motion is mapped to the kinematic control parameters of the robot body, generating control instructions that include joint trajectory parameters and end effector operation parameters, including:

[0104] For each atomic action in the atomic action sequence, parse the operation type identifier of the atomic action and the entity identifier of the target operation object;

[0105] Based on the operation type identifier, the preset kinematic parameter mapping rule library is queried to determine the end effector attitude template and motion constraints corresponding to the operation type. The motion constraints include the end effector's velocity upper limit constraint and acceleration upper limit constraint.

[0106] Based on the entity identifier of the target operation object, the three-dimensional bounding box information and surface geometric description of the physical entity are retrieved from the structured semantic representation of the environmental state, and the target pose sequence of the end effector during the operation execution process is calculated.

[0107] The target pose sequence of the end effector is input into the inverse kinematics model of the robot body. Under the premise of satisfying the motion constraints, the joint angle time series of each joint of the robot body is obtained. Smoothing interpolation and velocity planning are performed on the joint angle time series of each joint to generate the continuous motion trajectory curve of each joint.

[0108] The continuous motion trajectory curves of each joint are encoded as joint trajectory parameters, the time series of the operation state of the end effector is encoded as end effector operation parameters, and the joint trajectory parameters and end effector operation parameters are aligned in time sequence and then encapsulated into executable control instructions.

[0109] In the process of converting atomic action sequences into executable control instructions, for each atomic action in the sequence, two key attributes are first parsed: the operation type identifier and the entity identifier of the target operation object. The operation type identifier is a structured encoding of the semantic type of the action, such as "grab," "place," "push," and "twist," with each type corresponding to a different end effector motion mode and contact strategy. The entity identifier of the target operation object corresponds one-to-one with the physical entity nodes in the structured semantic representation of the environmental state, used to accurately retrieve the geometric and pose information of the target object in subsequent steps. This parsing process ensures that the subsequent kinematic parameter generation has clear semantic input, avoiding parameter mapping errors caused by ambiguous action descriptions.

[0110] Based on the parsed operation type identifier, a query is performed in the pre-defined kinematic parameter mapping rule base to obtain the end effector posture template and motion constraints corresponding to that operation type. The kinematic parameter mapping rule base is an offline-built structured knowledge base. Each record uses the operation type identifier as an index key and stores posture template information such as the end effector approach direction vector, gripper opening and closing state sequence, and wrist posture deflection range matching that operation type, as well as the allowed upper limits of velocity and acceleration for the end effector when performing the operation. The upper limits of velocity and acceleration are introduced to prevent damage to the mechanical structure due to excessive dynamic loads while ensuring operational accuracy. For example, for precision assembly operations, the values ​​of the upper limits of velocity and acceleration are relatively small, while for operations involving large-scale spatial movement, their values ​​can be relaxed accordingly. The content of the rule base can be calibrated and updated offline based on the specific mechanical parameters of the robot and the task scenario.

[0111] After obtaining the end effector posture template, the 3D bounding box information and surface geometry description of the physical entity are retrieved from the structured semantic representation of the environmental state based on the entity identifier of the target object. The 3D bounding box information provides the spatial occupancy range of the target object in the world coordinate system, including the center position coordinates and three-axis dimensions; the surface geometry description provides fine geometric features such as the local surface normal vector distribution and the annotation of the grasping candidate surface of the target object. Combining the above geometric information with the end effector posture template, the target pose sequence that the end effector needs to reach in sequence during the entire operation is obtained through contact point calculation and collision detection. Each pose in the target pose sequence is composed of the 3D position vector of the end effector in the world coordinate system and the quaternion rotation representation. The number of pose points is related to the division of the operation stage. For example, the "grasping" operation usually includes multiple key pose points such as the approach stage pose, the pre-grasping pose, the contact pose, and the lifting pose.

[0112] The target pose sequence of the end effector is input into the inverse kinematics model of the robot body. Under the premise of satisfying the upper limit constraints on velocity and acceleration, the joint angle values ​​of each joint of the robot body at each target pose time are obtained, forming the joint angle time series of each joint. During the inverse kinematics solution process, joint angle range constraints, singular configuration avoidance, and self-collision constraints must be considered simultaneously. For cases with multiple solutions, the minimum joint displacement principle is used to select the solution closest to the current joint state from the candidate solution set to reduce unnecessary joint motion amplitude. When the inverse kinematics has no feasible solution at a certain target pose, a fine-tuning mechanism for the target pose is triggered. In the neighborhood of the target pose, an reachable alternative pose is searched with a small step size, and the target pose sequence is updated before resolving until feasible joint angle solutions are obtained for the entire sequence.

[0113] After obtaining the joint angle time series of each joint, smoothing interpolation and velocity planning are performed to generate continuous motion trajectory curves for each joint. Smoothing interpolation uses cubic spline interpolation to construct curve segments between adjacent joint angle control points that satisfy the continuity of the first and second derivatives, eliminating velocity abrupt changes and shocks caused by the broken-line motion between discrete sampling points. Based on the interpolated curves, velocity planning uses upper limits for velocity and acceleration as boundary conditions, employing trapezoidal or S-shaped velocity profiles to perform temporal shaping of the motion velocity of each joint, ensuring that the velocity and acceleration of the joint do not exceed the upper limits of the constraints throughout the entire motion. For multi-joint coordinated motion scenarios, temporal synchronization processing between joints is also required, ensuring that each joint completes its motion segment on the same time axis, guaranteeing that the end effector moves smoothly along the expected trajectory without attitude jitter.

[0114] After the continuous motion trajectory curves of each joint are generated, they are encoded as joint trajectory parameters. The specific form of the joint trajectory parameters is a sequence of joint angle commands with a sampling interval of the control cycle. Each sampling moment corresponds to the angle command values ​​of all controlled joints of the robot body, and the encoding format is consistent with the underlying controller interface protocol of the robot body. Simultaneously, the opening and closing states, gripping force commands, and other operational states of the end effector during operation are time-series encoded according to the same control cycle to form end effector operation parameters. These end effector operation parameters describe the action commands of the gripper or other end tool at each control moment, such as triggering a gripper closing command at the contact pose moment or maintaining a constant gripping force during the lifting phase.

[0115] The joint trajectory parameters and end effector operation parameters are aligned along a unified time axis to ensure strict synchronization between joint movement and end effector operation. This avoids timing misalignment issues, such as the joint reaching the target position before the end effector responds or the end effector triggering an operation while the joint is still moving. The aligned joint trajectory parameters and end effector operation parameters are then combined according to a predetermined control instruction encapsulation format to generate a complete executable control instruction. This control instruction contains complete timing information and execution parameters and can be directly sent to the robot's underlying controller to drive the robot to physically execute the corresponding atomic actions according to the planned motion trajectory. The entire mapping process executes each atomic action in the atomic action sequence sequentially, ultimately forming a control instruction sequence covering the entire subtask execution cycle, providing a precise execution benchmark for subsequent execution monitoring and adaptive feedback adjustments.

[0116] The autonomous decision-making agent incrementally corrects hierarchical task planning or atomic action sequences based on the deviation between ontology state feedback data, environmental change feedback data, and the expected execution state, including:

[0117] Extract the expected execution state description of the currently executing subtask from the hierarchical task planning. The expected execution state description includes the expected ontology state vector and the expected environment state vector.

[0118] The collected ontological state feedback data and the expected ontological state vector are interpolated one dimension at a time to generate an ontological state deviation vector. The collected environmental change feedback data and the expected environmental state vector are interpolated one dimension at a time to generate an environmental state deviation vector. The ontological state deviation vector and the environmental state deviation vector are then weighted and fused to obtain a comprehensive deviation scalar.

[0119] When the overall deviation scalar is lower than the first judgment threshold, the judgment deviation is at the compensable level, and the joint trajectory parameters of the current atomic action are locally compensated and adjusted to maintain the hierarchical task planning unchanged.

[0120] When the comprehensive deviation scalar is between the first and second judgment thresholds, the judgment deviation is at the correction level, and the current subtask and its subsequent unexecuted atomic action sequence are re-decomposed and re-mapped.

[0121] When the overall deviation scalar is higher than the second judgment threshold, the judgment deviation is at the level of needing replanning, triggering the autonomous decision agent to re-reason and re-plan all sub-task sequences that have not yet been executed based on the structured semantic representation of the updated environmental state.

[0122] During execution, the expected execution state description corresponding to the currently executing subtask is continuously extracted from the hierarchical task planning. This expected execution state description is stored in the form of a structured vector, consisting of two components: the expected body state vector and the expected environment state vector. The expected body state vector covers dimensions such as the target angles of each joint of the robot, the target pose of the end effector, and the expected motion velocity; the expected environment state vector covers dimensions such as the position, orientation, contact state, and spatial relationship with surrounding entities of the target manipulated object. Both types of vectors are automatically generated by the autonomous decision agent based on the subtask objectives during the task planning phase and are bound to the corresponding subtask for storage, serving as a benchmark reference for subsequent deviation evaluation.

[0123] The collected body state feedback data originates from joint encoders, inertial measurement units, and torque sensors on the robot body, reflecting the robot's actual motion state during execution. The body state feedback data is then compared with the expected body state vector using a one-dimensional difference calculation to obtain the body state deviation vector. The difference in each dimension reflects the actual deviation of that state component. Similarly, environmental change feedback data comes from real-time perception of the environment by visual sensors, depth cameras, and tactile sensors. The environmental state deviation vector is obtained by calculating the dimension-by-dimensional difference between this data and the expected environmental state vector. In the dimension-by-dimensional difference calculation, the absolute value of the angle difference is used for angle components, the Euclidean distance difference is used for position components, and the 0-1 encoded difference is used for discrete state components to ensure that the differences of various physical quantities are comparable.

[0124] Ontology state deviation vector Deviation vector from environmental state By performing weighted fusion, a comprehensive deviation scalar is obtained. The weighted fusion process first calculates the weighted norm of each of the two bias vectors, and then assigns different fusion weights to the ontology bias and the environment bias according to the task type. Specifically, the comprehensive bias scalar... The calculation method is as follows:

[0125]

[0126] in, The fusion weights are the ontology state deviations. The fusion weights for environmental state deviations, , This represents a weighted norm operation, where the weight of each dimension is determined by its influence on the execution result of the current subtask. For precision manipulation subtasks, the end effector pose deviation has a higher weight; for navigation and movement subtasks, the body position deviation has a higher weight; and for object interaction subtasks, the environment state deviation has a higher weight. The weight configuration is determined along with the subtask type label during the task planning phase and does not need to be dynamically estimated during execution.

[0127] When the comprehensive deviation scalar Below the first judgment threshold At this point, the current deviation is determined to be at the compensable level. The deviation is relatively small, and the overall task progress is not substantially affected; no adjustments to the sub-task sequence or task planning are required. For compensable deviations, only the joint trajectory parameters of the current atomic action are locally compensated. Specifically, a compensation increment is superimposed on the original joint trajectory. This compensation increment is determined by the body state deviation vector. The components related to joint trajectory are calculated through inverse kinematic mapping. The compensated and adjusted joint trajectory parameters are directly sent to the underlying motion controller, maintaining the structure and content of the hierarchical task planning, and the atomic action sequences of subsequent subtasks remain unaffected. This lightweight compensation mechanism can quickly eliminate minor disturbances during execution without triggering recalculation of the planning layer, ensuring task continuity.

[0128] When the comprehensive deviation scalar Between the first judgment threshold With the second judgment threshold When the deviation is determined to be at the level requiring correction, the degree of deviation exceeds the capacity of local compensation but has not yet caused the overall task planning to fail. For deviations requiring correction, the incomplete parts of the current subtask and the subsequent unexecuted atomic action sequences are re-decomposed and remapped. The re-decomposition process uses the updated ontology state and environment state as new initial conditions, regenerates the atomic action sequence based on the target description of the current subtask, and redefines the execution timing constraints and preconditions for each atomic action. The remapping process then converts the newly generated atomic action sequence back into joint trajectory parameters and end effector operation parameters, generating corrected control commands. The entire correction process is completed at the subtask level. The results of completed subtasks and the target descriptions of subsequent subtasks in the hierarchical task planning remain unchanged; only the action-level implementation paths of the current and subsequent subtasks are updated. This medium-granularity correction strategy can flexibly respond to the need for action path changes caused by local execution deviations while maintaining the stability of the macro-planning.

[0129] When the comprehensive deviation scalar Higher than the second judgment threshold When the deviation is deemed to be at the level requiring replanning, the deviation is considered severe, indicating a fundamental difference between the current execution environment or ontology state and the original planning assumptions. Continuing with the original hierarchical task planning will fail to achieve the task objective. The autonomous decision-making agent is then triggered to re-reason and re-plan all unexecuted sub-task sequences based on the updated structured semantic representation of the environment state. During the replanning process, the autonomous decision-making agent uses the execution results of currently completed sub-tasks as known premises and the preset task objective as the final planning state to re-perform progressive reasoning on the remaining task space, generating new sub-task sequences and their dependency graphs. The newly generated hierarchical task plan replaces the unexecuted parts of the original plan; the results of completed sub-tasks are not rolled back, ensuring that existing execution results are fully utilized. After replanning is completed, new control instructions are generated and executed according to the normal sub-task decomposition and atomic action mapping process.

[0130] The core advantage of the three-level deviation judgment mechanism lies in matching the computational cost of deviation response with the actual severity of the deviation. This avoids triggering high-cost global replanning for minor deviations and also prevents task failure due to only local compensation for severe deviations. First judgment threshold. With the second judgment threshold The specific values ​​are pre-configured based on the task type, environmental complexity, and the robot's motion accuracy characteristics. In precision assembly tasks, both thresholds are set to smaller values ​​to improve correction sensitivity, while in coarse-grained handling tasks, both thresholds are set to larger values ​​to reduce unnecessary intervention at the planning layer. The entire incremental correction mechanism is tightly coupled with the perception, planning, and execution stages, enabling the autonomous decision-making agent to maintain effective control over the task execution process in a dynamically changing real-world environment, ensuring the robustness of the embodied intelligent robot's autonomous execution in complex task scenarios.

[0131] A second aspect of this invention provides an autonomous control system for an embodied intelligent robot based on an AI Agent, comprising:

[0132] The perception fusion unit is used to acquire multimodal perception data of the robot's environment, extract features from the multimodal perception data and fuse them across modes to generate fused perception features, and construct a structured semantic representation of the environment state based on the fused perception features, which includes the semantic attributes and spatial topological relationships of each entity in the environment.

[0133] The task planning unit is used by the autonomous decision-making agent to perform progressive reasoning based on the structured semantic representation of the environmental state and the preset task objectives to generate a hierarchical task plan. The hierarchical task plan contains a sequence of sub-tasks with dependencies. Each sub-task in the hierarchical task plan is decomposed into a sequence of atomic actions, and each atomic action has a definite execution timing constraint and precondition.

[0134] The control execution unit is used to map each atomic action to the kinematic control parameters of the robot body, generate control instructions containing joint trajectory parameters and end effector operation parameters, execute control instructions to drive the robot body, and continuously collect body state feedback data and environmental change feedback data during the execution process.

[0135] The feedback correction unit is used by the autonomous decision-making agent to incrementally correct the hierarchical task planning or atomic action sequence based on the deviation between the ontology state feedback data, environmental change feedback data and the expected execution state.

[0136] The closed-loop control unit is used to feed back the correction results to the control command generation stage, forming an autonomous closed-loop control process from perception to execution to feedback and then to adaptive adjustment.

[0137] A third aspect of the present invention provides an electronic device, comprising:

[0138] processor;

[0139] Memory used to store processor-executable instructions;

[0140] The processor is configured to invoke instructions stored in the memory to execute the aforementioned method.

[0141] A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the aforementioned method.

[0142] This invention can be a method, apparatus, system, and / or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for performing various aspects of the invention.

[0143] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An autonomous control method for embodied intelligent robots based on AI agents, characterized in that, include: Acquire multimodal perception data of the robot's environment, perform feature extraction and cross-modal fusion on the multimodal perception data to generate fused perception features, and construct a structured semantic representation of the environment state based on the fused perception features, which includes the semantic attributes and spatial topological relationships of each entity in the environment; The autonomous decision-making agent performs progressive reasoning based on the structured semantic representation of the environmental state and the preset task objectives to generate a hierarchical task plan. The hierarchical task plan contains a sequence of sub-tasks with dependencies. Each sub-task in the hierarchical task plan is decomposed into a sequence of atomic actions, and each atomic action has a definite execution time constraint and preconditions. Each atomic action is mapped to the kinematic control parameters of the robot body, generating control instructions that include joint trajectory parameters and end effector operation parameters. The control instructions are executed to drive the robot body, and the body state feedback data and environmental change feedback data are continuously collected during the execution process. The autonomous decision-making agent incrementally corrects the hierarchical task planning or atomic action sequence based on the deviation between the ontological state feedback data, environmental change feedback data and the expected execution state, and feeds the correction results back to the control command generation stage, forming an autonomous closed-loop control process from perception to execution to feedback and then to adaptive adjustment.

2. The method according to claim 1, characterized in that, Based on fused sensing features, a structured semantic representation of the environment state is constructed, including: Channel separation is performed on the fused perceptual features to obtain visual semantic feature components and spatial geometric feature components. Based on the visual semantic feature components, semantic category recognition and instance-level segmentation are performed on each physical entity in the environment. A unique entity identifier and semantic category label are assigned to each recognized physical entity. Based on spatial geometric feature components, three-dimensional pose estimation and geometric shape reconstruction are performed on each physical entity to generate three-dimensional bounding box information and surface geometric description of each physical entity. Based on the 3D bounding box information and surface geometry description of each physical entity, the spatial distance, relative azimuth angle and contact state between entity pairs are calculated, and a spatial topology graph with physical entities as nodes and spatial adjacency relationships as edges is constructed. For entity pairs with spatial adjacency in the spatial topology graph, further determine the functional interaction type between the entity pairs, mark the functional interaction type on the corresponding edge, and combine the unique entity identifier, semantic category label, 3D bounding box information, surface geometry description and its adjacency edge information in the spatial topology graph of each physical entity into the semantic attribute record of the physical entity. The semantic attribute records of all physical entities and the spatial topological relationship graph with functional interactive annotations are output as a structured semantic representation of the environment state.

3. The method according to claim 1, characterized in that, The autonomous decision-making agent performs progressive reasoning based on the structured semantic representation of the environmental state and preset task objectives to generate a hierarchical task plan. This hierarchical task plan contains a sequence of sub-tasks with dependencies, including: The autonomous decision-making agent analyzes the preset task objectives, decomposes the preset task objectives into several phased objectives with sequential constraints, extracts the physical entities and their semantic attribute records associated with the current phased objective from the structured semantic representation of the environmental state, and constructs a local environmental state view corresponding to the current phased objective. Intent-level reasoning is performed based on the local environment state view, mapping the current stage goal to a set of candidate operation intents. The candidate operation intents are the state change directions applied to the target physical entity. The feasibility of each candidate operation intent in the set of candidate operation intents is evaluated, and candidate operation intents that conflict with the spatial constraints of the physical entity in the local environment state view are eliminated, while feasible operation intents are retained. Each feasible operation intention is expanded into a corresponding subtask, and each subtask is labeled with the preconditions of the physical entities it depends on. For all subtasks belonging to the same stage objective, perform cross-analysis of the pre-existing physical entity state conditions to determine the serial dependency, parallel independence, and mutual exclusion relationships among the subtasks. The set of subtasks corresponding to all phased goals, as well as the serial dependencies, parallel independence relationships, and mutual exclusion relationships between subtasks, are integrated into hierarchical task planning.

4. The method according to claim 3, characterized in that, For all subtasks belonging to the same stage objective, perform cross-analysis of the pre-existing physical entity state conditions to determine the serial dependencies, parallel independence relationships, and mutual exclusion relationships among the subtasks, including: For each subtask, its preceding physical entity state conditions are represented as a set of condition tuples consisting of physical entity identifiers and desired state values; For any two subtasks belonging to the same stage objective, their condition tuple sets are compared tuple by tuple. If the expected state value of the first subtask conflicts with the expected state value of the same physical entity in the precondition of the second subtask, the two subtasks are marked as mutually exclusive. If the target physical entity of a certain tuple in the condition tuple set of the first subtask is the same as the target physical entity to which the state change is applied in the second subtask, and the expected state value of the first subtask is the same as the state value of the physical entity after the change is applied in the second subtask, then the first subtask is marked as a serial dependency on the second subtask. If the condition tuple sets of the two subtasks do not contain any physical entities that are referenced in common, and the target physical entities of the two subtasks are disjoint, then the two subtasks are marked as parallel and independent. For subtask pairs marked as mutually exclusive, conflict resolution is achieved by inserting intermediate transitional subtasks or adjusting the execution order constraints of subtasks. The serial dependency and parallel independence relationships determined after conflict resolution are output as the dependency relationships between subtasks.

5. The method according to claim 1, characterized in that, Each atomic motion is mapped to the kinematic control parameters of the robot body, generating control instructions that include joint trajectory parameters and end effector operation parameters, including: For each atomic action in the atomic action sequence, parse the operation type identifier of the atomic action and the entity identifier of the target operation object; Based on the operation type identifier, the preset kinematic parameter mapping rule library is queried to determine the end effector attitude template and motion constraints corresponding to the operation type. The motion constraints include the end effector's velocity upper limit constraint and acceleration upper limit constraint. Based on the entity identifier of the target operation object, the three-dimensional bounding box information and surface geometric description of the physical entity are retrieved from the structured semantic representation of the environmental state, and the target pose sequence of the end effector during the operation execution process is calculated. The target pose sequence of the end effector is input into the inverse kinematics model of the robot body. Under the premise of satisfying the motion constraints, the joint angle time series of each joint of the robot body is obtained. Smoothing interpolation and velocity planning are performed on the joint angle time series of each joint to generate the continuous motion trajectory curve of each joint. The continuous motion trajectory curves of each joint are encoded as joint trajectory parameters, the time series of the operation state of the end effector is encoded as end effector operation parameters, and the joint trajectory parameters and end effector operation parameters are aligned in time sequence and then encapsulated into executable control instructions.

6. The method according to claim 1, characterized in that, The autonomous decision-making agent incrementally corrects hierarchical task planning or atomic action sequences based on the deviation between ontology state feedback data, environmental change feedback data, and the expected execution state, including: Extract the expected execution state description of the currently executing subtask from the hierarchical task planning. The expected execution state description includes the expected ontology state vector and the expected environment state vector. The collected ontological state feedback data and the expected ontological state vector are interpolated one dimension at a time to generate an ontological state deviation vector. The collected environmental change feedback data and the expected environmental state vector are interpolated one dimension at a time to generate an environmental state deviation vector. The ontological state deviation vector and the environmental state deviation vector are then weighted and fused to obtain a comprehensive deviation scalar. When the overall deviation scalar is lower than the first judgment threshold, the judgment deviation is at the compensable level, and the joint trajectory parameters of the current atomic action are locally compensated and adjusted to maintain the hierarchical task planning unchanged. When the comprehensive deviation scalar is between the first and second judgment thresholds, the judgment deviation is at the correction level, and the current subtask and its subsequent unexecuted atomic action sequence are re-decomposed and re-mapped. When the overall deviation scalar is higher than the second judgment threshold, the judgment deviation is at the level of needing replanning, triggering the autonomous decision agent to re-reason and re-plan all sub-task sequences that have not yet been executed based on the structured semantic representation of the updated environmental state.

7. An embodied intelligent robot autonomous control system based on an AI agent, used to implement the method as described in any one of claims 1-6, characterized in that, include: The perception fusion unit is used to acquire multimodal perception data of the robot's environment, extract features from the multimodal perception data and fuse them across modes to generate fused perception features, and construct a structured semantic representation of the environment state based on the fused perception features, which includes the semantic attributes and spatial topological relationships of each entity in the environment. The task planning unit is used by the autonomous decision-making agent to perform progressive reasoning based on the structured semantic representation of the environmental state and the preset task objectives to generate a hierarchical task plan. The hierarchical task plan contains a sequence of sub-tasks with dependencies. Each sub-task in the hierarchical task plan is decomposed into a sequence of atomic actions, and each atomic action has a definite execution timing constraint and precondition. The control execution unit is used to map each atomic action to the kinematic control parameters of the robot body, generate control instructions containing joint trajectory parameters and end effector operation parameters, execute control instructions to drive the robot body, and continuously collect body state feedback data and environmental change feedback data during the execution process. The feedback correction unit is used by the autonomous decision-making agent to incrementally correct the hierarchical task planning or atomic action sequence based on the deviation between the ontology state feedback data, environmental change feedback data and the expected execution state. The closed-loop control unit is used to feed back the correction results to the control command generation stage, forming an autonomous closed-loop control process from perception to execution to feedback and then to adaptive adjustment.

8. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1 to 6.