A quadruped robot end-to-end navigation method based on reinforcement learning

By using an end-to-end navigation method based on reinforcement learning, environmental perception is directly mapped to joint torque output, which solves the problems of autonomy and robustness of legged robots in complex environments and achieves efficient autonomous navigation and motion optimization.

CN122195044APending Publication Date: 2026-06-12SHANGHAI INST OF MICROSYSTEM & INFORMATION TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI INST OF MICROSYSTEM & INFORMATION TECH CHINESE ACAD OF SCI
Filing Date
2026-02-14
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing legged robot navigation systems have low levels of autonomy and insufficient robustness in complex, unstructured environments. They also have limited adaptability to complex terrains and are susceptible to positioning drift due to external environmental interference.

Method used

An end-to-end navigation method based on reinforcement learning is adopted. By constructing a unified learning framework, environmental perception is directly mapped to joint torque output, omitting the sensor data conversion process. Decision-making and control are carried out using an Actor-Critic architecture and a deterministic hidden state space model.

🎯Benefits of technology

It enables robots to autonomously navigate complex, unstructured environments, improves the optimization of autonomous decision-making and action execution, reduces information loss and delay, and enhances the understanding and response capabilities to complex environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122195044A_ABST
    Figure CN122195044A_ABST
Patent Text Reader

Abstract

The application provides a kind of four-legged robot end-to-end navigation method based on reinforcement learning, comprising: taking robot as intelligent agent, the motion control of four-legged robot and navigation task are modeled as Markov decision process in limited time range, define the conversion formula of action to joint torque vector;Collect and preprocess the perception information to obtain the current observation information;Using world network model, the current observation information is encoded, and the deterministic hidden state is output;With deterministic hidden state as input, use execution network to generate the action of intelligent agent;According to the conversion formula of action to joint torque vector, the joint torque vector is solved in real time, and the corresponding torque instruction is issued to the joint driving module of robot to drive robot to execute corresponding action.The method of the application enables the robot to autonomously decide and execute the optimal composite action sequence according to the current environment and task requirements, and improves the autonomous traversal capability of the robot in complex unstructured environment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of robot navigation, specifically relating to an end-to-end navigation method for quadruped robots based on reinforcement learning. Background Technology

[0002] With the increasing demand for service robots and special-purpose robots in complex and unstructured environments, legged robots, with their superior terrain adaptability, have become an ideal alternative to wheeled and tracked platforms. Currently, mainstream legged robot navigation systems generally adopt a hierarchical architecture of "perception-localization-planning-control," with SLAM (Simultaneous Localization and Mapping) technology, as the core module for environmental perception and pose estimation, being widely integrated into commercial platforms such as Unitree Go2 and Boston Dynamics Spot. These systems typically rely on the fusion of LiDAR (such as MID-360) and IMU to construct a point cloud map, and achieve high-precision localization based on graph optimization algorithms (such as LIO-SAM), combined with traditional path planning algorithms (such as A* and DWA) to complete the navigation task.

[0003] Currently, mainstream SLAM applications mainly include three core modules: mapping, localization, and navigation. The mapping module collects LiDAR point cloud data, extracts geometric features (such as corners and faces) from the environment, and constructs a 3D point cloud map that can be used for relocalization. The localization module uses this prior map, combined with real-time sensor observations and robot motion data, to estimate the robot's precise pose in the environment, providing reliable position feedback for the navigation system. The navigation module autonomously plans a motion path based on the deviation between the current pose and the target point, and generates corresponding control commands to guide the robot to the target location.

[0004] In the system, the motion data extraction module is responsible for converting the robot's raw motion information into standard format odometry (Odom) and IMU (inertial measurement unit) data to support state estimation in the SLAM algorithm; the lidar data acquisition module acquires 3D point cloud information in real time as the main input for environmental perception; the topology point reading module loads preset navigation landmarks to support path planning tasks; finally, the motion control module executes navigation commands to coordinate the robot's gait and motion behavior to ensure that it safely and stably reaches the target destination.

[0005] This system employs mature laser SLAM technology, enabling a complete closed loop of mapping, localization, and navigation. It supports standardized data output (such as Odom, IMU, and PCD maps), facilitating integration and data exchange with other systems. However, it still has several limitations in practical applications: the initial pose requires manual intervention; if the robot deviates from its localization or loses its pose during operation, it often needs to be manually moved back to the origin for reinitialization, resulting in low automation and insufficient robustness. Simultaneously, the system has limited adaptability to complex terrain, currently only suitable for flat ground, lacking effective perception and response to unstructured environments. Furthermore, the SLAM algorithm itself is susceptible to external environmental interference (such as dynamic objects or changes in lighting), which may cause localization drift, affecting long-term operational reliability. The advantages and disadvantages of traditional SLAM algorithms are shown in Table 1.

[0006] Table 1. Advantages and disadvantages of traditional SLAM navigation algorithms

[0007]

[0008] Therefore, it is necessary to provide a reinforcement learning-based end-to-end navigation method for quadruped robots to improve the robot's ability to autonomously navigate complex unstructured environments. Summary of the Invention

[0009] The purpose of this invention is to provide a new navigation algorithm that enables robots to make autonomous decisions and execute optimal composite action sequences based on the current environment and task requirements, thereby improving the robot's ability to autonomously navigate complex unstructured environments.

[0010] To achieve the above objectives, this invention provides an end-to-end navigation method for a quadruped robot based on reinforcement learning, comprising:

[0011] Step S1: Using the robot as an intelligent agent, by defining the state space and action space of the intelligent agent, the motion control and navigation task of the quadruped robot is modeled as a Markov decision process with a finite time range, and the conversion formula from action to joint torque vector is defined.

[0012] Step S2: Collect and preprocess the sensing information to obtain the current observation information. ;

[0013] Step S3: Utilize the world network model to analyze the current observation information. Encode and output the deterministic hidden state. ;

[0014] Step S4: Deterministic latent state As input, the action of the agent is generated using an execution network based on the Actor-Critic architecture. ;

[0015] Step S5: Based on the conversion formula from motion to joint torque vector, calculate the joint torque vector in real time, and send the torque command corresponding to the joint torque vector to the robot's joint drive module to drive the robot to perform the corresponding motion.

[0016] The state space of an intelligent agent includes: ontological perception information and the relative position of the target point. and lidar point cloud data Proprioceptive information includes joint angles angular velocity Base linear velocity / base angular velocity ;

[0017] The agent's action space includes: the offset of the default joint position of the PD controller. .

[0018] In step S2, the robot collects body perception data and external perception data in real time through multiple sensors, performs standardized preprocessing, and integrates them into unified current observation information. Propriometry data includes: current joint angle. Current joint angular velocity angular velocity of the base The gravity vector of the IMU External perception data (Exteroception) This includes the relative position of the target point. and lidar point cloud data .

[0019] A deterministic hidden state space model is adopted as the architecture of the world network model; the deterministic hidden state space model includes an encoder, a dynamic predictor and a decoder connected in sequence, as well as a loop model connected to the dynamic predictor.

[0020] The loop model is configured to execute within a compressed latent variable space, utilizing historical states. Recent action sequence and the preceding random hidden state To infer the deterministic hidden state at the current moment ;

[0021] The encoder bases its input on the deterministic hidden state. and current observation vector and given deterministic hidden states and current observation vector Under the condition of random hidden state conditional probability distribution To generate the random hidden state at the current moment. ;

[0022] The dynamic predictor is based on the deterministic hidden state of the input. and given deterministic hidden states Predicting random hidden states under the given conditions prior distribution To obtain the predicted random hidden state ;

[0023] The decoder is based on the deterministic hidden state of the input. and random hidden states and given a deterministic hidden state and random hidden states Current observation estimation under the conditions Conditional distribution To output the current observation estimate Current observation estimates Used during the training of a deterministic hidden state-space model with the current observation vector. The reconstruction loss is calculated by comparison.

[0024] The input frequency of the world network model is 10Hz, which is consistent with the operating frequency of the robot's lidar; the operating frequency of the execution network is 50Hz.

[0025] The execution network includes a policy network and a value network, the policy network... It includes a history encoder, a path predictor, and a motion output head, which is used to output the motion. .

[0026] The historical encoder is a linear network. Will Current observation information A sequence of historical trajectory points Encoding as historical feature vectors ;

[0027] The path predictor is a GRU gated network. Based on deterministic hidden states To predict feasible paths ;

[0028] Action output head For MOE networks, action output head Integrating historical feature vectors and feasible paths Generate the final action . = , This is the offset of the default joint position of the PD controller.

[0029] The world network model and execution network are pre-trained;

[0030] During the pre-training of the execution network, the policy network first obtains the deterministic hidden states derived from the inference of the world network model. and current observation information The historical trajectory points are used to generate and execute actions, obtain the state at the next moment, and obtain a sample consisting of actions, states, and rewards, which is then placed into the experience replay pool.

[0031] Subsequently, samples in the experience replay pool are randomly sampled, and multiple value networks are used to calculate the advantage functions of the current state and the target state based on the samples and their respective reward functions. The total advantage function is obtained by weighted fusion, and the policy loss of the policy network is calculated accordingly to iteratively update the weight parameters of the policy network.

[0032] Subsequently, the value networks of multiple different groups calculate the value loss based on the reward function and the advantage function, thereby iteratively updating the weight parameters of the value networks.

[0033] The experience replay pool also includes privileged information, which includes at least an accurate terrain elevation map and ground contact force. When calculating the advantage function, by providing the value network with privileged information from the simulation environment, the value network can more accurately evaluate the merits of the policy network's actions and output a more reliable advantage function.

[0034] This invention discloses a reinforcement learning-based method for coordinated motion control and navigation of legged robots. By constructing a unified end-to-end learning framework, it achieves a direct mapping from environmental perception to joint torque output. This enables the robot to autonomously make decisions and execute optimal composite action sequences based on the current environment and task requirements, realizing end-to-end joint optimization of perception, decision-making, and control, and improving the robot's autonomous traversal capability in complex unstructured environments. In other words, by constructing a unified end-to-end learning framework, this invention achieves a direct mapping from environmental perception to joint torque output, eliminating the intermediate process of converting sensor data into odometry and inertial sensor data required in traditional methods, thereby reducing information loss and latency. Attached Figure Description

[0035] Figure 1 This is a system framework diagram of an end-to-end navigation method for quadruped robots based on reinforcement learning according to the present invention;

[0036] Figure 2This is an assembly drawing of the robot used in the reinforcement learning-based end-to-end navigation method for quadruped robots according to the present invention. Detailed Implementation

[0037] Preferred embodiments of the present invention are given in conjunction with the accompanying drawings and described in detail.

[0038] like Figure 1 As shown, the reinforcement learning-based end-to-end navigation method for quadruped robots of the present invention includes the following steps:

[0039] Step S1: Using the robot as an intelligent agent, by defining the state space and action space of the intelligent agent, the motion control and navigation task of the quadruped robot is modeled as a Markov decision process (MDP) with a finite time range, and the conversion formula from action to joint torque vector is defined.

[0040] The state space of an intelligent agent (i.e., a robot) includes: proprioceptive information (such as joint angles) angular velocity Base linear velocity / base angular velocity ), relative position of the target point and lidar point cloud data .

[0041] The relative position of the target point is the coordinate difference between the target point and the robot's current pose. The target point refers to the endpoint that the robot needs to reach. This coordinate difference includes the horizontal distance difference Δx and Δy. LiDAR point cloud data refers to the raw or pre-processed point cloud data from the LiDAR system. The target point coordinates are the coordinates in the robot's coordinate system, and there are typically two ways to obtain them:

[0042] Preset target point: a known coordinate point, sent via the communication module;

[0043] Dynamic target points: In complex tasks, target points may need to be dynamically generated using reference points located by RTK (e.g., following a moving target).

[0044] That is, the state of the state space .

[0045] The action space of an intelligent agent (i.e., a robot) includes the offset of the default joint position of the PD (proportional-derivative) controller. The unit of this offset is radians (rad), and the dimension matches the number of joints on the robot.

[0046] That is, the action space. .

[0047] Define the conversion formula from motion to joint torque vector, used to convert the output motion from the motion space. This is converted into torques at each joint that the robot can execute. The formula for converting motion into joint torque vectors is:

[0048] ,

[0049] in, This is the joint torque vector. , These are the proportional and differential gain matrices of the PD controller. This is the default joint position for the PD controller. This is the offset of the default joint position of the PD controller, i.e., the motion in the motion space. For the current joint angle, The default joint speed. This represents the current joint angular velocity.

[0050] Step S2: Collect and preprocess the sensing information to obtain the current observation information. ;

[0051] In step S2, the robot collects body perception data and external perception data in real time through multiple sensors, performs standardized preprocessing, and integrates them into unified current observation information. This provides input for the world network model.

[0052] Therefore, current observation information This includes the robot's proprioceptive information (i.e., processed proprioceptive data) and environmental information (i.e., processed external perception data, including the relative position of the target point) perceived at time t. and lidar point cloud data ).

[0053] Proprioception ( (This refers to current observation information) Part of it, which typically includes: the current joint angle Current joint angular velocity angular velocity of the base The gravity vector of the IMU (This can be considered as the robot's rotation angle relative to the direction of gravity). External perception data ( This includes the relative position of the target point. and lidar point cloud data That is, current observation information. =[Proprioception( ), Exteroception( )]= [ , , , , , ].in, This represents the current joint angle, in radians (rad). The current joint angular velocity is expressed in radians per second (rad / s). The gravity vector ∈ ^3; The base angular velocity is expressed in radians per second (rad / s). The relative position of the target point is expressed in meters (m). This is lidar point cloud data, in meters (m).

[0054] Step S3: Utilize the world network model to analyze the current observation information. Encode and output the deterministic hidden state. This enables robots to gain a deep understanding of their own sensory information and environmental information.

[0055] like Figure 1 As shown, the reinforcement learning-based end-to-end navigation method for quadruped robots in this invention uses a world network model to process current observation information. The code is encoded, and the encoding results are transformed into actions using an execution network based on the Actor-Critic architecture.

[0056] In this embodiment, a deterministic Recurrent State-Space Model (RSSM) is used as the architecture for the world network model. The input frequency of the world network model is 10Hz, which is consistent with the operating frequency of the robot's LiDAR. The operating frequency of the execution network (Actor-Critic architecture) (i.e., the frequency of the navigation and control strategies) is 50Hz, which is higher than the observation frequency of the world network model. This is because the quadruped robot's walking, climbing, and obstacle avoidance require precise and continuous joint torque adjustments. A high refresh rate ensures smooth robot movements and stable posture. If the frequency is too low, problems such as motion stuttering and gait imbalance will occur.

[0057] The deterministic hidden state space model consists of an encoder, a dynamic predictor, and a decoder connected in sequence, as well as a loop model connected to the dynamic predictor. Through the collaborative work of these four functional components, a complete state inference and prediction framework is formed.

[0058] The cyclic model is responsible for maintaining and updating the deterministic hidden state of the system. In the simulation, the deterministic hidden state of the world network model is updated every k time steps. Each time step corresponds to one input at a frequency of 10 Hz. The recurrent model is configured to execute within a compressed latent variable space, utilizing historical states. Recent action sequence and the preceding random hidden state To infer the deterministic hidden state at the current moment This design enables the cyclical model to effectively integrate long-term information and maintain continuous tracking of environmental states. Historical states It is the deterministic hidden state k time steps in advance; the preceding random hidden state It represents the random hidden state from k time steps ago, obtained through the encoder; the recent action sequence. It is the sequence of actions from time step tk to time step (t-1).

[0059] That is, the cyclic model is represented as:

[0060] (1)

[0061] in, Neural networks representing recurrent models, For historical state, This represents the preceding random hidden state. This is a recent sequence of actions.

[0062] The encoder bases its input on the deterministic hidden state. and current observation vector and given deterministic hidden states and current observation vector Under the condition of random hidden state conditional probability distribution To generate the random hidden state at the current moment. Therefore, the encoder will record the current observation. With deterministic hidden states By combining these, a posterior distribution containing the latest environmental information is generated, enabling the model to incorporate new observation data in a timely manner.

[0063] That is, the encoder is represented as:

[0064] (2)

[0065] The dynamic predictor is based on the deterministic hidden state of the input. and given deterministic hidden states Predicting random hidden states under the given conditions prior distribution To obtain the predicted random hidden state Therefore, the dynamic predictor does not depend on current observations, but is purely based on deterministic hidden states. This allows the model to predict the next hidden state, enabling it to make forward predictions even without new observations.

[0066] That is, the dynamic predictor is represented as:

[0067] (3)

[0068] The decoder is based on the deterministic hidden state of the input. and random hidden states and given a deterministic hidden state and random hidden states Current observation estimation under the conditions Conditional distribution To output the current observation estimate Current observation estimates Used during the training of a deterministic hidden state-space model with the current observation vector. The reconstruction loss is obtained through comparative calculations. Current observation estimates... It is the predicted value, while the current observation vector is... These are the actual observed data. Therefore, the decoder uses the deterministic hidden state... and random hidden states By combining these methods, an estimate of the original high-dimensional observations can be generated, realizing a mapping from the low-dimensional latent space to the original observation space.

[0069] That is, the decoder is represented as:

[0070] (4).

[0071] Therefore, the world network model is used to analyze the current observation information. The encoding process specifically includes: an encoder generating random hidden states; and a dynamic predictor obtaining predicted random hidden states. The decoder outputs the current observation estimate. Current observation estimates Used during the training of a deterministic hidden state-space model with the current observation vector. The reconstruction loss is obtained through comparative calculations; the recurrent model is based on the predicted stochastic hidden states. To update the deterministic hidden state.

[0072] Step S4: Deterministic latent state As input, the action of the agent is generated using an execution network based on the Actor-Critic architecture. Thus, the decision-making process of integrating historical trajectories, predicting feasible paths, and finally outputting actions was completed, generating the robot's actions. .

[0073] like Figure 1 As shown, the execution network includes a policy network and a value network, wherein the policy network It includes a history encoder, a path predictor, and a motion output head, which is used to output the motion. The action output head is the output layer of the execution network, responsible for converting intermediate features into the final action. .

[0074] The history encoder is a linear network. Will Current observation information A sequence of historical trajectory points Encoding as historical feature vectors Historical trajectory point sequence Obtained from the navigation system or route planning module. = [p_1, p_2, ..., p_n], where p_i are the coordinates of a path point, p_i ∈ 2 or 3 .

[0075] History Encoder Represented as:

[0076] (5)

[0077] Where θ is the network weight parameter, It is a sequence of historical trajectory points.

[0078] The path predictor is a GRU gated network. Based on deterministic hidden states To predict feasible paths .

[0079] Path predictor Represented as:

[0080] (6)

[0081] Where θ is the network weight parameter, This is a deterministic hidden state.

[0082] Action output head For MOE networks, action output head Integrating historical feature vectors and feasible paths Generate the final action . = , This is the offset of the default joint position of the PD controller.

[0083] Action output head Represented as:

[0084] (7)

[0085] Where θ is the network weight parameter, For historical feature vectors, This is a feasible path. For deterministic hidden states Perform a gradient cessation operation to prevent the execution network and the world network model from interfering with each other.

[0086] The world network model and the execution network are pre-trained. The world network model and the execution network perform inference in series, but collect replay data separately, and each network of the world network model and the execution network is trained independently.

[0087] To avoid the instability of training caused by the mutual influence of complex reward functions, this invention adopts a grouping approach for reward functions.

[0088] In other words, when pre-training the execution network, the policy network first obtains the deterministic hidden states obtained from the inference of the world network model. and current observation information The historical trajectory points are used to generate and execute actions, obtain the state at the next moment, and obtain a sample consisting of actions, states, and rewards, which is then placed into the experience replay pool.

[0089] Subsequently, samples from the experience replay pool are randomly sampled. Multiple value networks (Critic networks) are used to calculate the advantage functions of the current state and the target state based on the samples and their respective reward functions (as value losses). The overall advantage function is obtained by weighted fusion, and the policy loss of the policy network is calculated accordingly to iteratively update the weight parameters of the policy network. The experience replay pool may also include privileged information, which includes at least a precise terrain height map and ground contact force. When calculating the advantage function, by providing the value network with privileged information (precise terrain height map, ground contact force, etc.) in the simulation environment, the value network can more accurately evaluate the merits of the policy network's actions and output a more reliable advantage function.

[0090] Subsequently, multiple value networks with different groups calculate the value loss based on the reward function and the dominance function, thereby iteratively updating the weight parameters of the value networks. In this embodiment, the reward function is divided into four groups: navigation group, motion group, stabilization group, and exploration group, which are responsible for: 1) encouraging the robot to move to the target point; 2) constraining the robot to maintain a reasonable motion posture during movement; 3) keeping the robot as stable as possible after reaching the target point; and 4) urging the robot to leave the stagnation as soon as possible when it stops moving in a fixed position for a long time.

[0091] Preferably, to improve training efficiency and navigation path quality, path-assisted supervision loss can be introduced into the training of the policy network. Specifically, in a simulated training environment, a shortest path reference map covering the entire map is generated in advance using the wavefront propagation algorithm. During training, the ground truth value ψ* of the optimal forward direction is retrieved in real-time from the reference map and compared with the expected forward direction ψ_head predicted by the policy network. The difference between the two is calculated as the path-assisted supervision loss. This path-assisted supervision loss, together with the reinforcement learning policy gradient loss, constitutes the total training loss of the policy network.

[0092] Step S5: Using the PD controller, the robot's joint torque vector is calculated in real time according to the conversion formula from motion to joint torque vector. The torque command corresponding to the joint torque vector is then sent to the robot's joint drive module to drive the robot to perform the corresponding motion.

[0093] Deployment results:

[0094] The deployment of the reinforcement learning-based end-to-end navigation method for quadruped robots of the present invention is as follows: Figure 2 As shown, a reinforcement learning-based end-to-end navigation method for quadruped robots is deployed on the robot's onboard computer. The quadruped robot includes an RTK system, a LiDAR, a router, and the onboard computer. The RTK system includes a main antenna and a secondary antenna, and the router is used for ROS data communication between the LiDAR and the robot's onboard computer. High-precision positioning systems such as RTK-GPS are used to provide target points and robot pose, combined with an onboard LiDAR (such as the MID360) for real-time environmental perception, achieving closed-loop control. The system uses high-precision RTK-GPS (Real-time Dynamic Differential Global Positioning System) as the global pose reference.

[0095] The RTK system employs a dual-antenna configuration: the main antenna is rigidly mounted at the center of the robot's torso to acquire the robot's three-dimensional position (x, y, z) in the global coordinate system; the secondary antenna is mounted at the front of the robot's head or back, forming a fixed baseline with the main antenna, and is used to calculate the robot's yaw angle with high precision, thereby outputting complete six-degree-of-freedom pose information. This pose data is transmitted via serial port at a frequency of 10–20 Hz to the onboard main control unit (such as NVIDIA Jetson AGXOrin or an equivalent embedded computing platform).

[0096] In this embodiment, the target point coordinates are known fixed coordinates, transmitted to the robot via a low-power, long-range LoRa serial communication module. Specifically, one end of the LoRa module connects to the mission command terminal (which can be deployed on ground stations, handheld devices, or aerial vehicles such as drones), and the other end is integrated into the serial port interface of the robot's main control system. This communication link operates in the Sub-GHz band (e.g., 433 MHz or 868 / 915 MHz), achieving a transmission distance of 300–1000 meters in open environments. It possesses strong penetration and anti-interference capabilities, making it suitable for complex non-line-of-sight (NLOS) scenarios such as wilderness, ruins, and jungles. This method eliminates the need for local area networks or internet infrastructure, significantly improving the system's deployment flexibility and mission adaptability in environments without communication infrastructure. For example, in disaster search and rescue missions, operators can control drones equipped with LoRa transmitters to fly over obstacle areas and deliver target points to the ground-based legged robot in real time, guiding it to the location of trapped personnel.

[0097] This invention discloses a reinforcement learning-based method for coordinated motion control and navigation of legged robots. By constructing a unified end-to-end learning framework, it achieves a direct mapping from environmental perception to joint torque output. This enables the robot to autonomously make decisions and execute optimal composite action sequences based on the current environment and task requirements, realizing end-to-end joint optimization of perception, decision-making, and control, and improving the robot's autonomous traversal capability in complex unstructured environments. In other words, by constructing a unified end-to-end learning framework, this invention achieves a direct mapping from environmental perception to joint torque output, eliminating the intermediate process of converting sensor data into odometry and inertial sensor data required in traditional methods, thereby reducing information loss and latency.

[0098] This invention constructs an asymmetric Actor-Critic network architecture. The policy network (Actor) relies solely on measurable inputs such as robot body perception and LiDAR point clouds, while the value network (Critic) utilizes privileged information from the simulation environment (such as accurate terrain height maps and contact forces) for efficient training. This significantly improves learning efficiency and stability while ensuring policy deployability. A deterministic hidden state space model (RSSM) is introduced to model the hidden states of sparse and noisy point cloud data, enhancing the system's understanding of the 3D environment. By combining curriculum learning with a grouped reward function design, the robot can autonomously learn and flexibly combine complex behaviors such as "circling" and "climbing," and possesses strong robustness in actively recovering from failures such as falls and slips.

[0099] Compared to conventional SLAM (Simultaneous Localization and Mapping) solutions, this invention offers significant advantages. Traditional SLAM schemes typically follow a sequential paradigm of "perception-mapping-planning-control," a process inherently delayed, with errors accumulating between modules. More importantly, high-level path planners struggle to accurately grasp the dynamic motion limits of the underlying robot, potentially leading to physically infeasible planned paths. In contrast, this invention employs an end-to-end learning framework, directly mapping sensor inputs to joint control commands, internalizing the robot's dynamic characteristics into the strategy, ensuring the physical feasibility of all decisions. It does not rely on explicit, complete environmental mapping; instead, it uses neural networks to understand and react to the local environment in real time, resulting in faster and more agile responses to dynamic obstacles and sudden terrain changes. This makes it particularly suitable for highly dynamic, partially observable, and complex real-world scenarios.

[0100] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of the invention. Various variations can be made to the above embodiments of the present invention. That is, all simple and equivalent changes and modifications made based on the claims and description of this invention fall within the protection scope of the claims of this patent. All aspects not described in detail in this invention are conventional technical content.

Claims

1. An end-to-end navigation method for a quadruped robot based on reinforcement learning, characterized in that, include: Step S1: Using the robot as an intelligent agent, by defining the state space and action space of the intelligent agent, the motion control and navigation task of the quadruped robot is modeled as a Markov decision process with a finite time range, and the conversion formula from action to joint torque vector is defined. Step S2: Collect and preprocess the sensing information to obtain the current observation information. ; Step S3: Utilize the world network model to analyze the current observation information. Encode and output the deterministic hidden state. ; Step S4: Deterministic latent state As input, the action of the agent is generated using an execution network based on the Actor-Critic architecture. ; Step S5: Based on the conversion formula from motion to joint torque vector, calculate the joint torque vector in real time, and send the torque command corresponding to the joint torque vector to the robot's joint drive module to drive the robot to perform the corresponding motion.

2. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 1, characterized in that, The state space of an intelligent agent includes: ontological perception information and the relative position of the target point. and lidar point cloud data Proprioceptive information includes joint angles angular velocity Base linear velocity / base angular velocity ; The agent's action space includes: the offset of the default joint position of the PD controller. .

3. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 2, characterized in that, In step S2, the robot collects body perception data and external perception data in real time through multiple sensors, performs standardized preprocessing, and integrates them into unified current observation information. ; Proprioception data includes: current joint angle Current joint angular velocity angular velocity of the base Gravity vector of the IMU External perception data (Exteroception) This includes the relative position of the target point. and lidar point cloud data .

4. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 1, characterized in that, A deterministic hidden state space model is adopted as the architecture of the world network model; the deterministic hidden state space model includes an encoder, a dynamic predictor and a decoder connected in sequence, as well as a loop model connected to the dynamic predictor.

5. The reinforcement learning-based end-to-end navigation method for quadruped robots according to claim 4, characterized in that, The loop model is configured to execute within a compressed implicit variable space, utilizing historical states. Recent action sequence and the preceding random hidden state To infer the deterministic hidden state at the current moment ; The encoder bases its input on the deterministic hidden state. and current observation vector and given deterministic hidden states and current observation vector Under the condition of random hidden state conditional probability distribution To generate the random hidden state at the current moment. ; The dynamic predictor is based on the deterministic hidden state of the input. and given deterministic hidden states Predicting random hidden states under the given conditions prior distribution To obtain the predicted random hidden state ; The decoder is based on the deterministic hidden state of the input. and random hidden states and given a deterministic hidden state and random hidden states Current observation estimation under the conditions Conditional distribution To output the current observation estimate Current observation estimates Used during the training of a deterministic hidden state-space model with the current observation vector. The reconstruction loss is calculated by comparison.

6. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 1, characterized in that, The input frequency of the world network model is 10Hz, which is consistent with the operating frequency of the robot's lidar; the operating frequency of the execution network is 50Hz.

7. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 1, characterized in that, The execution network includes a policy network and a value network, the policy network... It includes a history encoder, a path predictor, and a motion output head, which is used to output the motion. .

8. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 7, characterized in that, The historical encoder is a linear network. Will Current observation information A sequence of historical trajectory points Encoding as historical feature vectors ; The path predictor is a GRU gated network. Based on deterministic hidden states To predict feasible paths ; Action output head For MOE networks, action output head Integrating historical feature vectors and feasible paths Generate the final action . = , This is the offset of the default joint position of the PD controller.

9. The end-to-end navigation method for quadruped robots based on reinforcement learning according to claim 1, characterized in that, The world network model and execution network are pre-trained; During the pre-training of the execution network, the policy network first obtains the deterministic hidden states derived from the inference of the world network model. and current observation information The historical trajectory points are used to generate and execute actions, obtain the state at the next moment, and obtain a sample consisting of actions, states, and rewards, which is then placed into the experience replay pool. Subsequently, samples in the experience replay pool are randomly sampled, and multiple value networks are used to calculate the advantage functions of the current state and the target state based on the samples and their respective reward functions. The total advantage function is obtained by weighted fusion, and the policy loss of the policy network is calculated accordingly to iteratively update the weight parameters of the policy network. Subsequently, the value networks of multiple different groups calculate the value loss based on the reward function and the advantage function, thereby iteratively updating the weight parameters of the value networks.

10. The reinforcement learning-based end-to-end navigation method for quadruped robots according to claim 9, characterized in that, The experience replay pool also includes privileged information, which includes at least an accurate terrain elevation map and ground contact force. When calculating the advantage function, by providing the value network with privileged information from the simulation environment, the value network can more accurately evaluate the merits of the policy network's actions and output a more reliable advantage function.