A hierarchical obstacle avoidance method and system based on PPO-TD3 algorithm for collaborative training
The collaborative training hierarchical obstacle avoidance method based on the PPO-TD3 algorithm solves the problems of robot response delay and resource waste in dynamic environments, achieving efficient and safe obstacle avoidance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANFU JIANGXI LAB
- Filing Date
- 2026-04-16
- Publication Date
- 2026-06-26
Smart Images

Figure CN122284652A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of dynamic obstacle avoidance technology, and in particular relates to a hierarchical obstacle avoidance method and system based on PPO-TD3 algorithm for collaborative training. Background Technology
[0002] With the widespread application of mobile robots in complex and dynamic environments (such as hospitals, shopping malls, streets, and police patrols), traditional obstacle avoidance methods based on rules or static path planning can no longer meet the dual requirements of real-time performance and safety.
[0003] While existing reinforcement learning solutions have partially solved the dynamic obstacle avoidance problem, they still have key shortcomings: Existing technologies employ a heuristic, hierarchical architecture where upper-layer modules calculate precise sequences of intermediate points as temporary target points, while lower-layer modules handle local obstacle avoidance, enabling robot navigation in map-less environments. However, these solutions perform poorly in dynamic environments, exhibiting discontinuous control, difficulty in handling sudden obstacles due to fixed intermediate point allocation methods, inconsistent target point switching leading to inconsistent motion trajectories, and low utilization of computational resources.
[0004] In traditional layered architectures, the coordinate commands output by higher layers are disconnected from the physical constraints of lower layers, causing some commands to be discarded due to exceeding execution capabilities and frequent emergency braking. Possible causes of this problem include: the command protocol not including dynamic parameters (such as maximum torque and friction coefficient); simulation training not considering the dynamic characteristics of actuators (such as motor temperature drift); and a lack of effective strategy correction mechanisms when lower-layer execution fails.
[0005] Existing structural inspection and navigation methods based on multimodal learning construct virtual environments that integrate damage information. They employ a multi-task reinforcement learning framework, combining CNN visual feature extraction and LSTM navigation memory modules to achieve autonomous navigation of agents in complex structural environments. However, these techniques suffer from significant real-time limitations. The cascaded processing of multiple modules leads to high decision-making delays, resulting in weak capabilities in handling dynamic obstacles. Furthermore, model training requires substantial computational resources.
[0006] In existing technologies, end-to-end deep reinforcement learning obstacle avoidance schemes use a single deep network to directly achieve end-to-end mapping from sensors to control, achieving good results in simulation environments. However, in practical applications, this scheme reveals problems such as policy fragility and action oscillations. Even slight environmental changes can lead to a sharp decline in performance, and the lack of physical constraints means that control outputs may exceed the capabilities of the actuator.
[0007] Existing obstacle avoidance systems in navigation solutions exhibit significant response delays when dealing with sudden dynamic obstacles, hindering timely obstacle avoidance decisions. Possible causes include: the system employs a fixed computational workflow, failing to dynamically adjust computational resource allocation based on environmental risks; pre-trained feature extraction networks are unable to adapt to unseen environmental features; data transmission between layers is not optimized, resulting in redundant information transfer; and a lack of prioritization mechanisms for different obstacle types.
[0008] In summary, end-to-end methods (such as a single PPO algorithm) suffer from excessively high response delays to sudden obstacles due to the lack of hierarchical division of labor, while traditional hierarchical methods suffer from wasted computing power and poor scene adaptability due to rigid architecture. Especially in human-robot mixed scenarios, existing technologies struggle to balance real-time obstacle avoidance and global path optimization, and fixed safety thresholds lead to frequent sudden stops or collisions. Furthermore, the fragmented control semantics across levels (such as higher levels outputting coordinate points while lower levels directly control motors) further exacerbate system instability risks. These limitations severely restrict the reliable deployment of robots in real dynamic environments, necessitating a novel hierarchical obstacle avoidance method that can adapt to scene complexity and combines rapid response with safety assurance. Summary of the Invention
[0009] The purpose of this application is to overcome the problems of the prior art by disclosing a hierarchical obstacle avoidance method and system based on the PPO-TD3 algorithm for collaborative training, which improves the dynamic environment adaptability of mobile robots and reduces the dynamic obstacle response delay.
[0010] On the one hand, the objective of this application is achieved through the following technical solution: A hierarchical obstacle avoidance method based on PPO-TD3 algorithm for collaborative training, wherein the hierarchical obstacle avoidance method based on PPO-TD3 algorithm for collaborative training includes: S1: Multi-source environmental perception and feature extraction, including: obtaining obstacle distribution heatmaps based on LiDAR, acquiring RGB images based on visual sensors, and identifying obstacle categories; and establishing velocity-direction vector fields for moving obstacles and updating motion state predictions in real time. S2: Dynamic scene complexity assessment, which integrates LiDAR obstacle density, visual semantic classification results and dynamic tracking data to calculate the real-time scene complexity index, and can trigger working mode switching decisions through a threshold comparator; S3: The hierarchical decision engine dynamically schedules the process. When the scene complexity is below the threshold, the underlying TD3 control network is activated to directly process the pre-processed sensor data and generate robot motor control signals. When a complex environment is detected, the high-level PPO decision network receives the global feature map and outputs semantic instructions with physical constraints. The underlying control network then parses the instructions into action parameters that can be executed. S4: Semantic-control instruction conversion. The structured instructions generated by the high-level network are fed into the instruction parser. The feasibility is verified by combining the current robot dynamic state. The abstract instructions are converted into specific motor control quantities through a constrained optimization algorithm to ensure the physical realizability of the instructions. S5: Hybrid training and knowledge update. In the pre-training stage of the simulation environment, a virtual training set containing typical obstacle scenarios is constructed to train the basic capabilities of the decision network and the control network respectively. After deployment, network parameters are optimized through human demonstration data, and an incremental learning mechanism is established to continuously absorb real-world scenario data. S6: Cross-level feedback optimization, the execution layer feeds back the control effect, constraint violation and actual obstacle avoidance results to the decision layer in real time, and coordinates the optimization objective through cross-level loss function.
[0011] According to a preferred embodiment, the raw point cloud data obtained by the lidar in step S1 is used to generate an obstacle distribution heatmap through a polar coordinate transformation module.
[0012] According to a preferred embodiment, in step S1, the RGB image acquired by the visual sensor is sent to a lightweight network for semantic segmentation to complete obstacle category recognition.
[0013] According to a preferred embodiment, in step S2, the real-time scene complexity index integrates static obstacle distribution, dynamic obstacle speed, and path curvature factors.
[0014] According to a preferred embodiment, the structured instructions in step S4 include: action type, spatial parameters, and velocity constraints; the robot dynamics state includes: load and ground friction coefficient parameters.
[0015] According to a preferred embodiment, step S5 further includes automatically initiating the model fine-tuning process during low-load periods at night.
[0016] According to a preferred embodiment, step S6 further includes: automatically labeling the collected abnormal cases and sending them into the training data pool to form a closed-loop learning process from perception to execution.
[0017] On the other hand, this application also discloses: A hierarchical obstacle avoidance system based on the PPO-TD3 algorithm for collaborative training is disclosed. The system is mounted on a robot and uses the aforementioned method for obstacle avoidance.
[0018] The aforementioned main solution and its various further alternative solutions can be freely combined to form multiple solutions, all of which are solutions that can be adopted and are claimed in this application. Those skilled in the art, after understanding the solution of this application, will realize that there are many combinations based on the prior art and common general knowledge, all of which are technical solutions to be protected in this application, and will not be exhaustively listed here.
[0019] The beneficial effects of this application are: Enhanced adaptability to dynamic environments: In densely populated environments, the obstacle avoidance rate of the robot dog can be improved compared to traditional solutions; dynamic obstacle response latency can be reduced; Optimized computational efficiency: By switching dynamic modes, computational costs can be saved. Attached Figure Description
[0020] Figure 1 This is a schematic diagram of the system architecture of this application. Detailed Implementation
[0021] The following specific examples illustrate the implementation of this application. Those skilled in the art can easily understand other advantages and effects of this application from the content disclosed in this specification. This application can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this application. It should be noted that, unless otherwise specified, the following embodiments and features in the embodiments can be combined with each other.
[0022] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.
[0023] In the description of this application, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, or the orientation or positional relationship commonly used when the product of this application is in use. They are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation on this application. In addition, the terms "first," "second," and "third," etc., are only used to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0024] Furthermore, terms such as "horizontal," "vertical," and "sag" do not imply that components must be absolutely horizontal or suspended, but rather that they can be slightly tilted. For example, "horizontal" simply means that its direction is more horizontal relative to "vertical," and does not mean that the structure must be completely horizontal, but can be slightly tilted.
[0025] In the description of this application, it should also be noted that, unless otherwise expressly specified and limited, the terms "set up," "install," "connect," and "link" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this application based on the specific circumstances.
[0026] Furthermore, it should be noted that unless otherwise specified in this application, the specific structures, connections, positions, power sources, etc. involved are all things that a person skilled in the art can know without creative effort based on the prior art.
[0027] Example 1 A hierarchical obstacle avoidance method based on PPO-TD3 algorithm for collaborative training includes the following steps.
[0028] Step S1: Multi-source environmental perception and feature extraction, including: obtaining an obstacle distribution heat map based on lidar, completing RGB image acquisition based on visual sensors, completing obstacle category identification; and establishing a velocity-direction vector field for moving obstacles to update motion state prediction in real time.
[0029] Preferably, the raw point cloud data obtained by the LiDAR in step S1 is used to generate an obstacle distribution heatmap through a polar coordinate transformation module. The RGB image acquired by the visual sensor in step S1 is fed into a lightweight network for semantic segmentation to complete obstacle category recognition.
[0030] Step S2: Dynamic scene complexity assessment. By combining the obstacle density of LiDAR, visual semantic classification results and dynamic tracking data, the real-time scene complexity index is calculated, and the working mode switching decision can be triggered through the threshold comparator.
[0031] Preferably, in step S2, the real-time scene complexity index integrates static obstacle distribution, dynamic obstacle speed, and path curvature factors.
[0032] Step S3: The hierarchical decision engine dynamically schedules the process. When the scene complexity is below the threshold, the bottom-level TD3 control network is activated to directly process the pre-processed sensor data and generate robot motor control signals. When a complex environment is detected, the high-level PPO decision network receives the global feature map and outputs semantic instructions with physical constraints. The bottom-level control network then parses the instructions into action parameters that can be executed.
[0033] In other words, by fusing obstacle dynamics, visual semantic information, and path curvature from LiDAR, the scene complexity index is calculated in real time, and a threshold-triggered hierarchical control mode switching is initiated (simple mode: only the underlying TD3 operates; complex mode: PPO+TD3 works collaboratively). Thus, by utilizing a dynamic scene evaluation and hierarchical switching control architecture, the problems of excessive response latency and wasted computing resources are solved.
[0034] Step S4: Semantic-control instruction conversion. The structured instructions generated by the high-level network are fed into the instruction parser. Feasibility is verified by combining the current robot dynamics state. The abstract instructions are converted into specific motor control quantities through a constrained optimization algorithm to ensure the physical realizability of the instructions.
[0035] Preferably, the structured instructions in step S4 include: action type, spatial parameters, and velocity constraints; the robot dynamics state includes: load and ground friction coefficient parameters.
[0036] This application's method designs a structured semantic instruction protocol. The high-level PPO outputs standardized instructions including action type, ideal parameters, and dynamic constraint boundaries. The low-level TD3 dynamically verifies and corrects the feasibility of the instructions by monitoring physical parameters such as motor status and ground friction in real time, and simultaneously feeds execution constraints back to the high-level strategy optimization, forming a closed-loop constraint negotiation. Thus, by utilizing the semantic-control instruction physical constraint embedding mechanism, it solves the problem of instruction infeasibility and efficiency loss caused by semantic gaps in the interaction between the high-level and low-level layers in traditional solutions.
[0037] Step S5: Hybrid training and knowledge update. In the simulation environment pre-training stage, a virtual training set containing typical obstacle scenarios is constructed to train the basic capabilities of the decision network and control network respectively. After deployment, network parameters are optimized through human demonstration data, and an incremental learning mechanism is established to continuously absorb real-world scenario data.
[0038] Preferably, step S5 further includes: automatically starting the model fine-tuning process during low-load periods at night.
[0039] Step S6: Cross-level feedback optimization. The execution layer feeds back the control effect, constraint violation, and actual obstacle avoidance results to the decision layer in real time, and coordinates the optimization objective through cross-level loss functions.
[0040] Preferably, step S6 further includes: automatically labeling the collected abnormal cases and sending them into the training data pool to form a closed-loop learning process from perception to execution.
[0041] This application is based on a dynamic adaptive hierarchical decision-making system, namely the collaborative architecture of PPO and TD3 algorithms, and realizes intelligent obstacle avoidance in dynamic environments through a hierarchical reinforcement learning framework.
[0042] Specifically: In simple environments, only the high-frequency control network of the underlying TD3 algorithm is run to improve response speed, while in complex scenarios, the complete hierarchical architecture (high-level PPO algorithm combined with the underlying TD3 algorithm) is activated to ensure decision quality; at the same time, a semantic command interface is innovatively designed, which enables the abstract commands output by the high-level PPO decision-making layer policy network (such as "turn right 1.5 meters") to be intelligently parsed into executable actions by the underlying TD3 execution layer control network in combination with real-time physical constraints (such as ground friction and motor status); with the integration of a hybrid training framework that combines simulation training, human demonstration and real-world learning, and a flexible safety mechanism based on dynamic risk assessment, the synergistic optimization of safety and operational efficiency in complex dynamic environments is achieved.
[0043] Example 2 refer to Figure 1 As shown in Example 1, this application also discloses a hierarchical obstacle avoidance system based on the PPO-TD3 algorithm for collaborative training. The system includes a collaborative architecture of the PPO and TD3 algorithms. The system is mounted on a robot and performs obstacle avoidance using the method described in Example 1.
[0044] The following beneficial technical effects have been achieved through this application, including: improved dynamic environment adaptability: in densely populated scenarios, the obstacle avoidance rate of the robot dog can be improved compared with traditional solutions; dynamic obstacle response latency can be reduced; and optimized computational efficiency: by switching dynamic modes, computational costs can be saved.
[0045] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A hierarchical obstacle avoidance method based on PPO-TD3 algorithm cooperative training, characterized in that, The hierarchical obstacle avoidance method based on PPO-TD3 algorithm cooperative training comprises: S1: multi-source environment perception and feature extraction, including: obtaining obstacle distribution heat map based on laser radar, completing RGB image acquisition based on visual sensor, and completing obstacle class identification; and establishing a velocity-direction vector field for moving obstacles, and updating the motion state prediction in real time; S2: dynamic scene complexity evaluation, integrating laser radar obstacle density, visual semantic classification results and dynamic tracking data to calculate real-time scene complexity index, and being able to trigger working mode switching decision through threshold comparator; S3: dynamic scheduling of hierarchical decision engine, when the scene complexity is lower than the threshold, the bottom layer TD3 control network is activated to directly process the preprocessed sensor data to generate robot motor control signal; when a complex environment is detected, the high-level PPO decision network receives the global feature map and outputs a semantic instruction with physical constraints, and the bottom layer control network parses the instruction into executable action parameters; S4: semantic-control instruction conversion, the structured instruction generated by the high-level network is transmitted to the instruction parser, and the feasibility is checked in combination with the current robot dynamics state, and the abstract instruction is converted into specific motor control quantity through the constrained optimization algorithm, so as to ensure the physical realizability of the instruction; S5: hybrid training and knowledge updating, a virtual training set containing typical obstacle scenes is constructed in the simulation environment pre-training stage, and the basic ability of the decision network and the control network is trained respectively; after deployment, the network parameter optimization is guided by human demonstration data, and an incremental learning mechanism is established to continuously absorb real scene data; S6: cross-level feedback optimization, the control effect, constraint violation and actual obstacle avoidance result are fed back to the decision layer in real time, and the optimization target is coordinated through the cross-level loss function.
2. The hierarchical obstacle avoidance method based on PPO-TD3 algorithm for cooperative training of claim 1, wherein, The original point cloud data obtained by the laser radar in step S1 is converted into an obstacle distribution heat map by a polar coordinate conversion module.
3. The hierarchical obstacle avoidance method based on PPO-TD3 algorithm for cooperative training of claim 1, wherein, The RGB image collected by the visual sensor in step S1 is sent to a lightweight network for semantic segmentation to complete obstacle class identification.
4. The hierarchical obstacle avoidance method based on PPO-TD3 algorithm for cooperative training of claim 1, wherein, In step S2, the real-time scene complexity index integrates static obstacle distribution, dynamic obstacle speed and path curvature factors.
5. The hierarchical obstacle avoidance method based on PPO-TD3 algorithm for cooperative training of claim 1, wherein, In step S4, the structured instruction includes: action type, spatial parameter and speed constraint; and the robot dynamics state includes: load, ground friction coefficient parameter.
6. The hierarchical obstacle avoidance method based on PPO-TD3 algorithm for cooperative training of claim 1, wherein, Step S5 further comprises automatically starting the model fine-tuning process during night low load period.
7. The hierarchical obstacle avoidance method based on PPO-TD3 algorithm for cooperative training of claim 1, wherein, Step S6 further comprises: automatically labeling the collected abnormal cases and sending them to the training data pool to form a closed-loop learning from perception to execution.
8. A hierarchical obstacle avoidance system based on PPO-TD3 algorithm for cooperative training, characterized in that, The system is mounted on a robot and uses the method of any one of claims 1 to 7 for obstacle avoidance.