A trajectory planning method and system based on adaptive course residual hierarchical reinforcement learning
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2023-09-11
- Publication Date
- 2026-06-30
AI Technical Summary
Existing autonomous vehicles struggle to balance safety and comfort in complex and uncertain urban traffic environments. Rule-based planning systems lack comprehensive coverage, while single deep learning or reinforcement learning methods are unstable in multi-task driving, making it difficult to guarantee both safety and comfort.
An adaptive curriculum residual hierarchical reinforcement learning trajectory planning method is adopted, which combines fuzzy logic and deep reinforcement learning to decompose the driving task into high-level behavioral decision-making and low-level speed planning. A progressive curriculum learning training strategy is adopted, a safety planning algorithm is used to accelerate model convergence, and a safe and comfortable planned trajectory is generated by the Frenet optimal trajectory generator.
It improves the safety and efficiency of autonomous vehicles in uncertain and dynamic urban environments, enhances the interpretability of the system and the reusability of sub-policies, and enables rapid deployment in the real world with superior performance.
Smart Images

Figure CN117192986B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of autonomous driving behavior planning, specifically involving a trajectory planning method and system based on adaptive curriculum residual hierarchical reinforcement learning. Background Technology
[0002] In recent years, there has been increasing interest in autonomous vehicles capable of efficient planning in uncertain and dynamic urban traffic environments. While rule-based planning systems are already quite mature in specific driving scenarios such as the DARPA competition, the diversity and uncertainty of traffic scenarios make it difficult to create a rule base that comprehensively covers all dynamic traffic scenarios, leading to criticism that autonomous vehicles are too conservative. In contrast, learning-based methods, such as deep learning or reinforcement learning, mine complex or even novel knowledge from data or the environment, reducing the need for manual rule-making and improving model performance, showing great promise for applications in the field of autonomous driving.
[0003] End-to-end deep reinforcement learning (E2E-DRL) based on Markov decision models (MDPs) enables autonomous vehicles to maximize rewards or achieve specific goals by optimizing strategies during interactions with the environment, exhibiting better generalization ability in uncertain and unknown dynamic scenarios. However, for complex driving tasks with multiple sub-goals, E2E-DRL uses a single network to act as the entire planning strategy in the autonomous driving system, directly outputting continuous low-dimensional control commands from the perception module, such as steering wheel angle and accelerator / pedal. This method of directly generating control actions is often unreliable, making it difficult to guarantee safety and comfort. Furthermore, it is difficult to capture the behavioral decisions of the intermediate planner, potentially leading the network to learn limited tactical decisions, such as path following, lacking clear interpretability.
[0004] Hierarchical reinforcement learning (HRL) methods typically decompose complex autonomous driving tasks into several easily solvable sub-problems. Introducing HRL defines the driving task as behavior planning and motion planning, enabling the autonomous vehicle to plan behaviors for different traffic scenarios, such as changing lanes (left, right, or straight), and then using motion planning to plan the optimal path based on the decision results. This coordinated response is applicable to multiple traffic scenarios. While HRL improves system interpretability and the reusability of sub-policies across multiple tasks, its large state space and diverse action space often require more excellent action sequences to train the model, significantly reducing the learning rate. Therefore, this invention develops a novel framework based on Deep Residual Reinforcement Learning (DRRL), which combines the advantages of safe planning algorithms and reinforcement learning. Using safe planning methods in this framework not only accelerates training stability and guides the agent to a safe region with average high rewards, but also limits the policy search space and avoids undesirable behaviors. Summary of the Invention
[0005] To address the problems existing in the prior art, this invention proposes an adaptive curriculum residual hierarchical reinforcement learning trajectory planning method and system. It provides an adaptive curriculum residual hierarchical reinforcement learning strategy framework for dynamic urban traffic scenarios, defining the driving task as high-level behavioral decision-making and low-level speed planning. Then, the Frenet optimal trajectory generator generates a safe and comfortable planned trajectory based on the decision results and target speed.
[0006] To achieve the above objectives, the technical solution adopted by this invention is: a method and system for trajectory planning based on adaptive curriculum residual hierarchical reinforcement learning, comprising the following steps:
[0007] Integrating the spatial location and dynamic characteristics of autonomous vehicles using fuzzy logic;
[0008] The CR-HRL decision framework combines rule-based methods with deep reinforcement learning. The spatial position and dynamic characteristics of the autonomous vehicle are input into the CR-HRL decision framework. The output ratio of rule-based and deep reinforcement learning is adaptively adjusted according to the training process to output high-level behavioral decision results and target speed.
[0009] A safe and comfortable planning trajectory is generated based on the results of high-level behavioral decisions and the target speed.
[0010] Furthermore, by using fuzzy logic to analyze the spatial location and dynamic characteristics of autonomous vehicles, traffic rules, surrounding traffic participants, and road condition information from the environment are extracted, and a grid map incorporating the fuzzy speed and deformed pose of surrounding vehicles is constructed, and traffic rules and road condition information are represented uniformly using vectors.
[0011] Furthermore, the PPO algorithm based on the Double Actor-Critic framework is used to train the high-level behavioral decision-making and target velocity planning strategies. The output of the high-level actor network is the lane-changing decision, the output of the low-level actor network is the target velocity, and the output of the critic network is the state value. The structure of the high-level and low-level actor networks is as follows: the Mt branch consists of 3 convolutional layers and 3 fully connected layers, and the Rt branch has only one fully connected layer. The outputs of the Mt branch and the Rt branch are concatenated and transmitted to the final fully connected layer. The last output layer uses Softmax. The critic network is the last output layer of the high-level and low-level actor network architectures removed.
[0012] Furthermore, a progressive learning approach is adopted, with the CR-HRL decision framework trained in three stages: the first stage is ACC adaptive cruise control, the second stage is lane changing, and the third stage is overtaking.
[0013] Furthermore, in the first stage, the residual policy of the agent is randomly initialized. In a single-lane scenario, the residual policy is trained by maximizing the safety reward function. Training stops when cruise control and maintaining a safe distance are achieved. If a lane-changing error occurs in the lane-changing policy, a penalty is imposed.
[0014] In the second stage, in a multi-lane scenario, the ACC pre-trained strategy completed in the first stage is loaded, and the traffic environment is reconfigured by adding additional surrounding traffic participants. The lane-changing strategy in this stage is trained by maximizing the lane-changing reward function.
[0015] In the third stage, an overtaking reward function is added to ensure that the final training strategy can avoid collisions, complete overtaking when necessary, and return to the original lane.
[0016] Furthermore, in the third phase of training, passenger comfort was considered in the underlying target speed strategy. When jitter occurred during acceleration changes, a non-smoothness penalty was applied.
[0017]
[0018] The final reward function is:
[0019]
[0020]
[0021] For the safety reward function, For lane change reward function, The overtaking reward function has a weight parameter w. change_lane w safe w overtake w eff and w acc As the course progresses through different stages, things change.
[0022] Furthermore, the input states s of the CR-HRL decision framework t It includes two parts: observations for residual reinforcement learning policies and observations for rule-based policies; the rule-based observations...
[0023]
[0024] Among them, v f v is the speed of the vehicle in front. r v is the speed of the following vehicle. max For the maximum vehicle speed, a max For the maximum acceleration, d maxΓ represents the maximum deceleration, L represents the vehicle length, g represents the distance between the preceding and following vehicles, Γ represents the driver's reaction time, and ∈ represents the driver's deficiencies when handling the desired speed, corresponding to the level of driving proficiency. The larger the value of ∈, the less proficient the driver is. Γ∈(0,1];
[0025] Observations of the residual strategy It consists of two parts: highly dynamic autonomous vehicles and surrounding traffic participants, structured road conditions, and fixed traffic rules.
[0026] The present invention also provides a trajectory planning system based on adaptive curriculum residual hierarchical reinforcement learning, including a state acquisition module, a behavior and velocity planning module, and a trajectory generation module;
[0027] The state acquisition module is used to integrate the spatial position and dynamic characteristics of autonomous vehicles using fuzzy logic;
[0028] The behavior and speed planning module is used to combine rule-based methods with deep reinforcement learning to form the CR-HRL decision framework. The spatial position and dynamic characteristics of the autonomous vehicle are input into the CR-HRL decision framework, and the output ratio of rule-based and deep reinforcement learning is adaptively adjusted according to the training process to output high-level behavior decision results and target speed.
[0029] The trajectory generation module is used to generate a safe and comfortable planned trajectory based on the results of high-level behavioral decisions and the target speed.
[0030] Another computer device is provided, including a processor and a memory. The memory is used to store a computer executable program. The processor reads the computer executable program from the memory and executes it. When the processor executes the program, it can implement the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method described in this invention.
[0031] The present invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can implement the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method described in the present invention.
[0032] Compared with the prior art, the present invention has at least the following beneficial effects:
[0033] This invention proposes an adaptive residual hierarchical policy framework that balances policy safety and efficiency. It uses fuzzy logic to integrate the spatial location and dynamic characteristics of the autonomous vehicle as input. Simultaneously, the model uses residual learning to integrate rule-based methods and DRL, adaptively adjusting the output ratio of rule-based and DRL during training to improve system interpretability. The rule-based policy serves as soft guidance, using residuals to guide the model to converge quickly to a safe region. Using fuzzy logic ensures the consistency and coupling of the spatial location and physical characteristics of surrounding traffic participants, contributing to more refined decision-making in policy learning.
[0034] Furthermore, this invention proposes a progressive three-stage curriculum learning algorithm that transforms knowledge gained from single driving tasks into solutions for more complex overtaking tasks, improving the learning rate and reusability of sub-policies. The proposed CR-HRL framework was validated on SUMO and ROS simulation platforms, demonstrating superior performance in both training and testing scenarios. Moreover, the method of this invention can be deployed to real-world vehicles without fine-tuning the training network.
[0035] Furthermore, this invention reconstructs a three-stage simulation scenario based on high-definition real-world maps and accelerates the training of the proposed method using a progressive course training approach, thereby improving the interpretability of the strategy and the reusability of sub-policies. Experiments demonstrate that in uncertain and dynamic urban environments, the method of this invention outperforms rule-based and RL baseline-based methods in terms of driving efficiency while maintaining safe driving. Attached Figure Description
[0036] Figure 1 A framework diagram for adaptive curriculum residual hierarchical reinforcement learning lane-changing strategy.
[0037] Figure 2 This is a diagram of the CR-HRL network structure.
[0038] Figure 3 This is a structured feature extraction map of a traffic scene.
[0039] Figure 4 Extract a raster map for surrounding traffic participants and road features.
[0040] Figure 5 This is a simulation diagram of multiple traffic scenarios.
[0041] Figure 6 Comparison diagram of integrated state experiments for spatial position and dynamic characteristics.
[0042] Figure 7 This is a comparative diagram of an experiment on overtaking strategy training.
[0043] Figure 8 This is a comparison chart of residual decision-making experiments.
[0044] Figure 9 This is a baseline comparison experiment graph. Detailed Implementation
[0045] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0046] Figure 1 This invention presents a CR-HRL (Curriculum Residual Hierarchical Reinforcement Learning) framework for high-quality decision-making and trajectory generation in various autonomous driving scenarios. The agent analyzes traffic rules, surrounding traffic participants, and road condition information from the environment, outputting a high-level behavioral decision and a low-level longitudinal target velocity at future time step t. Then, a Frenet optimal trajectory generator generates a safe and comfortable planned trajectory based on the decision results and target velocity. Fuzzy logic is used to integrate the spatial position and dynamic characteristics of the autonomous vehicle as input to the CR-HRL framework, making the decision results more refined. This invention also proposes a progressive curriculum learning scheme to effectively train the policy framework of this invention. The framework consists of two residual policies, such as... Figure 2 As shown:
[0047] • SL2015 lane-changing model and discrete residual RL agent for high-level behavioral decision-making.
[0048] • Velocity decision based on rule-based improved Krauss car-following model and continuous residual RL agent.
[0049] Within the hierarchical residual strategy framework, the SL2015 lane-changing model and the Krauss safe distance car-following model are used as shortcut connections in the residual network, serving as soft guidance for the framework. A dual Actor-Critic (DAC) architecture is employed to train the framework. Then, the Frenet optimal trajectory generator generates a safe and comfortable planned trajectory based on the decision results and target speed. The Krauss safe distance car-following model, compared to the commonly used IDM model, avoids premature deceleration, improving vehicle efficiency at intersections and making it more suitable for urban scenarios. The SL2015 lane-changing model is a built-in lane-changing model in the SUMO software, which can be used for sub-lane simulation. Its additional behavior layer is responsible for maintaining safe lateral clearance, effectively simulating vehicle lane-changing behavior.
[0050] A grid map incorporating the fuzzy velocity and deformed pose of surrounding vehicles is constructed, and traffic rules and road condition information are uniformly represented using vectors. To describe surrounding traffic participants, this invention considers all vehicles within an 80-meter sensor range in front and behind. This is described by the following features:
[0051] 1) State: The input state s of the CR-HRL strategy framework at time step t. t It consists of two parts: residual reinforcement learning strategy Observations and rule-based strategies Observations s t Defined as:
[0052]
[0053]
[0054]
[0055] Rule-based observations v f v is the speed of the vehicle in front. r v is the speed of the following vehicle. max For the maximum vehicle speed, a max For the maximum acceleration, d max Γ represents the maximum deceleration, L represents the vehicle length, g represents the distance between the vehicles in front and behind, Γ represents the driver's reaction time, and ∈ represents the driver's deficiencies when handling the desired speed, which can be understood as the level of driving proficiency. The larger the value, the less proficient the driver is. Γ∈(0,1).
[0056] Observations of the residual strategy It consists of two parts: highly dynamic autonomous vehicles and surrounding traffic participants, structured road conditions, and fixed traffic rules.
[0057] For fixed and clearly defined traffic rules and road conditions, such as Figure 3 As shown, this invention uses a one-hot vector R t express:
[0058] The existence and direction of the left lane, the current lane, and the right lane are defined as [l] e ,l d ,c e ,c d ,r e ,r d ]. Among them, l e ,c e ,r eThe values are 0, 0.5, and 1. 0 represents non-existence, 1 represents existence, and 0.5 represents that vehicles can use the lane for a certain period of time. The state of the traffic light is defined as [l r ,l y ,l g ].
[0059] To describe the autonomous vehicle and surrounding traffic participants, this invention considers all vehicles within a 40-meter sensor range in front and behind. This invention uses the following characteristics to describe the autonomous vehicle and surrounding traffic participants.
[0060] • Spatial Location: This invention uses an occupation grid map Mt centered on the autonomous vehicle to represent the sensory data at each time step t. Mt is built on the curve coordinates defined by the lane followed by the vehicle. By uniformly transforming the curved grid map into a straight road, training can be performed using straight road data. In addition to actual traffic participants, certain virtual obstacles are also projected onto the grid map, such as stop lines at intersections.
[0061] • Dynamic Characteristics: In real-world scenarios, drivers cannot precisely perceive the speed information of surrounding vehicles, only knowing vague information such as "fast" or "slow," yet still able to make reasonable lane-changing decisions. Referring to the real-time speed of autonomous vehicles and road speed limits, the speeds of surrounding traffic vehicles are divided into seven fuzzy sets {NB, NM, NS, Z, PS, PM, PB}, and these seven fuzzy sets are projected onto a grid map Mt using gradient colors. In the training scenario, oncoming lanes appear, and the oncoming and oncoming lanes are distinguished. As an example, this invention uses a red gradient to represent the speed of the oncoming lane and a green gradient to represent the speed of the oncoming lane. Finally, the state space is described by a deformable map and a road contour Rt, as shown below. Figure 4 As shown.
[0062] 2) Actions: Refer to Figure 2 At time step t, the output action of the CR-HRL strategy framework is determined by high-level behavioral decisions. and underlying target velocity planning The composition, where both levels of actions are generated using residuals, is represented as follows:
[0063]
[0064]
[0065]
[0066] High-level behavioral decision-making It consists of a high-level residual strategy and a lane-changing model based on SL2015, as detailed below:
[0067]
[0068] ε will adaptively adjust based on training performance. In the early stages of training, the agent is guided by a lane-changing model based on SL2015, leading it to regions with high average rewards. In the later stages of training, residual reinforcement learning optimizes safe and conservative tactical strategies.
[0069] Low-level target velocity planning Based on underlying residual strategy And the underlying rule-based improved Krauss car-following model Specifically as follows:
[0070]
[0071] To improve the interpretability and reusability of the system, this invention employs progressive learning, allowing the agent to apply knowledge gained from single driving tasks to solve more complex driving tasks. (Reference) Figure 1 The overtaking task is broken down into three stages: ACC adaptive cruise control, lane changing, and overtaking. The corresponding training process is as follows:
[0072] In the first stage, the residual policy of the agent is randomly initialized. In a single-lane scenario, the residual policy is trained by maximizing the safety reward function to achieve constant speed cruise and safe distance maintenance. When adaptive cruise is achieved, the invention stops training.
[0073]
[0074] The first phase of the lane-changing strategy ensures compliance with the overall plan, and penalties are imposed if a lane-changing error occurs.
[0075] For the underlying speed residual strategy in the first stage, an efficient and secure speed reward mechanism is set up, as follows:
[0076]
[0077]
[0078] v ref =min[v(t)+aΔt,v] limit ,v safe (t)],
[0079]
[0080] Among them, v limit Indicates the road speed limit, v safe (t) represents the safe speed, v fLet g be the speed of the vehicle in front, g be the distance between vehicles, v(t) be the current vehicle speed, b be the maximum deceleration of the vehicle, and Γ be the driver's reaction time. The addition of v(t) + aΔt is used to make the agent's velocity changes smoother.
[0081] In the second phase, the strategy is further optimized to improve speed and lane changing efficiency. In multi-lane scenarios, the ACC pre-trained strategy completed in the first phase is loaded, and the traffic environment is reconfigured by adding additional surrounding traffic participants. These traffic participants are controlled by the strategy built into the SUMO platform. The strategy built into the SUMO platform is trained to achieve safe lane changing and high-speed driving by maximizing the lane changing reward function. The vehicle accelerates by changing lanes while ensuring safety, thereby improving the efficiency of the vehicle reaching its destination.
[0082]
[0083] Where max dist(v1,v2,…,v n This refers to the furthest distance between a participating vehicle and the intelligent agent within a detection range of 80 meters ahead on a multi-lane road. f This refers to the vehicles ahead in the lane currently occupied by the agent. It's important to note that the road presence state input in the second stage is only 0 or 1.
[0084] In the third stage, a more sophisticated overtaking strategy is trained. The model in the second stage can only perform lane-changing tasks, while many real-world traffic scenarios require overtaking, where vehicles need to use transition lanes to accelerate past vehicles in their original lanes and then return to their original lanes appropriately. In this stage, an overtaking reward function is added to ensure that the final trained strategy can avoid collisions, achieve efficient driving, and, when necessary, complete overtaking and return to the original lane.
[0085]
[0086] Among them, t allow_driving Δt represents the permitted driving time for the agent in the current lane, and Δt represents the actual execution time of the agent in the current lane. If the current lane c e =1, A value of 0 indicates that overtaking is not necessary. If the current lane is c... e A value of 0.5 indicates that the current lane is an intermediate transition lane for overtaking tasks, and the driver needs to switch back to the original lane within the specified time.
[0087] Finally, passenger comfort was considered in the underlying target speed strategy of the final stage. When jitter occurred during acceleration changes, an unsmooth penalty was imposed:
[0088]
[0089] Therefore, the final reward function is:
[0090]
[0091]
[0092] Weight parameter w eff w acc The parameters will vary as the course progresses. The second stage focuses on the speed benefits of lane changes, while the parameters of the first stage, such as w... safe Waiting will reduce the bias, and the same applies to other stages.
[0093] Finally, the network structure and details of the method described in this invention are introduced in detail. This invention uses the Proximal Policy Optimization (PPO) algorithm based on the Double Actor-Critic framework to train the high-level behavioral decision-making and low-level target velocity planning strategies. The network inputs are the deformable map Mt and the road contour Rt. The output of the high-level actor network is the lane-changing decision, the output of the low-level actor network is the target velocity, and the output of the critic network is the state value.
[0094] The actor network architecture consists of three convolutional layers and three fully connected layers in the Mt branch, and only one fully connected layer in the Rt branch due to its simpler information. The outputs of the Mt and Rt branches are concatenated and fed into the final fully connected layer, which uses Softmax for the last output layer. The critic network has the same architecture, except for the last layer. The network training hyperparameters are shown in Table 1.
[0095]
[0096] This invention demonstrates the application results of CR-HRL in dynamic urban scenarios. Simultaneously, the effectiveness of the proposed method is proven through comparative experiments and ablation studies using a heuristic rule enumeration strategy and some baseline RL. The algorithm of this invention was tested in a joint simulator of SUMO and ROS and deployed in real-world scenarios.
[0097] This invention constructs a training environment based on the traffic flow simulation-based autonomous driving simulator SUMO and the physical simulation environment ROS+Gazebo. High-level behavioral decision-making and low-level target velocity generation are demonstrated in the SUMO traffic flow event simulator without any motion planning, accelerating the training process of CR-HRL. When the training strategy demonstrates good performance in SUMO, the SUMO and ROS platforms are connected. In ROS, the Frenet optimal trajectory generator generates a safe and comfortable planned trajectory based on the decision results and target velocity to verify the cross-platform robustness of the algorithm. This invention designs three traffic scenarios to progressively train the method, such as... Figure 5 As shown.
[0098] • One-way multi-lane: This training scenario corresponds to the first stage of the course training. Traffic lights and a small number of vehicles are placed to ensure that autonomous vehicles can abide by traffic rules, correctly follow the global navigation, and achieve constant speed cruise and safe distance maintenance, i.e., adaptive cruise function.
[0099] • Crossroads scenario: This scenario sets up multiple traffic participants, allowing autonomous vehicles to navigate through dense traffic flow, accelerate by changing lanes, and ensure traffic flow in the dynamic and complex crossroads scenario, while also ensuring the safety of the autonomous vehicles themselves.
[0100] • Dynamic city scenarios: Import a nine-grid map collected from the Changshu Autonomous Driving Test Base in China into SUMO. This map includes various driving tasks such as single-lane driving, multi-lane driving, driving at intersections with traffic lights, and overtaking.
[0101] The parameter settings for the SUMO simulation platform are shown in the table below.
[0102]
[0103] Spatial Location and Dynamic Characteristics Integrated State: This invention compares the performance of the proposed integrated autonomous vehicle state map (IOGM-V), the original grid map autonomous vehicle state map (OGM), and the original state grid map combined with the surrounding traffic-participating vehicle speed information vector (OGM-V) as inputs to CR-HRL. As shown in the figure, when the input is OGM, the average training reward per round is lower than that of OGM-V and IOGM-V. This is because for OGM, agents appearing in the same spatial location have the same weight despite different dynamic characteristics. This leads to an increase in the number of lane changes during training, but a decrease in performance. Compared to OGM, OGM-V shows a significant performance improvement, but due to the separation of spatial location and dynamic characteristics, the policy requires a longer training and understanding time, and its performance is slightly lower than that of IOGM-V. Therefore, the proposed integrated state of spatial location and dynamic characteristics significantly improves the performance of the policy compared to the other two inputs, such as... Figure 6 As shown.
[0104] Overtaking Course Strategy Training: To demonstrate the impact of progressive course learning on strategy training, this invention compares the training curves of CR-HRL and R-HRL. This invention uses a training scenario of one-way multi-lane traffic and trains CR-HRL in three stages; for R-HRL, this invention directly trains the neural network strategy by maximizing the overtaking reward (Equation 1). The learning curves are shown in the figure. The average reward of CR-HRL shows a significant jump in the first and second stages and is consistently higher than that of R-HRL, even increasing in the third stage. Figure 7As shown, the proposed CR-HRL outperforms R-HRL in both sample efficiency and policy performance.
[0105] To demonstrate the impact of residual learning on policy training, this invention compares the training curves of C-HRL and CR-HRL. This invention uses a training scenario of one-way multi-lane road, with both sides employing curriculum learning. The learning curves during the testing phase are shown in the figure. The results show that CR-HRL converges faster than C-HRL in the first and second stages, and its average policy performance is better. Furthermore, this invention uses the currently popular IL-HRL, first initializing the high-level lane-changing model and low-level speed planning policy using SL2015 and the Krauss model through imitation learning. This is equivalent to the first stage and the less effective second stage in curriculum training. Based on this, the final policy is trained using (Equation 1). Figure 8 As shown, the method proposed in this invention demonstrates superior sample efficiency and policy performance compared to IL-HRL.
[0106] This invention designs three advanced rule-based planning methods and two popular RL baseline methods, respectively, and compares them with the proposed CR-HRL: 1) Using the SL2015 lane-changing model and the Krauss-based safe car-following model as baselines. SL2015 and Krauss are currently the best-performing lane-change behavior planning and cruise control embedding methods in SUMO. 2) The CL2013 lane-changing model and the IDM intelligent car-following model. 3) Adaptive trajectory planning based on multiple models. 4) HRL-Residual end-to-end residual reinforcement learning, using Equation 1 to train the agent. 5) HRLfd expert demonstration hierarchical reinforcement learning. These models are trained on different training environments. Figure 1 and test site Figure 2 Fifty tests were conducted. To ensure fairness, the starting and ending points were randomly initialized to guarantee the same traffic flow. Figure 9 As shown, the method of this invention surpasses the baseline after approximately X to Y environmental steps. Starting from environmental step x, the policy gradually stabilizes. Clearly, the method of this invention outperforms the baseline in all environmental settings.
[0107] This invention provides a trajectory planning method based on adaptive curriculum residual hierarchical reinforcement learning (CR-HRL) for dynamic urban traffic scenarios. Figure 1As shown, the driving task is defined as high-level behavioral decision-making and low-level speed planning. A Frenet optimal trajectory generator then generates a safe and comfortable planned trajectory based on the decision results and target speed. A rule-based safe planning model is used to initialize the residual reinforcement policy, guiding the agent to areas with average high rewards. The residual reinforcement policy is then used to optimize and improve the tactical strategy. This invention utilizes fuzzy logic to integrate the spatial position and dynamic characteristics of the autonomous vehicle as input to CR-HRL, learning a more refined autonomous driving policy. To accelerate policy learning, a progressive learning program is designed to apply knowledge gained from single driving tasks to solve more complex driving tasks. A progressive reward formula is designed for complex driving tasks to improve training efficiency and enhance the reusability and generalization of the policy. Furthermore, the proposed method was trained on a SUMO simulation platform and achieved superior performance. Without fine-tuning, it can be deployed on a ROS-based simulation platform and tested in real-world scenarios, verifying the cross-platform robustness of the experimental method.
[0108] Based on the same concept, this invention provides a trajectory planning system based on adaptive curriculum residual hierarchical reinforcement learning, including a state acquisition module, a behavior and velocity planning module, and a trajectory generation module.
[0109] The state acquisition module is used to integrate the spatial position and dynamic characteristics of autonomous vehicles using fuzzy logic;
[0110] The behavior and speed planning module is used to combine rule-based methods with deep reinforcement learning to form the CR-HRL decision framework. The spatial position and dynamic characteristics of the autonomous vehicle are input into the CR-HRL decision framework, and the output ratio of rule-based and deep reinforcement learning is adaptively adjusted according to the training process to output high-level behavior decision results and target speed.
[0111] The trajectory generation module is used to generate a safe and comfortable planned trajectory based on the results of high-level behavioral decisions and the target speed.
[0112] On the other hand, the present invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can implement the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method described in the present invention.
[0113] The computer equipment may be a laptop, desktop computer, workstation, or vehicle-mounted computer.
[0114] The processor described in this invention may be a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).
[0115] The memory described in this invention can be an internal storage unit of a laptop, desktop computer, workstation, or vehicle-mounted computer, such as memory or hard disk; or it can be an external storage unit, such as a portable hard disk or flash memory card.
[0116] The present invention can also provide a computer device, including a processor and a memory, wherein the memory is used to store a computer executable program, the processor reads the computer executable program from the memory and executes it, and the processor can implement the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method described in the present invention when executing the computer executable program.
[0117] Computer-readable storage media can include computer storage media and communication media. Computer storage media includes volatile and non-volatile, removable and non-removable media implemented using any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media can include: read-only memory (ROM), random access memory (RAM), solid-state drives (SSDs), or optical discs, etc. Random access memory can include resistive random access memory (ReRAM) and dynamic random access memory (DRAM).
[0118] The above content is only for illustrating the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made to the technical solution based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.
Claims
1. A trajectory planning method based on adaptive curriculum residual hierarchical reinforcement learning, characterized in that, Includes the following steps: The spatial location and dynamic characteristics of autonomous vehicles are integrated using fuzzy logic. Specifically, this includes: parsing traffic rules, surrounding traffic participants and road condition information from the environment, constructing a grid map that incorporates the fuzzy speed and deformed pose of surrounding vehicles, and using vectors to uniformly represent traffic rules and road condition information. The CR-HRL decision framework combines rule-based methods with deep reinforcement learning. The spatial position and dynamic characteristics of the autonomous vehicle are input into the CR-HRL decision framework, which adaptively adjusts the output ratio of rule-based and deep reinforcement learning based on the training process, outputting high-level behavioral decisions and the target speed. The input state of the CR-HRL decision framework... It includes two parts: observations for residual reinforcement learning policies and observations for rule-based policies; the rule-based observations... : in, The speed of the vehicle in front. For the speed of the following vehicle, For maximum vehicle speed, For maximum acceleration, For maximum deceleration, For vehicle length, The distance between the car in front and the car behind. For driver reaction time, To address the driver's shortcomings when handling desired vehicle speeds, corresponding to their driving proficiency. A higher value indicates less proficiency. ; Observations of the residual strategy It consists of two parts: highly dynamic autonomous vehicles and surrounding traffic participants, structured road conditions, and fixed traffic rules. Generate a safe and comfortable planning trajectory based on the results of high-level behavioral decisions and target speed; The PPO algorithm based on the Double Actor-Critic framework is used to train the high-level behavioral decision-making and target velocity planning strategies. The output of the high-level actor network is the lane-changing decision, the output of the low-level actor network is the target velocity, and the output of the critic network is the state value. The structure of the high-level and low-level actor networks is as follows: the Mt branch consists of 3 convolutional layers and 3 fully connected layers, and the Rt branch has only one fully connected layer. The outputs of the Mt and Rt branches are concatenated and transmitted to the final fully connected layer. The last output layer uses Softmax. The critic network is the last output layer of the high-level and low-level actor network architectures removed. The occupied grid map Mt, centered on the autonomous vehicle, represents the sensory data at each time step t. Mt is built on the curve coordinates defined by the lane followed by the vehicle, and the curved grid map is uniformly transformed into a straight road. Referring to the real-time speed of the autonomous vehicle and the road speed limit, the speed of surrounding traffic vehicles is divided into 7 fuzzy sets {NB, NM, NS, Z, PS, PM, PB}, and these 7 fuzzy sets are projected onto the grid map Mt through gradient colors.
2. The adaptive curriculum residual hierarchical reinforcement learning trajectory planning method according to claim 1, characterized in that, A progressive learning approach is adopted, and the CR-HRL decision framework is trained in three stages: the first stage is ACC adaptive cruise control, the second stage is lane changing, and the third stage is overtaking.
3. The adaptive curriculum residual hierarchical reinforcement learning trajectory planning method according to claim 2, characterized in that, In the first stage, the residual policy of the agent is randomly initialized. In a single-lane scenario, the residual policy is trained by maximizing the safety reward function. Training stops when cruise control and safe distance maintenance are achieved. If a lane-changing error occurs in the lane-changing policy, a penalty is imposed. In the second stage, in a multi-lane scenario, the ACC pre-trained strategy completed in the first stage is loaded, and the traffic environment is reconfigured by adding additional surrounding traffic participants. The lane-changing strategy in this stage is trained by maximizing the lane-changing reward function. In the third stage, an overtaking reward function is added to ensure that the final training strategy can avoid collisions, complete overtaking when necessary, and return to the original lane.
4. The adaptive curriculum residual hierarchical reinforcement learning trajectory planning method according to claim 3, characterized in that, During the third phase of training, passenger comfort was considered in the underlying target speed strategy. When jitter occurred during acceleration changes, an unsmoothness penalty was applied. The final reward function is: For the safety reward function, For lane change reward function, For the overtaking reward function, the weight parameters are... , , , as well as As the course progresses through different stages, things change.
5. A trajectory planning system based on adaptive curriculum residual hierarchical reinforcement learning, characterized in that: The method for implementing the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method as described in any one of claims 1-4 includes a state acquisition module, a behavior and velocity planning module, and a trajectory generation module. The state acquisition module is used to integrate the spatial position and dynamic characteristics of autonomous vehicles using fuzzy logic; The behavior and speed planning module is used to combine rule-based methods with deep reinforcement learning to form the CR-HRL decision framework. The spatial position and dynamic characteristics of the autonomous vehicle are input into the CR-HRL decision framework, and the output ratio of rule-based and deep reinforcement learning is adaptively adjusted according to the training process to output high-level behavior decision results and target speed. The trajectory generation module is used to generate a safe and comfortable planned trajectory based on the results of high-level behavioral decisions and the target speed.
6. A computer device, characterized in that, It includes a processor and a memory, the memory being used to store a computer-executable program, the processor reading the computer-executable program from the memory and executing it, and the processor executing the program being able to implement the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method as described in any one of claims 1-4.
7. A computer-readable storage medium, characterized in that, A computer-readable storage medium stores a computer program that, when executed by a processor, can implement the adaptive curriculum residual hierarchical reinforcement learning trajectory planning method according to any one of claims 1-4.