Automatic driving integrated decision method and device, vehicle and storage medium
By generating safe trajectories that satisfy dynamic constraints and applying state-by-state constraints in autonomous driving systems, the problem of independent behavior selection and path planning modules in traditional decision-making systems is solved, thereby improving the intelligence and safety of autonomous driving systems and achieving a high level of automated control.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2022-10-17
- Publication Date
- 2026-06-23
AI Technical Summary
In traditional autonomous driving decision-making systems, the behavior selection and path planning modules are independent and lack interaction, resulting in inaccurate outputs in uncertain and interactive scenarios, making it difficult to achieve a high level of automation and intelligence. Reinforcement learning methods have limitations in terms of safety, making it difficult to guarantee absolute safety in every state.
By acquiring the state information of the target point in the world coordinate system and transforming it to the target coordinate system, a safe trajectory that satisfies the basic dynamic constraints is generated. Then, state-by-state constraints are applied to generate the optimal safe trajectory of the vehicle. Iterative updates are performed using the alternative state value function and the Lagrange multiplier function to improve the intelligence of the decision-making system.
It enhances the intelligence and safety of autonomous driving decision-making systems in interactive and uncertain scenarios, generates high-level automated control strategies, and ensures stable vehicle operation in complex environments.
Smart Images

Figure CN115534998B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of vehicle technology, and in particular to an integrated decision-making method, device, vehicle, and storage medium for autonomous driving. Background Technology
[0002] Advanced autonomous driving technology is of great significance for improving road traffic safety, reducing accident rates, and increasing traffic efficiency. However, the decision-making systems of advanced autonomous driving technology still face the following two major challenges.
[0003] Challenge 1: Traditional autonomous driving decision-making systems typically consist of two modules: behavior selection and path planning. These two modules are usually relatively independent and executed sequentially, with the path planning module completely dependent on the behavior selection module's results. They are mutually constraining, lacking effective interaction and feedback between them. In scenarios with uncertainty and interactivity, inaccurate or suboptimal results from the decision-making module can limit the advantages of the path planning module, making it difficult for rule-based autonomous driving decision-making systems to achieve high levels of automation and intelligence.
[0004] Behavioral selection and path planning, as different system components, often have different needs. The former leans more towards strategic-level behavioral choices, while the latter requires that indicators such as path curvature and lateral acceleration meet certain requirements. For example, in systems like... Figure 1 In the scenario shown, the behavior selection module typically outputs discrete selection results based on the safety of the feasible interval, while the path planning module comprehensively considers indicators such as curvature and lateral acceleration to output more refined trajectory planning results. The different optimization objectives of the two modules lead to different outputs. If the path planning module completely follows the output of the behavior selection module, it may produce trajectories that are difficult to guarantee dynamic stability, affecting the tracking of the lower-level controller. Removing the behavior selection module increases the solution range of the path planning module, reducing solution accuracy and efficiency. Specifically, current planning schemes typically separate time and space, performing path planning first, then velocity planning, and finally synthesizing the path. The problem is that while a search-based approach can obtain a globally optimal solution, it may sample many trajectories, and selecting a reasonable trajectory requires designing a reasonable cost function, a very time-consuming process. Furthermore, while optimization-based approaches can reduce solution time to some extent, they require transforming the problem into a convex problem, which is difficult to solve when considering numerous constraints. In the process of avoiding dynamic obstacles, path selection is closely related to velocity, and velocity planning under a given path also affects the human-likeness of driving behavior to some extent.
[0005] Challenge Two: Reinforcement learning-based methods have demonstrated significant advantages in sequential decision-making for autonomous driving. These methods can iteratively update strategies through self-evolution based on interactions with the environment, without relying on labeled data, and possess the ability to achieve ensemble decision-making. However, the learning-through-exploration and-trial-and-error nature of reinforcement learning methods also makes safety a major bottleneck limiting their application. Constrained Markov Decision Processes (CMDPs) are often used to describe reinforcement learning problems considering safety constraints. However, common problem settings and solution methods typically focus on optimizing safety in the desired sense. For autonomous vehicles, ensuring absolute safety in every state is more crucial.
[0006] Furthermore, solving for a safety strategy requires two steps: (1) designing a state-constrained problem; and (2) solving the state-constrained problem. The importance of the first step lies in the fact that the objective function of reinforcement learning optimization is usually defined in the infinite time domain. However, due to the characteristics of the dynamic system and the constraints of the input state, the state constraints that guarantee feasibility in the infinite time domain are often different from ordinary safety objective functions and are generally more conservative. Furthermore, if the form of the state-constrained problem is not designed correctly, a feasible safety strategy can never be obtained regardless of the optimization algorithm chosen. Summary of the Invention
[0007] This application provides an integrated decision-making method, device, vehicle, and storage medium for autonomous driving, which addresses the problem that rule-based autonomous driving decision-making systems in interactive and uncertain scenarios struggle to achieve high levels of automation and intelligence, thereby improving the intelligence of the decision-making system.
[0008] The first aspect of this application provides an integrated decision-making method for autonomous driving, including the following steps:
[0009] Obtain the first state information of the target point in the world coordinate system, and transform the first state information to the target coordinate system to obtain the second state information of the target point in the target coordinate system;
[0010] Based on the second state information, the initial lateral displacement and initial longitudinal velocity of the target point deviating from the reference path at the target time are obtained. Based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and initial longitudinal velocity, the state information of the target point generating the trajectory in the world coordinate system is obtained. A safe trajectory satisfying basic dynamic constraints is then generated based on the state information of the target point.
[0011] The safe trajectory that satisfies the basic dynamic constraints is subjected to state-by-state constraints to obtain the final lateral displacement and final longitudinal velocity of the target point deviating from the reference path at the target time. The optimal safe trajectory of the vehicle is generated based on the final lateral displacement and the final longitudinal velocity. The optimal safe trajectory is then used as the output of the vehicle's integrated decision system and input to the lower-level controller to control the vehicle according to the optimal safe trajectory.
[0012] According to one embodiment of this application, the step of performing state-by-state constraints on the safe trajectory that satisfies the basic dynamic constraints includes:
[0013] Determine the threshold for the alternative state value function within the state space of the safe and feasible region;
[0014] Based on the threshold of the alternative state value function in the state space of the safe and feasible region and the preset update strategy, the state value function, state-action value function, Lagrange multiplier function, strategy function and feasible state-action value function are iteratively updated until the preset iteration conditions are met.
[0015] According to one embodiment of this application, the iterative update of the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function based on the threshold of the alternative state value function in the state space of the safe and feasible region and a preset update strategy includes:
[0016] The objective function and gradient of the state value function are updated by minimizing the mean squared error as follows:
[0017]
[0018]
[0019] in, Let V be the objective function of the state value function, υ be the parameter of the state value function, and V be the objective function. v (s) is the state value function, where s is the state, and Q is the value function. ω (s′, a′) is the state-action value function, where s′ is the state at the next moment, a′ is the corresponding action, α is the temperature coefficient, and logπ is the value of the action. μ (·) represents the policy function π μ entropy, The gradient of the state value function is... The gradient of the state value function;
[0020] The objective function and gradient of the state-action value function are updated by minimizing the Bellman residual as follows:
[0021]
[0022]
[0023] in, Let Q be the objective function of the state-action value function. ω (s, a) is the state-action value function, where a is the action. The policy function π μ The state distribution is given by r(s, a), where r(s, a) is the reward function and γ∈(0, 1) represents the discount factor. Let the target state value function be... The gradient of the state-action value function. The gradient of the state-action value function;
[0024] The objective function and gradient of the Lagrange multiplier number are updated as follows:
[0025]
[0026]
[0027] in, Let λ be the objective function of the Lagrange multiplier function. ξ (s) is a Lagrange multiplier function. This is the feasible state-action value function. The gradient of the Lagrange multiplier function;
[0028] The objective function and gradient of the feasible state-action value function are updated as follows:
[0029]
[0030]
[0031]
[0032] in, To update the objective function of the feasible state-action value function, Let d be the objective function of the feasible state-action value function, and d be the constraint threshold. The gradient of the feasible state-action value function is given by: This is the gradient of the feasible state-action value function.
[0033] According to one embodiment of this application, the step of converting the first state information to the target coordinate system includes:
[0034] Based on a preset coordinate system transformation function, the first state information is transformed to the target coordinate system, wherein the preset coordinate system transformation function is:
[0035]
[0036] in, These represent the longitudinal displacement, longitudinal velocity, longitudinal acceleration, lateral displacement, lateral velocity, and lateral acceleration in the Frenet coordinate system; (x t y t v t acc t θ t κ t ) represent the horizontal position, vertical position, velocity, acceleration, orientation angle, and curvature in the world coordinate system, respectively; F coor (·) is the transformation function between coordinate systems.
[0037] According to one embodiment of this application, the functional relationship between the initial lateral displacement and the initial longitudinal velocity is as follows:
[0038]
[0039]
[0040] Where the lateral displacement l is a function of the longitudinal displacement s, and the longitudinal displacement s is a function of time t. p and q are the degrees of the polynomial.
[0041] According to the autonomous driving integrated decision-making method provided in this application, the method acquires the first state information of the target point in the world coordinate system and transforms it to the target coordinate system to obtain the second state information. Simultaneously, it acquires the initial lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. Based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and longitudinal velocity, it obtains the state information of the target point generating the trajectory in the world coordinate system, generates a safe trajectory that satisfies basic dynamic constraints, and applies state-by-state constraints to obtain the final lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. This generates the vehicle's optimal safe trajectory, which serves as the output of the vehicle's integrated decision-making system and is input to the lower-level controller for vehicle control. This solves the problem that rule-based autonomous driving decision-making systems struggle to achieve high levels of automation and intelligence in interactive and uncertain scenarios, thus improving the intelligence of the decision-making system.
[0042] A second aspect of this application provides an integrated decision-making device for autonomous driving, comprising:
[0043] The acquisition module is used to acquire the first state information of the target point in the world coordinate system and convert the first state information to the target coordinate system to obtain the second state information of the target point in the target coordinate system.
[0044] The generation module is used to obtain the initial lateral displacement and initial longitudinal velocity of the target point deviating from the reference path at the target time based on the second state information, and to obtain the state information of the target point in the world coordinate system for generating the trajectory based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, so as to generate a safe trajectory that satisfies basic dynamic constraints according to the state information of the target point; and
[0045] The control module is used to perform state-by-state constraints on the safe trajectory that satisfies the basic dynamic constraints, obtain the final lateral displacement and final longitudinal velocity of the target point deviating from the reference path at the target time, generate the optimal safe trajectory of the vehicle based on the final lateral displacement and the final longitudinal velocity, and input the optimal safe trajectory as the output of the vehicle's integrated decision system to the lower-level controller for controlling the vehicle according to the optimal safe trajectory.
[0046] According to one embodiment of this application, the control module, which performs state-by-state constraints on the safe trajectory satisfying the basic dynamic constraints, is specifically used for:
[0047] Determine the threshold for the alternative state value function within the state space of the safe and feasible region;
[0048] Based on the threshold of the alternative state value function in the state space of the safe and feasible region and the preset update strategy, the state value function, state-action value function, Lagrange multiplier function, strategy function and feasible state-action value function are iteratively updated until the preset iteration conditions are met.
[0049] According to one embodiment of this application, the control module iteratively updates the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function based on the threshold of the alternative state value function in the state space of the safe and feasible region and a preset update strategy. Specifically, the control module is used to:
[0050] The objective function and gradient of the state value function are updated by minimizing the mean squared error as follows:
[0051]
[0052]
[0053] in, Let V be the objective function of the state value function, υ be the parameter of the state value function, and V be the objective function. v (s) is the state value function, where s is the state, and Q is the value function. ω (s′, a′) is the state-action value function, where s′ is the state at the next moment, a′ is the corresponding action, α is the temperature coefficient, and logπ is the value of the action. μ (·) represents the policy function π μ entropy, The gradient of the state value function is... The gradient of the state value function;
[0054] The objective function and gradient of the state-action value function are updated by minimizing the Bellman residual as follows:
[0055]
[0056]
[0057] in, Let Q be the objective function of the state-action value function. ω (s, a) is the state-action value function, where a is the action. The policy function π μ The state distribution is given by r(s, a), where r(s, a) is the reward function and γ∈(0, 1) represents the discount factor. Let the target state value function be... The gradient of the state-action value function. The gradient of the state-action value function;
[0058] The objective function and gradient of the Lagrange multiplier number are updated as follows:
[0059]
[0060]
[0061] in, Let λ be the objective function of the Lagrange multiplier function. ξ (s) is a Lagrange multiplier function. This is the feasible state-action value function. The gradient of the Lagrange multiplier function;
[0062] The objective function and gradient of the feasible state-action value function are updated as follows:
[0063]
[0064]
[0065]
[0066] in, To update the objective function of the feasible state-action value function, Let d be the objective function of the feasible state-action value function, and d be the constraint threshold. The gradient of the feasible state-action value function is given by: This is the gradient of the feasible state-action value function.
[0067] According to one embodiment of this application, the step of converting the first state information to the target coordinate system, specifically the acquisition module, is used for:
[0068] Based on a preset coordinate system transformation function, the first state information is transformed to the target coordinate system, wherein the preset coordinate system transformation function is:
[0069]
[0070] in, These represent the longitudinal displacement, longitudinal velocity, longitudinal acceleration, lateral displacement, lateral velocity, and lateral acceleration in the Frenet coordinate system, respectively (x t y t v t acc t θ t κ t F represents the horizontal position, vertical position, velocity, acceleration, orientation angle, and curvature in the world coordinate system. coor (·) is the transformation function between coordinate systems.
[0071] According to one embodiment of this application, the functional relationship between the initial lateral displacement and the initial longitudinal velocity is as follows:
[0072]
[0073]
[0074] Where the lateral displacement l is a function of the longitudinal displacement s, and the longitudinal displacement s is a function of time t. p and q are the degrees of the polynomial.
[0075] According to the autonomous driving integrated decision-making device provided in this application embodiment, the device acquires the first state information of the target point in the world coordinate system and transforms it to the target coordinate system to obtain the second state information. Simultaneously, it acquires the initial lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. Based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, it obtains the state information of the target point generating the trajectory in the world coordinate system, generates a safe trajectory that satisfies basic dynamic constraints, and applies state-by-state constraints to obtain the final lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. This generates the vehicle's optimal safe trajectory, which is then input to the lower-level controller as the output of the vehicle's integrated decision-making system to control the vehicle. This solves the problem that rule-based autonomous driving decision-making systems struggle to achieve high levels of automation and intelligence in scenarios with interactivity and uncertainty, thus improving the intelligence of the decision-making system.
[0076] A third aspect of this application provides a vehicle including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the autonomous driving integrated decision-making method as described in the above embodiments.
[0077] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which is executed by a processor to implement the autonomous driving integrated decision-making method as described in the above embodiments.
[0078] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0079] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:
[0080] Figure 1 This is a schematic diagram illustrating the output differences between different modules of a decision system according to an embodiment of this application;
[0081] Figure 2 This is a flowchart of an integrated decision-making method for autonomous driving provided according to an embodiment of this application;
[0082] Figure 3 This is a flowchart of an integrated decision-making method for autonomous driving according to an embodiment of this application;
[0083] Figure 4 This is a schematic diagram of a security state projection process according to an embodiment of this application;
[0084] Figure 5This is a schematic diagram illustrating the physical meaning of a module according to an embodiment of this application;
[0085] Figure 6 This is a schematic diagram of different state spaces according to an embodiment of this application;
[0086] Figure 7 This is a block diagram of an integrated decision-making device for autonomous driving according to an embodiment of this application;
[0087] Figure 8 This is a structural schematic diagram of a vehicle provided according to an embodiment of this application. Detailed Implementation
[0088] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0089] The following description, with reference to the accompanying drawings, outlines an autonomous driving integrated decision-making method, apparatus, vehicle, and storage medium according to embodiments of this application. Addressing the challenges mentioned in the background art regarding the difficulty of achieving high levels of automation and intelligence in rule-based autonomous driving decision-making systems within interactive and uncertain scenarios, this application provides an autonomous driving integrated decision-making method. This method acquires first state information of a target point in the world coordinate system and transforms it to the target coordinate system to obtain second state information. Simultaneously, it acquires the initial lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. Based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and initial longitudinal velocity, it obtains the state information of the target point generating a trajectory in the world coordinate system, generates a safe trajectory satisfying basic dynamic constraints, and applies state-by-state constraints to obtain the final lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. This generates the vehicle's optimal safe trajectory, which serves as the output of the vehicle's integrated decision-making system and is input to the lower-level controller for vehicle control. This solves the problem of achieving high levels of automation and intelligence in rule-based autonomous driving decision-making systems within interactive and uncertain scenarios, thereby improving the intelligence of the decision-making system.
[0090] Specifically, such as Figure 2 As shown, Figure 2 This is a flowchart illustrating an integrated decision-making method for autonomous driving provided in an embodiment of this application.
[0091] Before introducing the autonomous driving integrated decision-making method of the embodiments of this application, let me briefly introduce the autonomous driving integrated decision-making system involved in the autonomous driving integrated decision-making method of the embodiments of this application.
[0092] In this embodiment, the autonomous driving integrated decision-making system mainly comprises three major modules: a safety projection module, a policy evaluation module, and a policy enhancement module. The policy evaluation module includes two sub-modules: a Lagrange multiplier module and a feasible value function module. The policy enhancement module includes a continuous lattice module, such as... Figure 3 As shown. This application's embodiments primarily target scenarios with interactivity and uncertainty, avoiding the problem of mutual constraints on optimization objectives among different stakeholders in a decision-making system, and the lack of interaction and feedback. Simultaneously, it uses an alternative Lagrange objective equation to construct a constrained Markov decision process with zero constraint violations to describe the problem, and uses the Lagrange multiplier method considering state distribution to solve for the optimal strategy, outputting a safe, continuous, and reasonable path. This makes the decision-making system's output highly flexible and intelligent.
[0093] like Figure 2 As shown, the autonomous driving integrated decision-making method includes the following steps:
[0094] In step S201, the first state information of the target point in the world coordinate system is obtained, and the first state information is transformed to the target coordinate system to obtain the second state information of the target point in the target coordinate system.
[0095] Specifically, in this application's embodiments, a state-by-state constrained Markov process can first be defined, consisting of a quintuple. in, Representing the state and state space, Indicates the action and the action space. Represents the Markov state transition probability. Let γ represent the reward function, and γ∈(0,1) represent the discount factor. Specifically, the SCMDP problem can be defined as follows:
[0096]
[0097]
[0098] Where, ρ π (s) represents the distribution of states under policy π. State space for a safe and feasible region. Let d be the state value function within the state space of the safe and feasible region, and d be the constraint threshold.
[0099] Furthermore, the problem defined in (1) is difficult to solve directly. Therefore, the embodiments of this application can transform the original optimization problem form (1) into its dual problem form:
[0100]
[0101] in, It is the alternative state value function within the state space of the safe and feasible region, corresponding to the feasible value function module. λ(·) is the Lagrange multiplier, corresponding to the Lagrange multiplier module. To ensure the requirements of state-by-state constraints, λ(·) is defined as a function of state s rather than a constant. In constrained reinforcement learning problems, the duality gap does not exist, and when ρ π (s) When traversing all states, i.e. including the safe state space, the optimal solution obtained by problem (2) is equivalent to that of problem (1).
[0102] Furthermore, define The basic safety state space, physically defined as the drivable area within the road network, is used to ensure that the output target point lies within this drivable area. First, the state space in problem (2) is defined as the projection of the state in the world coordinate system onto the Frenet coordinate system, such as... Figure 4 As shown.
[0103] Furthermore, the safety projection module aims to ensure that all output target points fall within the drivable area, i.e., the road area, while simultaneously helping the vehicle understand more information relative to the road and surrounding vehicles at the input information level. Specifically, in this embodiment, the first state information can be transformed to the target coordinate system based on a preset coordinate system transformation function, wherein the preset coordinate system transformation function is:
[0104]
[0105] in, These represent the longitudinal displacement, longitudinal velocity, longitudinal acceleration, lateral displacement, lateral velocity, and lateral acceleration in the Frenet coordinate system, respectively (x t y t v t acc t θ t κ t F represents the horizontal position, vertical position, velocity, acceleration, orientation angle, and curvature in the world coordinate system. coor (·) is the transformation function between coordinate systems.
[0106] Therefore, the state space is projected and constrained to the basic safe state space using the relative position information with respect to the reference line. Inside.
[0107] In step S202, based on the second state information, the initial lateral displacement and initial longitudinal velocity of the target point deviating from the reference path at the target time are obtained. Based on the preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, the state information of the target point in the world coordinate system for generating the trajectory is obtained, so as to generate a safe trajectory that satisfies the basic dynamic constraints according to the state information of the target point.
[0108] Specifically, in order to ensure the integrated output of the decision-making system, embodiments of this application may define a continuous action space. The lateral displacement of the target point from the reference path at time T. and longitudinal velocity Next, trajectories are generated using a continuous lattice module, and the strategy is post-processed, such as... Figure 5 As shown.
[0109] In some embodiments, the functional relationship between the initial lateral displacement and the initial longitudinal velocity is as follows:
[0110]
[0111]
[0112] Where the lateral displacement *l* is a function of the longitudinal displacement *s*, and the longitudinal displacement *s* is a function of time *t*. *p* and *q* are the degrees of the polynomial. In this embodiment, *p* = 6 and *q* = 5. At this time, the information of the trajectory starting point state... It can be obtained from (3).
[0113] Furthermore, in the continuous lattice module, to ensure the stability of the driving process, it is assumed that the lateral and longitudinal accelerations and the lateral velocity are 0 at time T, and the longitudinal distance s is... f Sampling is performed on the target point to obtain its state information. Combining (4), a path can be generated from s f (t) and l f (s f The spatiotemporal trajectory is jointly described. Furthermore, the state information of all points generating the trajectory in the world coordinate system is obtained:
[0114]
[0115]
[0116] Among them, v uss acc uss θ uss κ uss These are the sets of unsafe states.
[0117] Thus, a trajectory that satisfies basic dynamic constraints can be generated. Essentially, this embodiment uses partial information of the target point output by the network to control the distance of the target relative to the reference line to uniformly optimize the process of high-level behavior selection and trajectory generation. At the same time, it further adjusts the safety of the generated trajectory by controlling the output longitudinal reference velocity and avoids a complex trajectory selection process, ensuring the continuity and reachability of the target point at different times.
[0118] In step S203, state-by-state constraints are applied to the safe trajectory that satisfies the basic dynamic constraints to obtain the final lateral displacement and final longitudinal velocity of the target point deviating from the reference path at the target time. The optimal safe trajectory of the vehicle is generated based on the final lateral displacement and final longitudinal velocity, and the optimal safe trajectory is used as the output of the vehicle's integrated decision system and input to the lower-level controller to control the vehicle according to the optimal safe trajectory.
[0119] Furthermore, in some embodiments, state-by-state constraints are applied to the safe trajectory that satisfies the basic dynamic constraints, including: determining a threshold for the alternative state value function within the state space of the safe and feasible region; and iteratively updating the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function based on the threshold of the alternative state value function within the state space of the safe and feasible region and a preset update strategy, until a preset iteration condition is met.
[0120] Specifically, the safety strategy solving module in this application embodiment applies state-by-state constraints to the trajectory that satisfies the basic dynamic constraints. To ensure the feasibility of the trajectory safety state, the safety of the trajectory is constrained from an algorithmic perspective. Different trajectory safety states are as follows: Figure 6 As shown. The embodiments of this application first ensure that, from the perspective of problem setting, the requirements for solving state-constrained problems can be met by using the state-by-state constraint SCMDP form.
[0121] Furthermore, the objective function of problem (2) is solved using a strategy and updated using Lagrange multipliers:
[0122]
[0123]
[0124] Where, β λ With β π λ and π are the update step sizes, respectively. λ and π are the update parameters of the multiplier network and the policy network, respectively.
[0125] Furthermore, the threshold d of the alternative state value function within the safe and feasible region state space is the distance d between the generated trajectory and the surrounding area in the dynamic traffic flow scenario, i.e.:
[0126]
[0127] Where η is the attenuation coefficient, d t,i Let t be the distance between vehicle i and vehicle t at time t, and n be the number of vehicles in a circle.
[0128] Furthermore, a neural network is used to fit the approximate functions of each module, including the state value function V.υ State-action value function Q ω Lagrange multiplier λ ξ Policy function π μ Feasible State-Action Value Function This embodiment makes the training process more stable by explicitly modeling the state-value function. Furthermore, the problem in (2) can be transformed into the following form:
[0129]
[0130]
[0131]
[0132] Where a~π μ (s), It is strategy π μ The state distribution under logπ; μ (·) represents strategy π μ The entropy of α is the temperature coefficient.
[0133] Furthermore, update the policy function π. μ The objective function and gradient are:
[0134]
[0135]
[0136] In some embodiments, based on a threshold of the alternative state value function within the safe and feasible region state space and a preset update strategy, the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function are iteratively updated, including:
[0137] The objective function and gradient of updating the state value function by minimizing the mean squared error are:
[0138]
[0139]
[0140] in, Let V be the objective function of the state-value function, υ be the parameters of the state-value function, and V be the objective function. v (s) is the state value function, where s is the state, and Q is the value function. ω (s′, a′) is the state-action value function, where s′ is the state at the next moment, a′ is the corresponding action, α is the temperature coefficient, and logπ is the value of the action. μ (·) represents the policy function π μ entropy, The gradient of the state value function. The gradient of the state value function;
[0141] By minimizing the objective function and gradient of the Bellman residual update and the state-action value function, we obtain:
[0142]
[0143]
[0144] in, Let Q be the objective function of the state-action value function. ω (s, a) is the state-action value function, where a is the action. The policy function π μ The state distribution is given by r(s, a), where r(s, a) is the reward function and γ∈(0, 1) represents the discount factor. Let the target state value function be... The gradient of the state-action value function. The gradient of the state-action value function;
[0145] The objective function and gradient for updating the number of Lagrange multipliers are:
[0146]
[0147]
[0148] in, Let λ be the objective function of the Lagrange multipliers. ξ (s) is a Lagrange multiplier function. This is the feasible state-action value function. The gradient of the Lagrange multiplier function;
[0149] The objective function and gradient for updating the feasible state-action value function are:
[0150]
[0151]
[0152]
[0153] in, To update the objective function of the feasible state-action value function, Let d be the objective function of the feasible state-action value function, and d be the constraint threshold. The gradient of the feasible state-action value function. This is the gradient of the feasible state-action value function.
[0154] Furthermore, update the parameters of each network:
[0155]
[0156]
[0157]
[0158]
[0159]
[0160]
[0161] Where, β (·) Let τ be the learning rate, and τ be the target network parameter update ratio.
[0162] Furthermore, through iterative updates among the above network modules, the optimal strategy is output, and the lateral displacement of the target point from the reference path at time T is obtained. and longitudinal velocity By combining (4) and (5) to output the optimal safe trajectory, the output of the integrated decision system is obtained and finally input into the lower-level controller.
[0163] The autonomous driving integrated decision-making method proposed in this application obtains the first state information of the target point in the world coordinate system and transforms it to the target coordinate system to obtain the second state information. Simultaneously, it obtains the initial lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. Based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, it obtains the state information of the target point generating the trajectory in the world coordinate system, generates a safe trajectory that satisfies basic dynamic constraints, and applies state-by-state constraints to obtain the final lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. This generates the vehicle's optimal safe trajectory, which serves as the output of the vehicle's integrated decision-making system and is input to the lower-level controller for vehicle control. This solves the problem that rule-based autonomous driving decision-making systems struggle to achieve high levels of automation and intelligence in interactive and uncertain scenarios, thus improving the intelligence of the decision-making system.
[0164] Next, referring to the accompanying drawings, an integrated decision-making device for autonomous driving proposed according to an embodiment of this application is described.
[0165] Figure 7 This is a block diagram of an autonomous driving integrated decision-making device according to an embodiment of this application.
[0166] like Figure 7As shown, the autonomous driving integrated decision-making device 10 includes: an acquisition module 100, a generation module 200, and a control module 300.
[0167] The acquisition module 100 is used to acquire the first state information of the target point in the world coordinate system and convert the first state information to the target coordinate system to obtain the second state information of the target point in the target coordinate system.
[0168] The generation module 200 is used to obtain the initial lateral displacement and initial longitudinal velocity of the target point deviating from the reference path at the target time based on the second state information, and to obtain the state information of the target point in the world coordinate system based on the preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, so as to generate a safe trajectory that satisfies the basic dynamic constraints according to the state information of the target point; and
[0169] The control module 300 is used to perform state-by-state constraints on the safe trajectory that satisfies the basic dynamic constraints, obtain the final lateral displacement and final longitudinal velocity of the target point deviating from the reference path at the target time, generate the optimal safe trajectory of the vehicle based on the final lateral displacement and final longitudinal velocity, and input the optimal safe trajectory as the output of the vehicle's integrated decision system to the lower-level controller so as to control the vehicle according to the optimal safe trajectory.
[0170] Furthermore, in some embodiments, the safety trajectory satisfying the basic dynamic constraints is constrained state by state. The control module 300 is specifically used for:
[0171] Determine the threshold for the alternative state value function within the state space of the safe and feasible region;
[0172] Based on the threshold of the alternative state value function in the state space of the safe and feasible region and the preset update strategy, the state value function, state-action value function, Lagrange multiplier function, policy function and feasible state-action value function are iteratively updated until the preset iteration conditions are met.
[0173] Furthermore, in some embodiments, based on the threshold of the alternative state value function within the safe and feasible region state space and a preset update strategy, the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function are iteratively updated. The control module 300 is specifically used for:
[0174] The objective function and gradient of updating the state value function by minimizing the mean squared error are:
[0175]
[0176]
[0177] in, Let V be the objective function of the state-value function, υ be the parameters of the state-value function, and V be the objective function. v (s) is the state value function, where s is the state, and Q is the value function. ω (s′, a′) is the state-action value function, where s′ is the state at the next moment, a′ is the corresponding action, α is the temperature coefficient, and logπ is the value of the action. μ (·) represents the policy function π μ entropy, The gradient of the state value function. The gradient of the state value function;
[0178] By minimizing the objective function and gradient of the Bellman residual update and the state-action value function, we obtain:
[0179]
[0180]
[0181] in, Let Q be the objective function of the state-action value function. ω (s, a) is the state-action value function, where a is the action. The policy function π μ The state distribution is given by r(s, a), where r(s, a) is the reward function and γ∈(0, 1) represents the discount factor. Let the target state value function be... The gradient of the state-action value function. The gradient of the state-action value function;
[0182] The objective function and gradient for updating the number of Lagrange multipliers are:
[0183]
[0184]
[0185] in, Let λ be the objective function of the Lagrange multipliers. ξ (s) is a Lagrange multiplier function. This is the feasible state-action value function. The gradient of the Lagrange multiplier function;
[0186] The objective function and gradient for updating the feasible state-action value function are:
[0187]
[0188]
[0189]
[0190] in, To update the objective function of the feasible state-action value function, Let d be the objective function of the feasible state-action value function, and d be the constraint threshold. The gradient of the feasible state-action value function. This is the gradient of the feasible state-action value function.
[0191] Furthermore, in some embodiments, the first state information is transformed to the target coordinate system, and the acquisition module 100 is specifically used for:
[0192] Based on a preset coordinate system transformation function, the first state information is transformed to the target coordinate system. The preset coordinate system transformation function is as follows:
[0193]
[0194] in, These represent the longitudinal displacement, longitudinal velocity, longitudinal acceleration, lateral displacement, lateral velocity, and lateral acceleration in the Frenet coordinate system, respectively (x t y t v t acc t θ t κ t F represents the horizontal position, vertical position, velocity, acceleration, orientation angle, and curvature in the world coordinate system. coor (·) is the transformation function between coordinate systems.
[0195] Furthermore, in some embodiments, the functional relationship between the initial lateral displacement and the initial longitudinal velocity is as follows:
[0196]
[0197]
[0198] Where the lateral displacement l is a function of the longitudinal displacement s, and the longitudinal displacement s is a function of time t. p and q are the degrees of the polynomial.
[0199] It should be noted that the foregoing explanation of the autonomous driving integrated decision-making method embodiment also applies to the autonomous driving integrated decision-making device of this embodiment, and will not be repeated here.
[0200] The autonomous driving integrated decision-making device proposed in this application obtains the first state information of the target point in the world coordinate system and transforms it to the target coordinate system to obtain the second state information. Simultaneously, it obtains the initial lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. Based on a preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, it obtains the state information of the target point generating the trajectory in the world coordinate system, generates a safe trajectory that satisfies basic dynamic constraints, and applies state-by-state constraints to obtain the final lateral displacement and longitudinal velocity of the target point deviating from the reference path at the target time. This generates the vehicle's optimal safe trajectory, which serves as the output of the vehicle's integrated decision-making system and is input to the lower-level controller for vehicle control. This solves the problem that rule-based autonomous driving decision-making systems struggle to achieve high levels of automation and intelligence in scenarios with interactivity and uncertainty, thus improving the intelligence of the decision-making system.
[0201] Figure 8 A schematic diagram of the structure of a vehicle provided in an embodiment of this application. The vehicle may include:
[0202] The memory 801, the processor 802, and the computer program stored on the memory 801 and capable of running on the processor 802.
[0203] When the processor 802 executes the program, it implements the autonomous driving integrated decision-making method provided in the above embodiments.
[0204] Furthermore, the vehicle also includes:
[0205] Communication interface 803 is used for communication between memory 801 and processor 802.
[0206] The memory 801 is used to store computer programs that can run on the processor 802.
[0207] The memory 801 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.
[0208] If the memory 801, processor 802, and communication interface 803 are implemented independently, then the communication interface 803, memory 801, and processor 802 can be interconnected via a bus to complete communication between them. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized into address buses, data buses, control buses, etc. For ease of representation, Figure 8 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0209] Optionally, in a specific implementation, if the memory 801, processor 802, and communication interface 803 are integrated on a single chip, then the memory 801, processor 802, and communication interface 803 can communicate with each other through an internal interface.
[0210] The processor 802 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application.
[0211] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the above-described autonomous driving integrated decision-making method.
[0212] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0213] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "N" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0214] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more N executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.
[0215] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a ordered list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which programs can be printed, because programs can be obtained electronically, for example, by optically scanning the paper or other media, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0216] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0217] Those skilled in the art will understand that all or part of the steps of the methods described in the above embodiments can be implemented by a program instructing related hardware, and the program can be stored in a computer-readable storage medium. When executed, the program includes one or a combination of the steps of the method embodiments.
[0218] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0219] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.
Claims
1. An integrated decision-making method for autonomous driving, characterized in that, Includes the following steps: Obtain the first state information of the target point in the world coordinate system, and transform the first state information to the target coordinate system to obtain the second state information of the target point in the target coordinate system; Based on the second state information, the initial lateral displacement and initial longitudinal velocity of the target point deviating from the reference path at the target time are obtained. Based on the preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, the state information of the target point generating the trajectory in the world coordinate system is obtained, so as to generate a safe trajectory that satisfies the basic dynamic constraints according to the state information of the target point. as well as The safe trajectory that satisfies the basic dynamic constraints is constrained state by state to obtain the final lateral displacement and final longitudinal velocity of the target point deviating from the reference path at the target time. The optimal safe trajectory of the vehicle is generated based on the final lateral displacement and the final longitudinal velocity. The optimal safe trajectory is then used as the output of the vehicle's integrated decision system and input to the lower-level controller to control the vehicle according to the optimal safe trajectory. The step of constraining the safe trajectory that satisfies the basic dynamic constraints includes: determining the threshold of the alternative state value function in the state space of the safe and feasible region; and iteratively updating the state value function, state-action value function, Lagrange multiplier function, policy function and feasible state-action value function based on the threshold of the alternative state value function in the state space of the safe and feasible region and a preset update strategy until the preset iteration conditions are met. The threshold of the alternative state value function within the state space of the safe and feasible region. To generate the distance between the trajectory and the surrounding area in a dynamic traffic flow scenario, i.e.: ; in, It is the attenuation coefficient. For the car in Time and Vehicle The distance between them This refers to the number of vehicles per week.
2. The method according to claim 1, characterized in that, The iterative update of the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function based on the threshold of the alternative state value function in the state space of the safe and feasible region and a preset update strategy includes: The objective function and gradient of the state value function are updated by minimizing the mean squared error as follows: ; ; in, Let be the objective function of the state value function. These are the parameters of the state value function. For the state value function, For state, For state-action value function, For the state at the next moment, For the corresponding action, For temperature coefficient, For policy function entropy, The gradient of the state value function is... The gradient of the state value function; The objective function and gradient of the state-action value function are updated by minimizing the Bellman residual as follows: ; ; in, The objective function of the state-action value function is... For state-action value function, For action, For policy function The state distribution under the following conditions For the reward function, Indicates the discount factor. Let the target state value function be... The gradient of the state-action value function. The gradient of the state-action value function; The objective function and gradient of the Lagrange multiplier function are updated as follows: ; ; in, Let the objective function be the Lagrange multiplier function. Let Lagrange multipliers be the functions of the Lagrange multipliers. This is the feasible state-action value function. The gradient of the Lagrange multiplier function; The objective function and gradient of the feasible state-action value function are updated as follows: ; ; ; in, To update the objective function of the feasible state-action value function, Let be the objective function of the feasible state-action value function. To constrain the threshold, The gradient of the feasible state-action value function is given by: This is the gradient of the feasible state-action value function.
3. The method according to claim 1, characterized in that, The step of converting the first state information to the target coordinate system includes: Based on a preset coordinate system transformation function, the first state information is transformed to the target coordinate system, wherein the preset coordinate system transformation function is: ; in, These represent the longitudinal displacement, longitudinal velocity, longitudinal acceleration, lateral displacement, lateral velocity, and lateral acceleration in the Frenet coordinate system. These represent the horizontal position, vertical position, velocity, acceleration, orientation angle, and curvature in the world coordinate system. This is a transformation function between coordinate systems.
4. The method according to claim 1, characterized in that, The functional relationship between the initial lateral displacement and the initial longitudinal velocity is as follows: ; ; Among them, lateral displacement It is longitudinal displacement The function of longitudinal displacement It is time The function; and It is the degree of the polynomial.
5. An integrated decision-making device for autonomous driving, characterized in that, include: The acquisition module is used to acquire the first state information of the target point in the world coordinate system and convert the first state information to the target coordinate system to obtain the second state information of the target point in the target coordinate system. The generation module is used to obtain the initial lateral displacement and initial longitudinal velocity of the target point deviating from the reference path at the target time based on the second state information, and to obtain the state information of the target point in the world coordinate system to generate the trajectory based on the preset stabilization strategy and the functional relationship between the initial lateral displacement and the initial longitudinal velocity, so as to generate a safe trajectory that satisfies the basic dynamic constraints according to the state information of the target point. as well as The control module is used to perform state-by-state constraints on the safe trajectory that satisfies the basic dynamic constraints, obtain the final lateral displacement and final longitudinal velocity of the target point deviating from the reference path at the target time, generate the optimal safe trajectory of the vehicle based on the final lateral displacement and the final longitudinal velocity, and input the optimal safe trajectory as the output of the vehicle's integrated decision system to the lower-level controller for controlling the vehicle according to the optimal safe trajectory. The control module, which performs state-by-state constraints on the safe trajectory that satisfies the basic dynamic constraints, is specifically used to: determine the threshold of the alternative state value function in the state space of the safe and feasible region; and iteratively update the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function based on the threshold of the alternative state value function in the state space of the safe and feasible region and a preset update strategy, until the preset iteration conditions are met. The threshold of the alternative state value function within the state space of the safe and feasible region. To generate the distance between the trajectory and the surrounding area in a dynamic traffic flow scenario, i.e.: ; in, It is the attenuation coefficient. For the car in Time and Vehicle The distance between them This refers to the number of vehicles per week.
6. The apparatus according to claim 5, characterized in that, The control module iteratively updates the state value function, state-action value function, Lagrange multiplier function, policy function, and feasible state-action value function based on the threshold of the alternative state value function in the state space of the safe and feasible region and a preset update strategy. Specifically, the control module is used to: The objective function and gradient of the state value function are updated by minimizing the mean squared error as follows: ; ; in, Let be the objective function of the state value function. These are the parameters of the state value function. For the state value function, For state, For state-action value function, For the state at the next moment, For the corresponding action, For temperature coefficient, For policy function entropy, The gradient of the state value function is... The gradient of the state value function; The objective function and gradient of the state-action value function are updated by minimizing the Bellman residual as follows: ; ; in, The objective function of the state-action value function is... For state-action value function, For action, For policy function The state distribution under the following conditions For the reward function, Indicates the discount factor. Let the target state value function be... The gradient of the state-action value function. The gradient of the state-action value function; The objective function and gradient of the Lagrange multiplier function are updated as follows: ; ; in, Let the objective function be the Lagrange multiplier function. Let Lagrange multipliers be the functions of the Lagrange multipliers. This is the feasible state-action value function. The gradient of the Lagrange multiplier function; The objective function and gradient of the feasible state-action value function are updated as follows: ; ; ; in, To update the objective function of the feasible state-action value function, Let be the objective function of the feasible state-action value function. To constrain the threshold, The gradient of the feasible state-action value function is given by: This is the gradient of the feasible state-action value function.
7. The apparatus according to claim 5, characterized in that, The step of converting the first state information to the target coordinate system, specifically the acquisition module, is used for: Based on a preset coordinate system transformation function, the first state information is transformed to the target coordinate system, wherein the preset coordinate system transformation function is: ; in, These represent the longitudinal displacement, longitudinal velocity, longitudinal acceleration, lateral displacement, lateral velocity, and lateral acceleration in the Frenet coordinate system. These represent the horizontal position, vertical position, velocity, acceleration, orientation angle, and curvature in the world coordinate system. This is a transformation function between coordinate systems.
8. The apparatus according to claim 5, characterized in that, The functional relationship between the initial lateral displacement and the initial longitudinal velocity is as follows: ; ; Among them, lateral displacement It is longitudinal displacement The function of longitudinal displacement It is time The function, and It is the degree of the polynomial.
9. A vehicle, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program to implement the autonomous driving integrated decision-making method as described in any one of claims 1-4.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, The program is executed by the processor to implement the autonomous driving integrated decision-making method as described in any one of claims 1-4.