Automatic driving behavior decision and model training method, system and device and medium
By constructing a discrete set of scene spaces and a pre-trained decision model, and combining it with reinforcement learning to update parameters, the problems of data volume and labeling accuracy in autonomous driving decision models are solved, improving the model's generalization ability and decision performance, and enabling it to adapt to complex environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 苏州畅行智驾汽车科技有限公司
- Filing Date
- 2023-08-09
- Publication Date
- 2026-06-19
Smart Images

Figure CN117197784B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of vehicles and data processing technology, and in particular to a training method for an autonomous driving behavior decision-making model, an autonomous driving behavior decision-making method, a training system for an autonomous driving behavior decision-making model, an autonomous driving behavior decision-making system, an electronic device, and a computer-readable storage medium. Background Technology
[0002] Current autonomous driving decision-making systems increasingly employ deep neural networks to train end-to-end behavioral decision-making models to process massive amounts of driving data. With the continuous advancement of autonomous driving technology, deep neural networks have achieved certain results in behavioral decision-making, enabling them to complete autonomous driving tasks in certain scenarios. However, due to the complexity of the autonomous driving environment, existing decision-making models often still have some problems.
[0003] 1. Data Volume and Labeling Accuracy: Deep neural networks require a large amount of high-quality training data to achieve good performance, but acquiring large-scale real-world driving data is an expensive and time-consuming process. At the same time, the accuracy of data labeling is also a challenge, as correct labeling of driving behavior requires professional drivers or experts.
[0004] 2. Generalization ability to complex environments: Current deep neural networks still need improvement in their generalization ability when dealing with complex traffic scenarios, extreme weather conditions, and unusual events. Because real-world driving scenarios are highly diverse, models need to be able to make reliable decisions under various circumstances.
[0005] Overall, while deep neural networks have made some progress in autonomous driving decision-making, many technical problems still need to be solved. Summary of the Invention
[0006] In view of the above problems, embodiments of the present invention are proposed to provide a training method for an autonomous driving behavior decision model, an autonomous driving behavior decision method, and a corresponding training system for an autonomous driving behavior decision model and an autonomous driving behavior decision system to overcome or at least partially solve the above problems.
[0007] To address the aforementioned issues, this invention discloses a training method for an autonomous driving behavior decision-making model, applied to a vehicle. The method includes: acquiring environmental perception data of the vehicle in a target scenario; constructing a discrete scene space set based on the environmental perception data; pre-training the decision-making model using the scene space set and outputting decision results; calculating the cumulative reward value of the decision-making model based on driver behavior data and / or the vehicle's driving environment data, as well as the decision results; and iteratively updating the parameters of the decision-making model based on the scene space set and the cumulative reward value.
[0008] Optionally, constructing a discrete scene space set based on the environmental perception data includes: converting continuous variables in the environmental perception data into one-hot codes; and combining the one-hot codes into the scene space set.
[0009] Optionally, the step of pre-training the decision model using the scene space set and outputting the decision result includes: inputting the scene space set into the decision model, outputting multiple decision behaviors and the probability of each decision behavior; and taking the decision behavior with the highest probability as the decision result.
[0010] Optionally, calculating the cumulative reward value of the decision model based on the driver's behavioral operation data and / or the vehicle's driving environment data, as well as the decision result, includes: assessing the degree of acceptance of the decision result based on the behavioral operation data, and / or determining the evaluation result of the decision result based on the driving environment data; and calculating the cumulative reward value based on the degree of acceptance and / or the evaluation result.
[0011] Optionally, assessing the degree of acceptance of the decision result based on the behavioral operation data includes: monitoring at least one of the driver's facial expressions, postures, and intervention actions; classifying the facial expressions and / or postures to obtain the degree of acceptance; and / or determining the degree of acceptance based on the intervention actions and the decision result; wherein the degree of acceptance is acceptance, disapproval, or neutral.
[0012] Optionally, determining the level of approval based on the intervention action and the decision result includes: comparing the intervention action with the decision result; determining the level of approval as approval if the intervention action and the decision result are the same or related; determining the level of approval as disapproval if the intervention action and the decision result are contradictory; and determining the level of approval as neutral if the intervention action and the decision result are unrelated.
[0013] Optionally, determining the evaluation result of the decision based on the driving environment data includes: determining the evaluation result based on the driving environment data of the vehicle after executing the decision and at least one evaluation parameter; wherein the evaluation parameter includes: dynamic constraints, kinematic constraints, traffic regulations, collision risk, and fuel consumption.
[0014] Optionally, calculating the cumulative reward value based on the level of approval and / or the evaluation result includes: separately counting the number of items with an approval level of "approved" and "disapproved"; increasing the reward value by a positive unit value for the number of items with an approval level of "approved," and increasing the reward value by a negative unit value for the number of items with an disapproved level of "disapproved," and not increasing the reward value if the number of items with a neutral approval level is a neutral item; and / or, increasing the reward value by a negative unit value if the evaluation result indicates that the driving environment data does not comply with the dynamic constraints or the kinematic constraints; not increasing the reward value if the evaluation result indicates that the driving environment data complies with the dynamic constraints or the kinematic constraints; and / or, increasing the reward value by a negative unit value if the evaluation result indicates that the driving environment data violates traffic regulations; not increasing the reward value if the evaluation result indicates that the driving environment data does not violate traffic regulations; and / or, increasing the reward value by a negative unit value if the evaluation result indicates that the increased collision risk of the driving environment data is greater than or equal to a preset first risk threshold. The reward value is calculated as follows: A positive reward value is added if the evaluation result indicates that the reduced collision risk from the driving environment data is greater than or equal to a preset second risk threshold; no reward value is added if the evaluation result indicates that the increased collision risk from the driving environment data is less than the first risk threshold, or if the evaluation result indicates that the reduced collision risk from the driving environment data is less than the second risk threshold; and / or, a negative reward value is added if the evaluation result indicates that the increased fuel consumption from the driving environment data is greater than or equal to a preset first fuel consumption threshold; a positive reward value is added if the evaluation result indicates that the reduced fuel consumption from the driving environment data is greater than or equal to a preset second fuel consumption threshold; no reward value is added if the evaluation result indicates that the increased fuel consumption from the driving environment data is less than the first fuel consumption threshold, or if the evaluation result indicates that the reduced fuel consumption from the driving environment data is less than the second fuel consumption threshold; the cumulative reward value is obtained by adding the positive and negative reward values.
[0015] Optionally, the step of iteratively updating the parameters of the decision model based on the scene space set and the cumulative reward value includes: uploading the scene space set and the cumulative reward value to a cloud-based online learning library, so that the cloud-based online learning library uses a reinforcement learning method to iteratively update the parameters of the decision model based on the principle of maximizing the cumulative reward value.
[0016] Optionally, the method further includes: if the change value of the parameters of the decision model after iterative updating by the cloud-based online learning library is greater than or equal to a preset change threshold, receiving the iteratively updated parameters of the decision model from the cloud-based online learning library and updating the parameters of the decision model locally on the vehicle.
[0017] This invention also discloses an autonomous driving behavior decision-making method, the method comprising: acquiring vehicle data and environmental data in real time; inputting the vehicle data and environmental data into an autonomous driving behavior decision-making model trained according to the training method of the autonomous driving behavior decision-making model described above, and outputting a behavior decision result; and executing according to the behavior decision result.
[0018] This invention also discloses a training system for an autonomous driving behavior decision-making model, applied to a vehicle. The system includes: a data acquisition module for acquiring environmental perception data of the vehicle in a target scenario; a set construction module for constructing a discrete scene space set based on the environmental perception data; a model training module for pre-training the decision-making model using the scene space set and outputting decision results; a reward calculation module for calculating the cumulative reward value of the decision-making model based on the driver's behavior operation data and / or the vehicle's driving environment data, as well as the decision results; and a parameter update module for iteratively updating the parameters of the decision-making model based on the scene space set and the cumulative reward value.
[0019] Optionally, the set construction module includes: an encoding conversion module for converting continuous variables in the environmental perception data into one-hot codes; and an encoding combination module for combining the one-hot codes into the scene space set.
[0020] Optionally, the model training module includes: a behavior and probability output module, used to input the scene space set into the decision model and output multiple decision behaviors and the probability of each decision behavior; and a decision result determination module, used to take the decision behavior with the highest probability as the decision result.
[0021] Optionally, the reward calculation module includes: an evaluation and determination module, used to evaluate the degree of acceptance of the decision result based on the behavioral operation data, and / or determine the evaluation result of the decision result based on the driving environment data; and a cumulative calculation module, used to calculate the cumulative reward value based on the degree of acceptance and / or the evaluation result.
[0022] Optionally, the assessment and determination module includes: a driver monitoring module, used to monitor at least one of the driver's facial expressions, postures, and intervention actions; and a data classification module, used to classify the facial expressions and / or postures to obtain the degree of approval, and / or to determine the degree of approval based on the intervention actions and the decision results; wherein the degree of approval is approval, disapproval, or neutral.
[0023] Optionally, the data classification module includes: an action result comparison module, used to compare the intervention action with the decision result; and a recognition level determination module, used to determine the recognition level as "approved" when the intervention action and the decision result are the same or related; to determine the recognition level as "disapproved" when the intervention action and the decision result are contradictory; and to determine the recognition level as "neutral" when the intervention action and the decision result are unrelated.
[0024] Optionally, the evaluation determination module includes: an environmental feedback module, used to determine the evaluation result based on the driving environment data of the vehicle after executing the decision result and at least one evaluation parameter; wherein the evaluation parameter includes: dynamic constraints, kinematic constraints, traffic regulations, collision risk, and fuel consumption.
[0025] Optionally, the cumulative calculation module includes: a quantity statistics module, used to count the number of items with the approval level of "approved" and "disapproved" respectively; a reward value increase module, used to increase a positive unit reward value according to the number of items with the approval level of "approved", and increase a negative unit reward value according to the number of items with the approval level of "disapproved", and not increase a reward value if the number of items with the approval level of "neutral"; and / or, if the evaluation result is that the driving environment data does not conform to the dynamic constraints or the kinematic constraints, increase a negative unit reward value; if the evaluation result is that the driving environment data conforms to the dynamic constraints or the kinematic constraints, not increase a reward value; and / or, if the evaluation result is that the driving environment data violates traffic regulations, increase a negative unit reward value; if the evaluation result is that the driving environment data does not violate traffic regulations, not increase a reward value; and / or, if the evaluation result is that the collision risk increased by the driving environment data is greater than or equal to a preset first risk threshold, increase a negative unit reward value. The reward value is calculated as follows: If the evaluation result indicates that the collision risk reduced by the driving environment data is greater than or equal to a preset second risk threshold, a positive reward value is added; if the evaluation result indicates that the collision risk reduced by the driving environment data is less than the first risk threshold, or if the evaluation result indicates that the collision risk reduced by the driving environment data is less than the second risk threshold, no reward value is added; and / or, if the evaluation result indicates that the fuel consumption increased by the driving environment data is greater than or equal to a preset first fuel consumption threshold, a negative reward value is added; if the evaluation result indicates that the fuel consumption reduced by the driving environment data is greater than or equal to a preset second fuel consumption threshold, a positive reward value is added; if the evaluation result indicates that the fuel consumption increased by the driving environment data is less than the first fuel consumption threshold, or if the evaluation result indicates that the fuel consumption reduced by the driving environment data is less than the second fuel consumption threshold, no reward value is added; a reward value accumulation module is used to add the reward values of the increased positive unit values and the reward values of the increased negative unit values to obtain the cumulative reward value.
[0026] Optionally, the parameter update module is used to upload the scene space set and the cumulative reward value to a cloud-based online learning library, so that the cloud-based online learning library uses reinforcement learning to iteratively update the parameters of the decision model based on the principle of maximizing the cumulative reward value.
[0027] Optionally, the system further includes a local update module, configured to receive the iteratively updated parameters of the decision model from the cloud-based online learning library and update the parameters of the decision model locally on the vehicle side when the change value after iteratively updating the parameters of the decision model in the cloud-based online learning library is greater than or equal to a preset change threshold.
[0028] This invention also discloses an autonomous driving behavior decision-making system, the system comprising: a real-time data acquisition module for acquiring vehicle data and environmental data in real time; a data input / output module for inputting the vehicle data and environmental data into an autonomous driving behavior decision-making model trained according to the training method of the autonomous driving behavior decision-making model described above, and outputting behavior decision results; and a decision result execution module for executing according to the behavior decision results.
[0029] This invention also discloses an electronic device, comprising: one or more processors; and one or more machine-readable media storing instructions thereon, which, when executed by the one or more processors, cause the electronic device to perform the training method for an autonomous driving behavior decision model as described above, and / or the autonomous driving behavior decision method.
[0030] This invention also discloses a computer-readable storage medium storing a computer program that causes a processor to execute the training method for the autonomous driving behavior decision model and / or the autonomous driving behavior decision method as described above.
[0031] The embodiments of the present invention have the following advantages:
[0032] The training scheme for the autonomous driving behavior decision-making model provided in this embodiment of the invention is applied to the vehicle. It acquires environmental perception data of the vehicle in the target scenario, constructs a discrete scene space set based on the environmental perception data, pre-trains the decision-making model using the scene space set, outputs decision results, calculates the cumulative reward value of the decision-making model based on the driver's behavior operation data and / or the vehicle's driving environment data, and the decision results, and iteratively updates the parameters of the decision-making model based on the scene space set and the cumulative reward value.
[0033] This invention constructs a discrete scene space set, transforming environmental perception data into a discretized form, thereby reducing reliance on large-scale, high-quality training data. Compared to directly using environmental perception data, the discretized scene space set is smaller in scale. Simultaneously, since discretized data is easier to label, the requirements for labeling accuracy are reduced, decreasing the human, material, and financial resources required for data collection and labeling. The decision-making model is pre-trained using the scene space set, allowing it to learn and generalize within this discretized set. After pre-training, the decision-making model can make decisions based on different scenarios, rather than just learning in specific environments. This allows the model to better adapt to complex driving scenarios and improve its generalization ability. The parameters of the decision-making model are iteratively updated using the scene space set and accumulated reward values, continuously optimizing the model and further improving decision-making performance and safety.
[0034] In summary, the training scheme for autonomous driving behavior decision-making models solves the problems of data volume and annotation accuracy by discretizing the scene space, pre-training, and iteratively updating parameters. It improves the generalization ability of the decision-making model and brings beneficial effects such as reducing data dependence and optimizing the decision-making model, providing an effective solution for the further development of autonomous driving technology. Attached Figure Description
[0035] Figure 1 This is a flowchart illustrating the steps of a training method for an autonomous driving behavior decision model according to an embodiment of the present invention.
[0036] Figure 2 This is a schematic diagram of a scene space set according to an embodiment of the present invention;
[0037] Figure 3 This is a schematic diagram illustrating the principle of a training method for an autonomous driving behavior decision-making model according to an embodiment of the present invention;
[0038] Figure 4 This is a flowchart illustrating the steps of an autonomous driving behavior decision-making method according to an embodiment of the present invention;
[0039] Figure 5 This is a structural block diagram of a training system for an autonomous driving behavior decision model according to an embodiment of the present invention;
[0040] Figure 6 This is a structural block diagram of an autonomous driving behavior decision system according to an embodiment of the present invention. Detailed Implementation
[0041] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0042] The training scheme for the autonomous driving behavior decision-making model proposed in this invention employs a discretized scene space set and a reinforcement learning-based method to optimize the decision-making model, improve generalization ability, and reduce reliance on large amounts of high-quality data. The training scheme for the autonomous driving behavior decision-making model constructs a scene space set by discretizing continuous environmental perception data, reducing the need for large-scale, high-quality data. The decision-making model is pre-trained using this scene space set, allowing it to learn and generalize within the discretized scene space, thereby improving its adaptability to complex environments. The cumulative reward value of the decision-making model is calculated based on driver behavior data and / or vehicle driving environment data, as well as the decision results. The parameters of the decision-making model are iteratively updated using reinforcement learning to optimize its decision-making performance.
[0043] Reference Figure 1This document illustrates a flowchart of the training method for an autonomous driving behavior decision-making model according to an embodiment of the present invention. This training method for the autonomous driving behavior decision-making model can be applied to the vehicle side. Specifically, the training method for the autonomous driving behavior decision-making model may include the following steps:
[0044] Step 101: Obtain environmental perception data of the vehicle in the target scenario.
[0045] In embodiments of the present invention, the target scenario can be understood as a typical driving scenario, i.e., a common driving scenario, not a special or extreme driving scenario. Environmental perception data mainly includes target object information, road information, and navigation path topology information. Target object information includes type, size, speed, acceleration, relative distance, etc. Road information includes lane line type, road curvature, slope, etc. Navigation path topology information includes the topological connection relationships of the reference path, such as a road segment, distance, etc.
[0046] Step 102: Construct a discrete set of scene spaces based on environmental perception data.
[0047] In embodiments of this invention, measurements such as vehicle size and speed are generally continuous variables, which differ significantly from the information representation methods obtained by drivers when making behavioral decisions. Therefore, embodiments of this invention convert environmental perception data into a one-hot encoding format according to certain rules, thereby constructing a discrete scene space set. One-hot encoding is a data representation method typically used to convert discrete categorical variables into vector form, ensuring that each category has a unique representation in the vector. In one-hot encoding, if a sample belongs to a certain category, it is set to 1 at the corresponding category position and 0 at other positions. Thus, each category is represented by a unique binary vector, and the dimension of the vector is equal to the number of categories.
[0048] For example, suppose we have a categorical variable for color, containing three colors: red, blue, and green. Using one-hot encoding, red can be represented as [1,0,0], blue as [0,1,0], and green as [0,0,1]. If a sample's color is red, its corresponding vector is [1,0,0].
[0049] One-hot encoding is commonly used in machine learning and deep learning, especially in classification tasks. It transforms categorical variables into numerical vectors, facilitating computation and processing, while ensuring that partial order relationships are not introduced to avoid the model misinterpreting the magnitude relationships between categories. In classification problems, class labels are typically one-hot encoded so that the model can better learn the differences between different categories.
[0050] Taking speed as an example, the feasible region of vehicle speed V is defined as 0 to 150 kph. It is segmented at certain intervals and one-hot encoded. For example, the one-hot encoding corresponding to the vehicle speed in the range of 0 ≤ V < 5 kph is selected as 1, and the one-hot encoding corresponding to the vehicle speed in the range of 5 ≤ V < 20 kph is selected as 2, and so on. All continuous variables are discretized according to the feasible region, and the discretization interval is adjusted according to the actual situation.
[0051] After discretization, the environmental perception data can be abstracted into a combination of a finite number of attributes. Let m represent the total number of attributes, and each attribute contains n. m Each dimension. Let p m This represents the actual measurement result of the m-th attribute, where number is the column number with a value of 1 in the one-hot encoding. The actual measured environmental sensing data is converted into one-hot encoding according to the following formula:
[0052]
[0053] Where m represents the total number of attributes, i represents the index of each of the m attributes, and p i Represents the actual measurement result of the i-th attribute, n j p represents the dimension contained in the i-th attribute. m This represents the actual measurement result of the m-th attribute.
[0054] Reference Figure 2 The diagram illustrates a scene space set according to an embodiment of the present invention.
[0055] Figure 2 The scene space set contains three attributes, m = 3. The feasible region of attribute 1 after discretization (represented by three circles in the column containing attribute 1, similarly by four circles in the column containing attribute 2, and two circles in the column containing attribute 3) is 3, i.e., n1 is 3. The actual measurement result of the current measurement (circle marked 2 in the column containing attribute 1) is 2, i.e., p1 is 2. Similarly, n2 is 4, p2 (circle marked 4 in the column containing attribute 2) is 4, n3 is 2, and p3 (circle marked 2 in the column containing attribute 3) is 2. Substituting these values into the formula, we get the one-hot encoding number as 1*4*2 + 3*2 + 2 = 16. The dimension of the entire scene space set is... Since 3*4*2=24, we can construct a row vector of length 1*24, where the value of the 16th column is 1 and the values of the remaining columns are 0. Figure 2 The arrows between the circles indicate that the feasible regions can be combined to form a set of scene spaces.
[0056] Step 103: Use the scene space set to pre-train the decision model and output the decision results.
[0057] In embodiments of this invention, a discrete scene space set is used as input to train a decision model to predict the decision outcome in each scene. This decision model can be a deep neural network, which, through supervised learning, outputs the probability of the corresponding decision behavior in each scene. The training objective is to make the decision model as close as possible to the decision outcomes in real driving data, thereby improving the model's performance and accuracy. The actual decision behavior of the driver is used as the ground truth to design a cost function to iterate the decision model. The parameters of the pre-trained decision model are obtained upon convergence of the iteration. These decision behaviors include, but are not limited to: following, cruise control, left lane change, right lane change, overtaking on the left, overtaking on the right, yielding on the left, and yielding on the right. After training, the decision model and its parameters can be uploaded to a cloud-based online learning library. This library can acquire environmental perception data from all vehicles during operation for iterative updates. Simultaneously, the pre-trained decision model and its parameters need to be deployed to the vehicle for data collection and real-time feedback.
[0058] Step 104: Calculate the cumulative reward value of the decision model based on the driver's behavior data and / or the vehicle's driving environment data, as well as the decision results.
[0059] In embodiments of the present invention, driver behavior data and / or vehicle driving environment data are used as inputs, combined with the output of the decision model, to calculate the cumulative reward value of the decision model in a real-world scenario. The cumulative reward value is an evaluation of the decision model's behavioral decision-making performance throughout the driving process. For example, the decision model outputs a decision result, and a reward or penalty is given based on the actual situation. The cumulative reward value can be determined based on driver behavior data and / or vehicle driving environment data, such as whether the vehicle's movement conforms to dynamic constraints, whether it complies with traffic rules, and whether it increases the risk of collision.
[0060] Step 105: Iteratively update the parameters of the decision model based on the scene space set and the cumulative reward value.
[0061] In embodiments of the present invention, the parameters of the decision-making model are iteratively updated using the calculated scene space set and cumulative reward value. A common method is reinforcement learning, which optimizes the decision-making model by maximizing the cumulative reward value. By continuously collecting data during actual driving, calculating the cumulative reward value, and updating the parameters, the decision-making model can be gradually optimized, enabling it to make more accurate behavioral decisions.
[0062] The training scheme for the autonomous driving behavior decision-making model provided in this embodiment of the invention is applied to the vehicle. It acquires environmental perception data of the vehicle in the target scenario, constructs a discrete scene space set based on the environmental perception data, pre-trains the decision-making model using the scene space set, outputs decision results, calculates the cumulative reward value of the decision-making model based on the driver's behavior operation data and / or the vehicle's driving environment data, and the decision results, and iteratively updates the parameters of the decision-making model based on the scene space set and the cumulative reward value.
[0063] This invention constructs a discrete scene space set, transforming environmental perception data into a discretized form, thereby reducing reliance on large-scale, high-quality training data. Compared to directly using environmental perception data, the discretized scene space set is smaller in scale. Simultaneously, since discretized data is easier to label, the requirements for labeling accuracy are reduced, decreasing the human, material, and financial resources required for data collection and labeling. The decision-making model is pre-trained using the scene space set, allowing it to learn and generalize within this discretized set. After pre-training, the decision-making model can make decisions based on different scenarios, rather than just learning in specific environments. This allows the model to better adapt to complex driving scenarios and improve its generalization ability. The parameters of the decision-making model are iteratively updated using the scene space set and accumulated reward values, continuously optimizing the model and further improving decision-making performance and safety.
[0064] In summary, the training scheme for autonomous driving behavior decision-making models solves the problems of data volume and annotation accuracy by discretizing the scene space, pre-training, and iteratively updating parameters. It improves the generalization ability of the decision-making model and brings beneficial effects such as reducing data dependence and optimizing the decision-making model, providing an effective solution for the further development of autonomous driving technology.
[0065] In an exemplary embodiment of the present invention, one implementation of constructing a discrete scene space set based on environmental perception data involves converting continuous variables in the environmental perception data into one-hot codes, and combining the one-hot codes into a scene space set. Constructing a discrete scene space set based on environmental perception data is a crucial step in the training scheme of an autonomous driving behavior decision model. Its purpose is to convert continuous environmental perception data into a discrete form, thereby constructing a finite scene space set for training and optimizing the decision model. In practical applications, vehicles acquire environmental perception data through various sensing devices (such as cameras, LiDAR, millimeter-wave radar, etc.), including the attributes of objects (such as type, size, speed, etc.), road information (such as lane lines, road curvature, etc.), and navigation path topology information. Continuous variables: Some attributes in the environmental perception data (such as vehicle speed, distance, etc.) may be continuous variables, with values ranging from a continuous range to discrete values. Scene space set: After converting continuous variables into a discrete form through one-hot encoding, a series of scene representations are obtained, each scene consisting of one-hot codes of discrete attributes. The scene space set contains all possible environmental scenes, each represented by different attribute combinations.
[0066] For example, suppose an autonomous vehicle is driving, and its environmental perception data includes continuous variables such as vehicle speed, distance to objects ahead, and lane type. The feasible range for vehicle speed is 0–150 km / h, the feasible range for distance to objects ahead is 0–200 meters, and there are two types of lanes: straight and curved. First, these continuous variables need to be discretized. Assuming that when discretizing vehicle speed, the speed range is divided into three intervals: 0–50 km / h, 50–100 km / h, and 100–150 km / h. When discretizing distance to objects ahead, the distance range is divided into three intervals: 0–50 meters, 50–100 meters, and 100–200 meters. When discretizing lane type, one-hot encoding is used, represented by a binary vector of length 2: [1,0] represents a straight lane, and [0,1] represents a curved lane. Then, according to the discretization rules described above, vehicle speed, target distance, and lane type are converted into corresponding one-hot encoded vectors. For example, a vehicle speed of 60 km / h is represented as [0,1,0]; a target distance of 80 meters is represented as [0,1,0]; and a lane type of straight line is represented as [1,0]. Finally, all the discretized attributes are combined to obtain a one-hot encoded scene representation, such as [0,1,0,0,1,0,1,0]. All possible scene combinations constitute the scene space set, which is used to train and optimize the decision model.
[0067] This invention, by converting continuous variables into discrete forms with one-hot encoding, reduces data dimensionality, simplifies data processing and representation, and makes the scene space set easier to understand and process. Simultaneously, the scene space set can cover a variety of different environmental scenarios, providing more comprehensive training data, helping to optimize the generalization ability of the decision model, enabling the autonomous driving system to better cope with complex and unseen driving situations, and improving driving safety and performance.
[0068] In one exemplary embodiment of the present invention, one implementation method for pre-training a decision model using a scene space set to output decision results involves inputting the scene space set into the decision model, outputting multiple decision behaviors and the probability of each decision behavior, and selecting the decision behavior with the highest probability as the final decision result. Pre-training the decision model using a scene space set to output decision results is a key step in the training scheme of an autonomous driving behavior decision model. Its purpose is to use a pre-constructed scene space set to train the decision model, enabling the decision model to predict multiple decision behaviors based on the input scene information, calculate the probability of each decision behavior, and finally select the decision behavior with the highest probability as the final decision result.
[0069] For example, suppose the scene space contains three discrete attributes: vehicle speed, distance to the target ahead, and lane type. The decision model is a deep neural network that receives one-hot encodings of these three attributes as input and outputs probability values for four decision actions (following the car in front, changing lanes to the left, changing lanes to the right, and cruising). When the vehicle is in a specific scene, such as a speed of 60 km / h, a target distance of 80 meters, and a straight lane type, these attributes are converted into one-hot encodings [0, 1, 0, 0, 1, 0, 1, 0], which are then input into the decision model for prediction. Assume the following probability output is obtained after prediction:
[0070] -Following the car: 0.2
[0071] -Left lane change: 0.6
[0072] -Right lane change: 0.15
[0073] - Cruise speed: 0.05
[0074] Based on the output probability value, the decision behavior with the highest probability is selected as the final decision result, that is, choosing to change lanes to the left is the decision result in the current scenario.
[0075] This invention utilizes a scene space set to pre-train a decision model and output decision results, allowing the model to predict multiple decision behaviors based on different scenarios, rather than just a single optimal behavior. This improves the robustness and generalization ability of the decision model, enabling autonomous vehicles to better cope with complex and changing traffic environments. By outputting the probabilities of each decision behavior, the system can also consider the feasibility of different behaviors when making decisions, thereby improving driving safety and stability. Furthermore, the pre-trained decision model can serve as a foundation for subsequent online learning and optimization, further enhancing the model's performance.
[0076] In one exemplary embodiment of the present invention, one implementation method for calculating the cumulative reward value of the decision model based on driver behavior data and / or vehicle driving environment data, and the decision result, involves assessing the degree of acceptance of the decision result based on the behavior data, and / or determining the evaluation result of the decision result based on the driving environment data, and calculating the cumulative reward value based on the degree of acceptance and / or the evaluation result. In practical applications, a driver feedback model can be constructed to assess the degree of acceptance of the decision result. Specifically, when assessing the degree of acceptance of the decision result using the driver feedback model, i.e., based on the behavior data, the driver's facial expressions and / or posture can be monitored by an in-vehicle camera, and then classified to obtain the degree of acceptance. Alternatively, in-vehicle sensors can monitor the driver's intervention actions, and then determine the degree of acceptance based on the intervention actions and the decision result. The degree of acceptance can be acceptance, disapproval, or neutral. When determining the degree of acceptance based on the intervention actions and the decision result, the intervention actions can be compared with the decision result. If the intervention actions and the decision result are the same or related, the degree of acceptance is determined to be acceptance. For example, the decision result is to change lanes to the left, and the driver's intervention action is to turn on the left turn signal. When the intervention action contradicts the decision outcome, the level of acceptance is determined as disapproval. For example, the decision outcome is to pull over, and the driver's intervention action is to press the accelerator. When the intervention action is unrelated to the decision outcome, the level of acceptance is determined as neutral. For example, the decision outcome is to change lanes to the left, and the driver's intervention action is to press the accelerator.
[0077] In practical applications, environmental feedback models can be constructed to determine the evaluation results of decision-making outcomes. When using environmental feedback models to determine the evaluation results of decision-making outcomes, i.e., when determining the evaluation results based on driving environment data, the evaluation results can be determined based on the driving environment data after the vehicle has executed the decision and at least one evaluation parameter. These evaluation parameters include, but are not limited to: dynamic constraints, kinematic constraints, traffic regulations, collision risk, and fuel consumption. For example, to determine whether vehicle motion conforms to kinematic or dynamic constraints, such as changing lanes in a curve, causing the vehicle's lateral acceleration to exceed a safety threshold, it can be considered that it does not conform to dynamic constraints. To determine whether vehicle motion conforms to traffic regulations, such as changing lanes across a solid line, it can be considered that it does not conform to traffic regulations. To determine whether vehicle motion increases collision risk, an artificial potential field method or a cost function can be used for evaluation. To determine whether the vehicle increases fuel consumption, an evaluation can be performed by recording fuel consumption for a period of time before and after the execution of the decision.
[0078] In practical applications, one implementation method for calculating cumulative reward values based on the degree of approval and / or evaluation results is as follows: The number of items with an approval level of "approved" and "disapproved" is counted separately. For the number of items with an approval level of "approved," a positive reward value is added; for the number of items with an disapproved level of "disapproved," a negative reward value is added; if the number of items with a neutral approval level, no reward value is added. And / or, if the evaluation result indicates that the driving environment data does not comply with dynamic or kinematic constraints, a negative reward value is added; if the evaluation result indicates that the driving environment data complies with dynamic or kinematic constraints, no reward value is added. And / or, if the evaluation result indicates that the driving environment data violates traffic regulations, a negative reward value is added; if the evaluation result indicates that the driving environment data does not violate traffic regulations, no reward value is added. And / or, if the evaluation result indicates that the increased collision risk from the driving environment data is greater than or equal to a preset first risk threshold, a negative unit reward value is added; if the evaluation result indicates that the decreased collision risk from the driving environment data is greater than or equal to a preset second risk threshold, a positive unit reward value is added; if the evaluation result indicates that the increased collision risk from the driving environment data is less than the first risk threshold, or if the evaluation result indicates that the decreased collision risk from the driving environment data is less than the second risk threshold, no reward value is added. And / or, if the evaluation result indicates that the increased fuel consumption from the driving environment data is greater than or equal to a preset first fuel consumption threshold, a negative unit reward value is added; if the evaluation result indicates that the decreased fuel consumption from the driving environment data is greater than or equal to a preset second fuel consumption threshold, a positive unit reward value is added; if the evaluation result indicates that the increased fuel consumption from the driving environment data is less than the first fuel consumption threshold, or if the evaluation result indicates that the decreased fuel consumption from the driving environment data is less than the second fuel consumption threshold, no reward value is added. The cumulative reward value is obtained by adding the positive unit reward value and the negative unit reward value.
[0079] It should be noted that the above unit values can be determined according to the actual situation.
[0080] The aforementioned calculation of the cumulative reward value of the decision-making model based on driver behavior data and / or vehicle driving environment data, along with the decision results, is another crucial step in the training scheme for autonomous driving behavior decision-making models. Its purpose is to optimize the performance of the decision-making model by calculating the cumulative reward value through evaluating the degree of acceptance of the decision results and environmental feedback. Specifically, the driver feedback model assesses the degree of acceptance of the decision results. It determines the driver's attitude towards the decision results, including acceptance, disapproval, and neutrality, based on driver behavior data (such as facial expressions and gestures) and intervention actions. The environmental feedback model evaluates the decision results. It judges the quality of the decision results based on driving environment data and evaluation parameters (such as dynamic constraints, kinematic constraints, traffic regulations, collision risk, fuel consumption, etc.) after the vehicle executes the decision. The cumulative reward value is a comprehensive evaluation of the decision results, including the combined impact of driver and environmental feedback. It is calculated by assessing the degree of acceptance and / or evaluation results. A positive cumulative reward value indicates that the decision results are accepted and supported by environmental rules, while a negative cumulative reward value indicates that the decision results are not accepted or violate environmental rules.
[0081] For example, suppose the decision is to change lanes to the left. During the actual execution, the driver feedback model detects that the driver has activated the left turn signal, and the environmental feedback model indicates that the vehicle's lateral acceleration exceeds the safety threshold during the left lane change. Based on this feedback information, a cumulative reward value can be calculated. If the level of approval is "approved," a positive reward value is added; if the level of approval is "disapproved," a negative reward value is added; if the level of approval is "neutral," no reward value is added. Simultaneously, based on the evaluation results of the environmental feedback model, if the vehicle violates dynamic constraints or traffic regulations, a negative reward value is added; if the vehicle complies with dynamic constraints and traffic regulations, no reward value is added. Finally, all added positive and negative reward values are summed to obtain the cumulative reward value.
[0082] This invention enables online optimization of the decision-making model by assessing the acceptance and evaluation results of decision-making outcomes based on driver and environmental feedback. The optimization goal is to maximize the cumulative reward value, thereby gradually guiding the decision-making model towards producing decisions that better align with driver expectations and environmental rules. This optimization process helps improve the behavioral decision-making capabilities of autonomous vehicles, enabling them to drive more safely, stably, and intelligently in complex traffic environments. Simultaneously, the calculation of the cumulative reward value helps monitor the performance of the decision-making model, promptly identifying and correcting potential problems, and improving the reliability and availability of the autonomous driving system.
[0083] In one exemplary embodiment of the present invention, one implementation method for iteratively updating the parameters of the decision model based on the scene space set and the cumulative reward value is as follows: the scene space set and the cumulative reward value are uploaded to a cloud-based online learning library, so that the cloud-based online learning library uses a reinforcement learning method to iteratively update the parameters of the decision model based on the principle of maximizing the cumulative reward value. Here, the cloud-based online learning library is an online learning platform located in the cloud, used to store and manage large amounts of data, and employs reinforcement learning algorithms for iterative model updates. Reinforcement learning is a machine learning method that learns optimal behavioral strategies through interaction with the environment. By observing the state of the environment, selecting actions, and obtaining rewards or penalties from the environment, the strategy is gradually optimized to maximize the cumulative reward.
[0084] Uploading the scene space set and cumulative reward values to the cloud-based online learning library is to leverage the capabilities of cloud-based reinforcement learning to optimize the decision-making model. After uploading this data, the cloud-based online learning library uses reinforcement learning methods to iteratively update the parameters of the decision-making model based on the principle of maximizing the cumulative reward value. This iterative update process is based on the reward feedback from different scenes in the scene space set and the evaluation results of the cumulative reward value, enabling the decision-making model to gradually learn better decision-making strategies.
[0085] For example, suppose the scenario space includes scenarios such as left lane changing, right lane changing, and going straight, each with different environmental perception data and cumulative reward values. By uploading this data to a cloud-based online learning library, which uses reinforcement learning algorithms to iteratively update the parameters of the corresponding decision-making model, for instance, in a real-world application, when a vehicle is in a left lane changing scenario, based on driver and environmental feedback, the cumulative reward value may be high, indicating that the decision is accepted and complies with traffic rules. In the cloud-based online learning library, the parameters of the decision-making model are updated according to the reinforcement learning algorithm, further optimizing the model's decision-making ability for left lane changing scenarios. Similarly, a similar update process is performed for other scenarios.
[0086] This invention employs reinforcement learning through a cloud-based online learning library. This allows for the global aggregation and management of large sets of scene spaces and accumulated reward values, thereby optimizing the decision-making model globally. This enables the decision-making model to gradually approach its optimal performance across different scenarios, thus improving the decision-making capabilities and safety of the autonomous driving system. Furthermore, because data is stored in the cloud, different vehicles can share the learned knowledge, promoting knowledge transfer and sharing, and accelerating the advancement of autonomous driving technology.
[0087] In an exemplary embodiment of the present invention, if the change value of the decision model parameters after iterative updates in the cloud-based online learning library is greater than or equal to a preset change threshold, the updated decision model parameters are received from the cloud-based online learning library, and the parameters of the decision model locally on the vehicle are updated. After iteratively updating the decision model parameters in the cloud-based online learning library, new model parameters are obtained. If the updated parameters change significantly compared to the previous parameters (i.e., the change value is greater than or equal to the preset change threshold), these parameters are sent to the vehicle, and the parameters of the decision model locally on the vehicle are updated. The purpose of this is to promptly apply the optimized decision model parameters to the vehicle to improve real-time decision-making performance.
[0088] For example, suppose the cloud-based online learning library obtains new decision model parameters after one iteration update. By comparing the changes between the new and original parameters, if the change is greater than or equal to a preset change threshold, it indicates that the decision model has undergone significant optimization. For instance, the original decision model parameters were [0.5, -0.2, 0.8], and after one iteration update, the new parameters are [0.7, -0.1, 0.9]. When comparing the changes, Euclidean distance or other suitable distance metrics can be used to calculate the difference between the parameters. Assuming Euclidean distance is used, the calculated change is 0.3, while the preset change threshold is 0.2. Because 0.3 is greater than 0.2 (the preset change threshold), the parameters are considered to have undergone a significant update. Therefore, the new parameters [0.7, -0.1, 0.9] are sent to the vehicle, and the decision model parameters are updated locally on the vehicle.
[0089] This invention utilizes a cloud-based method for iterative parameter updates, determining whether to send the updated parameters to the vehicle based on a change threshold. This approach enables efficient model updates and deployment. Parameter transmission and update operations are only triggered when there are significant parameter changes, reducing unnecessary communication and computational overhead. Simultaneously, timely application of optimized parameters to the vehicle maintains the real-time performance and accuracy of the decision-making model, thereby improving the performance of the autonomous driving system in different scenarios. This real-time update method also allows the vehicle to quickly adapt to different driving scenarios and environments, enhancing the robustness and reliability of the autonomous driving system.
[0090] Reference Figure 3 This diagram illustrates the principle of a training method for an autonomous driving behavior decision model according to an embodiment of the present invention. Figure 3The system acquires environmental perception data, such as object information, road information, navigation information, and vehicle information. It performs one-hot encoding on the environmental perception data to obtain a scene space set composed of one-hot encodings. This scene space set is input into the vehicle's decision-making model, which outputs a decision result. A driver feedback model is used to evaluate the degree of acceptance of the decision result, and an environmental feedback model is used to determine the evaluation result. The degree of acceptance and the evaluation result are combined to obtain a cumulative reward value. The one-hot encodings and the cumulative reward value are uploaded to a cloud-based online learning library for iterative updates to the parameters of the cloud-based decision-making model. The cloud-based online learning library can push the iteratively updated decision-making model parameters back to the vehicle to update the parameters of the vehicle's decision-making model.
[0091] Reference Figure 4 This diagram illustrates a flowchart of an autonomous driving behavior decision-making method according to an embodiment of the present invention. This autonomous driving behavior decision-making method can be applied to the vehicle. Specifically, the autonomous driving behavior decision-making method may include the following steps:
[0092] Step 401: Acquire vehicle data and environmental data in real time.
[0093] In embodiments of the present invention, in autonomous driving or assisted driving scenarios, vehicle data and environmental data can be acquired in real time. Vehicle data may include, but is not limited to, vehicle speed, vehicle acceleration, relative distances to other vehicles, buildings, people, animals, plants, etc., and the topological connectivity of the reference path. Environmental data may include, but is not limited to, road information, lighting information, temperature and humidity information, etc.
[0094] Step 402: Input vehicle data and environmental data into the autonomous driving behavior decision model and output the behavior decision results.
[0095] In embodiments of the present invention, the autonomous driving behavior decision model can be an autonomous driving behavior decision model trained according to the steps described in the embodiments above. The autonomous driving behavior decision model outputs the vehicle's behavior decision results. These behavior decision results can include actions such as following, cruise control, left lane change, right lane change, overtaking on the left, overtaking on the right, yielding on the left, and yielding on the right.
[0096] Step 403: Execute according to the behavioral decision results.
[0097] In embodiments of the present invention, the behavioral decision result can be any of the above-mentioned actions, and the vehicle can perform the corresponding action according to the behavioral decision result.
[0098] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of the present invention are not limited to the described order of actions, because according to the embodiments of the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily essential to the embodiments of the present invention.
[0099] Reference Figure 5 This diagram illustrates a structural block diagram of a training system for an autonomous driving behavior decision-making model according to an embodiment of the present invention. This training system for an autonomous driving behavior decision-making model is applied to a vehicle. Specifically, this training system for an autonomous driving behavior decision-making model may include the following modules.
[0100] Data acquisition module 51 is used to acquire environmental perception data of the vehicle in the target scenario;
[0101] The set construction module 52 is used to construct a discrete set of scene spaces based on the environmental perception data;
[0102] Model training module 53 is used to pre-train the decision model using the scene space set and output the decision result;
[0103] The reward calculation module 54 is used to calculate the cumulative reward value of the decision model based on the driver's behavior operation data and / or the vehicle's driving environment data, as well as the decision result;
[0104] The parameter update module 55 is used to iteratively update the parameters of the decision model based on the scene space set and the cumulative reward value.
[0105] In one exemplary embodiment of the present invention, the collection construction module 52 includes:
[0106] The encoding conversion module is used to convert continuous variables in the environmental perception data into one-hot codes.
[0107] The encoding combination module is used to combine the one-hot encodings into the scene space set.
[0108] In an exemplary embodiment of the present invention, the model training module 53 includes:
[0109] The behavior and probability output module is used to input the scene space set into the decision model and output multiple decision behaviors and the probability of each decision behavior;
[0110] The decision result determination module is used to select the decision behavior with the highest probability as the decision result.
[0111] In an exemplary embodiment of the present invention, the reward calculation module 54 includes:
[0112] The evaluation and determination module is used to evaluate the degree of acceptance of the decision result based on the behavioral operation data, and / or determine the evaluation result of the decision result based on the driving environment data;
[0113] The cumulative calculation module is used to calculate the cumulative reward value based on the level of recognition and / or the evaluation results.
[0114] In one exemplary embodiment of the present invention, the evaluation and determination module includes:
[0115] The driver monitoring module is used to monitor at least one of the driver's facial expressions, postures, and intervention actions;
[0116] A data classification module is used to classify the facial expressions and / or the postures and movements to obtain the degree of approval, and / or to determine the degree of approval based on the intervention actions and the decision results;
[0117] The level of recognition is categorized as recognition, disapproval, and neutrality.
[0118] In one exemplary embodiment of the present invention, the data classification module includes:
[0119] An action result comparison module is used to compare the intervention action with the decision result;
[0120] The level of approval determination module is used to determine the level of approval as approval when the intervention action is the same as or related to the decision result; to determine the level of approval as disapproval when the intervention action is contrary to the decision result; and to determine the level of approval as neutral when the intervention action is unrelated to the decision result.
[0121] In one exemplary embodiment of the present invention, the evaluation and determination module includes:
[0122] An environmental feedback module is used to determine the evaluation result based on the vehicle's driving environment data after executing the decision and at least one evaluation parameter;
[0123] The evaluation parameters include: dynamic constraints, kinematic constraints, traffic regulations, collision risk, and fuel consumption.
[0124] In an exemplary embodiment of the present invention, the cumulative calculation module includes:
[0125] The quantity statistics module is used to count the number of items with the approval level as "approved" and "disapproved" respectively.
[0126] The reward value increment module is used to increment a positive unit reward value based on the number of items deemed acceptable according to the level of acceptance, and to increment a negative unit reward value based on the number of items deemed unacceptable according to the level of acceptance. If the level of acceptance is neutral, no reward value is incremented. And / or, if the evaluation result indicates that the driving environment data does not comply with the dynamic constraints or the kinematic constraints, a negative unit reward value is incremented; if the evaluation result indicates that the driving environment data complies with the dynamic constraints or the kinematic constraints, no reward value is incremented; and / or, if the evaluation result indicates that the driving environment data violates traffic regulations, a negative unit reward value is incremented; if the evaluation result indicates that the driving environment data does not violate traffic regulations, no reward value is incremented; and / or, if the evaluation result indicates that the increased collision risk from the driving environment data is greater than or equal to a preset first risk threshold, a negative unit reward value is incremented. Reward value: If the evaluation result indicates that the collision risk reduced by the driving environment data is greater than or equal to a preset second risk threshold, a positive reward value is added; if the evaluation result indicates that the collision risk reduced by the driving environment data is less than the first risk threshold, or if the evaluation result indicates that the collision risk reduced by the driving environment data is less than the second risk threshold, no reward value is added; and / or, if the evaluation result indicates that the fuel consumption increased by the driving environment data is greater than or equal to a preset first fuel consumption threshold, a negative reward value is added; if the evaluation result indicates that the fuel consumption reduced by the driving environment data is greater than or equal to a preset second fuel consumption threshold, a positive reward value is added; if the evaluation result indicates that the fuel consumption increased by the driving environment data is less than the first fuel consumption threshold, or if the evaluation result indicates that the fuel consumption reduced by the driving environment data is less than the second fuel consumption threshold, no reward value is added.
[0127] The reward value accumulation module is used to add the reward value of the positive unit value increase to the reward value of the negative unit value increase to obtain the cumulative reward value.
[0128] In an exemplary embodiment of the present invention, the parameter update module 55 is used to upload the scene space set and the cumulative reward value to a cloud-based online learning library, so that the cloud-based online learning library adopts a reinforcement learning method to iteratively update the parameters of the decision model based on the principle of maximizing the cumulative reward value.
[0129] In one exemplary embodiment of the present invention, the system further includes:
[0130] The local update module is used to receive the iteratively updated parameters of the decision model from the cloud-based online learning library and update the parameters of the decision model locally on the vehicle side when the change value after iteratively updating the parameters of the decision model in the cloud-based online learning library is greater than or equal to a preset change threshold.
[0131] Reference Figure 6 This diagram illustrates a structural block diagram of an autonomous driving behavior decision-making system according to an embodiment of the present invention. This autonomous driving behavior decision-making system is applied to a vehicle. Specifically, this autonomous driving behavior decision-making system may include the following modules.
[0132] Real-time data acquisition module 61 is used to acquire vehicle data and environmental data in real time;
[0133] Data input / output module 62 is used to input the vehicle data and the environmental data into the autonomous driving behavior decision model and output the behavior decision results;
[0134] The decision result execution module 63 is used to execute according to the behavior decision result.
[0135] The aforementioned autonomous driving behavior decision model can be an autonomous driving behavior decision model trained according to the steps in the embodiments described above.
[0136] As the system implementation is basically similar to the method implementation, it is described in a relatively simple way. For relevant details, please refer to the description of the method implementation.
[0137] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0138] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, or computer program products. Therefore, embodiments of the present invention can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of the present invention can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0139] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0140] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0141] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0142] Although preferred embodiments of the present invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present invention.
[0143] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0144] The above provides a detailed description of the training method for an autonomous driving behavior decision-making model, an autonomous driving behavior decision-making method, a training system for an autonomous driving behavior decision-making model, and an autonomous driving behavior decision-making system provided by the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method for training an automatic driving behavior decision model, characterized in that, Applied to the vehicle end, the method includes: Acquire environmental perception data of the vehicle in the target scenario; A discrete set of scene spaces is constructed based on the environmental perception data; The decision model is pre-trained using the aforementioned scene space set, and the decision results are output. The cumulative reward value of the decision model is calculated based on the driver's behavioral operation data and / or the vehicle's driving environment data, as well as the decision results. The parameters of the decision model are iteratively updated based on the set of scene spaces and the cumulative reward value; The step of calculating the cumulative reward value of the decision model based on the driver's behavioral operation data and / or the vehicle's driving environment data, as well as the decision result, includes: The degree of acceptance of the decision result is assessed based on the behavioral operation data, and / or the evaluation result of the decision result is determined based on the driving environment data; Count the number of people who agree and disagree with the stated level of approval. The reward value increases by a positive unit value based on the number of items with the level of approval; the reward value increases by a negative unit value based on the number of items with the level of disapproval; if the number of items with the level of approval is neutral, no reward value is added. And / or, if the evaluation result indicates that the driving environment data does not meet the dynamic or kinematic constraints, a negative unit reward value is added; if the evaluation result indicates that the driving environment data meets the dynamic or kinematic constraints, no reward value is added. And / or, if the evaluation result indicates that the driving environment data violates traffic regulations, a negative unit reward value is added; if the evaluation result indicates that the driving environment data does not violate traffic regulations, no reward value is added. And / or, if the evaluation result indicates that the increased collision risk from the driving environment data is greater than or equal to a preset first risk threshold, a negative reward value is added; if the evaluation result indicates that the decreased collision risk from the driving environment data is greater than or equal to a preset second risk threshold, a positive reward value is added; if the evaluation result indicates that the increased collision risk from the driving environment data is less than the first risk threshold, or if the evaluation result indicates that the decreased collision risk from the driving environment data is less than the second risk threshold, no reward value is added. And / or, if the evaluation result indicates that the increase in fuel consumption due to the driving environment data is greater than or equal to a preset first fuel consumption threshold, a negative unit reward value is added; if the evaluation result indicates that the decrease in fuel consumption due to the driving environment data is greater than or equal to a preset second fuel consumption threshold, a positive unit reward value is added; if the evaluation result indicates that the increase in fuel consumption due to the driving environment data is less than the first fuel consumption threshold, or if the evaluation result indicates that the decrease in fuel consumption due to the driving environment data is less than the second fuel consumption threshold, no reward value is added. The cumulative reward value is obtained by adding the reward value of the positive unit increase to the reward value of the negative unit increase.
2. The method of claim 1, wherein, The step of constructing a discrete set of scene spaces based on the environmental perception data includes: The continuous variables in the environmental sensing data are converted into one-hot codes; The one-hot encodings are combined into the scene space set.
3. The method of claim 1, wherein, The step of pre-training the decision model using the scene space set and outputting the decision result includes: The set of scene spaces is input into the decision model, and multiple decision behaviors and the probability of each decision behavior are output. The decision with the highest probability is taken as the decision result.
4. The method of claim 1, wherein, The step of evaluating the degree of acceptance of the decision result based on the behavioral operation data includes: The system detects at least one of the driver's facial expressions, postures, and intervention actions. The degree of approval is obtained by classifying the facial expressions and / or the postures and movements, and / or the degree of approval is determined based on the intervention actions and the decision results; The level of recognition is categorized as recognition, disapproval, and neutrality.
5. The method of claim 4, wherein, Determining the level of approval based on the intervention action and the decision result includes: Compare the intervention action with the decision result; If the intervention action is the same as or related to the decision result, the level of approval is determined to be approval; If the intervention action contradicts the decision result, the level of approval is determined to be disapproval; If the intervention action is not related to the decision outcome, the level of approval is determined to be neutral.
6. The method of claim 4, wherein, The evaluation result of determining the decision result based on the driving environment data includes: The evaluation result is determined based on the vehicle's driving environment data after executing the decision and at least one evaluation parameter; The evaluation parameters include: dynamic constraints, kinematic constraints, traffic regulations, collision risk, and fuel consumption.
7. The method of claim 1, wherein, The iterative update of the parameters of the decision model based on the scene space set and the cumulative reward value includes: The scene space set and the cumulative reward value are uploaded to the cloud-based online learning library, so that the cloud-based online learning library uses reinforcement learning to iteratively update the parameters of the decision model based on the principle of maximizing the cumulative reward value.
8. The method of claim 7, wherein, The method further includes: If the change value of the parameters of the decision model after iterative updates by the cloud-based online learning library is greater than or equal to a preset change threshold, the updated parameters of the decision model are received from the cloud-based online learning library, and the parameters of the decision model on the vehicle's local device are updated.
9. An automatic driving behavior decision-making method, characterized in that, The method includes: Real-time acquisition of vehicle and environmental data; The vehicle data and the environmental data are input into the autonomous driving behavior decision model trained by the training method of the autonomous driving behavior decision model according to any one of claims 1 to 7, and the behavior decision result is output. Execute according to the behavioral decision results.
10. A training system for an autonomous driving behavior decision-making model, characterized in that, The system, applied to vehicles, includes: The data acquisition module is used to acquire environmental perception data of the vehicle in the target scenario; The set construction module is used to construct a discrete set of scene spaces based on the environmental perception data; The model training module is used to pre-train the decision model using the scene space set and output the decision results. The reward calculation module is used to calculate the cumulative reward value of the decision model based on the driver's behavior operation data and / or the vehicle's driving environment data, as well as the decision result; The parameter update module is used to iteratively update the parameters of the decision model based on the scene space set and the cumulative reward value. The reward calculation module includes: The evaluation and determination module is used to evaluate the degree of acceptance of the decision result based on the behavioral operation data, and / or determine the evaluation result of the decision result based on the driving environment data; The quantity statistics module is used to count the number of items with the approval level as "approved" and "disapproved" respectively. The reward value increment module is used to increment a positive unit reward value according to the number of items recognized by the level of recognition, and to increment a negative unit reward value according to the number of items disapproved by the level of recognition; if the level of recognition is neutral, no reward value is incremented; and / or, if the evaluation result indicates that the driving environment data does not meet dynamic or kinematic constraints, a negative unit reward value is incremented; if the evaluation result indicates that the driving environment data meets dynamic or kinematic constraints, no reward value is incremented; and / or, if the evaluation result indicates that the driving environment data violates traffic regulations, a negative unit reward value is incremented; if the evaluation result indicates that the driving environment data does not violate traffic regulations, no reward value is incremented; and / or, if the evaluation result indicates that the increased collision risk of the driving environment data is greater than or equal to a preset first risk threshold, a negative unit reward value is incremented. If the evaluation result indicates that the reduced collision risk from the driving environment data is greater than or equal to a preset second risk threshold, a positive reward value is added; if the evaluation result indicates that the increased collision risk from the driving environment data is less than the first risk threshold, or if the evaluation result indicates that the reduced collision risk from the driving environment data is less than the second risk threshold, no reward value is added; and / or, if the evaluation result indicates that the increased fuel consumption from the driving environment data is greater than or equal to a preset first fuel consumption threshold, a negative reward value is added; if the evaluation result indicates that the reduced fuel consumption from the driving environment data is greater than or equal to a preset second fuel consumption threshold, a positive reward value is added; if the evaluation result indicates that the increased fuel consumption from the driving environment data is less than the first fuel consumption threshold, or if the evaluation result indicates that the reduced fuel consumption from the driving environment data is less than the second fuel consumption threshold, no reward value is added. The reward value accumulation module is used to add the reward value of the positive unit value increase to the reward value of the negative unit value increase to obtain the cumulative reward value.
11. An autonomous driving behavior decision-making system, characterized in that, The system includes: The real-time data acquisition module is used to acquire vehicle data and environmental data in real time. The data input / output module is used to input the vehicle data and the environmental data into the autonomous driving behavior decision model trained by the training method of the autonomous driving behavior decision model according to any one of claims 1 to 8, and output the behavior decision result. The decision result execution module is used to execute the decision according to the behavior decision result.
12. An electronic device, comprising: include: One or more processors; and One or more machine readable medium storing instructions thereon, which when executed by the one or more processors, cause the electronic device to perform the method of claim 1 to 8, and / or the method of claim 9.
13. A computer-readable storage medium, characterized in that, The computer program stored therein causes the processor to perform the method of claim 1 to 8, and / or the method of claim 9.