Generator training method and apparatus, vehicle control method and apparatus, and device and medium
By training the generator with the help of a target decision-maker and using reward and punishment mechanisms to adjust the generator parameters, the generator learns decision-making behavior consistent with the target decision-maker. This solves the problem of poor generator training performance and improves the decision-making accuracy in autonomous driving.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ZHEJIANG GEELY HLDG GRP CO LTD
- Filing Date
- 2025-12-02
- Publication Date
- 2026-07-02
Smart Images

Figure CN2025139495_02072026_PF_FP_ABST
Abstract
Description
Generator training methods, vehicle control methods, devices, equipment and media
[0001] This application claims priority to Chinese Patent Application No. 202411930537.8, filed on December 26, 2024, entitled “Generator Training Method, Vehicle Control Method, Apparatus, Device and Medium”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to, but is not limited to, the field of autonomous driving technology, and in particular to a generator training method, a vehicle control method, a device, an equipment, and a medium. Background Technology
[0003] Generative Adversarial Imitation Learning (GAIL) is a highly effective imitation learning method that leverages data from human-environment interactions. As a significant application and extension of Generative Adversarial Networks (GANs) in the field of imitation, GAIL adopts the core architecture of GANs, consisting of two networks: a generator and a discriminator. These two networks continuously compete against each other, optimizing their own performance. The generator uses existing human demonstration samples to generate imitation samples, which then deceive the discriminator, making it unable to distinguish between the imitation samples and the actual imitations.
[0004] With the rapid development of autonomous driving technology, GAIL has also been applied to autonomous driving. However, because the generator does not make full use of human knowledge in the imitation process, the training effect of the generator is not good and it cannot be well applied to autonomous driving. Summary of the Invention
[0005] The following is an overview of the subject matter described in detail herein, and this overview is not intended to limit the scope of the claims.
[0006] This application provides a generator training method, a vehicle control method, an apparatus, a device, and a medium.
[0007] In a first aspect, embodiments of this application provide a generator training method, the method comprising:
[0008] Multiple first training samples are acquired. The first training samples include first state information of the vehicle and first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of a pre-acquired target decision-maker. The target decision-maker is used to simulate the decision-making behavior of human experts.
[0009] The initial generator is trained using multiple first training samples to obtain the first target generator.
[0010] In one example, obtaining multiple first training samples includes:
[0011] The process of obtaining each of the first training samples includes:
[0012] Obtain the first state information of the vehicle;
[0013] The first state information of the vehicle is input into the target decision-maker to obtain the first behavior sequence corresponding to the first state information. The target decision-maker is trained based on multiple second training samples. Each second training sample includes the second state information of the vehicle and the second behavior sequence corresponding to the vehicle. The second behavior sequence is obtained by simulating driving based on the second state information of the vehicle in a simulation environment using preset rules.
[0014] The first training sample is constructed based on the first state information and the first behavior sequence corresponding to the first state information.
[0015] In one example, constructing the first training sample based on the first state information and the first action sequence corresponding to the first state information includes:
[0016] The first state information is input into the initial generator to obtain the third line sequence output by the initial generator;
[0017] If the similarity between the third action sequence and the first action sequence corresponding to the first state information is less than a preset threshold, the penalty value of the initial generator is increased, the parameters of the initial generator are adjusted according to the penalty value, and the first training sample is constructed according to the first state information and the first action sequence corresponding to the first state information.
[0018] If the similarity between the third action sequence and the first action sequence corresponding to the first state information is greater than or equal to the preset threshold, then the reward value of the initial generator is increased, and the parameters of the initial generator are adjusted according to the reward value;
[0019] The process of training the initial generator using multiple first training samples to obtain the first target generator includes:
[0020] The first target generator is obtained by training the initial generator with adjusted parameters using multiple first training samples.
[0021] In one example, after training the initial generator with multiple first training samples to obtain a first target generator, the method further includes:
[0022] Obtain third state information for multiple vehicles;
[0023] The third state information of each vehicle is input into the first target generator to obtain the behavior sequence corresponding to each third state information.
[0024] Based on the feedback instructions from user input, samples are constructed from multiple third state information and the corresponding behavior sequences of each third state information to obtain multiple third training samples;
[0025] The first target generator is trained using multiple third training samples to obtain the second target generator.
[0026] In one example, based on the feedback indication from user input, multiple third state information are used to construct samples from the behavioral sequences corresponding to each third state information, resulting in multiple third training samples, including:
[0027] Display multiple information pairs, each of which includes a third state information and a behavior sequence corresponding to the third state information;
[0028] The system receives the feedback instruction input by the user, determines a plurality of third training samples, each third training sample including a third state information from a plurality of information pairs, and a fourth behavior sequence corresponding to the third state information, the fourth behavior sequence being determined according to the feedback instruction.
[0029] In one example, the fourth behavior sequence is the behavior sequence corresponding to the third state information, or the fourth behavior sequence is obtained by adjusting the behavior sequence corresponding to the third state information through the feedback indication.
[0030] Secondly, embodiments of this application provide a vehicle control method, including:
[0031] Obtain the vehicle's current status information;
[0032] The current state information is input into the generator to obtain the target behavior sequence;
[0033] The vehicle is controlled according to the target behavior sequence, and the generator is obtained based on the generator training method described in the first aspect.
[0034] Thirdly, embodiments of this application provide a generator training apparatus, the apparatus comprising:
[0035] The first acquisition module is configured to acquire multiple first training samples. The first training samples include first state information of the vehicle and a first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of a pre-acquired target decision-maker. The target decision-maker is configured to simulate the decision-making behavior of human experts.
[0036] The training module is configured to train the initial generator using multiple first training samples to obtain the first target generator.
[0037] Fourthly, embodiments of this application provide a vehicle control device, the device comprising:
[0038] The second acquisition module is configured to acquire the vehicle's current status information;
[0039] The processing module is configured to input the current state information into the generator to obtain the target behavior sequence;
[0040] The control module is configured to control the vehicle according to the target behavior sequence, and the generator is obtained based on the generator training method described in the first aspect.
[0041] Fifthly, embodiments of this application provide an electronic device, the device including: a processor and a memory storing computer program instructions;
[0042] When the processor executes computer program instructions, it implements either the generator training method as described in the first aspect or the vehicle control method as described in the second aspect.
[0043] In a sixth aspect, embodiments of this application provide a computer storage medium storing computer program instructions, which, when executed by a processor, implement the generator training method as described in the first aspect or the vehicle control method as described in the second aspect.
[0044] In a seventh aspect, embodiments of this application provide a computer program product in which instructions, when executed by a processor of an electronic device, cause the electronic device to perform a generator training method as described in the first aspect or a vehicle control method as described in the second aspect.
[0045] Eighthly, embodiments of this application provide a vehicle, including: a processor and a memory storing computer program instructions;
[0046] When the processor executes the computer program instructions, it implements the vehicle control method as described in the second aspect.
[0047] In a ninth aspect, embodiments of this application provide a chip, the chip including a memory and a processor, the memory storing code and data, the memory being coupled to the processor, the processor running a program in the memory causing the chip to be configured to perform a generator training method as described in the first aspect or a vehicle control method as described in the second aspect.
[0048] In a tenth aspect, embodiments of this application provide a computer program that, when executed by a processor, is configured to perform either the generator training method described in the first aspect or the vehicle control method described in the second aspect.
[0049] The generator training method, vehicle control method, apparatus, device, and medium provided in this application embodiment employ a target decision-maker to assist in generator training. By using the output of the target decision-maker as a training sample, the generator can learn decision-making behaviors that are essentially the same as those of the target decision-maker, thus training the generator effectively, accelerating the learning process, and improving the accuracy of decision-making.
[0050] The above is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims. Attached Figure Description
[0051] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0052] Figure 1 is a flowchart illustrating an embodiment of the generator training method provided in this application;
[0053] Figure 2 is another flowchart illustrating an embodiment of the generator training method provided in this application;
[0054] Figure 3 is a schematic diagram of the generator, decision-maker and discriminator provided in this application;
[0055] Figure 4 is a flowchart illustrating an embodiment of the vehicle control method provided in this application;
[0056] Figure 5 is a schematic diagram of an embodiment of the generator training device provided in this application;
[0057] Figure 6 is a structural schematic diagram of an embodiment of the vehicle control device provided in this application;
[0058] Figure 7 is a schematic diagram of an embodiment of the electronic device provided in this application. Detailed Implementation
[0059] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.
[0060] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.
[0061] The vehicles can be private cars, such as sedans, sport utility vehicles (SUVs), multi-purpose vehicles (MPVs), or pickup trucks. Vehicles can also be commercial vehicles, such as vans, buses, small trucks, or large semi-trailers. Vehicles can be either gasoline-powered or new energy vehicles. When a vehicle is a new energy vehicle, it can be a hybrid or a pure electric vehicle.
[0062] In one alternative approach, GAIL is a modeling learning method that can achieve good results. It can learn from data on human-environment interactions. Its core architecture is the same as that of generative adversarial networks, consisting of two networks: a generator and a discriminator. These two networks continuously compete against each other and optimize themselves.
[0063] With the rapid development of autonomous driving technology, GAIL has also been applied to autonomous driving. However, because the generator does not make full use of human knowledge in the imitation process, the training effect of the generator is not good and it cannot be well applied to autonomous driving.
[0064] Based on this, embodiments of this application provide a generator training method, a vehicle control method, an apparatus, a device, and a medium. By employing a target decision-maker to assist in the training of the generator, and using the output of the target decision-maker as a training sample, the generator can learn decision-making behaviors that are essentially the same as those of the target decision-maker. This can effectively train the generator, while also accelerating the learning process and improving the accuracy of decision-making.
[0065] This application provides a generator training method, apparatus, device, computer storage medium, and computer program product. The generator training method provided in this application will be described first.
[0066] Figure 1 shows a schematic flowchart of an embodiment of the generator training method provided in this application. The generator training method of this application can be applied to electronic devices.
[0067] As shown in Figure 1, the generator training method provided in this embodiment includes the following steps S101 to S102, wherein:
[0068] S101. Obtain multiple first training samples. The first training samples include the first state information of the vehicle and the first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of the pre-acquired target decision-maker. The target decision-maker is used to simulate the decision-making behavior of human experts.
[0069] In this embodiment, multiple first training samples are acquired. The first training samples include the vehicle's first state information, which includes the vehicle's own speed, position, and direction, the speed, position, and direction of surrounding vehicles, and the position and direction of obstacles. The first training samples also include the vehicle's first behavior sequence corresponding to the first state information, which includes acceleration, deceleration, left lane change, right lane change, and overtaking. The first behavior sequence is obtained based on the output of the pre-acquired target decision-maker, and the output of the target decision-maker is used as a sample for training.
[0070] The first state information in the first training sample can be obtained from the Next Generation Simulation (NGSIM) dataset, which includes natural vehicle dataset trajectories captured from high-rise buildings, capturing vehicle trajectories on highways and urban traffic roads from a bird's-eye view.
[0071] Among them, the pre-acquired target decision-maker is a human experience decision-maker, which is a tool that uses human knowledge and experience to assist the generator in learning and making decisions. The target decision-maker is used to simulate the decision-making behavior of human experts. By inputting the first state information into the target decision-maker, the target decision-maker analyzes, judges and makes decisions based on this data, and then outputs the corresponding behavior sequence.
[0072] S102. Train the initial generator using multiple first training samples to obtain the first target generator.
[0073] In this embodiment, multiple first training samples are used to train the initial generator. During the training process, if the first state information input to the generator and the target decision-maker is consistent, and the outputs of the generator and the target decision-maker are the same, indicating that the decisions are consistent, the generator is given a positive reward; if the outputs of the generator and the target decision-maker are different, indicating that the decisions are inconsistent, the generator is given a negative reward, until the initial generator converges and the first target generator is obtained.
[0074] In this embodiment, multiple first training samples are obtained. These training samples include the first state information of the vehicle and the first behavior sequence corresponding to the first state information. The first behavior sequence is obtained based on the output of the target decision-maker. The target decision-maker is used to simulate the decision-making behavior of human experts. The training of the generator is assisted by the target decision-maker. By using the output of the target decision-maker as the training sample, the generator can learn to make decisions that are basically the same as those of the target decision-maker. This can train the generator better, accelerate the learning process, and improve the accuracy of decision-making.
[0075] Figure 2 illustrates another flowchart of an embodiment of the generator training method provided in this application. The generator training method of this application can be applied to electronic devices.
[0076] As shown in Figure 2, the generator training method provided in this application embodiment includes the following steps S201 to S205, wherein:
[0077] S201. Obtain multiple first training samples. The first training samples include the first state information of the vehicle and the first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of the pre-acquired target decision-maker. The target decision-maker is used to simulate the decision-making behavior of human experts.
[0078] In this embodiment, multiple first training samples are acquired. The first training samples include the vehicle's first state information, which includes the vehicle's own speed, position, and direction, the speed, position, and direction of surrounding vehicles, and the position and direction of obstacles. The first training samples also include the vehicle's first behavior sequence corresponding to the first state information, which includes acceleration, deceleration, left lane change, right lane change, and overtaking. The first behavior sequence is obtained based on the output of the pre-acquired target decision-maker, and the output of the target decision-maker is used as a sample for training.
[0079] The first state information in the first training sample can be obtained from the NGSIM dataset, which includes natural vehicle dataset trajectories captured from high-rise buildings, capturing vehicle trajectories on highways and urban traffic roads from a bird's-eye view.
[0080] Among them, the pre-acquired target decision-maker is a human experience decision-maker, which is a tool that uses human knowledge and experience to assist the generator in learning and making decisions. The target decision-maker is used to simulate the decision-making behavior of human experts. By inputting the first state information into the target decision-maker, the target decision-maker analyzes, judges and makes decisions based on this data, and then outputs the corresponding behavior sequence.
[0081] In one example, step S201 includes:
[0082] The process of obtaining each first training sample includes:
[0083] Obtain the first state information of the vehicle; input the first state information of the vehicle into the target decision-maker to obtain the first behavior sequence corresponding to the first state information. The target decision-maker is trained based on multiple second training samples. Each second training sample includes the second state information of the vehicle and the second behavior sequence corresponding to the vehicle. The second behavior sequence is obtained by simulating driving based on the second state information of the vehicle in a simulation environment using preset rules; construct the first training sample based on the first state information and the first behavior sequence corresponding to the first state information.
[0084] In this embodiment, the acquisition process of each first training sample is specifically as follows: acquiring the first state information of the vehicle, inputting the first state information into the target decision-maker, and the target decision-maker outputting the first behavior sequence corresponding to the first state information.
[0085] Specifically, the target decision-maker is obtained based on multiple second training samples. Each second training sample includes the vehicle's second state information and its corresponding second behavior sequence. The second state information is obtained from a pre-acquired dataset, specifically the NGSIM dataset, which includes multiple second state information items: the vehicle's own speed, position, and direction; the speed, position, and direction of surrounding vehicles; and the position and direction of obstacles. In the simulation environment, simulated driving is performed based on the vehicle's second state information using preset rules.
[0086] The preset rules are expert rules, which are pre-selected driving rules set by experts, covering multiple aspects such as basic driving principles, driving skills and habits, safe driving awareness, and emergency response. Alternatively, in a simulation environment, a preset driving model, such as a car-following model, is used. The vehicle's second state information is input into the preset driving model to obtain a second behavior sequence. Existing driving rules set by human experts or corresponding driving models are used to simulate the actions to be taken in each scenario in the dataset, thereby obtaining the corresponding second behavior sequence. The vehicle's second state information and the second behavior sequence constitute the second training sample. The decision-maker is trained using the second training sample obtained in the above manner, enabling the decision-maker to output human experience strategies, i.e., behavior sequences, based on the vehicle's state information.
[0087] In one example, a first training sample is constructed based on the first state information and the first action sequence corresponding to the first state information, including:
[0088] The first state information is input into the initial generator to obtain the third line sequence output by the initial generator. If the similarity between the third line sequence and the first line sequence corresponding to the first state information is less than a preset threshold, the penalty value of the initial generator is increased, and the parameters of the initial generator are adjusted according to the penalty value. The first training sample is constructed based on the first state information and the first line sequence corresponding to the first state information. If the similarity between the third line sequence and the first line sequence corresponding to the first state information is greater than or equal to the preset threshold, the reward value of the initial generator is increased, and the parameters of the initial generator are adjusted according to the reward value.
[0089] The initial generator is trained using multiple first training samples to obtain the first target generator, including:
[0090] The initial generator, after parameter adjustment, is trained using multiple first training samples to obtain the first target generator.
[0091] In the above embodiments, the generator is given a positive or negative reward based on similarity, and the similarity is determined by a discriminator. For example, referring to Figure 3, the left side of Figure 3 represents the NGSIM dataset, where the state information has its own corresponding timestamp s. t The first state information of the vehicle is obtained from the dataset, and each first state information has its corresponding timestamp s. t Whether two sets of first-state information are identical is determined by the timestamps corresponding to their respective first-state information. The first-state information s... e The output is sent to the target decision-maker, and the first line of the target decision-maker's output is sequence a. e , will the first state information s eThe input is fed into the initial generator, and the third line of the output from the initial generator is a sequence 'a'. g The target decision maker will (s e ,a e ) is sent to the discriminator, and the initial generator will send (s) e ,a g ) is sent to the discriminator, and the discriminator determines the value based on (s) e ,a e ) and (s e ,a g Calculate the similarity.
[0092] Specifically, if the similarity between the third row sequence and the first row sequence is less than a preset threshold, a g and a e For different decision behaviors, it indicates that the initial generator is not learning well, so a negative reward is given to the generator, that is, the penalty value r1(πθ) of the initial generator is increased. The parameters of the initial generator are adjusted according to the penalty value. Specifically, the first value r(πθ) is obtained according to the penalty value r1(πθ) and the loss value loss. The parameters of the initial generator are adjusted according to the first value, and the first training sample is constructed according to the first state information and the first behavior sequence corresponding to the first state information.
[0093] If the similarity between the third row sequence and the first row sequence is greater than or equal to a preset threshold, a g and a e For the same decision-making behavior, to incentivize the initial generator to learn, a positive reward is given to the generator, i.e., the reward value r1(πθ) of the initial generator is increased. The parameters of the initial generator are adjusted according to the reward value r1(πθ) and the loss value loss. Specifically, a first value r(πθ) is obtained based on the reward value r1(πθ) and the loss value loss, and the parameters of the initial generator are adjusted according to the first value. Furthermore, multiple first training samples are used to train the parameter-adjusted initial generator until the generator converges. This means that the discriminator has difficulty distinguishing the behavior sequences of the initial generator from those of the target decision-maker, thus obtaining the first target generator.
[0094] Furthermore, referring to Figure 3, the target decision-maker and the initial generator can interact. If the similarity between the third behavior sequence and the first behavior sequence is less than a preset threshold, the target decision-maker uses the first behavior sequence as the target sequence and sends the first behavior sequence πH to the initial generator, replacing the third behavior sequence with the first behavior sequence, allowing the initial generator to learn better decision-making behavior. Alternatively, if the initial generator learns better decision-making behavior, it can also send the point behavior πθ of the initial generator to the target decision-maker, enabling the target decision-maker to learn better decision-making behavior and generate behavior sequences that are closer to the target sequence of the target decision-maker more quickly. Specifically, experienced users evaluate the third behavior sequence of the initial generator. If they consider the third behavior sequence to be good, they mark it and provide it to the target decision-maker for learning, thereby enabling interaction and sharing between the target decision-maker and the initial generator, allowing them to learn better decision-making behavior from each other.
[0095] S202. Train the initial generator using multiple first training samples to obtain the first target generator.
[0096] In this embodiment, multiple first training samples are used to train the initial generator. During the training process, if the first state information input to the generator and the target decision-maker is consistent, and the outputs of the generator and the target decision-maker are the same, indicating that the decisions are consistent, the generator is given a positive reward; if the outputs of the generator and the target decision-maker are different, indicating that the decisions are inconsistent, the generator is given a negative reward, until the initial generator converges and the first target generator is obtained.
[0097] S203. Obtain the third state information of multiple vehicles.
[0098] In this embodiment, a first target generator is obtained through the first stage of learning. The first target generator can output better decision-making behavior. In order to make the first target generator more applicable to actual traffic scenarios, learn better driving strategies, output better decision-making behavior, and at the same time conform to human values, a second stage of learning can be carried out.
[0099] In real-world traffic scenarios, humans exhibit different behaviors when driving different vehicles. For example, under the same traffic conditions and rules, on highways or urban roads, humans driving small cars tend to travel at relatively higher speeds and maintain shorter following distances. However, when driving large vehicles, they tend to travel at relatively lower speeds and maintain longer following distances. Decision-making behaviors in the same environment will differ due to their varying values, requiring alignment based on human values. Therefore, a second phase of learning can be implemented.
[0100] In this embodiment, the third state information of multiple vehicles is obtained for the first target generator to learn. The third state information includes the vehicle's own speed, position, and direction, the speed, position, and direction of surrounding vehicles, and the position and direction of obstacles. The third state information can also be obtained from the above dataset.
[0101] S204. Input the third state information of each vehicle into the first target generator to obtain the behavior sequence corresponding to each third state information.
[0102] In this embodiment, the third state information of each vehicle is input to the first target generator, which outputs the behavior sequence corresponding to each third state information, displays the third state information and corresponding behavior sequence of each vehicle, and allows the user to provide feedback based on the displayed third state information and corresponding behavior sequence of the vehicle.
[0103] S205. Based on the feedback instructions from user input, construct samples from multiple third-state information and the corresponding behavior sequences of each third-state information to obtain multiple third training samples.
[0104] In this embodiment, based on the feedback instructions from user input, samples are constructed from multiple third state information and the corresponding behavior sequences of each third state information to obtain multiple third training samples.
[0105] In one example, step S205 includes:
[0106] Display multiple information pairs, each information pair including a third state information and a corresponding behavior sequence; receive feedback instructions from user input, determine multiple third training samples, each third training sample including a third state information from one of the multiple information pairs and a corresponding fourth behavior sequence, the fourth behavior sequence being determined according to the feedback instructions.
[0107] In this embodiment, multiple information pairs are displayed. Each information pair includes a third state information and a corresponding behavior sequence. The user can view each third state information and its corresponding behavior sequence to determine whether to accept or reject the behavior sequence. If the user agrees with the behavior corresponding to the behavior sequence in the traffic scenario corresponding to the third state information, the user can click "Accept Behavior Sequence." If the user disagrees with the behavior corresponding to the behavior sequence in the traffic scenario corresponding to the third state information, the user can click "Reject Behavior Sequence" and provide the corresponding behavior, i.e., the behavior sequence. The system receives user feedback, such as acceptance or rejection, and determines multiple third training samples. Each third training sample includes the third state information from multiple information pairs and a fourth behavior sequence corresponding to the third state information. The fourth behavior sequence is determined based on the user's feedback.
[0108] Optionally, the fourth action sequence is the action sequence corresponding to the third state information, or the fourth action sequence is obtained by adjusting the action sequence corresponding to the third state information through feedback instructions.
[0109] In this embodiment, if the user agrees with the behavior sequence output by the first target generator, the user accepts the behavior sequence, and the fourth behavior sequence is consistent with the behavior sequence corresponding to the third state information. If the user does not agree with the behavior sequence output by the first target generator, the user rejects the behavior training and provides a new behavior. The fourth behavior sequence is a new behavior sequence obtained by adjusting the behavior sequence corresponding to the third state information through feedback instructions. Through user feedback, multiple third state information and fourth behavior sequences are collected to form multiple third training samples for the first target generator to perform the second stage of learning.
[0110] S206. The first target generator is trained using multiple third training samples to obtain the second target generator.
[0111] In this embodiment, the first target generator is trained using third training sample data to learn human behavior strategies, thereby obtaining the second target generator. This completes the learning of driving strategies based on human values and is applicable to real-world traffic scenarios.
[0112] In this embodiment of the application, the generator is enabled to provide human-like strategies and achieve the same behavior as humans through aligned reinforcement learning, which can be better used in real traffic scenarios.
[0113] Figure 4 shows a flowchart of an embodiment of the vehicle control method provided in this application. The vehicle control method of this application can be applied to vehicles. As shown in Figure 4, the vehicle control method provided in this embodiment includes the following steps S401 to S403, wherein:
[0114] S401. Obtain the current status information of the vehicle.
[0115] In this embodiment, the current status information of the vehicle is acquired. This current status information includes the vehicle's current speed, current position, and current direction; the current speed, current position, and current direction of surrounding vehicles; and the current position and current direction of obstacles. This status information can be obtained through sensors installed on the vehicle body.
[0116] S402. Input the current state information into the generator to obtain the target behavior sequence.
[0117] In this embodiment, the current state information of the vehicle is input into the generator, which is obtained through the generator training method described above and can be used for vehicle driving to achieve automatic control of the vehicle.
[0118] S403. Control the vehicle according to the target behavior sequence. The generator is obtained based on the generator training method described above.
[0119] In this embodiment, the vehicle is controlled according to the target behavior sequence, and the generator obtained through the above learning is applied to autonomous driving, which can be applied to real traffic scenarios and better control the vehicle's driving.
[0120] The generator training method provided in the embodiments of this application will be illustrated below.
[0121] GAIL, a reinforcement learning method that combines Generative Adversarial Networks (GANs) and Imitation Learning, consists of a discriminator and a generator. The core idea of GAIL is to learn a policy that can mimic expert behavior through an adversarial training process involving a generator and a discriminator. In GAIL, the generator aims to learn a policy whose generated behavior is indistinguishable from expert behavior; the discriminator aims to distinguish the generator's policy behavior from the expert's real behavior. Through this adversarial training, the generator gradually learns a policy that can produce behavior similar to that of an expert.
[0122] This application proposes a novel generative adversarial imitation learning framework. Based on GAIL, it adds a human experience-based decision-maker (i.e., the target decision-maker mentioned above) and employs a two-stage behavioral policy learning method for the generator. The first stage is an imitation learning stage based on human experience, and the second stage is a reinforcement learning stage. The first stage utilizes methods such as human experience, imitation learning, and control sharing to complete the human policy learning. The second stage utilizes a value alignment method for intelligent agents based on feedback reinforcement learning.
[0123] Specifically, the learning content for the first phase is as follows:
[0124] 1) Initialize the human experience decision-maker (i.e., the decision-maker mentioned above) and the discriminator (also called the judge). The input of the human experience decision-maker is the vehicle's state information, including: the vehicle's own speed, position, and direction, the speed, position, and direction of surrounding vehicles, and the position and direction of obstacles. The output of the human experience decision-maker is a sequence of behaviors, including acceleration, deceleration, left lane change, right lane change, and overtaking.
[0125] The generator is initialized. In this embodiment, the generator is an intelligent agent that controls the intelligent vehicle through self-learning. The input is the vehicle's state information, and the output is a sequence of behaviors. The generator adopts a deep neural network-like structure.
[0126] The discriminator employs a deep neural network-like structure to determine the similarity between the human experience decision-maker and the generator. The discrimination method involves comparing the similarity between the behavioral sequences output by the human experience decision-maker and the behavioral sequences output by the generator. Alternatively, it can be determined by comparing the action probabilities output by the human experience decision-maker and the action probabilities output by the generator.
[0127] Referring to Figure 3, the human expert policy corresponding to the human experience decision-maker is πH, and the policy learned by the generator is πθ. The human experience decision-maker guides the generator's learning by judging the quality of the samples generated by the generator. For example, when the similarity is low, the human expert policy πH is used to replace the generator's policy for the generator to learn. Meanwhile, the generator's learned policy πθ performs better. By collecting the generator's policies as learning samples for the human experience decision-maker, the interaction between πH and πθ is completed. Whether the generator's learned policy πθ is effective can be determined by having users with driving experience evaluate the generator's policy.
[0128] 2) Train the human experience decision-maker using target behavior samples to obtain the target human experience decision-maker (i.e., the target decision-maker mentioned above). Specifically, the target behavior samples are obtained in the following way: In the simulation environment, based on the NGSIM dataset, which includes the vehicle's state information (i.e., the second state information mentioned above), simulate driving in the simulation environment using expert rules to obtain the vehicle's behavior sequence (i.e., the second behavior sequence mentioned above). Collect the vehicle's state information (i.e., the second state information mentioned above) and the corresponding behavior sequence (i.e., the second behavior sequence mentioned above) to obtain the target behavior sample (i.e., the second training sample mentioned above).
[0129] Alternatively, kinematic formulas (i.e., the driving model mentioned above) can be used to re-simulate what actions should be taken in each scenario of the dataset. Specifically, in the simulation environment, based on the NGSIM dataset, the driving model is used to obtain the vehicle's behavior sequence (i.e., the second behavior sequence mentioned above), collect the vehicle's state information (i.e., the second state information mentioned above) and the corresponding behavior sequence (i.e., the second behavior sequence mentioned above), and obtain the target behavior sample (i.e., the second training sample mentioned above).
[0130] By using target behavior samples and training a human experience decision-maker using clonal learning, the finally learned human experience decision-maker can output a human experience strategy (i.e., the first behavior sequence mentioned above) based on environmental state information (i.e., the first state information mentioned above, which is input into the target decision-maker), thereby obtaining the target human experience decision-maker.
[0131] 3) The generator learns policies by interacting with the environment and combining them with the target human experience decision-maker. The generator interacts with the environment in the autonomous driving simulator, and uses the environmental state perceived by the intelligent vehicle and its own state during the interaction to provide behavioral decision actions as optimization samples, which are stored in the experience pool to obtain the policy training experience pool.
[0132] 4) During training, the generator outputs a behavior sequence based on environmental state information. This behavior sequence is then input into the discriminator. The discriminator, based on the similarity between the generator's and the target human experience decision-maker's strategies under the same conditions, and the principle of control sharing, uses the target human experience decision-maker's strategy as the guiding strategy to replace the generator's strategy (i.e., replacing the third behavior sequence with the first behavior sequence mentioned above, enabling the initial generator to learn better decision-making behavior). Simultaneously, if the generator and the target human experience decision-maker make consistent decisions under the same conditions, the generator's learning reward is increased (i.e., if the similarity between the third behavior sequence and the first behavior sequence corresponding to the first state information is greater than or equal to a preset threshold, the initial generator's reward value is increased). The generator's parameters are adjusted using the replaced strategy experience pool and the discriminator's similarity (i.e., adjusting the initial generator's parameters based on the reward value / adjusting the initial generator's parameters based on the penalty value mentioned above), allowing human experience to guide the generator in generating behavior sequences that more closely resemble the target behavior more quickly during the intelligent vehicle's learning process.
[0133] 5) Repeat steps 3 and 4 until the generator and discriminator converge. This means that the behavior generated by the generator becomes increasingly similar to the target behavior, while the discriminator becomes increasingly unable to distinguish between the generator's behavior and the target behavior. Through the above steps, the generative adversarial imitation learning training process is applied in an autonomous driving simulation environment, enabling the generator to learn driving skills by observing and imitating human behavior.
[0134] In the above process, through the first stage of learning, a first target generator is obtained. The first target generator can output better decision-making behavior. In order to make the first target generator more applicable to actual traffic scenarios, learn better driving strategies, output better decision-making behavior, and at the same time conform to human values, a second stage of learning can be carried out.
[0135] Specifically, the learning content for the second phase is as follows:
[0136] The goal is to learn driving strategies superior to humans while adhering to human values. Therefore, a second stage of learning can be implemented: behavior learning based on human values. In traffic scenarios, humans exhibit different behaviors when driving different types of vehicles. For example, under the same traffic conditions and rules, in the same highway or urban road environment, humans driving small cars tend to drive faster and maintain shorter following distances. However, when driving large vehicles, humans tend to drive slower and maintain longer following distances. Similarly, when driving special vehicles such as police cars and ambulances, human values such as driving speed and maintaining following distance are more prominent. Generally speaking, agents in the same environment (i.e., the first target generator mentioned above) will exhibit different behaviors due to their different values, requiring alignment based on human values. To address this common problem, the implementation example proposes using reinforcement learning methods to learn human values.
[0137] The intelligent agent D R The model (i.e., the first target generator mentioned above) is represented as D. R =(π) R A R ), indicating that the intelligent agent is in value A R The generated strategy π R Human Model D H Defined as DH = (π H A H ), representing human beings in value A H The generated strategy π H Humans may have different perspectives on agent models, reflecting underlying differences in value between humans and agents. A value alignment problem arises when the agent's policy differs from the human's. Currently, resolving the value alignment problem primarily involves humans ranking the data generated by the agent to create a reward model to guide learning. While humans may not be able to accurately define their goals, they can identify states that meet human expectations. In this case, this information is included in the human policy π. H In other words, π is generated when humans do something based on their own values. HTherefore, intelligent agents should learn how to determine their own value function, formulate human-like strategies, and achieve the same behaviors as humans. Different value drivers will produce different strategies, and conversely, the differences in strategies reflect the degree of value mismatch.
[0138] The agent, based on its own model D R Learning through interaction with the environment, in value A R The policy trajectory τ is generated under the action. R ={(s1,a1),(s2,a2),…,(s n ,a n The strategic trajectory of human value creation is as follows: Based on the two trajectories, A R Adjustments are made. The strategy is to adjust the value function A. R The state behavior generated under influence has an overall value that is a combination of various value dimensions. For example, the overall value is a complete value system composed of time value, security value, and other values, namely: A R =θ1m1 + θ2m2 + ... + θ n m n To find a reasonable θ i Make A R The resulting strategies are similar to human values. The difference lies in whether humans consider the current strategy to align with their own value, i.e., τ. R With τ H The difference is specifically manifested in the agent's policy trajectory τ. R Whether it is accepted by humans. Human samples act as a supervisory signal, monitoring and adjusting the agent's value function. The degree to which humans accept machine value is reflected in their acceptance of actions under the same conditions. The probability P that the machine value parameter θ is accepted by human value is determined by the agent adjusting θ to make A... R As large as possible by A H accept.
[0139] The second phase of the value alignment process is as follows:
[0140] 1) The driving strategy learned in the first stage is used to drive in the simulation environment. The strategy will make a series of driving decisions based on the environment and act on the environment. The environment (i.e. the third state information mentioned above) is represented by s. Each environment will have an action a (i.e., inputting the third state information into the first target generator to obtain the behavior sequence corresponding to each third state information).
[0141] 2) The intelligent agent (i.e. the first goal generator mentioned above) will show the strategy to be taken in the future, that is, the action to be taken at time s, to the human, allowing the human to choose to accept or reject the action, or to directly execute the action a and let the human take over and interrupt the action, thus obtaining human feedback (i.e. the user input feedback instructions mentioned above).
[0142] 3) By iterating through 1) and 2), we can collect many (s,a) state pairs, which are called trajectory sequences in reinforcement learning (i.e., the feedback instructions based on user input mentioned above, which are used to construct multiple third-state information and the corresponding behavior sequences of each third-state information to obtain multiple third training samples).
[0143] 4) Update again using the collected (s,a) state pairs to complete the learning of human-valued driving strategies (i.e., the second target generator is obtained by training the first target generator with multiple third training samples as mentioned above).
[0144] This application provides GAIL with rich prior human data knowledge, embedding human behavioral data knowledge into the learning process, aiming to solve the bottleneck problem of GAIL generators' random imitation behavior. It discovers that agents in different environments will exhibit different value-driven behaviors due to their different attributes. Through aligned reinforcement learning, the agent system learns how to determine its own value function, provides human-like strategies, and achieves behaviors similar to humans.
[0145] The generator training method provided in this application can be executed by a generator training device. This application uses the example of a generator training device executing the generator training method to illustrate the generator training device provided in this application.
[0146] Figure 5 shows a schematic diagram of the generator training device provided in an embodiment of this application. As shown in Figure 5, the generator training device 50 of this application includes: a first acquisition module 501 and a training module 502.
[0147] The first acquisition module 501 is used to acquire multiple first training samples. The first training samples include the first state information of the vehicle and the first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of the pre-acquired target decision-maker. The target decision-maker is used to simulate the decision-making behavior of human experts.
[0148] The training module 502 is used to train the initial generator using multiple first training samples to obtain the first target generator.
[0149] In one example, the first acquisition module 501 is further configured to acquire the first state information of the vehicle. The training module 502 is further configured to input the first state information of the vehicle into the target decision-maker to obtain the first behavior sequence corresponding to the first state information. The target decision-maker is trained based on multiple second training samples. Each second training sample includes the second state information of the vehicle and the second behavior sequence corresponding to the vehicle. The second behavior sequence is obtained by simulating driving based on the second state information of the vehicle in a simulation environment using preset rules. The first training sample is constructed based on the first state information and the first behavior sequence corresponding to the first state information.
[0150] In one example, the training module 502 is further configured to input the first state information into the initial generator to obtain the third line sequence output by the initial generator; if the similarity between the third line sequence and the first line sequence corresponding to the first state information is less than a preset threshold, the penalty value of the initial generator is increased, and the parameters of the initial generator are adjusted according to the penalty value, and a first training sample is constructed based on the first state information and the first line sequence corresponding to the first state information; if the similarity between the third line sequence and the first line sequence corresponding to the first state information is greater than or equal to the preset threshold, the reward value of the initial generator is increased, and the parameters of the initial generator are adjusted according to the reward value; the initial generator with adjusted parameters is trained using multiple first training samples to obtain a first target generator.
[0151] In one example, the first acquisition module 501 is further configured to acquire third state information of multiple vehicles. The training module 502 is further configured to input the third state information of each vehicle into the first target generator to obtain the behavior sequence corresponding to each third state information; based on the feedback indication of user input, to construct samples from the multiple third state information and the behavior sequence corresponding to each third state information to obtain multiple third training samples; and to train the first target generator using the multiple third training samples to obtain a second target generator.
[0152] In one example, the generator training apparatus also includes an output module and an input module.
[0153] The output module displays multiple information pairs, each pair including a third state information and a corresponding behavior sequence. The input module receives user input feedback. The training module 502 further determines multiple third training samples, each including a third state information from one of the multiple information pairs and a corresponding fourth behavior sequence, the fourth behavior sequence being determined based on the feedback instruction.
[0154] The generator training device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.
[0155] The vehicle control method provided in this application can be executed by a vehicle control device. This application uses the example of a vehicle control device executing the vehicle control method to illustrate the vehicle control device provided in this application.
[0156] Figure 6 shows a schematic diagram of the vehicle control device provided in an embodiment of this application. As shown in Figure 6, the vehicle control device 60 of this application includes: a second acquisition module 601, a processing module 602, and a control module 603.
[0157] The second acquisition module 601 is used to acquire the current status information of the vehicle.
[0158] The processing module 602 is used to input the current state information into the generator to obtain the target behavior sequence.
[0159] The control module 603 is used to control the vehicle according to the target behavior sequence, and the generator is obtained based on the generator training method described above.
[0160] The vehicle control device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.
[0161] Figure 7 shows a schematic diagram of the hardware structure of the electronic device provided in an embodiment of this application.
[0162] The electronic device may include a processor 701 and a memory 702 storing computer program instructions.
[0163] Specifically, the processor 701 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.
[0164] Memory 702 may include mass storage for data or instructions. For example, and not limitingly, memory 702 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 702 may include removable or non-removable (or fixed) media. Where appropriate, memory 702 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 702 is non-volatile solid-state memory.
[0165] In some embodiments, memory 702 may include read-only memory (ROM), random access memory (RAM), disk storage media device, optical storage media device, flash memory device, electrical, optical, or other physical / tangible memory storage device. Thus, generally, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to one aspect of this disclosure.
[0166] The processor 701 implements any of the generator training methods described in the above embodiments by reading and executing computer program instructions stored in the memory 702.
[0167] In one example, the electronic device may also include a communication interface 707 and a bus 710. As shown in Figure 7, the processor 701, memory 702, and communication interface 707 are connected via the bus 710 and communicate with each other.
[0168] The communication interface 707 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.
[0169] Bus 710 includes hardware, software, or both, that couples components of an online data flow metering device together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Extended Industry Standard Architecture (EISA) bus, a Front-Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local Bus (VLB) bus, or other suitable buses, or a combination of two or more of these. Where appropriate, bus 710 may include one or more buses. Although specific buses are described and illustrated in the embodiments of this application, this application considers any suitable bus or interconnection.
[0170] The electronic device can execute the generator training method in the embodiments of this application, thereby implementing the generator training method and apparatus described in conjunction with Figures 1 and 5. Alternatively, it can execute the vehicle control method in the embodiments of this application, thereby implementing the vehicle control method and apparatus described in conjunction with Figures 4 and 6.
[0171] Furthermore, in conjunction with the generator training method or vehicle control method in the above embodiments, this application embodiment can provide a computer storage medium for implementation. The computer storage medium stores computer program instructions; when these computer program instructions are executed by a processor, they implement the generator training method or vehicle control method in the above embodiments.
[0172] In conjunction with the generator training method or vehicle control method in the above embodiments, this application embodiment can provide a computer program product, in which the instructions of the computer program product, when executed by the processor of an electronic device, cause the electronic device to implement the generator training method or vehicle control method in the above embodiments.
[0173] This application provides a chip that includes a memory and a processor. The memory stores code and data and is coupled to the processor. The processor runs a program in the memory, which sets the chip to execute the generator training method or the vehicle control method described in the above embodiments.
[0174] This application provides a computer program that, when executed by a processor, is configured to perform the generator training method or the vehicle control method described in the above embodiments.
[0175] This application provides a vehicle, including: a processor and a memory storing computer program instructions;
[0176] When the processor executes the computer program instructions, it implements the vehicle control method described above.
[0177] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0178] The functional blocks shown in the above-described block diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable read-only ROM (EROM), floppy disks, compact disc read-only ROMs (CD-ROMs), optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet or intranets.
[0179] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.
[0180] The aspects of this application have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that these instructions, executable via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can also be implemented by dedicated hardware performing the specified functions or actions, or can be implemented by a combination of dedicated hardware and computer instructions.
[0181] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.
Claims
1. A generator training method, comprising: Multiple first training samples are acquired. The first training samples include first state information of the vehicle and first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of a pre-acquired target decision-maker. The target decision-maker is set to simulate the decision-making behavior of human experts. The initial generator is trained using multiple first training samples to obtain the first target generator.
2. The method according to claim 1, wherein, The acquisition of multiple first training samples includes: The process of obtaining each of the first training samples includes: Obtain the first state information of the vehicle; The first state information of the vehicle is input into the target decision-maker to obtain the first behavior sequence corresponding to the first state information. The target decision-maker is trained based on multiple second training samples. Each second training sample includes the second state information of the vehicle and the second behavior sequence corresponding to the vehicle. The second behavior sequence is obtained by simulating driving based on the second state information of the vehicle in a simulation environment using preset rules. The first training sample is constructed based on the first state information and the first behavior sequence corresponding to the first state information.
3. The method according to claim 2, wherein, The step of constructing the first training sample based on the first state information and the first behavior sequence corresponding to the first state information includes: The first state information is input into the initial generator to obtain the third line sequence output by the initial generator; If the similarity between the third action sequence and the first action sequence corresponding to the first state information is less than a preset threshold, the penalty value of the initial generator is increased, the parameters of the initial generator are adjusted according to the penalty value, and the first training sample is constructed according to the first state information and the first action sequence corresponding to the first state information. If the similarity between the third action sequence and the first action sequence corresponding to the first state information is greater than or equal to the preset threshold, then the reward value of the initial generator is increased, and the parameters of the initial generator are adjusted according to the reward value; The process of training the initial generator using multiple first training samples to obtain the first target generator includes: The first target generator is obtained by training the initial generator with adjusted parameters using multiple first training samples.
4. The method according to any one of claims 1 to 3, after training the initial generator with a plurality of first training samples to obtain the first target generator, the method further includes: Obtain third state information for multiple vehicles; The third state information of each vehicle is input into the first target generator to obtain the behavior sequence corresponding to each third state information. Based on the feedback instructions from user input, samples are constructed from multiple third state information and the corresponding behavior sequences of each third state information to obtain multiple third training samples; The first target generator is trained using multiple third training samples to obtain the second target generator.
5. The method according to claim 4, wherein, The feedback indication based on user input involves constructing samples from multiple third state information entries, each corresponding to a specific behavior sequence, to obtain multiple third training samples, including: Display multiple information pairs, each of which includes a third state information and a behavior sequence corresponding to the third state information; The system receives the feedback instruction input by the user, determines a plurality of third training samples, each third training sample including a third state information from a plurality of information pairs, and a fourth behavior sequence corresponding to the third state information, the fourth behavior sequence being determined according to the feedback instruction.
6. The method according to claim 5, wherein, The fourth behavior sequence is the behavior sequence corresponding to the third state information, or the fourth behavior sequence is obtained by adjusting the behavior sequence corresponding to the third state information through the feedback indication.
7. A vehicle control method, the method comprising: Obtain the vehicle's current status information; The current state information is input into the generator to obtain the target behavior sequence, wherein the generator is obtained based on the generator training method described in any one of claims 1 to 6; The vehicle is controlled according to the target behavior sequence.
8. A generator training apparatus, the apparatus comprising: The first acquisition module is configured to acquire multiple first training samples. The first training samples include first state information of the vehicle and a first behavior sequence of the vehicle corresponding to the first state information. The first behavior sequence is obtained according to the output of a pre-acquired target decision-maker. The target decision-maker is configured to simulate the decision-making behavior of human experts. The training module is configured to train the initial generator using multiple first training samples to obtain the first target generator.
9. A vehicle control device, the device comprising: The second acquisition module is configured to acquire the vehicle's current status information; The processing module is configured to input the current state information into the generator to obtain the target behavior sequence, wherein the generator is obtained based on the generator training method described in any one of claims 1 to 6; The control module is configured to control the vehicle according to the target behavior sequence.
10. An electronic device, the electronic device comprising: Processor and memory storing computer program instructions; When the processor executes the computer program instructions, it implements the generator training method as described in any one of claims 1 to 6, or the vehicle control method as described in claim 7.
11. A computer-readable storage medium storing computer program instructions that, when executed by a processor, implement the generator training method as described in any one of claims 1 to 6, or the vehicle control method as described in claim 7.
12. A computer program product, wherein instructions in the computer program product, when executed by a processor of an electronic device, cause the electronic device to perform a generator training method as described in any one of claims 1 to 6, or a vehicle control method as described in claim 7.
13. A vehicle comprising: Processor and memory storing computer program instructions; When the processor executes the computer program instructions, it implements the vehicle control method as described in claim 7.
14. A chip comprising a memory and a processor, the memory storing code and data, the memory being coupled to the processor, the processor executing a program in the memory such that the chip is configured to perform the generator training method according to any one of claims 1 to 6, or the vehicle control method according to claim 7.
15. A computer program, when executed by a processor, configured to perform the generator training method according to any one of claims 1 to 6, or the vehicle control method according to claim 7.