An end-to-end automatic driving method and system based on meta-reinforcement learning
By using a meta-reinforcement learning approach and employing the Reptile algorithm to train a VWG feature extraction model and an MPPO decision control model, the problems of gradient vanishing, long training time, and poor generalization performance in end-to-end autonomous driving systems are solved. This approach achieves rapid adaptation and high-quality feature extraction, thereby improving the stability and efficiency of autonomous driving systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2023-04-26
- Publication Date
- 2026-06-30
AI Technical Summary
Existing end-to-end autonomous driving systems suffer from vanishing and exploding gradient problems in their feature extraction models, resulting in poor feature extraction quality and slow convergence speed. They also experience excessively long training times when facing new environments, and the agent cannot effectively utilize previous knowledge to learn quickly when encountering new driving tasks, leading to poor generalization performance.
A meta-reinforcement learning-based approach is adopted, using the Reptile algorithm to train the VWG feature extraction model and the MPPO decision control model. Meta-learning is performed by constructing multiple datasets, and the feature extraction model is optimized by combining the variational autoencoder (VAE) and the Wasserstein generative adversarial network (WGAN-GP). When faced with new tasks, the driving strategy can be quickly learned by leveraging previous experience.
This system enables autonomous driving systems with fast model training speed, high generalization performance, and good feature extraction quality. It can quickly adapt to and optimize driving strategies when faced with new driving scenarios, reducing training time and improving the model's convergence speed and generalization ability.
Smart Images

Figure CN116469080B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of autonomous driving decision control and image feature extraction technology, and in particular to an end-to-end autonomous driving method and system based on meta-reinforcement learning. Background Technology
[0002] End-to-end autonomous driving systems based on deep reinforcement learning mainly consist of two parts: feature extraction and decision control. The feature information extracted by the feature extraction model serves as the input to the decision control model and is crucial for the agent to understand environmental information.
[0003] Traditional feature extraction models suffer from problems such as vanishing and exploding gradients, low feature extraction quality, and slow convergence when facing new tasks. Furthermore, reinforcement learning-based decision-making models require long training times when facing new tasks. Therefore, developing autonomous driving systems for new tasks, based on emerging technologies, has enormous research potential and application value. End-to-end autonomous driving systems and methods first involve onboard cameras acquiring video image information of the driving scene. Then, the acquired RGB images are used as input to a feature extraction model, which performs feature reduction to extract high-quality features for the agent to understand environmental information. Finally, the agent makes corresponding decision-making and control actions based on its own strategy. Autonomous driving systems integrate many cutting-edge technologies such as deep learning, reinforcement learning, and meta-learning, making it a research hotspot in the field of artificial intelligence. It drives innovation and transformation in travel services, improving traffic congestion and enhancing traffic safety and convenience.
[0004] Currently, research on end-to-end autonomous driving systems has yielded some results. However, this research task requires simultaneous achievement of requirements such as speed, accuracy, and generalization, leading to room for improvement in most research methods. The main areas for improvement include:
[0005] (1) The feature extraction model requires manual parameter tuning. If the parameter tuning is not done properly, gradient vanishing and gradient explosion will occur, resulting in poor feature extraction quality and slow convergence speed.
[0006] (2) When faced with a new environment, the feature extraction model needs to be trained from scratch, which takes too long;
[0007] (3) When faced with new driving tasks, the agent cannot effectively utilize previously learned knowledge to learn quickly, lacks the ability to learn effectively, and exhibits poor generalization performance. These problems pose challenges to end-to-end autonomous driving systems. Therefore, in-depth research is needed on feature extraction models and decision control models to improve accuracy while also achieving a certain degree of speed and generalization. Summary of the Invention
[0008] The purpose of this invention is to overcome the shortcomings of the existing technology and provide an end-to-end autonomous driving method and system based on meta-reinforcement learning.
[0009] The objective of this invention can be achieved through the following technical solutions:
[0010] An end-to-end autonomous driving method based on meta-reinforcement learning includes the following steps:
[0011] S1. Construct a dataset and use the meta-learning algorithm Reptile to train the VWG feature extraction model to obtain the MVWG feature extraction model.
[0012] S2. Construct a decision control model and train the decision control model using the meta-learning algorithm Reptile to obtain the trained MPPO decision model;
[0013] S3. Initialize the autonomous driving system using the trained MVWG feature extraction model and MPPO decision model. When the vehicle agent completes the driving task in a new driving scenario, collect RGB images of the driving environment in real time.
[0014] S4. Input the RGB image of the driving environment into the encoder in the MVWG model and encode it to extract the feature information of the RGB image of the driving environment.
[0015] S5. After obtaining the image feature information output by the encoder in step S4, the vehicle intelligent agent combines its current operating information and outputs corresponding decision control actions according to the initialized MPPO strategy. The decision control actions are then fed back to the driving environment to further optimize the driving strategy and obtain a stable autonomous driving system.
[0016] Further, in step S1, constructing the dataset includes the following steps:
[0017] The car is driven manually in a simulated environment on the simulation platform using the keyboard;
[0018] Different driving scenarios are constructed by randomly setting weather values; the randomly set weather values include randomly setting different solar altitude, solar angle, cloud cover, rainfall, or wind speed, etc.
[0019] Multiple datasets were constructed by collecting environmental images from different driving scenarios; each dataset includes multiple RGB images.
[0020] Furthermore, in step S1, the VWG feature extraction model includes an encoder, a decoder, and a discriminator;
[0021] The encoder is used to encode the input real image, that is, to reduce the dimensionality of the features, and output n-dimensional feature information.
[0022] The decoder is used to decode the input feature information, that is, to reconstruct the image;
[0023] The discriminator is used to distinguish between the authenticity of real images and reconstructed images, and outputs the probability that the discriminator classifies a reconstructed image or a real image as a real image.
[0024] Further, in step S1, the VWG feature extraction model is trained using the meta-learning algorithm Reptile. The meta-learning includes a meta-training phase and a meta-testing phase. The meta-training phase includes the following steps:
[0025] Multiple datasets are randomly selected from all datasets as training sets;
[0026] The RGB images in each training set are input into the encoder in the VWG feature extraction model. The encoder encodes the real images, which is called feature dimensionality reduction, to obtain n-dimensional feature information.
[0027] The extracted n-dimensional feature information is input into the decoder of the VWG feature extraction model for decoding, i.e., image reconstruction;
[0028] Both real and reconstructed images are input into the discriminator of the VWG feature extraction model. The discriminator distinguishes between real and reconstructed images and outputs the probability that the discriminator classifies a reconstructed image or a real image as a real image.
[0029] Minimize the loss function values of the encoder, decoder and discriminator to optimize the feature extraction model. After training on each training set is completed, a set of model parameters is obtained and the model enters the meta-testing stage.
[0030] The meta-testing phase includes the following steps:
[0031] One or more datasets are randomly selected from all datasets as the test set;
[0032] By continuing to train the model obtained after the meta-training phase using the test set, a stable meta-learning-based MVWG feature extraction model can be obtained with only a small number of fine-tuning samples.
[0033] Furthermore, the VWG feature extraction model uses Wasserstein distance to measure the distance between the real data distribution and the generated data distribution. The Wasserstein distance formula is:
[0034]
[0035] In the formula, ∏(P r ,P g ) represents P r and P gThe set of all possible joint distributions formed by combining them, ||xy|| represents the distribution from E (x,y)~γ Mid-sampling yields the distance between a real sample x and a generated sample y, E (x,y)~γ [||xy||] represents the expected distance between sample pairs under the joint distribution γ. This represents the lower bound of the expected value among all possible joint distributions.
[0036] Furthermore, the VWG feature extraction model optimizes the feature extraction model by minimizing the loss function value using gradient descent. The loss function is defined as:
[0037] L = L VAE +L dis
[0038] in
[0039] L VAE =β*L ws +L rec (mse)+L gen
[0040] L VAE L in ws for:
[0041]
[0042] In the formula, Z i Z represents the distribution of generated vectors. p Represents a standard normal distribution;
[0043] When Z p When it is a normal distribution,
[0044] In the formula, To generate the mean of the vector distribution, I is the identity matrix, and F is the definition of the norm;
[0045] L VAE L in rec for:
[0046]
[0047] L VAE L in gen for:
[0048]
[0049] L in L dis for:
[0050]
[0051] In the formula, x represents the actual sample. This represents the generated samples, where N is the dimension of the implicit vector, and β is the adjustment factor. ws Strength parameters; This represents the distribution of the generated samples. Represents the distribution of the real samples;
[0052] L dis In The expression for point-by-point interpolation between the real image and the pseudo image is:
[0053]
[0054] Where ∈ represents the image ratio, x r For a real image, x g To generate an image;
[0055] L dis In The term is the gradient of the discriminator output relative to the interpolation. Indicates calculation The L2 norm of λ is the ratio of gradient penalty to the loss of other discriminators.
[0056] Furthermore, the meta-training phase of the Reptile meta-learning algorithm includes an inner loop and an outer loop, for any extracted task T. i The algorithm is trained using gradient descent in the inner loop, and the parameters are obtained after k updates. Then it enters the outer circulation phase;
[0057] In the outer loop phase, the Reptile algorithm uses the difference between the parameters before and after the update. The gradient direction is used to update the parameters, and the formula is:
[0058]
[0059] In the formula, ε represents the learning rate of the outer loop. These are the initial parameters for the model. It is task T i The parameters obtained when the inner loop ends. It is task T i The parameters obtained when the outer loop ends;
[0060] After all the extracted tasks have gone through the meta-training phase, the model parameters are obtained.
[0061] Further, in step S2, training the decision control model using the meta-learning algorithm Reptile includes the following steps:
[0062] In the meta-training phase, multiple different driving tasks are randomly selected by setting different starting points to optimize the PPO algorithm. The PPO algorithm includes an Actor_new network, an Actor_old network, and a Critic network. The Actor_new network updates its parameters by minimizing a loss function composed of the ratio of new to old policies and an advantage function. The Critic network updates its parameters using the advantage function. The Actor_old network updates its parameters by periodically replicating the parameters of the Actor_new network. After the meta-training phase, optimal model parameters are obtained.
[0063] When faced with a new driving task, i.e. entering the meta-testing phase, the model is initialized with the model parameters obtained in the meta-training phase and fine-tuned so that the vehicle agent learns the MPPO decision model.
[0064] Furthermore, the reward function in the PPO algorithm is set as follows:
[0065]
[0066]
[0067]
[0068] In the formula, v min For the minimum speed, v max For maximum speed, v target For the target velocity, [v min ,v target [ ] is a speed buffer zone; the intelligent vehicle receives the maximum reward when its speed remains within the speed buffer zone; d max R is the maximum permissible road offset. p Let α be the penalty function for violations. max Used to limit the vehicle's forward vector and the forward vector of the current waypoint The included angle.
[0069] An end-to-end autonomous driving system based on meta-reinforcement learning is implemented using the end-to-end autonomous driving method based on meta-reinforcement learning as described above, including an image acquisition module, a feature extraction module, and a decision control module.
[0070] The image acquisition module is used to acquire RGB images of driving scenarios in real time;
[0071] The feature extraction module includes an MVWG feature extraction model, used to extract feature information from the RGB image;
[0072] The decision control module includes an MPPO decision model, which combines the vehicle's current operating information to output corresponding decision control actions and feeds these actions back to the driving environment to further optimize the driving strategy and obtain a stable autonomous driving system.
[0073] Compared with the prior art, the present invention has the following beneficial effects:
[0074] 1. This invention first trains an MVWG (Meta-VAE-WGAN-GP) feature extraction model by collecting multiple datasets, and then trains an MPPO (Meta-Proximal Policy Optimization) decision control model on different driving tasks. When faced with a new driving task, the trained MVWG feature extraction model and MPPO decision control model are used to initialize the autonomous driving system. When the intelligent vehicle encounters a new driving scenario, the camera captures environmental images in real time, and the images are input into the encoder module of the feature extraction model for encoding and feature extraction. The extracted feature information is then input to the vehicle agent. After obtaining 64-dimensional high-quality image feature information, the vehicle agent combines its current speed, acceleration, and steering angle information to output corresponding decision control actions, such as braking, accelerator, and steering, according to the initialized MPPO policy. Simultaneously, the actions are fed back to the driving environment to further optimize the driving strategy. Finally, based on previous experience, the agent quickly trains a high-performance and stable autonomous driving system using a small number of samples. This invention has the advantages of fast model training speed, high generalization performance, and high-quality feature extraction.
[0075] 2. The feature extraction model in this invention combines the variational autoencoder (VAE) with the Wasserstein generative adversarial network (WGAN-GP) with gradient penalty to form the VWG (VAE-WGAN-GP) model, which can effectively improve the feature extraction quality of the feature extraction model.
[0076] 3. This invention uses the meta-learning algorithm Reptile to train the VWG feature extraction model, thereby improving the training speed of the model when facing new driving scenarios.
[0077] 4. This invention uses the meta-learning algorithm Reptile to train the proximal policy optimization algorithm PPO decision control model, enabling the agent to quickly learn a better driving strategy based on previously learned experience when facing new driving tasks, reducing the model's training time and improving the model's generalization performance.
[0078] 5. This invention further improves the reward function in the PPO decision control algorithm, thereby increasing the convergence speed of the model. Attached Figure Description
[0079] Figure 1 This is a flowchart of the end-to-end autonomous driving method based on meta-reinforcement learning of the present invention;
[0080] Figure 2 This is a block diagram of the VWG feature extraction model in an embodiment of the present invention;
[0081] Figure 3 This is a flowchart of training a VWG feature extraction model using the Reptile meta-learning algorithm in an embodiment of the present invention;
[0082] Figure 4 This is a block diagram of the PPO decision control algorithm in an embodiment of the present invention;
[0083] Figure 5 This is a flowchart illustrating the training of the PPO decision control model using the Reptile meta-learning algorithm in an embodiment of the present invention. Detailed Implementation
[0084] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0085] Example 1
[0086] like Figure 1 As shown in the figure, an end-to-end autonomous driving method based on meta-reinforcement learning disclosed in this invention includes the following steps:
[0087] (1) First, the autonomous driving system is initialized using the trained MVWG (Meta-VAE-WGAN-GP) feature extraction model and MPPO (Meta-Proximal Policy Optimization) decision model.
[0088] (2) When the vehicle intelligent agent completes the driving task in a new driving scenario, the vehicle-mounted wide-angle camera is used to collect real-time RGB images.
[0089] (3) Input the image into the encoder module of the MVWG model and encode it to extract 64-dimensional high-quality feature information.
[0090] (4) After the vehicle intelligent agent obtains the 64-dimensional high-quality image feature information output by the encoder, it combines its current speed, acceleration and steering angle information, and outputs corresponding decision control actions such as braking, throttle, and steering according to the initialized MPPO strategy, and feeds the actions back to the driving environment to continue to optimize the driving strategy.
[0091] (5) The vehicle intelligent agent can quickly train a high-performance and stable autonomous driving system based on previous experience and a small number of samples.
[0092] In this embodiment, the MVWG feature extraction model in step (1) above is provided by Figure 2 , 3 The specific steps are as follows:
[0093] (6) The car is manually controlled by the keyboard to drive in the simulated environment of the Carla simulation platform. The car is used to collect images by the on-board wide-angle camera. In this embodiment, 20 datasets are collected by randomly setting the weather value. Each dataset has 10,000 RGB images, of which 9,000 are used for training and 1,000 are used for testing.
[0094] (7) Enter the meta-training stage of meta-learning. Randomly select 5 datasets from 20 datasets collected by the vehicle-mounted wide-angle camera as the datasets for the meta-training stage of the feature extraction model based on meta-learning.
[0095] (8) Meta-training is divided into an inner loop phase and an outer loop phase. In the inner loop phase, task T is trained... i The RGB images in the dataset are input into the encoder module, which encodes the real images, i.e., feature dimensionality reduction, to obtain high-quality 64-dimensional feature information.
[0096] (9) Input the extracted 64-dimensional high-quality feature information into the decoder for decoding, i.e., reconstruct the image.
[0097] (10) Input both the real image and the reconstructed image into the discriminator, and let the discriminator identify the authenticity of the real image and the reconstructed image. Output the probability that the discriminator will identify the reconstructed image or the real image as the real image.
[0098] (11) Optimize each module by continuously minimizing the loss function values of the encoder, decoder, and discriminator. The loss function of the VWG model is:
[0099] L = L VAE +L dis
[0100] in
[0101] L VAE =β*L ws +L rec (mse)+L gen
[0102] L VAE L in ws for:
[0103]
[0104] Z in the formula i Z represents the distribution of generated vectors. p This represents a standard normal distribution. When Z... p When it is a normal distribution,
[0105]
[0106] L VAE L in rec for:
[0107]
[0108] L VAE L in gen for:
[0109]
[0110] L in L dis for:
[0111]
[0112] In the formula, x represents the actual sample. This represents the generated samples, where N is the dimension of the implicit vector, and β is the adjustment factor. ws Strength parameters. This represents the distribution of the generated samples. This represents the distribution of the true samples. To satisfy the Lipschitz constraint, the discriminator's gradient must not exceed K. Therefore, a regularization term, or gradient constraint, is added to the loss function, ensuring that the gradient of the discriminator function is always 1. Furthermore, L... dis In It is a point-by-point interpolation between the real image and the pseudo image, expressed as:
[0113]
[0114] Where ∈ represents the image ratio, which can be obtained from a uniform distribution in [0,1]. r For a real image, x g To generate an image.
[0115] L dis In The term is the gradient of the discriminator output relative to the interpolation. This represents the L2 norm of the algorithm. λ is the ratio of the gradient penalty to the loss of the other discriminator, set to 10.
[0116] (12) Task T i After training on the dataset is completed, i.e., after the inner loop phase ends, the parameters are obtained. Entering the external circulation phase.
[0117] (13) In the outer loop phase, the Reptile algorithm updates the parameters using the difference before and after the update. The gradient direction is used to update the parameters, and the formula is:
[0118]
[0119] In the formula, ε represents the learning rate of the outer loop. These are the initial parameters of the model. It is task T i The parameters obtained when the inner loop ends. It is task T i The parameters obtained when the outer loop ends.
[0120] (14) After all the extracted tasks have gone through the meta-training phase, a better set of model parameters will be obtained.
[0121] (15) In the meta-testing phase, when the model faces a new driving scenario, that is, when the autonomous driving system of the present invention faces a new driving task, the parameters are used. The VWG feature extraction model in the initialization system is made so that the model can quickly converge with only a small number of samples, thus obtaining a good MVWG feature extraction model based on meta-learning.
[0122] In this embodiment, the MPPO decision model in step (1) above is derived from... Figure 5 It can be obtained, specifically:
[0123] (16) During the meta-training phase, 10 different driving tasks are randomly selected by setting different starting points.
[0124] (17) Enter the inner loop stage of meta-training, for any extracted task T i The parameters are obtained after the inner loop phase is completed. And then it enters the external circulation phase.
[0125] (18) In the outer loop phase, the Reptile algorithm updates the parameters using the difference before and after the update. The gradient direction is used to update the parameters, and the formula is:
[0126]
[0127] In the formula, ε represents the learning rate of the outer loop. These are the initial parameters of the model. It is task T i The parameters obtained when the inner loop ends. It is task Ti The parameters obtained when the outer loop ends.
[0128] (19) After all the extracted tasks have gone through the meta-training phase, a better set of model parameters will be obtained.
[0129] Meanwhile, the reward function in the PPO algorithm is set as follows:
[0130]
[0131]
[0132]
[0133] Taking into full account factors such as driving speed, road deviation, steering wheel angle, and penalties for violations, the reward function is composed of the product of three terms: speed, centering, and angle. The formula uses v... min It is the minimum speed, v max It is the maximum speed, v tarhet It is the target speed, [v] min ,v target [ ] is the speed buffer zone; the intelligent vehicle can obtain the maximum reward of 1 when its speed is maintained within this range. d max This is the maximum permissible road offset, typically half the lane width. R p This is the penalty function when a violation occurs. α max Used to limit the vehicle's forward vector and the forward vector of the current waypoint The angle between them is a maximum of 20 degrees.
[0134] (20) During the meta-testing phase, when the agent faces a new driving task, that is, when the autonomous driving system in this invention faces a new driving task, it uses parameters... The PPO decision control model is initialized so that the agent can quickly converge with only a small number of samples, thus obtaining a good MPPO decision control model based on meta-reinforcement learning.
[0135] Example 2
[0136] This embodiment provides an end-to-end autonomous driving system based on meta-reinforcement learning, which is implemented using an end-to-end autonomous driving method based on meta-reinforcement learning as described in the above embodiment, including an image acquisition module, a feature extraction module, and a decision control module;
[0137] The image acquisition module is used by the vehicle's intelligent agent to acquire RGB images of the environment in real time when facing new driving scenarios; and to acquire RGB images of the driving scenario in real time.
[0138] The feature extraction module includes the MVWG feature extraction model. The encoder module within the trained MVWG model encodes RGB images to extract 64-dimensional high-quality features. The MVWG feature extraction model is trained on multiple datasets using the meta-learning algorithm Reptile.
[0139] The decision control module includes the MPPO decision model. The decision control strategy is initialized using the MPPO model trained in the meta-training stage. After the vehicle agent obtains 64-dimensional high-quality image feature information, it combines its current speed, acceleration, and steering angle information, and outputs corresponding decision control actions according to the strategy, such as braking, accelerator, and steering. At the same time, it also feeds the actions back to the driving environment to continue to optimize the driving strategy. Finally, after a small number of samples, a high-performance and stable autonomous driving system is quickly trained.
[0140] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.
Claims
1. An end-to-end autonomous driving method based on meta-reinforcement learning, characterized in that, Includes the following steps: S1. Construct a dataset and use the meta-learning algorithm Reptile to train the VWG feature extraction model to obtain the MVWG feature extraction model. S2. Construct a decision control model and train the decision control model using the meta-learning algorithm Reptile to obtain the trained MPPO decision model; S3. Initialize the autonomous driving system using the trained MVWG feature extraction model and MPPO decision model. When the vehicle agent completes the driving task in a new driving scenario, collect RGB images of the driving environment in real time. S4. Input the RGB image of the driving environment into the encoder in the MVWG model and encode it to extract the feature information of the RGB image of the driving environment. S5. After the vehicle intelligent agent obtains the image feature information output by the encoder in step S4, it combines its current operating information and outputs corresponding decision control actions according to the initialized MPPO strategy. The decision control actions are then fed back to the driving environment to continue optimizing the driving strategy in order to obtain a stable autonomous driving system. In step S1, constructing the dataset includes the following steps: The car is driven manually in a simulated environment on the simulation platform using the keyboard; Different driving scenarios are constructed by randomly setting weather values; the randomly set weather values include randomly setting different solar altitude, solar angle, cloud cover, rainfall, or wind speed; Multiple datasets were constructed by collecting environmental images from different driving scenarios; each dataset includes multiple RGB images. Step S2, training the decision control model using the meta-learning algorithm Reptile, includes the following steps: In the meta-training phase, the PPO algorithm is optimized by randomly selecting multiple different driving tasks at different starting points. The PPO algorithm includes an Actor_new network, an Actor_old network, and a Critic network. The Actor_new network updates its parameters by minimizing a loss function composed of the ratio of new to old policies and an advantage function. The Critic network updates its parameters using the advantage function. The Actor_old network updates its parameters by periodically replicating the parameters of the Actor_new network. After the meta-training phase, optimal model parameters are obtained. ; When faced with a new driving task, i.e. entering the meta-testing phase, the model is initialized with the model parameters obtained in the meta-training phase and fine-tuned so that the vehicle agent learns the MPPO decision model.
2. The end-to-end autonomous driving method based on meta-reinforcement learning according to claim 1, characterized in that, In step S1, the VWG feature extraction model includes an encoder, a decoder, and a discriminator; The encoder is used to encode the input real image, i.e., feature dimensionality reduction, and output the result. n 3D feature information; The decoder is used to decode the input feature information, that is, to reconstruct the image; The discriminator is used to distinguish between the authenticity of real images and reconstructed images, and outputs the probability that the discriminator classifies a reconstructed image or a real image as a real image.
3. The end-to-end autonomous driving method based on meta-reinforcement learning according to claim 1, characterized in that, In step S1, the VWG feature extraction model is trained using the meta-learning algorithm Reptile. The meta-learning includes a meta-training phase and a meta-testing phase. The meta-training phase includes the following steps: Multiple datasets are randomly selected from all datasets as training sets; The RGB images from each training set are input into the encoder of the VWG feature extraction model. The encoder encodes the real images, i.e., feature dimensionality reduction, to obtain... n 3D feature information; The extracted n The 3D feature information is input into the decoder of the VWG feature extraction model for decoding, i.e., image reconstruction; Both real and reconstructed images are input into the discriminator of the VWG feature extraction model. The discriminator distinguishes between real and reconstructed images and outputs the probability that the discriminator classifies a reconstructed image or a real image as a real image. Minimize the loss function values of the encoder, decoder and discriminator to optimize the feature extraction model. After training on each training set is completed, a set of model parameters is obtained and the model enters the meta-testing stage. The meta-testing phase includes the following steps: One or more datasets are randomly selected from all datasets as the test set; The model obtained after the meta-training phase is further trained using the test set, and then fine-tuned with a small number of samples to obtain the MVWG feature extraction model based on meta-learning.
4. The end-to-end autonomous driving method based on meta-reinforcement learning according to claim 3, characterized in that, The VWG feature extraction model uses Wasserstein distance to measure the distance between the real data distribution and the generated data distribution. The Wasserstein distance formula is: In the formula, express and The set of all possible joint distributions formed by combining them. Indicates from A real sample is obtained by mid-sampling. and generate samples The distance between them Describe the joint distribution The expected value of the distance between the next sample pairs. This represents the lower bound of the expected value among all possible joint distributions.
5. The end-to-end autonomous driving method based on meta-reinforcement learning according to claim 3, characterized in that, The VWG feature extraction model optimizes the feature extraction model by minimizing the loss function value using the gradient descent method. The loss function is defined as follows: in In for: In the formula, Represents the distribution of generated vectors. Represents a standard normal distribution; when When it is a normal distribution, In the formula, To generate the mean of the vector distribution, It is the identity matrix. Here is the definition of a norm; In for: In for: In for: In the formula, This represents a real sample. This refers to the generation of samples. Let be the dimension of the implicit vector. To regulate Strength parameters; This represents the distribution of the generated samples. Represents the distribution of the real samples; In The expression for point-by-point interpolation between the real image and the pseudo image is: in: For image ratio, For real images, To generate an image; In The term is the gradient of the discriminator output relative to the interpolation. Indicates calculation L2 norm, It is the ratio of gradient penalty to the loss of other discriminators.
6. The end-to-end autonomous driving method based on meta-reinforcement learning according to claim 3, characterized in that, The meta-training phase of the Reptile meta-learning algorithm includes an inner loop and an outer loop, for any extracted task. The algorithm is trained using gradient descent in the inner loop, and the parameters are obtained after k updates. Then it enters the external circulation phase; In the outer loop phase, the Reptile algorithm uses the difference between the parameters before and after the update. The gradient direction is used to update the parameters, and the formula is: In the formula, This represents the learning rate of the outer loop. These are the initial parameters for the model. It is a task The parameters obtained when the inner loop ends. It is a task The parameters obtained when the outer loop ends; After all the extracted tasks have gone through the meta-training phase, the model parameters are obtained. .
7. The end-to-end autonomous driving method based on meta-reinforcement learning according to claim 1, characterized in that, The reward function in the PPO algorithm is set as follows: In the formula, For minimum speed, For maximum speed, For the target speed, As a speed buffer zone, the smart car receives the maximum reward when its speed remains within the speed buffer zone; The maximum permissible road offset, This is the penalty function for when a violation occurs. Used to limit the vehicle's forward vector and the forward vector of the current waypoint The included angle.
8. An end-to-end autonomous driving system based on meta-reinforcement learning, characterized in that, The method is implemented using an end-to-end autonomous driving approach based on meta-reinforcement learning as described in any one of claims 1-7, including an image acquisition module, a feature extraction module, and a decision control module. The image acquisition module is used to acquire RGB images of driving scenarios in real time; The feature extraction module includes an MVWG feature extraction model, used to extract feature information from the RGB image; The decision control module includes an MPPO decision model, which combines the vehicle's current operating information to output corresponding decision control actions and feeds these actions back to the driving environment to further optimize the driving strategy and obtain a stable autonomous driving system.