Signal intersection vehicle cloud multi-level collaborative control method

By employing a multi-level vehicle-cloud collaborative control method at signalized intersections, combined with deep reinforcement learning and model predictive control, the problem of vehicle collaborative control in complex traffic scenarios has been solved, achieving optimization of traffic efficiency and energy consumption, and improving the adaptability and robustness of intelligent transportation systems.

CN120472690BActive Publication Date: 2026-06-23CHONGQING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV
Filing Date
2025-05-21
Publication Date
2026-06-23

Smart Images

  • Figure CN120472690B_ABST
    Figure CN120472690B_ABST
Patent Text Reader

Abstract

The present application relates to a kind of signal intersection car cloud multi-level collaborative control method, belong to intelligent transportation control field.The method is realized through the multi-level control strategy of the combination of depth reinforcement learning and model predictive control, the collaborative optimization control of vehicle under signal intersection environment.The present application constructs DRL-MPC scheme based on MPC framework, and MPC module runs at low frequency to provide basic control input, and DRL module runs at high frequency to adjust and optimize MPC output, and the overall goal is optimized by the cooperation of the two, while collecting data by interacting with the environment, using the proximal policy optimization algorithm (PPO) to train DRL module, update its policy network and value network parameters to continuously improve the control strategy, can reduce the waiting time in front of traffic light while saving energy consumption.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent traffic control technology, specifically relating to a multi-level collaborative control method for vehicle-cloud systems at signalized intersections. Background Technology

[0002] In recent years, the development of intelligent connected vehicle technology has promoted the implementation of related applications such as autonomous driving. However, most current research focuses on autonomous driving technology for single vehicles, lacking simulation testing and collaborative control strategies for intelligent vehicles in complex traffic scenarios, especially in continuous signalized intersections with mixed traffic of other vehicles. Existing traffic control methods are mostly based on traditional signal timing or simple sensor control, which are difficult to adapt to complex traffic demands. Vehicle queuing, waiting, and frequent starts and stops are common at signalized intersections, leading to increased energy consumption and reduced traffic efficiency. Furthermore, existing simulation testing methods often fail to fully consider the dynamic interaction between vehicles and traffic signals, and the impact of queue dissipation on subsequent traffic flow.

[0003] Based on this, this invention proposes a multi-level vehicle-cloud collaborative control method for signalized intersections. This method constructs a vehicle-cloud hierarchical control platform, combining deep reinforcement learning (DRL) and model predictive control (MPC) technologies to optimize intelligent vehicle operation. The intelligent vehicle cyber-physical system is divided into multiple operational scales, employing a hierarchical decomposition to achieve separation of concerns. Through the vehicle-cloud hierarchical architecture, the advantages of cloud computing and real-time vehicle control are leveraged to optimize intelligent vehicle speed and path planning. DRL is used to adjust MPC outputs, improving policy adaptability and robustness, reducing waiting time, lowering energy consumption, and increasing traffic efficiency, providing a new path for the optimization of intelligent transportation systems. Summary of the Invention

[0004] In view of this, the present invention aims to provide a multi-level collaborative control method for vehicles at signalized intersections, which achieves collaborative optimization control of vehicles in signalized intersection environments through a multi-level control strategy that combines deep reinforcement learning and model predictive control.

[0005] To achieve the above objectives, the present invention provides the following technical solution:

[0006] A multi-level vehicle-to-cloud collaborative control method for signalized intersections includes the following steps:

[0007] S1. Based on the traffic demand and spatiotemporal characteristics of signalized intersections, construct the operational framework of the intelligent vehicle cyber-physical system;

[0008] S2. Construct a vehicle-cloud hierarchical control platform;

[0009] The physical layer uses SUMO to simulate road scenarios and loads vehicle dynamic models, while vehicle-side computing and control are implemented using Python. The information layer also uses Python to integrate SUMO and platform data, inputs it into the cloud control application platform, solves the optimal speed sequence through a predictive cruise algorithm, and outputs the results to the vehicle.

[0010] S3. A multi-level strategy of Deep Reinforcement Learning-Model Predictive Control (DRL-MPC) is adopted to optimize traffic control at signalized intersections. DRL is used to adjust the output of MPC to collaboratively optimize the control effect.

[0011] S4. Collect data by interacting with the environment, train the DRL module using the Proximal Policy Optimization (PPO) algorithm, and update the parameters of the policy network and value network of the DRL module to continuously improve the control policy.

[0012] Furthermore, the specific content of step S1, which involves constructing the operational framework of the intelligent vehicle cyber-physical system, is as follows:

[0013] The intelligent vehicle cyber-physical system is divided into multiple levels from the perspectives of time and space to match the actual needs of traffic control at signalized intersections;

[0014] Specifically, the system's hierarchical structure is determined based on the layout of signalized intersections, traffic flow distribution, and vehicle dynamics. Each layer corresponds to a specific functional module. The bottom layer is responsible for real-time vehicle data acquisition and preliminary processing, the middle layer performs local traffic flow optimization, and the top layer implements overall collaborative control strategies. Information is exchanged between the layers through standardized interfaces to ensure the modularity and scalability of the entire system, allowing for flexible adjustments to the system's configuration and functions according to different traffic scenarios and control requirements.

[0015] Furthermore, the specific content of step S2, constructing the vehicle-cloud hierarchical control platform, is as follows:

[0016] At the physical layer, SUMO software is used to create a realistic road network scenario, while defining vehicle type parameters.

[0017] The dynamics and kinematics model of the vehicle is integrated into the SUMO simulation environment. The model parameters include the vehicle's acceleration, deceleration, maximum speed and turning radius, so as to accurately simulate the vehicle's driving state under different traffic conditions.

[0018] The Python programming language is used to implement the vehicle-side computing and control logic, including data acquisition, processing of vehicle sensor data, and execution of control commands received from the cloud control application platform to control the vehicle's acceleration, deceleration, and lane changing.

[0019] In the information layer, Python code is used to write data fusion code, integrating vehicle status data and road condition data obtained from the SUMO simulator, as well as traffic flow data and traffic light status data obtained from the support platform. The data is cleaned, aligned, and fused to eliminate data noise and inconsistencies. The fused data is then input into the cloud control application platform, which runs a predictive cruise algorithm solver to calculate the optimal speed sequence based on the current traffic conditions and vehicle status, serving as the vehicle control command. The optimal speed sequence is then converted into specific vehicle control commands and sent back to the vehicle-side platform in the physical layer via a communication interface, enabling real-time collaborative control of the vehicle.

[0020] Furthermore, in step S2, the road network scenario includes road layout, number of lanes, location of signalized intersections, and phase timing.

[0021] Furthermore, in step S2, the vehicle type parameters include vehicle length, maximum speed, and acceleration and deceleration limits.

[0022] Furthermore, the specific content of step S3 is as follows:

[0023] First, the state space is defined. The state space of the DRL covers the current traffic conditions of the signalized intersection and the initial control signals output by the MPC module. After normalization, the signals are input into the deep neural network.

[0024] Then, the motion space is defined. The motion space of DRL is consistent with the output dimension of MPC. The initial value of the motion element is limited to the range of [-1,1], and subsequently scaled to the appropriate range according to the actual control requirements.

[0025] Next, the reward function is designed, taking into account traffic flow, vehicle waiting time and energy consumption, while introducing a penalty term to avoid state constraint violations.

[0026] Finally, a collaborative optimization mechanism was designed. The DRL module updates the strategy network parameters based on the reward signal and adjusts the action output, enabling the MPC and DRL to work together and improve the adaptability and robustness of the control system.

[0027] Furthermore, the specific content of step S4 is as follows:

[0028] Data is collected through interaction with the environment, and the DRL module is trained using the Proximal Policy Optimization (PPO) algorithm to update its policy network and value network parameters in order to continuously improve the control policy.

[0029] First, the mean and standard deviation of continuous actions are output through the policy network to generate actions that follow a Gaussian distribution; the value network estimates the state value to provide a benchmark for the calculation of the advantage function.

[0030] Then, the experience playback buffer is managed by storing the data generated from the interaction with the environment in the experience playback buffer. The first-in-first-out (FIFO) strategy is used to maintain a constant amount of data, and the data is updated periodically or in batches.

[0031] Then, generalized advantage estimation (GAE) is calculated. Combining the n-step reward and temporal difference methods, the advantage value is calculated using the GAE formula to smooth the advantage estimate and improve training stability.

[0032] Furthermore, in step S4, the data generated by interacting with the environment includes state, action, reward, and next state.

[0033] Beneficial effects:

[0034] 1. By adopting the DRL-MPC multi-level control strategy, coordinated optimization control of vehicles in signalized intersection environments is realized, reducing waiting time at traffic lights and saving energy consumption.

[0035] 2. By leveraging the collaborative control architecture of vehicle-cloud layering, the advantages of cloud computing capabilities and real-time vehicle control can be fully utilized to achieve efficient optimization of the transportation system.

[0036] 3. Utilize deep reinforcement learning (DRL) to adjust the output of model predictive control (MPC) to improve the adaptability and robustness of the control strategy, enabling it to better cope with complex traffic environments.

[0037] Other advantages, objectives, and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination, or may be learned from practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description

[0038] Figure 1 This serves as the framework for the simulation platform.

[0039] Figure 2 A layered control framework for vehicle-cloud computing;

[0040] Figure 3 DRL-MPC framework diagram;

[0041] Figure 4 A time-scale graph is used for DRL-MPC control. Detailed Implementation

[0042] To make the technical solutions, advantages, and objectives of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the described embodiments of the present invention without creative effort are within the protection scope of this application.

[0043] This invention provides a multi-level vehicle-to-cloud cooperative control method for signalized intersections, comprising the following steps:

[0044] S1. Based on the traffic demand and spatiotemporal characteristics of signalized intersections, construct the operational framework of the intelligent vehicle cyber-physical system;

[0045] The constructed intelligent vehicle cyber-physical system (HPS) operational framework is as follows: The HPS is divided into multiple layers from a temporal and spatial perspective to match the actual needs of signalized intersection traffic control. Specifically, the layered structure of the system is determined based on the layout of the signalized intersection, traffic flow distribution, and vehicle dynamic characteristics. Each layer corresponds to a specific functional module; for example, the bottom layer is responsible for real-time vehicle data acquisition and preliminary processing, the middle layer performs local traffic flow optimization, and the top layer implements overall collaborative control strategies. Through this layered approach, the system can process complex traffic information more efficiently and achieve precise control of traffic at signalized intersections. Simultaneously, information exchange between layers is achieved through standardized interfaces, ensuring the modularity and scalability of the entire system, allowing for flexible adjustments to the system's configuration and functions according to different traffic scenarios and control requirements.

[0046] S2. Construct a vehicle-cloud hierarchical control platform. The physical layer uses SUMO to simulate road scenarios and loads vehicle dynamic models. Vehicle-side computing and control are implemented using Python. The information layer also uses Python to integrate SUMO and platform data. After inputting the data into the cloud control application platform, the optimal speed sequence is solved by the predictive cruise algorithm, and the results are output to the vehicle.

[0047] The road scenario and the cloud control system designed in this paper were implemented using SUMO and Pythen software. The simulation platform framework is as follows: Figure 1As shown, at the physical layer, SUMO software is used to create a realistic road network scenario, including detailed information such as road layout, number of lanes, location and phase timing of signalized intersections, and vehicle type parameters such as vehicle length, maximum speed, acceleration and deceleration limits. The vehicle's dynamics and kinematics model is integrated into the SUMO simulation environment, including parameters such as vehicle acceleration, deceleration, maximum speed, and turning radius, to accurately simulate the vehicle's driving state under different traffic conditions. The Python programming language is used to implement the vehicle-side calculation and control logic, including data acquisition and processing of vehicle sensor data (such as speed, position, acceleration, etc.), and execution of control commands received from the cloud control application platform to control the vehicle's acceleration, deceleration, lane changing, and other behaviors. In the information layer, Python code is used to write data fusion code, integrating vehicle status data and road condition data obtained from the SUMO simulator, as well as traffic flow data and traffic light status data obtained from the support platform. This data is cleaned, aligned, and fused to eliminate noise and inconsistencies. The fused data is then input into the cloud control application platform, which runs a predictive cruise algorithm solver to calculate the optimal speed sequence based on current traffic and vehicle status, serving as the vehicle's control commands. The calculated optimal speed sequence is converted into specific vehicle control commands (such as acceleration and speed setpoints), and these commands are sent back to the vehicle-side platform in the physical layer via a communication interface, achieving real-time collaborative vehicle control. The vehicle-cloud layered control framework is as follows: Figure 2 .

[0048] S3. A multi-level strategy of Deep Reinforcement Learning-Model Predictive Control (DRL-MPC) is adopted to optimize traffic control at signalized intersections. DRL is used to adjust the MPC output, collaboratively optimizing the control effect. The DRL-MPC framework is as follows: Figure 3 As shown;

[0049] The MPC module operates at the lower layer, providing basic control inputs based on the MPC objective function and traffic demand predicted by the associated nominal model, optimized over a prediction window. The objective function is given according to the control objective, and state and input constraints are explicitly considered during optimization. To improve the optimality of the MPC output and avoid severe constraint violations, the DRL module operates at the upper layer, modifying the MPC output through a learning process involving interaction with the real system. The state space of the DRL agent includes the signalized intersection traffic state and the MPC output, while the reward function is designed to complement the MPC objective function, enabling the two modules to collaboratively optimize the overall objective. Furthermore, traffic demand is input into the DRL agent, and penalties for constraint violations are added to the reward function. The DRL actions have the same dimension as the MPC output, but their elements are smaller in scale. The proposed MPC-DRL framework has a hierarchical structure, as follows: Figure 2 As shown:

[0050] Assuming the dynamic model of the signalized intersection is a discrete-time model, and the simulation sampling time is T. s The DRL module operates with a control sampling time of T. d Then T s T d T c The relationship between them can be described as follows:

[0051] T c =m1·T d =m1·m2·T s m1,m2∈N + m1>1 (1)

[0052] Note that, for simplicity, we assume the simulation sampling step size, DRL control step size, and MPC control step size coincide. Therefore, the overall control input of the combined frame is a combination of the MPC output and the DRL output, per T d Updated once per time unit. (MPC control step k) c The corresponding control step k d (i.e., k) d T d ∈[k c T c ,(k+1)T c The overall control input is:

[0053] u c (k d ) = sat(u rl (k d )+u b (k c (2)

[0054] Among them, u rl It is the output of DRL, u b This is the output of MPC, using a saturation function to ensure additive control input u. c (k d The constraint is satisfied and defined on the element as follows:

[0055]

[0056] Among them, u min and u max To control the minimum and maximum allowable values ​​of corresponding elements in the input, Figure 3 This explains the different timescales of MPC and DRL control sampling time, and u rl How to modify u b DRL-MPC control uses a time scale such as Figure 4 As shown.

[0057] Execute a standard MPC program within the MPC module. The system state is x, and the simulation sampling step size is k. s Update, with MPC control step size k c The corresponding simulation sampling step size is:

[0058] {k C m,k C m+1,…,k C m+m-1}(4)

[0059] Where m = m1m2, therefore, the step size in the simulated sampling k S =k C At time m, the actual state of the signalized intersection is measured and input into the MPC module. At each control step k... C Solve the following optimization problem:

[0060]

[0061] in, Indicates that in length N p,c The control variable, u, is optimized on the prediction window. b (k c ) is for MPC in control step k C The output of . Where u s (k s ) represents the simulated sampling step k s MPC output at that time and Indicates the simulated sampling step k s The predicted future state. Additionally, d(k^s) includes the simulation sampling step k. s Equation (1) represents the system state evolution driven by the signalized intersection dynamics f, and equations (2) and (3) represent the constraints of the MPC state set X and the MPC output set U, respectively. Since the system sampling frequency is higher than the MPC output generation frequency, equation (4) maps the MPC output to each simulated sampling step within the prediction window, thereby enabling the output to be realized in the system dynamics.

[0062] In the formula, J(k) s ) represents the time interval [k s T s ,(k s +1)T s The predicted objective function value within ] This indicates that in a length of N P,C The variables that need to be optimized within the prediction window. This represents the predicted future state under controlled step size. Where N... P,S and N P,CThese are the predicted horizon lengths calculated based on the simulation sampling step size and the MPC time step size, respectively, where N... P,S =N P,C m.

[0063] Viewing a signalized intersection network as a Markov decision process (MDP) can be done using a quintuple.<S,A,P,R,γ> This invention defines a state space S, an action space A, and a reward distribution R. Here, P represents the transition probability between states, implicitly defined by the signalized intersection network model. γ is a discount factor used to define future rewards. The RL module operates at a lower level, with a higher frequency than the MPC module. To avoid excessively frequent changes in the control input of the signalized intersection network, the control sampling time T of the RL module is... d Greater than the simulation sampling time T s Therefore, corresponding to the RL control step size k d The simulation sampling step size is:

[0064] {k d m2,k d m2+1,…,k d m² + m² - 1} (6)

[0065] The state, actions, and rewards of the RL are updated in each control step, as defined below:

[0066] State x rl (k d )∈S: The state space of RL should contain all the necessary information of the framework. The introduced RL uses a deep neural network, and for ease of learning, the states of the input layer are normalized to the same order of magnitude. Therefore, the normalized states are:

[0067]

[0068] in, These represent the normalized state of the signalized intersection network, the MPC output, and the simulated sampling step k. d The actual needs of m2, and the framework in the previous control step k d -1 is the overall control input. Note that the purpose of adding the fourth element is to provide additional knowledge about the overall control input of the composite framework, which helps to avoid drastic fluctuations in the control input.

[0069] Action u rl (k d )∈A:action u rl Used to modify the output of MPC. b Therefore, they have the same dimension, namely dimu. rl =dimub For simplicity, we assume the action space is continuous. Note the action u. rl It is generated from the output layer of the DNN, and its elements have initial values ​​between [-1, 1]. Therefore, these values ​​are scaled down to the actual control input before being added. Furthermore, u rl The order of magnitude of the element is greater than u b The elements are small, therefore in this frame, u b It is the dominant control input, providing basic performance, while u rl It's an auxiliary function control input; its purpose is to improve performance by setting it to a high frequency. rl The following inequalities satisfy the definition of action space A:

[0070] -w u ΔU≤u rl ≤w u ΔU (8)

[0071] In the formula, ΔU=u max -u min , where u max and u min w represents the upper and lower bounds of the MPC output. u ∈[0,1] is the factor that determines u rl For u b Scaling parameters that affect the degree of influence.

[0072] Reward r(x) rl (k d ),u rl (k d ))∈R: Here, in order to increase the interactivity of MPC, it is necessary to coordinate MPC and RL to achieve optimal control performance. Therefore, the reward function should include the objective function J of the MPC module. MPC :

[0073]

[0074] In the formula, P s w represents a situation where state constraints are violated. p A value greater than 0 represents the penalty weight parameter, which penalizes a state that violates the constraint. Here, we assume r... t (k d ) is r(x rl (k d ),u rl (k d The equivalent representation of )) represents that RL controls the step size k. d The reward is based on observations of real-time traffic conditions. To evaluate the objective function J of MPC, the real-time state x(k) of the relevant traffic can be measured at each simulation sampling step. s The output u of MPCs (k s The result can be obtained from the formula. Furthermore, it can be determined based on the traffic state x(k). s Directly calculate P s (k s ), where k s =k d m2+1,…,k d m2+m2. Here the reward is a negative value, therefore R is a set of negative numbers.

[0075] The Deep Actor-Critic training framework was considered because lane changing and green wave traffic involve both discrete and continuous actions, and the Proximal Policy Optimization (PPO) algorithm was used to select the RL agent.

[0076] In reinforcement learning, the goal of the policy gradient algorithm is to optimize a parameterized policy π. θ (a|s) is used to update the policy parameter θ by maximizing the expected reward. This objective can be expressed by the policy gradient theorem:

[0077]

[0078] in, R represents the expectation of sampling the current policy, where γ is the discount factor. t This is the cumulative reward starting from time step t. To optimize this objective, we need to calculate the gradient and use it to update the policy:

[0079]

[0080] Where, logπ θ (a t |s t ) is the current policy in state s t Choose action a t The logarithmic probability of A. t It is the advantage function, representing the current action a. t The advantage relative to the benchmark is usually estimated by a value function (Critic), the advantage function A. t Used to measure the current action taken relative to a reference value (such as the state value function V(s)). t The quality of a product can be calculated using the following formula:

[0081] A t =R t -V(s t (12)

[0082] Wherein, V(s) t ) is state s t The estimated value is usually estimated by a value network (Critic).

[0083] S4. Collect data by interacting with the environment, train the DRL module using the Proximal Policy Optimization (PPO) algorithm, and update the parameters of the policy network and value network of the DRL module to continuously improve the control policy;

[0084] The PPO algorithm incorporates a clipping objective function into the traditional policy gradient algorithm to limit the magnitude of each policy update, thereby preventing instability caused by excessively large policy updates. The PPO objective function is a clipping objective function, consisting of two parts: the original policy gradient objective and a clipping term, which can be expressed as:

[0085]

[0086] Where, π θ (a t |s t ) is the new policy (the current policy) in state s t Select action a t The probability, This is the probability of the old policy (the policy before the update). ε is a small hyperparameter, typically ranging from [0.1, 0.3], used to limit the magnitude of each policy update. The clip function calculates the probability ratio. The policy update is limited to the range [1-ε, 1+ε] to prevent it from being too large.

[0087] The ultimate goal of PPO is to maximize the pruned objective function L. CLIP (θ), that is:

[0088]

[0089] This optimization process involves multiple updates, with the advantage function A calculated based on the sampled data of the current strategy during each update. t Then, the objective function described above is used to optimize the policy. This implementation is suitable for both offline training and embedding into online control systems, allowing the agent to gradually learn better policies by continuously updating the network parameters.

[0090] The training process of the vehicle-cloud hierarchical PPO-MPC algorithm is shown in Algorithm 1. Both the policy network and the value network adopt a multi-layer feedforward neural network structure. The policy network outputs the mean and standard deviation of continuous actions (for reference trajectory parameters and queue grouping parameters) and the class probability of discrete actions (for lane-changing decisions). The main hyperparameter settings are as follows:

[0091] Table 1

[0092]

[0093] Algorithm 1

[0094] 1. Initial Input: Initialize the policy network parameters θ, the value network parameters φ, and the training hyperparameters.

[0095] 2. Initial conditions: Initialize the policy network π θ Value Network V φ Initialize the experience playback buffer D

[0096] 3. Repeat the training for e = 1, 2, ... M rounds;

[0097] a. Initialize the traffic environment, setting initial traffic demand and signal status.

[0098] b. Execute the MPC control cycle k repeatedly c =0,1,…T / T c -1, Solving formula

[0099] c. Execute PPO control step k in a loop d =k c m1 to (k c m1+1)-1

[0100] 4. Randomly sample mini-batch data from buffer D.

[0101] 5. Calculate the target value and generalized advantage estimate (GAE).

[0102] 6. Perform multiple PPO updates;

[0103] 7. Update the target network parameters;

[0104] 8. Output: Optimized policy parameters and value parameters

[0105] End

[0106] Algorithm 4-1 details the training process of the vehicle-cloud hierarchical PPO-MPC intersection control algorithm. This algorithm employs a multi-timescale control framework, where the MPC module runs at a lower frequency, providing the basic control strategy; the PPO module runs at a higher frequency, adjusting and optimizing the MPC output. The training process mainly includes four core steps:

[0107] The first step is environmental interaction and data collection. The algorithm interacts with the traffic environment through MPC and PPO, collecting tuples of state, action, reward, and next state transitions, and storing them in the experience replay buffer. This step ensures the diversity and representativeness of the training data.

[0108] The second step is the calculation of the advantage function. The algorithm uses the generalized advantage estimation (GAE) method to calculate the advantage value of each state-action pair. This method combines the low bias of n-step rewards with the low variance of temporal difference methods, which helps to improve training stability.

[0109] The third step involves updating the policy and value networks. The algorithm uses the pruning objective function of the Policy Processing (PPO) to update the policy network. This objective function avoids excessive policy changes by limiting the magnitude of policy updates, while maximizing policy performance. The value network is updated by minimizing the mean squared error between the predicted and target values.

[0110] The fourth step is target network update. The algorithm periodically updates the target network parameters, a mechanism that further improves the stability of training.

[0111] Through multiple rounds of iterative training, the algorithm gradually optimizes the parameters of the policy network and value network, ultimately obtaining a control strategy that can effectively coordinate traffic at intersections. This training process fully utilizes the model prediction capability of MPC and the adaptive learning capability of PPO, realizing multi-scale control optimization through vehicle-cloud collaboration.

[0112] Simulation experiments demonstrate that the proposed vehicle-cloud hierarchical DRL-MPC framework exhibits excellent generalization ability and optimization performance. It can significantly improve intersection traffic efficiency and reduce vehicle energy consumption while ensuring vehicle safety. This hierarchical framework, which integrates deep reinforcement learning and model predictive control, provides a new technical approach for traffic management in intelligent connected environments and lays a theoretical foundation for multi-vehicle cooperative control in subsequent research.

[0113] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the present invention, and all such modifications or substitutions should be covered within the protection scope of the present invention.

Claims

1. A multi-level collaborative control method for vehicle-to-cloud systems at signalized intersections, characterized in that, Includes the following steps: S1. Based on the traffic demand and spatiotemporal characteristics of signalized intersections, construct the operational framework of the intelligent vehicle cyber-physical system; S2. Construct a vehicle-cloud hierarchical control platform; The physical layer uses SUMO to simulate road scenarios and loads vehicle dynamic models, while vehicle-side computing and control are implemented using Python. The information layer also uses Python to integrate SUMO and platform data, inputs it into the cloud control application platform, solves the optimal speed sequence through a predictive cruise algorithm, and outputs the results to the vehicle. S3. A multi-level strategy of Deep Reinforcement Learning-Model Predictive Control (DRL-MPC) is adopted to optimize traffic control at signalized intersections. The DRL is used to adjust the MPC output to collaboratively optimize the control effect. The specific content of step S3 is as follows: First, the state space is defined. The state space of the DRL covers the current traffic conditions of the signalized intersection and the initial control signals output by the MPC module. After normalization, it is input into the deep neural network. Then, the motion space is defined. The motion space of DRL is consistent with the output dimension of MPC. The initial value of the motion element is limited to the range of [-1,1], and subsequently scaled to the appropriate range according to the actual control requirements. Next, the reward function is designed, taking into account traffic flow, vehicle waiting time and energy consumption, while introducing a penalty term to avoid state constraint violations. Finally, a collaborative optimization mechanism was designed. The DRL module updates the strategy network parameters based on the reward signal and adjusts the action output, so that MPC and DRL work together to improve the adaptability and robustness of the control system. S4. Collect data by interacting with the environment, train the DRL module using the Proximal Policy Optimization (PPO) algorithm, and update the parameters of the policy network and value network of the DRL module to continuously improve the control policy.

2. The multi-level collaborative control method for vehicle-to-cloud systems at signalized intersections according to claim 1, characterized in that: The specific content of step S1, which involves constructing the operational framework of the intelligent vehicle cyber-physical system, is as follows: The intelligent vehicle cyber-physical system is divided into multiple levels from the perspectives of time and space to match the actual needs of traffic control at signalized intersections; Specifically, the system's hierarchical structure is determined based on the layout of signalized intersections, traffic flow distribution, and vehicle dynamics. Each layer corresponds to a specific functional module. The bottom layer is responsible for real-time vehicle data acquisition and preliminary processing, the middle layer performs local traffic flow optimization, and the top layer implements overall collaborative control strategies. Information is exchanged between the layers through standardized interfaces to ensure the modularity and scalability of the entire system, allowing for flexible adjustments to the system's configuration and functions according to different traffic scenarios and control requirements.

3. The multi-level collaborative control method for vehicle-cloud systems at signalized intersections according to claim 2, characterized in that: The specific content of step S2, which involves constructing the vehicle-cloud hierarchical control platform, is as follows: At the physical layer, SUMO software is used to create a realistic road network scenario, while defining vehicle type parameters. The dynamics and kinematics model of the vehicle is integrated into the SUMO simulation environment. The model parameters include the vehicle's acceleration, deceleration, maximum speed and turning radius, so as to accurately simulate the vehicle's driving state under different traffic conditions. The Python programming language is used to implement the vehicle-side computing and control logic, including data acquisition, processing of vehicle sensor data, and execution of control commands received from the cloud control application platform to control the vehicle's acceleration, deceleration, and lane changing. In the information layer, Python code is used to write data fusion code, integrating vehicle status data and road condition data obtained from the SUMO simulator, as well as traffic flow data and traffic light status data obtained from the support platform. The data is cleaned, aligned, and fused to eliminate data noise and inconsistencies. The fused data is then input into the cloud control application platform, which runs a predictive cruise algorithm solver to calculate the optimal speed sequence based on the current traffic conditions and vehicle status, serving as the vehicle control command. The optimal speed sequence is then converted into specific vehicle control commands and sent back to the vehicle-side platform in the physical layer via a communication interface, enabling real-time collaborative control of the vehicle.

4. The multi-level collaborative control method for vehicle-cloud systems at signalized intersections according to claim 3, characterized in that: In step S2, the road network scenario includes road layout, number of lanes, location of signalized intersections, and phase timing.

5. The multi-level collaborative control method for vehicle-to-cloud systems at signalized intersections according to claim 4, characterized in that: In step S2, the vehicle type parameters include vehicle length, maximum speed, acceleration and deceleration limits.

6. The multi-level collaborative control method for vehicle-cloud systems at signalized intersections according to claim 5, characterized in that: The specific content of step S4 is as follows: First, the mean and standard deviation of continuous actions are output through the policy network to generate actions that follow a Gaussian distribution; the value network estimates the state value to provide a benchmark for the calculation of the advantage function. Then, the experience playback buffer is managed by storing data generated from interactions with the environment in the experience playback buffer. A first-in-first-out (FIFO) strategy is used to maintain a constant amount of data, and the data is updated periodically or in batches. Then, the generalized advantage estimation (GAE) is calculated. Combining the n-step reward and temporal difference methods, the advantage value is calculated using the GAE formula to smooth the advantage estimation and improve training stability.

7. The multi-level collaborative control method for vehicle-cloud systems at signalized intersections according to claim 6, characterized in that: In step S4, the data generated by interacting with the environment includes status, actions, and rewards.