[0021] like figure 1 As shown, a depth strengthening of a binding state prediction, a transformation traffic signal control method, including the following steps:
[0022] Step 1: Modeling the intersection model with SUMO, the intersection is two-way 6 lanes, the lane is 500m, along the direction of driving, the left lane is the left turn lane, the middle lane is a straight lane, the right lane is directly to the right turn. The traffic data includes vehicle generation, simulation, number of vehicles, and travel trajectory. The generation of the vehicle in the present invention can simulate the situation in real life, and has the application value of the engineering application. The probability density function is:
[0023]
[0024] Among them, λ is a proportional parameter set to 1, and A is a shape parameter set to 2. When the time of simulation of one round is 2 hours, the number of vehicles is set to 1000, 2000, 3000, respectively, and the three flow conditions are respectively, medium and high. The vehicle length is 5m, the maximum speed is 25m / s, the maximum acceleration is 2m / s 2 , Maximum reduction of 5m / s 2 The minimum spacing between the vehicles is 2.5m. When the vehicle is running, the probability of 70% is straight, and the probability of 15% is left, and 15% of the probability is right.
[0025] Step 2: In the present invention, the DRL model uses D3QN to train the signal control policy using two DQN, select the action corresponding to the maximum Q value by the current network, and then acquire the Q value of this action in the target network, so that each selected The action Q value is not the largest, alleviates excessive estimation of the Q value, the excessive problem of the mitigation model, the optimization target of the current network is expressed as:
[0026]
[0027] Where R is a reward, γ is a discount fact, W is the parameters of the current network, W - Parameters for the target value network. The D3QN also optimizes the network structure, and divides the Q value of the state action into two parts. Part represents the value function V (s) having the environment state itself, and the other part indicates the additional value of the selection action, called For the advantageous function a (s, a), the Q value can be rewritten as:
[0028] q (s, a) = v (s) + a (s, a) (3)
[0029] Next, the three-element state, action, and rewards that strengthen the learning.
[0030] The status includes the number, speed and acceleration information of the vehicle in the road network, first divided the lane into several cells in accordance with a certain distance ratio. figure 2The designed map of the crossroad road is a state design, which contains the length information of the cell. Among them, the two lanes on the right should be partially divided, and the left left turn carrier is divided separately, and the portion of the traffic light is divided into 5 cells in 7m, and then divided by 10m, 25m, 40m, 160m, and 230m. Cells, such a direction of the intersection will be divided into 20 cells, and an intersection will be divided into 80 cells. Calculate the number, average speed, and flat acceleration of each cell, as a number of vectors, speed vectors, and acceleration vectors, the three vectors constitute a state of the environment.
[0031] The action is to switch the status of the traffic light, so that more vehicles quickly pass through the intersection. The action group A = {NSG, NSLG, EWG, EWLG}, including four actions, each action execution time is 3 seconds. Among them, NSA indicates that the north-south direction is straight and right green, and NSLA means that the north-north direction turns left green light, EWA indicates that the east-west direction is straight and the right green light, EWLA indicates that there is a green light. For direct and right turn, set the green light, the shortest length is 12s, up to 60S, for the left turn, set the green light, the shortest length is 12s, the longest 24s. The Agent will execute the yellow light of the time of 3s during the green light and red light.
[0032] The reward indicates that the Agent is a reward for environmental feedback after an action. The present invention is defined as the length, waiting time, total latency of the vehicle, the number of vehicles passing through the intersection, expressed as the number of traffic times of the intersection. :
[0033] rim n+1 = Α 1 * L n + alpha 2 * W n + alpha 3 * D n + alpha 4 * N n + alpha 5 * T n (4)
[0034] Where R n+1 Represents the reward of the Agent after executing the Nth action, L n Indicates the length of the queue period during the Nth Action, W n Represents waiting time for all vehicles, D n Indicates the delay of all vehicles, N n Indicates the number of vehicles through the intersection, T n Indicates the sum of the passages through the intersection of the vehicle, α 1 Α 2 Α 3 Α 4 Α 5 Indicates the weighted coefficient, which is set to -0.5, -0.25, -0.5, 1, 0.5, respectively.
[0035] For multiplex, use MARL to control traffic signals, and traffic signals per intersection are controlled by Agent, using status information interaction and spatial discount factors to achieve multiple intelligent body collaboration. Take 2 × 2 well-shaped road network as an example, each intersection is equivalent. For the intersection of the upper left, the input status of its Agent In addition to the traffic information of the local intersection, including traffic information of the upper right intersection and the left side of the road, the reward is the weighting and representation of all intersection rewards:
[0036] R = β 1 rim tl + β 2 rim tr + β 3 rim ll + β 4 rim lr (5)
[0037] Where R represents the reward of the Agent left, R tl R tr R ll R lr Represents rewards, β, β, β, β, β 1 Β 2 Β 3 Β 4 Indicates the weighted coefficient, and the present invention is defined as 0.5, 0.2, 0.2, 0.1, respectively.
[0038] Step 3: Using the LSTM predicts the future microscopic state, predict the number of K time steps, the speed vector, and acceleration vectors are predicted, the predicted stepk k is obtained by the network, and the current state is written as S, the prediction status is recorded as S p The optimization target of the optimal operation value function under the D3QN algorithm is predicted in combination states:
[0039]
[0040] Step 4: D3Qn uses empirical playback to update the target value, deposit the sample (S, A, R, S ') obtained by the Agent and the environment in the empirical pool, spray small batch sample from the experience pool, using a random gradient drop method Training depth neural network makes it approaching the Q value, random sampling can break the strong correlation between the samples, so that the training converges, the flow chart of the empirical playback image 3 Indicated. The DRL-related super parameter is set as follows: The number of training rounds is 400, the minimum size of the experience pool is 2000, the maximum size is 100,000, and the discount factor is 0.85. Q Network is a full-connected neural network, using a mean square error loss function, select the ADAM optimizer, the relevant super parameter is set as follows: depth is 5, the width is 400, the learning rate is 0.001, the batch size is 128, the training iteration is 800 . The LSTM prediction network uses a binary cross entropy as a loss function. The ADAM optimizer is selected. The associated hypertorter is set as follows: the number of cells is 6, the number of layers is 3, the number of neurons is 160, the batch size is 128, the number of training iterations is 1.
[0041] Step 5: Use the SUMO-generated traffic data to train the model to test, and the evaluation index includes the average wait time t with Webster. wt , The average queue length L, average driving time T at , Average CO emissions D co , Average CO 2 emission Expressed as:
[0042]
[0043] Where n represents the total number of vehicles, t is indicated by time long, WN t Indicates that the total number of stops in the T hour network, L t Total length of the queue of the T hour, N t Indicates the total number of running vehicles in T, CO t Indicates the total amount of CO emissions in the T hour network, CO 2t Indicates the CO emissions in the T hour 2 Total amount.
[0044] The present invention utilizes a simple and efficient state of discrete Traffic State Encoding, DTSE, using dynamic allocation, Kalman filtering or neural network, and the like to predict future traffic conditions. Decision, thus shortening the waiting time of the vehicle and improving the passivity of the road network. The invention has a positive theoretical significance and application value for promoting short - time traffic prediction, strengthening learning techniques in intelligent traffic signal control.