Deep reinforcement learning traffic signal control method combined with state prediction

A reinforcement learning and traffic signal technology, which is applied in the traffic control system of road vehicles, traffic signal control, traffic control system, etc., can solve the problems of limited control effect, achieve the effect of easy prediction, improve traffic efficiency, and reduce the amount of data

Pending Publication Date: 2022-01-21
5 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, for the complex and changeable traffic flow in the actual scene, the optimal control strategy can only be obtained by integrating the current, historical and future states
[0004] Real traffic flow data has the characteris...
View more

Method used

Step 2: DRL model adopts D3QN among the present invention, utilizes two DQNs to train signal control strategy, selects the action corresponding to maximum Q value by current network, then obtains the Q value of this action in target network, makes every The Q value of the selected action is not the largest, which reduces the overestimation of the Q value and alleviates the overfitting problem of the model. The optimization goal of the current network is expressed as:
The present invention utilizes Discrete Traffic State Encoding (Discrete Traffic State Encodin...
View more


The invention discloses a deep reinforcement learning traffic signal control method combined with state prediction. The method comprises the following steps: (1) modeling road network environment and traffic flow data; (2) selecting a deep reinforcement learning algorithm and designing three elements; (3) predicting a future traffic state; (4) training the model; and (5) carrying out experiment for test. The waiting time of vehicles can be shortened, and the passing efficiency of a road network is improved.

Application Domain

Controlling traffic signalsMachine learning

Technology Topic

Reinforcement learning algorithmEngineering +5


  • Deep reinforcement learning traffic signal control method combined with state prediction
  • Deep reinforcement learning traffic signal control method combined with state prediction
  • Deep reinforcement learning traffic signal control method combined with state prediction


  • Experimental program(1)

Example Embodiment

[0021] like figure 1 As shown, a depth strengthening of a binding state prediction, a transformation traffic signal control method, including the following steps:
[0022] Step 1: Modeling the intersection model with SUMO, the intersection is two-way 6 lanes, the lane is 500m, along the direction of driving, the left lane is the left turn lane, the middle lane is a straight lane, the right lane is directly to the right turn. The traffic data includes vehicle generation, simulation, number of vehicles, and travel trajectory. The generation of the vehicle in the present invention can simulate the situation in real life, and has the application value of the engineering application. The probability density function is:
[0024] Among them, λ is a proportional parameter set to 1, and A is a shape parameter set to 2. When the time of simulation of one round is 2 hours, the number of vehicles is set to 1000, 2000, 3000, respectively, and the three flow conditions are respectively, medium and high. The vehicle length is 5m, the maximum speed is 25m / s, the maximum acceleration is 2m / s 2 , Maximum reduction of 5m / s 2 The minimum spacing between the vehicles is 2.5m. When the vehicle is running, the probability of 70% is straight, and the probability of 15% is left, and 15% of the probability is right.
[0025] Step 2: In the present invention, the DRL model uses D3QN to train the signal control policy using two DQN, select the action corresponding to the maximum Q value by the current network, and then acquire the Q value of this action in the target network, so that each selected The action Q value is not the largest, alleviates excessive estimation of the Q value, the excessive problem of the mitigation model, the optimization target of the current network is expressed as:
[0027] Where R is a reward, γ is a discount fact, W is the parameters of the current network, W - Parameters for the target value network. The D3QN also optimizes the network structure, and divides the Q value of the state action into two parts. Part represents the value function V (s) having the environment state itself, and the other part indicates the additional value of the selection action, called For the advantageous function a (s, a), the Q value can be rewritten as:
[0028] q (s, a) = v (s) + a (s, a) (3)
[0029] Next, the three-element state, action, and rewards that strengthen the learning.
[0030] The status includes the number, speed and acceleration information of the vehicle in the road network, first divided the lane into several cells in accordance with a certain distance ratio. figure 2The designed map of the crossroad road is a state design, which contains the length information of the cell. Among them, the two lanes on the right should be partially divided, and the left left turn carrier is divided separately, and the portion of the traffic light is divided into 5 cells in 7m, and then divided by 10m, 25m, 40m, 160m, and 230m. Cells, such a direction of the intersection will be divided into 20 cells, and an intersection will be divided into 80 cells. Calculate the number, average speed, and flat acceleration of each cell, as a number of vectors, speed vectors, and acceleration vectors, the three vectors constitute a state of the environment.
[0031] The action is to switch the status of the traffic light, so that more vehicles quickly pass through the intersection. The action group A = {NSG, NSLG, EWG, EWLG}, including four actions, each action execution time is 3 seconds. Among them, NSA indicates that the north-south direction is straight and right green, and NSLA means that the north-north direction turns left green light, EWA indicates that the east-west direction is straight and the right green light, EWLA indicates that there is a green light. For direct and right turn, set the green light, the shortest length is 12s, up to 60S, for the left turn, set the green light, the shortest length is 12s, the longest 24s. The Agent will execute the yellow light of the time of 3s during the green light and red light.
[0032] The reward indicates that the Agent is a reward for environmental feedback after an action. The present invention is defined as the length, waiting time, total latency of the vehicle, the number of vehicles passing through the intersection, expressed as the number of traffic times of the intersection. :
[0033] rim n+1 = Α 1 * L n + alpha 2 * W n + alpha 3 * D n + alpha 4 * N n + alpha 5 * T n (4)
[0034] Where R n+1 Represents the reward of the Agent after executing the Nth action, L n Indicates the length of the queue period during the Nth Action, W n Represents waiting time for all vehicles, D n Indicates the delay of all vehicles, N n Indicates the number of vehicles through the intersection, T n Indicates the sum of the passages through the intersection of the vehicle, α 1 Α 2 Α 3 Α 4 Α 5 Indicates the weighted coefficient, which is set to -0.5, -0.25, -0.5, 1, 0.5, respectively.
[0035] For multiplex, use MARL to control traffic signals, and traffic signals per intersection are controlled by Agent, using status information interaction and spatial discount factors to achieve multiple intelligent body collaboration. Take 2 × 2 well-shaped road network as an example, each intersection is equivalent. For the intersection of the upper left, the input status of its Agent In addition to the traffic information of the local intersection, including traffic information of the upper right intersection and the left side of the road, the reward is the weighting and representation of all intersection rewards:
[0036] R = β 1 rim tl + β 2 rim tr + β 3 rim ll + β 4 rim lr (5)
[0037] Where R represents the reward of the Agent left, R tl R tr R ll R lr Represents rewards, β, β, β, β, β 1 Β 2 Β 3 Β 4 Indicates the weighted coefficient, and the present invention is defined as 0.5, 0.2, 0.2, 0.1, respectively.
[0038] Step 3: Using the LSTM predicts the future microscopic state, predict the number of K time steps, the speed vector, and acceleration vectors are predicted, the predicted stepk k is obtained by the network, and the current state is written as S, the prediction status is recorded as S p The optimization target of the optimal operation value function under the D3QN algorithm is predicted in combination states:
[0040] Step 4: D3Qn uses empirical playback to update the target value, deposit the sample (S, A, R, S ') obtained by the Agent and the environment in the empirical pool, spray small batch sample from the experience pool, using a random gradient drop method Training depth neural network makes it approaching the Q value, random sampling can break the strong correlation between the samples, so that the training converges, the flow chart of the empirical playback image 3 Indicated. The DRL-related super parameter is set as follows: The number of training rounds is 400, the minimum size of the experience pool is 2000, the maximum size is 100,000, and the discount factor is 0.85. Q Network is a full-connected neural network, using a mean square error loss function, select the ADAM optimizer, the relevant super parameter is set as follows: depth is 5, the width is 400, the learning rate is 0.001, the batch size is 128, the training iteration is 800 . The LSTM prediction network uses a binary cross entropy as a loss function. The ADAM optimizer is selected. The associated hypertorter is set as follows: the number of cells is 6, the number of layers is 3, the number of neurons is 160, the batch size is 128, the number of training iterations is 1.
[0041] Step 5: Use the SUMO-generated traffic data to train the model to test, and the evaluation index includes the average wait time t with Webster. wt , The average queue length L, average driving time T at , Average CO emissions D co , Average CO 2 emission Expressed as:
[0043] Where n represents the total number of vehicles, t is indicated by time long, WN t Indicates that the total number of stops in the T hour network, L t Total length of the queue of the T hour, N t Indicates the total number of running vehicles in T, CO t Indicates the total amount of CO emissions in the T hour network, CO 2t Indicates the CO emissions in the T hour 2 Total amount.
[0044] The present invention utilizes a simple and efficient state of discrete Traffic State Encoding, DTSE, using dynamic allocation, Kalman filtering or neural network, and the like to predict future traffic conditions. Decision, thus shortening the waiting time of the vehicle and improving the passivity of the road network. The invention has a positive theoretical significance and application value for promoting short - time traffic prediction, strengthening learning techniques in intelligent traffic signal control.


no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Crossroad setting method capable of realizing simultaneous driving of vehicles in multiple directions

PendingCN114657829AAdd U-turn functionImprove traffic efficiency
Owner:李云柯 +2

Pedestrian street crossing device based on moving sidewalks, and control method thereof

InactiveCN110264703AImprove traffic efficiencyimprove pass rate

Dual weighing and charging system

InactiveCN109191596AImprove traffic efficiencyrelieve lane pressure

Path navigation method, server, terminal and computer readable storage medium

PendingCN112304323AImprove navigation efficiency and accuracyImprove traffic efficiency

Multi-vehicle cooperation method and device, system, equipment, medium and product

PendingCN113734202AImprove traffic efficiencyImprove driving safety

Classification and recommendation of technical efficacy words

  • Improve traffic efficiency
  • reduce data volume

Toll station vehicle dynamic weighing estimation method and device

ActiveCN104089690AImprove traffic efficiencyshorten the time

Highway non-stop electronic toll collection system and method based on mobile terminal application and license plate recognition

InactiveCN107622536AImprove traffic efficiency

Multi-intersection cooperative control method and device, electronic equipment and storage medium

ActiveCN111311959AImprove traffic efficiencyease traffic congestion

Intelligent network connection vehicle speed decision method based on un-signalized intersection subarea

ActiveCN110444015AExpand the collision detection areaImprove traffic efficiency

High flow traffic control system

ActiveCN103334353AImprove traffic efficiencysolve road congestion

Method for detecting, evaluating, and analyzing look sequences

InactiveUS6997556B2reduce data volume
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products