Traffic signal timing method, device and equipment based on deep reinforcement learning
By pre-allocating and adjusting the green light time of traffic lights through deep reinforcement learning, the problem of traditional algorithms being unable to adapt to changes in traffic flow is solved, thereby reducing vehicle waiting time at intersections and enhancing training stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIHUA UNIV
- Filing Date
- 2023-04-18
- Publication Date
- 2026-06-23
AI Technical Summary
Traditional traffic light timing algorithms cannot reasonably allocate red and green light times according to actual traffic flow changes, leading to intersection congestion. Existing deep reinforcement learning methods suffer from problems such as difficulty in training convergence, local optima, and algorithm instability, and the activation function is not set reasonably.
The green light time for each phase is pre-allocated using deep reinforcement learning methods, and the green light duration is adjusted by interactive training with traffic conditions before the start of the green light time for each phase. A phase mechanism is added to adapt to changes in traffic flow, and the DDPG algorithm is used to optimize the green light time.
It effectively reduces vehicle waiting time at intersections, adapts to complex traffic flow changes, improves training stability and the reasonable allocation of green light duration, and reduces intersection congestion.
Smart Images

Figure CN116597670B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of traffic control technology, and more specifically, to a traffic signal timing method, apparatus, and device based on deep reinforcement learning. Background Technology
[0002] With the continuous development of the economy, vehicles are becoming increasingly common in people's daily lives, which brings more challenges to Traffic Signal Control (TSC). Traditional TSCs mostly use fixed timing algorithms, but fixed timing algorithms often cannot reasonably allocate traffic light times according to the actual traffic flow. Especially when traffic flow changes irregularly, fixed timing will not be able to meet the traffic demand, often leading to severe congestion at intersections.
[0003] Reinforcement learning algorithms can be categorized into probability-based learning algorithms, value-based learning algorithms, and actor-critic algorithms that combine probability and value learning. Existing technologies propose a policy gradient-based reinforcement learning method to adjust traffic light timing schemes, using DQN to select the red and green light states. Additionally, existing technologies employ a deep deterministic policy gradient-based algorithm for traffic light timing. However, these existing solutions suffer from convergence difficulties during training, are prone to getting stuck in local optima, and lack stability. Deep reinforcement learning is widely used in traffic light control; however, the setting of the incentive function remains controversial because vehicle delay time is a long-term reward that cannot be directly used for reinforcement learning. For example, existing technologies propose setting the reward value as a weighted sum of various traffic performance indicators, but the weights cannot be quantified. Furthermore, most current traffic light control schemes use phase switching mechanisms or directly allocate green light times, which reduces the effectiveness of vehicle delay control. Summary of the Invention
[0004] The purpose of this invention is to provide a traffic signal timing method, apparatus, and device based on deep reinforcement learning. This invention pre-allocates the green light time for each phase based on the current lane queue length at the intersection, thereby effectively reducing the overall waiting time for vehicles at the intersection. However, the pre-allocated green light times for each phase can only be adjusted within a fixed period and cannot adapt to real-time changes in traffic flow. Therefore, this invention uses deep reinforcement learning to further adjust the phase green light duration before the start of each phase's green light time, based on the feedback from interactive training with traffic conditions, to achieve real-time prediction of the optimal green light time for the current phase. Furthermore, to better adapt to the uncertainty and complexity of traffic flow, this invention incorporates a phase mechanism in the state design to avoid situations where different intersections are in different phases with the same number of vehicles, thus making the training more stable. Therefore, this invention, through the above methods, can effectively adapt to the reasonable allocation of traffic signal green light durations under complex and changing traffic flow conditions, thereby reducing vehicle waiting time at intersections.
[0005] The above-mentioned technical objective of the present invention is achieved through the following technical solution:
[0006] A first aspect of this application provides a traffic signal timing method based on deep reinforcement learning, the method comprising:
[0007] Get the current queue length of vehicles at the intersection and the current phase of the lane at the intersection. An intersection includes four phases, and the phase represents the traffic status of the lane.
[0008] An optimization model for traffic lights at a single intersection is established with the objective function of minimizing vehicle delay time and queue length.
[0009] The state, action, and reward of the traffic light agent are preset. Based on the traffic light agent and the DDPG algorithm, the Markov decision model is initialized and trained.
[0010] Based on the current queue length of vehicles in each phase of the intersection, the green light time for each phase of the lane is pre-allocated proportionally.
[0011] The current state of the intersection is input into the trained Markov decision model. Based on the actions output by the Markov decision model, the green light time of each phase of the lane is adjusted. The action is to select the optimal green light time of each phase with the goal of reducing the queue length of vehicles at the intersection.
[0012] After the selected optimal green light time ends, obtain the current intersection status and reward value, and store them as experience in the experience pool. Extract a certain amount of experience from the experience pool and update the neural network of the Markov decision model based on the DDPG algorithm.
[0013] The green light time for each phase is adjusted based on the output of the updated Markov decision model.
[0014] In one implementation, the expression for the optimization model is: ,in, Indicates the delay time at the intersection. Indicates the length of the queue of vehicles at the intersection. n This indicates the number of lanes at a single intersection. X This represents the overall control process of the traffic light agent. λ Indicates the green credit ratio. T This indicates the duration of the cycle in which the traffic light colors are displayed in turn. Indicates the switching of the green light for each phase. g Indicates the effective time of the green light for each phase. This indicates the vehicle's delay time.
[0015] In one implementation, the expression for the intersection delay time is: ,in, =1 indicates the first entrance lane at the intersection. Indicates all entrances to the intersection. Indicates the entrance to the intersection Road traffic flow, Represents the delay function, This indicates traffic flow.
[0016] In one implementation scheme, the formula for calculating the queue length of vehicles at the intersection is: ,in, Indicates the first signal cycle. Indicates the first phase, The maximum number of phases, The number of lanes at a single intersection. For the current phase the way exist t The number of vehicles in the queue at any given time. This indicates the average distance between vehicles.
[0017] In one implementation, the expression for the preset state of the traffic light agent is: ,in, Indicates the current moment. This indicates the number of lanes at the current intersection. Indicates the current state Phase, This is the normalization factor.
[0018] In one implementation, the expression for the preset action of the traffic light agent is: ,in, x This indicates the green light time pre-allocated to each phase according to a set ratio. Increase the time value within a reasonable range. α This indicates the magnification factor.
[0019] In one implementation, the expression for the preset action of the traffic light agent is: ,in, Indicates the first time after the green light period ends. i Queue length for each lane This represents the number of lanes at a single intersection.
[0020] In one implementation, the neural network of the Markov decision model includes an action network and a critic network, wherein the network parameters of the action network are updated using a policy gradient algorithm, and the network parameters of the critic network are updated using a temporal difference algorithm.
[0021] A second aspect of this application provides a traffic signal timing device based on deep reinforcement learning, the device comprising:
[0022] The data acquisition module is used to obtain the queue length of vehicles at the intersection at the current time and the phase of the lane at the intersection at the current time. One intersection includes four phases, and the phase represents the traffic status of the lane.
[0023] The optimization model building module is used to build an optimization model for the traffic lights at a single intersection, with the objective function being to minimize the delay time of vehicles at the intersection and the length of the vehicle queue.
[0024] The preset module is used to preset the state, actions and rewards of the traffic light agent. Based on the traffic light agent and the DDPG algorithm, the Markov decision model is initialized and trained.
[0025] The green light time pre-allocation module is used to pre-allocate the green light time of each lane phase according to the queue length of vehicles in each phase of the intersection at the current time.
[0026] The green light time adjustment module is used to input the current state of the intersection into the trained Markov decision model, and adjust the green light time of each phase of the lane according to the action output by the Markov decision model. The action is to select the optimal green light time of each phase with the goal of reducing the queue length of vehicles at the intersection.
[0027] The network update module is used to obtain the current intersection status and reward value after the selected optimal green light time has ended, and store it as experience in the experience pool. It extracts a certain amount of experience from the experience pool and updates the neural network of the Markov decision model based on the DDPG algorithm.
[0028] The time adjustment module is used to adjust the green light time for each phase based on the output of the updated Markov decision model.
[0029] A third aspect of this application provides an electronic device including a memory and a processor, the memory storing a computer program, the processor invoking the computer program to perform the steps of the method described in the first aspect of this application.
[0030] Compared with the prior art, the present invention has the following beneficial effects:
[0031] The traffic signal timing method based on deep reinforcement learning provided by this invention effectively reduces the overall waiting time of vehicles at intersections by pre-allocating the green light time for each phase according to the queue length of the lanes at the current intersection. However, the pre-allocated green light times for each phase can only be adjusted within a fixed period and cannot adapt to real-time changes in traffic flow. Therefore, this invention uses deep reinforcement learning to further adjust the phase green light duration before the start of each phase's green light time by interacting with traffic conditions and receiving feedback, thereby achieving real-time prediction of the optimal green light time for the current phase. Furthermore, to better adapt to the uncertainty and complexity of traffic flow, this invention incorporates a phase mechanism in the state design to avoid situations where different intersections are in different phases with the same number of vehicles, thus making the training more stable. Therefore, this invention, through the above methods, can effectively adapt to the reasonable allocation of green light durations for traffic signals under complex and changing traffic flow conditions, thereby reducing vehicle waiting time at intersections.
[0032] Furthermore, the traffic signal timing device and equipment based on deep reinforcement learning provided in the second and third aspects of this application have the same technical effects as the aforementioned traffic signal timing method based on deep reinforcement learning, which will not be elaborated here. Attached Figure Description
[0033] The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and form part of this application, do not constitute a limitation thereof. In the drawings:
[0034] Figure 1 This is a flowchart illustrating a traffic signal timing method based on deep reinforcement learning, provided as an embodiment of this application.
[0035] Figure 2 A phase diagram of traffic lights at an intersection provided in an embodiment of this application;
[0036] Figure 3 A flowchart illustrating the traffic signal timing method based on deep reinforcement learning provided in this application embodiment;
[0037] Figure 4 This is a schematic diagram of the action network structure provided in an embodiment of this application;
[0038] Figure 5 A schematic diagram of the structure of the critic network provided in the embodiments of this application;
[0039] Figure 6 A network update framework diagram of a neural network for updating a Markov decision model provided in this application embodiment;
[0040] Figure 7 A schematic diagram illustrating the queue length of vehicles under low traffic conditions, provided for an embodiment of this application;
[0041] Figure 8 A schematic diagram illustrating the queue length of vehicles under normal traffic conditions, provided for embodiments of this application;
[0042] Figure 9 This is a schematic diagram of the queue length of vehicles in traffic congestion, provided in an embodiment of this application. Detailed Implementation
[0043] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and accompanying drawings. The illustrative embodiments and descriptions of the present invention are only used to explain the present invention and are not intended to limit the present invention.
[0044] Please refer to Figure 1 , Figure 1 A flowchart illustrating a traffic signal timing method based on deep reinforcement learning, as provided in this application embodiment, is shown below. Figure 1 As shown, the method includes the following steps:
[0045] S110: Obtain the queue length of vehicles at the intersection at the current time and the phase of the lane at the intersection at the current time. One intersection includes four phases, and the phase represents the traffic status of the lane.
[0046] Specifically, TSC (Traffic Signal Control) mainly consists of three traditional color signals: red, yellow, and green. The cycle length of traffic signal control is the time required for each color of the traffic light to be displayed once. The cycle length can also be understood as the total duration for each phase to appear once. Phase settings aim to provide right-of-way at the intersection for traffic flows in different directions, thus solving traffic congestion problems. A traffic light phase includes one main control step and several transitional control steps. Each step represents the current state combination of traffic lights in each direction at the intersection. The main control step is the color combination that primarily controls vehicle passage, while the transitional steps are mainly to ensure the safety of vehicles that have already left the intersection during the transition from one phase to another. The phase control adopted at an intersection is determined based on the actual traffic flow conditions while ensuring safety. If the number of phases is too small, the right-of-way at the intersection cannot be effectively allocated, leading to traffic chaos and reduced traffic safety. If too many phases are designed, the overall capacity of the intersection decreases, and the waiting time for vehicles at the intersection increases. Therefore, in this embodiment of traffic signal control, four-phase control is adopted.
[0047] Specifically, such as Figure 2 As shown, Figure 2 The intersection consists of 12 lanes running east-west and north-south, each 300 meters long. A 150-meter radius around the traffic lights detects the queue length, number of vehicles, and waiting time. The innermost lane (L1) is for left turns and straight-ahead traffic, the middle lane (L2) is for straight-ahead traffic, and the outermost lane (L3) is for right turns and straight-ahead traffic. Traffic flows under traffic light control: a red light indicates no passage, a yellow light is a transitional signal, and a green light allows passage. Right-turning vehicles do not conflict with other lanes and are therefore not considered due to the lack of traffic light control. The phase design employs a commonly used four-phase traffic system, such as... Figure 2 As shown, green circles indicate that the current lane is open to traffic, while red circles indicate that the current lane is closed to traffic.
[0048] S120 establishes an optimization model for traffic lights at a single intersection, with the objective function being to minimize the delay time of vehicles at the intersection and the length of the vehicle queue.
[0049] In this embodiment, traffic signal control parameters refer to parameters that describe the movement process of an intersection, and these parameters can better describe the traffic signals. This embodiment first explains the time parameters of traffic, traffic flow parameters, and some commonly used traffic performance indicators.
[0050] Time parameters include cycle duration and green light ratio. Cycle duration is the total cumulative time for all traffic light colors at a given intersection to cycle through, or the time between the start of a green light in a phase and the next green light cycle; it is denoted by the symbol T. Green light ratio is the proportion of the effective green light time for each phase to the cycle duration. The effective green light time for each phase is determined based on the traffic volume and is denoted by the symbol T. λ It means that, among them, , This indicates the effective green light duration, which is the time a vehicle can travel when the green light is on.
[0051] Traffic flow parameters include: Traffic flow refers to the number of vehicles passing through per unit time, expressed as... Saturation flow rate refers to the maximum flow rate of vehicles that can cross the stop line at an approach lane during a single consecutive green light signal period, denoted by S. Lane flow ratio refers to the ratio of actual traffic flow to saturation traffic flow per unit time, denoted by y. Lane saturation ratio: refers to the ratio of actual traffic flow to lane capacity, expressed as... express, .
[0052] Performance metrics include: Delay time, which is the difference between the time it takes for a vehicle to smoothly travel from one intersection to another and the time it takes in congested conditions; Stops, which refers to the number of times a vehicle stops in congested traffic; and Queue length, which is the distance from the stop line at the intersection to the end of the queue.
[0053] In the actual process of controlling traffic lights, the allocated green light time is linear, but the selection of which phase to execute is discrete. Therefore, the optimization model of the traffic lights at a single intersection is a mixed integer programming problem. Thus, this embodiment uses the minimum delay time of vehicles at the intersection and the minimum queue length of vehicles as the objective function to establish an optimization model of the traffic lights at a single intersection, thereby optimizing the timing of the traffic lights at the intersection.
[0054] S130: Preset the state, action, and reward of the traffic light agent; initialize and train the Markov decision model based on the traffic light agent and the DDPG algorithm.
[0055] This embodiment employs the DDPG algorithm, which incorporates a pre-allocation period mechanism in deep reinforcement learning, to solve the real-time timing problem of the optimization model in step S120. Due to the constantly changing traffic flow, designing an adaptive controller agent capable of adapting to the dynamic traffic environment is a key challenge in intersection signal control. For the current state of the traffic light intersection, selecting a timing action will affect the state of the next traffic light, as it is only related to the previous state. Therefore, traffic signal control can be abstracted as a multi-stage sequential decision problem, namely a Markov decision process (MDP), which can be solved using reinforcement learning algorithms. An MDP can be represented by a state S (representing the finite set of system states), an action A (representing the set of actions that can be performed in a given state), and a reward R (representing the reward given by the environment after the system performs an action in a given state). This reward can be a positive or negative decay factor. Due to the complexity of traffic flow, it is well-suited to use reinforcement learning heuristic algorithms to find a suboptimal solution. The process is as follows: The controller agent decides an action based on the current traffic state. This action changes the environment, and the environment returns a reward to the controller. The controller adjusts the network parameters based on the reward. After training, the controller agent can eventually adjust the phase sequence of each stage or execute the green light time according to the traffic state, thereby achieving the goal of minimizing the average delay time of vehicles.
[0056] S140 pre-allocates the green light time for each phase of the lane according to the queue length of vehicles in each phase of the intersection at the current time.
[0057] Considering a two-way, three-lane intersection as an environment, we control traffic lights according to phase. Let's begin, as... Figure 3 As shown, north-south traffic begins to proceed. At this time, the north-south traffic light is green, while the lights for other lanes are red (vehicles are prohibited from passing). To simplify the environment, we directly set the light to green after the red light, without using a yellow light transition. The phase execution process can be as follows: Figure 3As shown in the figure. This embodiment uses a DRL-based framework to solve the traffic signal timing problem, abstracting the traffic signal control process at intersections into a Markov decision model and employing a suitable algorithm to achieve adaptive control. Although traditional DRL methods can also be used to solve traffic signal timing problems, the training results may not converge because they directly perform traffic light timing, which may lead to unstable training and easily cause local optima or traffic congestion in some lanes during training. To solve this problem, this embodiment proposes a DRL framework with pre-allocated phase green light time. During the control process, the current state (including the queue length of each lane and the current phase) is fed back to the controller agent. Based on this information, the controller agent makes decisions to adjust the green light time with the objective of maximizing the predicted total future reward (i.e., minimizing the average queue length of vehicles in the future) to achieve the optimal control effect.
[0058] S150: Input the current state of the intersection into the trained Markov decision model, and adjust the green light time of each phase of the lane according to the action output by the Markov decision model. The action is to select the optimal green light time of each phase with the goal of reducing the queue length of vehicles at the intersection.
[0059] S160: After the selected optimal green light time ends, obtain the current intersection status and reward value, and store them as experience in the experience pool. Extract a certain amount of experience from the experience pool and update the neural network of the Markov decision model based on the DDPG algorithm.
[0060] For steps S150-S160, as follows Figure 3 As shown, before the start of each cycle, the green light duration for each phase is pre-allocated proportionally based on the number of vehicles in the current intersection lanes. The controller agent uses the traffic environment (queue length of each lane) at the moment before the start of each phase as input and selects an action (adjusting the time allocation) through a policy. Subsequently, the environment changes its state to reflect the traffic environment at the next moment t1 based on the executed action and returns a reward to the agent. The controller agent uses the traffic environment and reward from the previous moment t1 to continue making the next action and puts the resulting feedback experience into an experience pool for policy training. When the experience reaches a certain amount, it begins to use the experience in the experience pool to learn and update the policy. Through this continuous learning, an optimal timing scheme and corresponding optimal policy can be found based on the current state. The actual process of controlling the traffic lights involves the current state (current queue length of each lane and current phase) fed back to the controller by the sensors of the current lane. The controller then pre-allocates the green light time based on the current state and makes an action (increasing or decreasing the green light time) to maximize the total future reward (minimum current queue length).
[0061] S170 adjusts the green light time for each phase based on the output of the updated Markov decision model.
[0062] In this embodiment, it is common knowledge for those skilled in the art to adjust the optimal green light time for each phase based on the output of the updated Markov decision model, and no further explanation is required here.
[0063] In summary, the traffic signal timing method based on deep reinforcement learning provided in this embodiment effectively reduces the overall waiting time of vehicles at intersections by pre-allocating the green light time of each phase according to the queue length of the lanes at the current intersection. However, the pre-allocated green light time of each phase can only be adjusted within a fixed period and cannot adapt to real-time changes in traffic flow. Therefore, this invention uses deep reinforcement learning to further adjust the phase green light duration before the start of each phase's green light time by interacting with traffic conditions and receiving feedback, thereby achieving real-time prediction of the optimal green light time for the current phase. Furthermore, to better adapt to the uncertainty and complexity of traffic flow, this invention incorporates a phase mechanism in the state design to avoid situations where different intersections are in different phases with the same number of vehicles, thus making the training more stable. Therefore, this invention, through the above methods, can effectively adapt to the reasonable allocation of green light durations in complex and changing traffic flow conditions, thereby reducing vehicle waiting time at intersections.
[0064] In one embodiment, the expression of the optimization model is: ,in, Indicates the delay time at the intersection. Indicates the length of the queue of vehicles at the intersection. n This indicates the number of lanes at a single intersection. X This represents the overall control process of the traffic light agent. λ Indicates the green credit ratio. T This indicates the duration of the cycle in which the traffic light colors are displayed in turn. Indicates the switching of the green light for each phase. g Indicates the effective time of the green light for each phase. This indicates the vehicle's delay time.
[0065] In this embodiment, as Figure 2 The example shows a single intersection. If the vehicle queue length at an intersection is too long, it is likely to cause traffic congestion. Therefore, using a reasonable timing scheme can effectively prevent intersection congestion. This embodiment defines the average vehicle queue length model as follows: ,in, Indicates the first signal cycle. Indicates the first phase, The maximum number of phases, The number of lanes at a single intersection. For the current phase the way exist t The number of vehicles in the queue at any given time. This indicates the average distance between vehicles. Let the sum of the average queue lengths after each phase of the vehicle's journey be the objective function.
[0066] for For the current phase the way The number of vehicles in the queue at time t. ,in For phase Import Channel Whether it's a green light or not, the value is 1 if it's green, and 0 otherwise. For the first One signal cycle at the intersection entrance Road traffic flow, For the first Intersection of each cycle The Each phase's effective green light time For the first The signal cycle number Each phase inlet Delayed vehicles, Entering the entrance at time t The number of vehicles. This embodiment does not consider the yellow light duration of the traffic lights at intersections, therefore... Therefore, suppose there is a single intersection, and each intersection has... There are one phase, and the optimized vector is... .
[0067] In a further embodiment of this invention, the expression for the intersection delay time is: in, This represents the total delay time for all roads. =1 indicates the first entrance lane at the intersection. Indicates all entrances to the intersection. Indicates the entrance to the intersection Road traffic flow, Represents the delay function, Indicates traffic flow. This indicates the green credit ratio.
[0068] The classic formula for the average delay of vehicles at each approach lane of an intersection is:
[0069] ,in, This represents the saturation ratio, and k represents the entrance lane number of the intersection. Indicates the vehicle delay time. Indicates the duration of the signal period. Indicates the green credit ratio. This represents traffic saturation flow. Therefore, the optimization objective based on a single intersection is as follows:
[0070] , P The optimization condition St for 0 is: ; .
[0071] By optimizing the green credit ratio Cycle duration Parameters make To minimize the total vehicle delay time, an optimization model for traffic signals at a single intersection was established, with the average vehicle delay time and average queue length as the optimization objectives. The corresponding optimization condition St for P1 is as follows: ; ; ; ; .
[0072] Delays at intersections; This represents the average queue length of vehicles at the intersection. This is achieved by optimizing the green light ratio. ,cycle Green light switching for each phase Green light duration for each phase Parameters make Minimize, i.e., minimize average vehicle delay time and vehicle queue length, to achieve the control objective of minimizing. In the actual process of controlling traffic lights, the allocated green light time is linear, but the selection of which phase to execute is discrete. Therefore, problem P1 is an NP-hard mixed integer programming problem. Hence, the solution of the optimization model is modeled as an MDP process. Environmental variables such as state, action, and reward are set and trained and learned using the DDPG algorithm to solve the objective function of the optimization model.
[0073] In one embodiment, the expression for the state of the preset traffic light agent is: ,in, Indicates the current moment. This indicates the number of lanes at the current intersection. Indicates the current state Phase, This is the normalization factor.
[0074] In this embodiment, in traffic signal control, existing technologies use the image of the current intersection as the current state. However, due to the different heights and angles of different cameras, training an effective model becomes difficult. Therefore, this embodiment uses data that the detector can detect as the current state, such as the queue length of vehicles in each lane, to more effectively reflect the actual situation of the intersection. Assuming that before the phase begins, the queue lengths of lanes in different phases are the same, the reward values returned for performing the same action may be different. To avoid this, this embodiment incorporates a phase design (i.e., specifying the current phase). Finally, the state of the traffic light agent is defined as: ,in, Indicates the current moment. This indicates the number of lanes at a single intersection. Indicates the current state Phase, This is the normalization factor.
[0075] In one embodiment, the expression for the action of the preset traffic light agent is: ,in, x This indicates the green light time pre-allocated to each phase according to a set ratio. This indicates the time value for scaling up to a reasonable range. α This indicates the magnification factor.
[0076] In this embodiment, the action is set as the duration of each phase. Because traffic flow changes in real time, allocating time using a fixed period cannot adapt to the changing requirements of traffic flow. Therefore, a variable period method is actually used. The phase time is adjusted based on the current state based on the phase time already allocated before the start of the cycle. This provides a more stable way to adjust the green light time. The agent outputs the duration to be adjusted for the next phase. Due to the normalization processing of the output action, the actual output is a value in the range {-1, 1}. Based on this, this embodiment uses... A coefficient is used to amplify the signal within a reasonable range, and finally, the time allocated for each phase before the cycle begins is added to obtain t, which is the effective green light duration. Therefore, the action of the traffic light agent is defined as follows: Where x represents the green light time pre-allocated to each phase according to the proportion. This indicates the time value for scaling up to a reasonable range. This represents the amplification factor. For example, since the value of the action output is [-1, 1], it is multiplied by a value of 20 to adjust the green light time to [-20, 20].
[0077] In one embodiment, the expression for the action of the preset traffic light agent is: ,in, Indicates the first time after the green light period ends. i Queue length for each lane n This indicates the number of lanes at a single intersection.
[0078] In this embodiment, to better adapt to traffic scenarios, the reward is defined as the average queue length of each lane, rather than the waiting time. In existing technologies, control methods based on the DNQ algorithm set the reward as the waiting time. However, in real traffic scenarios, vehicle waiting times are difficult to calculate. For example, if a vehicle stops at an intersection and then continues to the intersection before stopping again, the actual waiting time is the sum of these two stops. Using queue length can better measure vehicle rewards in traffic scenarios and better reflect the current intersection status and the quality of actions. The average queue length of vehicles after the current green light time following the end of the previous phase is defined as the average queue length of vehicles. This queue length can intuitively reflect the quality of each action. Therefore, the reward for the traffic light agent is defined as: ,in, Indicates the first time after the green light period ends. i Queue length for each lane n This indicates the number of lanes at a single intersection.
[0079] In one embodiment, the neural network of the Markov decision model includes an action network and a critic network, wherein the network parameters of the action network are updated using a policy gradient algorithm, and the network parameters of the critic network are updated using a temporal difference algorithm.
[0080] Specifically, since the final result of the Markov decision model is to predict a process that minimizes the vehicle queue length, it needs to output a continuously adjusting green light duration. The DDPG algorithm in reinforcement learning can output continuous actions. DDPG is an extension of the DQN algorithm because DQN's actions are discrete and cannot handle high-dimensional action problems. Instead of discrete actions, it outputs a continuous time t. The DDPG algorithm is based on the AC (Actor-Critic) framework. The so-called AC algorithm is that the algorithm is divided into two parts: Actor (policy network) and Critic (value network).
[0081] For the value function of a value network, the goal of traffic signal control is to find a strategy. This maximizes the expected long-term return. The essence of reinforcement learning is to find the mechanism that maximizes expected returns. By maximizing expected returns, reasonable traffic light timings can be implemented, thereby alleviating future traffic congestion. Among these, ; Discount parameters It is an important factor in training a network of critics. This represents future returns. In application, if... When the value is 0, the training process only focuses on the most recent reward. Most current reinforcement learning methods use this approach. The value was set to be close to 1, but in the actual experimental process of this embodiment, it was found that... Setting it to 0.75 is best suited for the current lane model.
[0082] In practice, this embodiment employs an adaptive control strategy that adjusts traffic signal control time in real time. First, parameters are randomly initialized. To obtain an initial state s, the algorithm adjusts the green light duration for each phase within a pre-assigned phase over a period of time. To account for potentially arriving vehicles, reinforcement learning is employed to adjust the green light time for each phase, effectively adapting to changes in traffic flow and achieving efficient control. The algorithm flow is as follows: Figure 3 As shown. Before each cycle begins, the execution time for each phase needs to be pre-allocated based on the queue length of each lane. *T, Indicates the current phase The pre-allocated green light duration, Indicates the current phase The queue length of vehicles in the corresponding traffic lane, where n represents the number of lanes. This represents the queue length of vehicles in lane i, where T is a pre-defined cycle time.
[0083] Furthermore, due to the excessively large state space, a neural network is needed for fitting. The structure of the action network and commentator network included in the neural network is as follows: Figure 4 and Figure 5 As shown. This embodiment uses the phase stage and queue length states as state information and uses appropriate normalization as input. To prevent gradient explosion, all states are normalized and clipped to [0, 2]. Similarly, this embodiment normalizes and clips the reward and its d to [-1, 1] to stabilize minibatch updates. The neural network structure of the Actor network is as follows. Figure 4 As shown, using As the activation function of the output layer. The neural network structure of the Critic network is as follows: Figure 5 As shown.
[0084] The DDPG algorithm includes Actor and Critic networks. Additionally, each network has a corresponding target network, so the DDPG algorithm comprises four networks: the Actor network, the Critic network, and the Critic network. Critic Network Target Actor Network and Target Critic network Algorithm updates are as follows: Figure 6 The main updates shown are to the parameters of the Actor and Critic networks. The Actor network is updated by maximizing the cumulative expected reward, while the Critic network is updated by minimizing the error between the evaluated value and the target value. To eliminate sample correlation, the P-DDPG model employs an empirical pool. The transition from sampling during the learning process... The data is stored in the Replay Buffer. During the training phase, a batch of data is sampled from the Replay Buffer. Let's assume a single sampled data point... The Actor and Critic network update process is as follows: Figure 6 As shown.
[0085] What the Critic does is update the information fed back from the interaction between the Actor and the environment. Network. Target Critic Network The state is calculated using the Target Actor network. The following actions: ;
[0086] The calculated actions at this point do not need to be mixed with noise. Then, the state-action pairs are calculated using the Target Critic network. Target value: ;
[0087] Next, the state-action pair is calculated using the Critic network. Evaluation value: ;
[0088] Finally, the gradient descent algorithm is used to minimize the difference between the evaluated value and the expected value. This updates the parameters in the Critic network. ; ;
[0089] The Actor's role is to interact with the environment and learn a better policy using policy gradients, guided by the Critic's value function. The Actor network update process involves calculating the state using the Actor network. The following actions: ;
[0090] No noise needs to be added after the actions are calculated. Then, the state-action pairs are calculated using a Critic network. The assessed value (i.e., cumulative expected return) of ) ;
[0091] Finally, the gradient ascent algorithm is used to maximize the cumulative expected return. (Specifically, gradient descent algorithm is used for optimization -) In essence, they are all the same, thus updating the parameters in the Actor network. This completes the update of both the Actor and Critic networks. .
[0092] Furthermore, the DDPG algorithm employs a soft update method for updating the target network, which includes both the action network and the commentator network. This involves introducing a learning rate (momentum). The old target network parameters and the new corresponding network parameters are weighted and averaged, and then assigned to the target network, thus completing the target update. This is existing technology and will not be explained further. Through network training, the parameters are continuously learned and updated during the training process, eventually reaching a relatively stable traffic signal control strategy to control traffic signals.
[0093] Furthermore, this embodiment also provides a comparison of the timing method provided in this application with existing technologies. To effectively solve traffic congestion, this embodiment uses vehicle delay time as the final performance metric. Considering the long delay time period, it is not suitable to use a learning algorithm. Therefore, this embodiment selects the average queue length as the training metric. Minimizing the queue length effectively reduces vehicle delay time and is easier to calculate in practice. To ensure the reliability of the timing method provided in this embodiment, the DDPG, P-DDPG, Fix-time, MP (Max-pressure), and SOTL timing algorithms are compared, and the changes in average queue length during their training process under the same environmental conditions are analyzed. Figure 7 , Figure 8 and Figure 9 As shown, the rewards received by vehicles vary with the number of times different phases are completed. The horizontal axis represents the number of times a phase is completed, while the vertical axis represents the average queue length of all lanes when a phase is completed, thus reflecting the rewards received by vehicles at different times.
[0094] from Figure 7 , Figure 8 and Figure 9As can be clearly seen, the average vehicle queue length fluctuates within a certain range. Because the average queue length fluctuates within this range, the Exponential Moving Average (EMA) metric was used to analyze the reward to determine whether the queue length had converged. The EMA indicates that the algorithm in the image tends to stabilize and eventually converges. When dealing with larger traffic volumes, because timed control cannot predict future traffic flow, the average delay increases, and the fluctuation range also widens. This means that the intersection's traffic efficiency is low most of the time, reflecting to some extent the drawbacks of timed control. When traffic flow is low, the traditional DDPG algorithm uses a fixed duration benchmark for adjustment. When the memory buffer is full, the controller begins learning, and the average latency decreases over time. When training reaches a certain point, the average latency no longer decreases significantly but fluctuates within a small range, indicating that the controller has learned a quiet and effective phase control strategy. At this point, the final queue length and vehicle delay time of both the DDPG and P-DDPG algorithms are the same, indicating that both have learned a quiet and effective phase control strategy. However, the P-DDPG algorithm, which considers pre-allocating the green light duration for each phase, converges faster than the traditional DDPG algorithm, demonstrating that the pre-allocated P-DDPG algorithm can learn an effective strategy faster. However, when traffic flow reaches a moderate level, DDPG may cause complete congestion on the road due to an incorrect timing method, preventing the learning algorithm from learning an effective way to alleviate traffic congestion. P-DDPG can learn an effective traffic light timing scheme under both moderate and heavy traffic conditions. Therefore, P-DDPG is more suitable for heavy traffic situations than DDPG, and it has stronger stability and faster convergence speed. Compared with traditional control algorithms such as MP and SOTL, although the above algorithms are better than timing algorithms in terms of vehicle delay time, they can only allocate time reasonably based on the current state and do not consider the future arrival of vehicles. Therefore, the total vehicle delay time of the above algorithms is higher than that of the P-DDPG algorithm provided in this embodiment.
[0095] Furthermore, we compared the simulation results of the SOTL algorithm, MP algorithm, DDPG algorithm and P-DDPG algorithm under three different vehicle densities. The comparison results are shown in Table 1.
[0096] Table 1 Simulation Results
[0097]
[0098] The results show that the average delay of DDPG is 60%, 21%, and 28% lower than Fix-time, SOTL, and MP under low traffic density, respectively; 28%, 26%, and 12% lower under medium traffic density, respectively; and 15%, 13%, and 8% lower under high traffic density, respectively. Furthermore, as shown in the figure, P-DDPG outperforms DDPG in both convergence and stability. The reason why P-DDPG outperforms DDPG in the experimental results is that the P-DDPG provided in this embodiment uses a method of pre-timing the green light for each phase.
[0099] In summary, the traffic signal timing method based on deep reinforcement learning provided in this application can reduce traffic congestion at intersections, achieve faster convergence in complex traffic flows, and increase the timing stability of traffic light control. Even if the DDPG algorithm makes an erroneous move, it will not cause complete traffic congestion. Moreover, it is superior to traditional traffic signal control algorithms. In addition, the P-DDPG algorithm outputs continuous actions, which are more suitable for obtaining better stage execution time, thereby effectively controlling traffic.
[0100] Based on the same inventive concept, corresponding to the above-described embodiments of the traffic signal timing method based on deep reinforcement learning, this invention also provides a traffic signal timing device based on deep reinforcement learning, the device comprising:
[0101] The data acquisition module is used to obtain the queue length of vehicles at the intersection at the current time and the phase of the lane at the intersection at the current time. One intersection includes four phases, and the phase represents the traffic status of the lane.
[0102] The optimization model building module is used to build an optimization model for the traffic lights at a single intersection, with the objective function being to minimize the delay time of vehicles at the intersection and the length of the vehicle queue.
[0103] The preset module is used to preset the state, actions and rewards of the traffic light agent. Based on the traffic light agent and the DDPG algorithm, the Markov decision model is initialized and trained.
[0104] The green light time pre-allocation module is used to pre-allocate the green light time of each lane phase according to the queue length of vehicles in each phase of the intersection at the current time.
[0105] The green light time adjustment module is used to input the current state of the intersection into the trained Markov decision model, and adjust the green light time of each phase of the lane according to the action output by the Markov decision model. The action is to select the optimal green light time of each phase with the goal of reducing the queue length of vehicles at the intersection.
[0106] The network update module is used to obtain the current intersection status and reward value after the selected optimal green light time has ended, and store it as experience in the experience pool. It extracts a certain amount of experience from the experience pool and updates the neural network of the Markov decision model based on the DDPG algorithm.
[0107] The time adjustment module is used to adjust the green light time for each phase based on the output of the updated Markov decision model.
[0108] The traffic signal timing device based on deep reinforcement learning provided in this embodiment has the following beneficial effects: By pre-allocating the green light time for each phase according to the current lane queue length at the intersection, the overall waiting time of vehicles at the intersection is effectively reduced. However, the pre-allocated green light time for each phase can only be adjusted within a fixed period and cannot adapt to real-time changes in traffic flow. Therefore, before the start of each phase's green light time, this invention uses deep reinforcement learning to further adjust the phase's green light duration based on the feedback from interactive training with traffic conditions, thereby achieving real-time prediction of the optimal green light time for the current phase. Furthermore, to better adapt to the uncertainty and complexity of traffic flow, this invention incorporates a phase mechanism in the state design to avoid situations where different intersections are in different phases with the same number of vehicles, thus making the training more stable. Therefore, this invention, through the above methods, can effectively adapt to the reasonable allocation of traffic signal green light duration under complex and changing traffic flow conditions, thereby reducing vehicle waiting time at intersections.
[0109] In another embodiment of the present invention, an electronic device is provided, comprising one or more processors; a memory coupled to the processors for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the steps of the traffic signal timing method based on deep reinforcement learning described in the above embodiment. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions in a computer storage medium to achieve the corresponding method flow or corresponding function; the processor described in the embodiments of the present invention can be used to execute the operation of the traffic signal timing method based on deep reinforcement learning.
[0110] In another embodiment of the present invention, a computer-readable storage medium is provided, which is a memory device in a computer device for storing programs and data. It is understood that the computer-readable storage medium here may include both built-in storage media in the computer device and extended storage media supported by the computer device. The computer-readable storage medium provides storage space that stores the operating system of the terminal. Furthermore, the storage space also stores one or more instructions suitable for loading and execution by a processor, which may be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here may be high-speed RAM or non-volatile memory, such as at least one disk storage device. The processor can load and execute one or more instructions stored in the computer-readable storage medium to implement the corresponding steps of the traffic light timing method based on deep reinforcement learning in the above embodiments. Those skilled in the art should understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0111] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A traffic signal timing method based on deep reinforcement learning, characterized in that, The methods include: Get the current queue length of vehicles at the intersection and the current phase of the lane at the intersection. An intersection includes four phases, and the phase represents the traffic status of the lane. An optimization model for traffic lights at a single intersection is established, with the objective function being to minimize vehicle delay time and queue length. The expression for this optimization model is: ,in, Indicates the delay time at the intersection. Indicates the length of the queue of vehicles at the intersection. n This indicates the number of lanes at a single intersection. X This represents the overall control process of the traffic light agent. λ Indicates the green credit ratio. T This indicates the duration of the cycle in which the traffic light colors are displayed in turn. Indicates the switching of the green light for each phase. g Indicates the effective time of the green light for each phase. This represents the vehicle delay time; the expression for the intersection delay time is: ,in, =1 indicates the first entrance lane at the intersection. Indicates all entrances to the intersection. Indicates the entrance to the intersection Road traffic flow, Represents the delay function, This represents traffic flow; the formula for calculating the queue length of vehicles at an intersection is: ,in, Indicates the first signal cycle. Indicates the first phase, The maximum number of phases, The number of lanes at a single intersection. For the current phase the way exist t The number of vehicles in the queue at any given time. p This indicates the average distance between vehicles; The state, action, and reward of the traffic light agent are preset. Based on the traffic light agent and the DDPG algorithm, a Markov decision model is initialized and trained. The expression for the preset state of the traffic light agent is: ,in, Indicates the current moment. The number of lanes at a single intersection. Indicates the current state Phase, The normalization factor is used; the expression for the action of the traffic light agent is preset to be... ,in, x This indicates the green light time pre-allocated to each phase according to a set ratio. To amplify the time value within a reasonable range, α This represents the amplification factor; the expression for the reward of the preset traffic light agent is: ,in, Indicates the first time after the green light period ends. i Queue length for each lane The number of lanes at a single intersection; Based on the current queue length of vehicles in each phase of the intersection, the green light time for each phase of the lane is pre-allocated proportionally. The current state of the intersection is input into the trained Markov decision model. Based on the actions output by the Markov decision model, the green light time of each phase of the lane is adjusted. The action is to select the optimal green light time of each phase with the goal of reducing the queue length of vehicles at the intersection. After the selected optimal green light time ends, obtain the current intersection status and reward value, and store them as experience in the experience pool. Extract a certain amount of experience from the experience pool and update the neural network of the Markov decision model based on the DDPG algorithm. The green light time for each phase is adjusted based on the output of the updated Markov decision model.
2. The method according to claim 1, characterized in that, The neural network of the Markov decision model includes an action network and a commentator network. The network parameters of the action network are updated using a policy gradient algorithm, and the network parameters of the commentator network are updated using a temporal difference algorithm.
3. A traffic signal timing device based on deep reinforcement learning, characterized in that, The device includes: The data acquisition module is used to obtain the queue length of vehicles at the intersection at the current time and the phase of the lane at the intersection at the current time. One intersection includes four phases, and the phase represents the traffic status of the lane. The optimization model building module is used to establish an optimization model for the traffic lights at a single intersection, with the objective function being to minimize the delay time of vehicles at the intersection and the queue length of vehicles; wherein, the expression of the optimization model is: ,in, Indicates the delay time at the intersection. Indicates the length of the queue of vehicles at the intersection. n This indicates the number of lanes at a single intersection. X This represents the overall control process of the traffic light agent. λ Indicates the green credit ratio. T This indicates the duration of the cycle in which the traffic light colors are displayed in turn. Indicates the switching of the green light for each phase. g Indicates the effective time of the green light for each phase. This represents the vehicle delay time; the expression for the intersection delay time is: ,in, =1 indicates the first entrance lane at the intersection. Indicates all entrances to the intersection. Indicates intersection Road traffic flow, Represents the delay function, Indicates traffic flow; The formula for calculating the queue length of vehicles at an intersection is: ,in, Indicates the first signal cycle. Indicates the first phase, The maximum number of phases, The number of lanes at a single intersection. For the current phase the way exist t The number of vehicles in the queue at any given time. p This indicates the average distance between vehicles; The preset module is used to preset the state, action, and reward of the traffic light agent. Based on the traffic light agent and the DDPG algorithm, it initializes and trains the Markov decision model. The expression for the preset state of the traffic light agent is: ,in, Indicates the current moment. This indicates the number of lanes at a single intersection. Indicates the current state Phase, The normalization factor is used; the expression for the action of the traffic light agent is preset to be... ,in, x This indicates the green light time pre-allocated to each phase according to a set ratio. Increase the time value within a reasonable range. α This represents the amplification factor; the expression for the reward of the preset traffic light agent is: ,in, Indicates the first time after the green light period ends. i Queue length for each lane n Indicates the number of lanes at a single intersection; The green light time pre-allocation module is used to pre-allocate the green light time of each lane phase according to the queue length of vehicles in each phase of the intersection at the current time. The green light time adjustment module is used to input the current state of the intersection into the trained Markov decision model, and adjust the green light time of each phase of the lane according to the action output by the Markov decision model. The action is to select the optimal green light time of each phase with the goal of reducing the queue length of vehicles at the intersection. The network update module is used to obtain the current intersection status and reward value after the selected optimal green light time has ended, and store it as experience in the experience pool. It extracts a certain amount of experience from the experience pool and updates the neural network of the Markov decision model based on the DDPG algorithm. The time adjustment module is used to adjust the green light time for each phase based on the output of the updated Markov decision model.
4. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, the processor invoking the computer program to perform the steps of the method as described in any one of claims 1 to 2.