Dynamic traffic signal control method and system based on deep reinforcement learning
By constructing a state-space vector and a dynamic time step, and combining it with resilient dilemma zone determination and communication monitoring, the decision lag and safety issues of deep reinforcement learning in traffic signal control are solved, thereby improving the reliability and safety of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TAIYUAN INST OF TECH
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing dynamic traffic signal control methods based on deep reinforcement learning suffer from problems such as decision lag, lack of vehicle motion state constraints leading to rear-end collisions, and lack of degradation protection mechanisms under communication interruption conditions, resulting in intersection congestion deadlocks.
By extracting the shock wave velocity and wavefront position, a state space vector is constructed, the estimated encounter time is calculated and set as a dynamic time step, it is determined whether the vehicle is in an elastic distress zone, an interception flag is generated and a state overlay operation is performed, and physical safety constraints and hardware degradation mechanisms are introduced in conjunction with communication anomaly monitoring to trigger degradation protection.
It enables dynamic adjustment and physical synchronization of traffic signal control, reduces decision lag and safety risks, and improves the system's operational reliability and vehicle traffic safety.
Smart Images

Figure CN122245109A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent transportation technology, specifically to a dynamic traffic signal control method and system based on deep reinforcement learning. Background Technology
[0002] Dynamic traffic signal control refers to the process by which traffic control equipment dynamically adjusts the phase and timing of traffic lights based on real-time traffic flow data. Dynamic traffic signal control technology based on deep reinforcement learning combines the perceptual feature extraction capability of deep neural networks with the sequential decision-making capability of reinforcement learning. It treats the intersection traffic controller as an intelligent agent, allowing the agent to interact with the intersection traffic environment and continuously update the neural network parameters based on the current intersection traffic status data, thereby outputting traffic signal control actions.
[0003] Existing dynamic traffic signal control technologies based on deep reinforcement learning generally use a fixed time period as the decision step size of the deep reinforcement learning model. Intersection sensing devices collect traffic flow and lane occupancy data at fixed time intervals. The computing device inputs the collected traffic flow and lane occupancy data into the deep reinforcement learning model. At the end of each fixed time period, the deep reinforcement learning model outputs a phase switching action and a current phase maintenance action. Existing deep reinforcement learning models optimize parameters with the goal of maximizing the number of vehicles passing through the intersection. The control system directly issues control commands based on the calculation results of the neural network.
[0004] Due to the dynamic cycle of alternating stopping and starting shock waves within traffic flow, and the lack of physical safety constraints at the action output end of deep reinforcement learning models, existing traffic signal control methods suffer from decision lag and low safety. The fixed decision cycle cannot be synchronized with the encounter cycle of stopping and starting shock waves, causing the traffic signal control actions output by deep reinforcement learning models to lag, reducing the traffic efficiency of intersections. Deep reinforcement learning models directly issue control commands ignoring the instantaneous motion state of vehicles approaching the intersection. When vehicles are in trouble zones, deep reinforcement learning models issue phase switching actions, forcing drivers to take emergency braking actions, causing rear-end collisions at intersections. Control systems that rely solely on data-driven approaches lose control capabilities when encountering underlying communication interruptions. The system lacks hardware degradation protection mechanisms, which can easily lead to prolonged congestion and deadlock at intersections. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a dynamic traffic signal control method and system based on deep reinforcement learning. It aims to solve the problems of control action lag caused by the fixed decision cycle of existing control methods, rear-end collisions caused by the lack of vehicle motion state constraints in deep reinforcement learning models, and intersection congestion deadlock caused by the lack of degradation protection mechanism in the control system under communication interruption conditions.
[0006] To achieve the above objectives, the present invention provides the following technical solution: The first aspect of this invention provides a dynamic traffic signal control method based on deep reinforcement learning, comprising the following steps: Acquire microscopic trajectory data, extract shock wave velocity and shock wavefront position based on the microscopic trajectory data, and construct a state space vector by combining the phase index of the currently executed signal; The estimated encounter time is calculated based on the shock wave velocity and the shock wavefront position, and the estimated encounter time is set as the dynamic time step of the deep reinforcement learning model. The motion state of vehicles approaching the intersection is obtained, and based on the motion state and the shock wave velocity, it is determined whether the vehicle is in an elastic dilemma zone to generate an interception marker. The state space vector is input into the policy network of the deep reinforcement learning model to output the original expected action, and a state overlay operation is performed on the original expected action based on the interception identifier to generate the final execution action for traffic signal control. At the same time, the cumulative number of masked interceptions is recorded. A comprehensive reward value is generated based on the shock wave velocity and the cumulative number of mask interceptions, and the network parameters of the deep reinforcement learning model are updated using the comprehensive reward value. The global delay time is calculated based on the micro-trajectory data and pushed to the management platform, and degradation protection is triggered when communication anomalies are detected.
[0007] In the above method, the process of extracting shock wave features and constructing the state space is as follows: Microscopic trajectory data is processed using a traffic flow theory model to calculate the traffic flow and traffic density of each lane at the intersection. The stopping shock wave velocity and the starting shock wave velocity are calculated based on the rate of change of traffic flow and density. The instantaneous velocity and instantaneous acceleration of all vehicles within the lane are traversed, and the vehicle boundary coordinates are extracted as the stopping shock wave front position coordinates and the starting shock wave front position coordinates. The shock wave velocity and shock wave front position are normalized and then concatenated with the signal phase index, converted to a one-hot encoded format, to generate a state space vector. This step combines the macroscopic features of traffic flow with the microscopic spatiotemporal distribution of vehicles, providing the decision network with traffic flow evolution feature input.
[0008] The process of determining the dynamic time step is as follows: The system reads the maximum and minimum permissible time steps, sets the expected meeting time as a dynamic time step, and performs boundary constraint processing based on the maximum and minimum permissible time steps to determine the trigger time for the next decision. This step synchronizes the decision frequency of deep reinforcement learning with the meeting cycles of stopping and starting shock waves within the traffic flow, avoiding action lag or invalid calculations caused by fixed decision cycles.
[0009] The process of generating an interception flag and performing a state overwrite operation is as follows: The relative collision speed between the vehicle's instantaneous speed and the stopping shock wave speed is calculated, and the safe critical distance at the start of the elastic predicament zone is calculated by combining the preset driver reaction time and maximum safe deceleration. It is then determined whether the relative distance between the vehicle and the stopping shock wave front position coordinates is between the preset safe critical distance at the end of the elastic predicament zone and the safe critical distance at the start. If it is within this range, it is determined to be in an elastic predicament zone, an interception flag is generated and assigned a value of 1; otherwise, it is assigned a value of 0. When the interception flag is 1 and the original expected action is a phase switching action, a state overwrite operation is performed, modifying the original expected action to maintain the current phase action as the final action, and updating the mask interception count; when the interception flag is 0, the original expected action is directly used as the final action. This mechanism intervenes in the actions output by the policy network by introducing a car-following state and predicament zone determination model, shielding traffic signal switching actions with potential safety hazards without disrupting the model exploration mechanism.
[0010] The process of generating the comprehensive reward value and updating the model parameters is as follows: The difference between the initial shock wave velocity and the stopping shock wave velocity is calculated to generate a reward term. This reward term, along with the cumulative number of masked interceptions, is used to calculate a comprehensive reward value. The state space vector, the original expected action, the comprehensive reward value, and the state space vector for the next time step are encapsulated into an experience data tuple and written into the experience replay pool. The network parameters are then updated using the backpropagation algorithm. Using the shock wave velocity difference as the basic reward criterion, a penalty term is constructed based on the cumulative number of masked interceptions to guide the deep reinforcement learning model to update its parameters in a direction that accelerates queue dissipation at intersections and reduces safety interventions.
[0011] The specific steps for degraded protection under abnormal operating conditions are as follows: The lane space saturation is calculated based on the current residual queue length and the total physical length of the lane. When the saturation exceeds the overflow threshold for several consecutive dynamic time steps, a safety fallback plan is triggered to forcibly clear and alleviate the congestion. Vehicle identification codes are used to track complete traffic slices of individual vehicles to obtain actual travel times, and global delay times are calculated and pushed to the management platform to generate logs. A heartbeat monitoring system is built by sending timestamped handshake probe packets from the edge. When the watchdog timer fails to receive probe packets for more than a preset number of consecutive times, a communication lockout state is determined, and the underlying hardware relay takes over control and switches to a single-point timed multi-period timing scheme for degraded protection. This design compensates for the lack of a response mechanism in the data-driven model when communication is interrupted or abnormal traffic flow inputs, improving the system's operational reliability.
[0012] A second aspect of the present invention provides a dynamic traffic signal control system based on deep reinforcement learning, comprising: The shock wave feature extraction and state construction module is used to acquire microscopic trajectory data, extract shock wave velocity and shock wavefront position, and construct a state space vector by combining the current signal phase index. The dynamic decision step size generation module is used to calculate the expected encounter time based on the shock wave velocity and the shock wave front position, and set it as the dynamic time step size of the deep reinforcement learning model. The elastic distress zone determination module is used to acquire the vehicle's motion state and, in conjunction with the shock wave velocity, determine whether the vehicle is in an elastic distress zone to generate an interception flag. The action mask execution module is used to input the state space vector into the policy network to output the original expected action, and to perform a state overwrite operation based on the interception identifier to generate the final execution action, while recording the cumulative number of mask interceptions; The model training and update module is used to generate a comprehensive reward value based on the shock wave velocity and the cumulative number of mask interceptions, and to update the network parameters of the deep reinforcement learning model. The operation assessment and security degradation module is used to calculate the global delay time and push it to the management platform, and trigger degradation protection when communication anomalies are detected.
[0013] In the above system, the shock wave feature extraction and state construction module is specifically used for: Traffic flow theory models are used to process micro-trajectory data to calculate traffic flow and traffic density for each lane at the intersection. The stopping shock wave velocity and starting shock wave velocity are calculated based on the rate of change of traffic flow and traffic density. The instantaneous velocity and instantaneous acceleration of all vehicles in the lane are traversed, and the vehicle boundary coordinates are extracted as the stopping shock wave front position coordinates and starting shock wave front position coordinates. The shock wave velocity is normalized using a preset reference shock wave velocity, and the shock wave front position is normalized using the total physical length of the lane. The current signal phase index is converted into a one-hot encoded format and then concatenated with the normalized parameters to generate a state space vector.
[0014] The dynamic decision step size generation module is specifically used for: The system reads the maximum and minimum allowed time steps, and performs boundary constraint processing based on the set dynamic time step and the maximum and minimum allowed time steps to determine the trigger time for the next decision of the deep reinforcement learning model.
[0015] The elastic predicament zone determination module is specifically used for: The system calculates the relative collision speed between the vehicle's instantaneous speed and the parking shock wave speed, and calculates the critical safety distance at the start of the elastic predicament zone by combining the preset driver reaction time and maximum safe deceleration. It then determines whether the relative distance between the vehicle and the parking shock wave front position coordinates is between the preset critical distance at the end of the elastic predicament zone and the critical safety distance at the start of the elastic predicament zone. If the distance is within this range, the vehicle is determined to be in the elastic predicament zone, and an interception flag is generated and assigned a value of 1. If the distance is outside this range, an interception flag is generated and assigned a value of 0.
[0016] The action mask execution module is specifically used for: When the interception flag is 1 and the original expected action is a phase switching action, a state overwrite operation is performed, modifying the original expected action to keep the current phase action as the final action to be executed, and updating the cumulative number of mask interceptions; when the interception flag is 0, the state overwrite operation is skipped, and the original expected action is directly used as the final action to be executed.
[0017] The model training and update module is specifically used for: The difference between the starting shock wave velocity and the stopping shock wave velocity is calculated to generate a reward item. The reward item and the cumulative number of mask interceptions are used to calculate a comprehensive reward value. The state space vector, the original expected action, the comprehensive reward value and the state space vector of the next time step are combined and encapsulated into an experience data tuple and written into the experience replay pool. The network parameters are updated through the backpropagation algorithm.
[0018] The operation assessment and security degradation module is specifically used for: The lane space saturation is calculated based on the remaining physical length of the queue that has not yet dissipated in the current lane and the total physical length of the lane. When the lane space saturation exceeds the preset overflow threshold for several consecutive dynamic time steps, a safety fallback plan is triggered to forcibly clear and alleviate the congestion. The total number of valid vehicles leaving the intersection is obtained, and the vehicle identification code is used to track the complete passage slice of a single vehicle to obtain the actual passage time. The difference between the actual passage time and the theoretical free passage time is accumulated and divided by the total number of valid vehicles to obtain the global delay time. When the total number of valid vehicles is 0, the global delay time is assigned as the upper limit of the system's maximum waiting time. The global delay time and traffic operation parameters are serialized and pushed to the management platform, and a delay alarm log is generated when the severe congestion alarm threshold is exceeded. A heartbeat monitoring mechanism is built by sending handshake probe packets with timestamps. When the watchdog timer fails to receive handshake probe packets for more than a preset number of consecutive times, it is determined to enter the communication lockout state. The underlying hardware relay takes over the authority and switches to a single-point timing multi-period timing scheme to perform degraded protection.
[0019] This invention provides a dynamic traffic signal control method and system based on deep reinforcement learning. It has the following beneficial effects: 1. This invention calculates the expected encounter time between the shock wave velocity and the shock wave front position, and directly sets it as the dynamic time step of the deep reinforcement learning model. This physically synchronizes the decision frequency of the signal control with the encounter period of the starting shock wave and the stopping shock wave within the traffic flow, enabling the model to dynamically adjust the next decision time according to the actual traffic flow state. This helps to avoid the decision lag and invalid calculation caused by the traditional fixed time step, and improves the adaptability of traffic signal control to changes in traffic flow at intersections.
[0020] 2. This invention uses the vehicle's motion state and shock wave velocity to determine whether the vehicle is in an elastic dilemma zone. When it is determined to be in a dilemma zone and the original expected action output by the network is a phase switching, a state overwrite operation is performed to maintain the current phase. A physical safety constraint mechanism is introduced at the output of the deep reinforcement learning model. Without disrupting the model's trial and error exploration logic, it actively intercepts and blocks signal switching actions with potential collision risks, reducing the probability of drivers being caught in a dilemma at intersections and helping to ensure vehicle traffic safety.
[0021] 3. This invention monitors communication handshake detection packets and calculates lane space saturation. When communication anomalies or traffic overflow at intersections are detected, a low-level degradation protection mechanism is triggered. When the control device does not receive detection packets continuously, a hardware relay takes over the authority and switches to a pre-stored timing scheme. Furthermore, when the queue occupancy rate exceeds the standard continuously, forced easing is implemented. This provides backup safety for reinforcement learning models that rely solely on data-driven approaches, and helps improve the overall operational reliability of the traffic signal control system. Attached Figure Description
[0022] Figure 1 This is a system framework diagram of an embodiment of the present invention; Figure 2 This is a schematic diagram of the method flow according to an embodiment of the present invention; Figure 3 This is a dynamic decision-making step-size response curve diagram of a specific application embodiment of the present invention; Figure 4 The graph shows a comparison and verification of the global average delay time at intersections in a specific application embodiment of the present invention. Detailed Implementation
[0023] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] See attached document Figure 1 This invention provides a dynamic traffic signal control system based on deep reinforcement learning, comprising: Sensing devices, edge computing devices, and signal control devices.
[0025] The signal output terminal of the sensing device is communicatively connected to the data input terminal of the edge computing device. The control command output terminal of the edge computing device is communicatively connected to the signal input terminal of the signal control device.
[0026] Sensing devices are deployed at the intersection via mechanical supports. These devices collect microscopic trajectory data from traffic participants. The sensing devices then transmit this data in real-time to edge computing devices. The edge computing devices receive the microscopic trajectory data and execute dynamic traffic signal control logic. The edge computing devices generate phase control commands and send them to the signal control equipment. The signal control equipment receives the phase control commands and controls the traffic light hardware at the intersection to perform the desired actions.
[0027] The edge computing devices in the system are internally configured with computing modules that execute control logic, and may include: The module includes shock wave feature extraction and state construction, dynamic decision step size generation, elastic dilemma zone determination, action mask execution, and model training and update.
[0028] The data output of the shock wave feature extraction and state construction module is communicatively connected to the data inputs of both the dynamic decision step size generation module and the action mask execution module. The clock synchronization output of the dynamic decision step size generation module is communicatively connected to the trigger input of the action mask execution module. The data stream output of the sensing device is simultaneously communicatively connected to the data input of the resilient distress zone determination module. The state output of the resilient distress zone determination module is communicatively connected to the control intervention end of the action mask execution module. The data bus of the action mask execution module is connected to the data acquisition end of the model training and update module.
[0029] The shock wave feature extraction and state construction module acquires microscopic trajectory data sent by sensing devices. This data includes the absolute coordinates and instantaneous velocity of the vehicles. The module then processes this microscopic trajectory data using a continuous traffic flow dynamics model. Finally, the module calculates the traffic flow and traffic density for each lane at the intersection.
[0030] The shock wave feature extraction and state construction module calculates the stopping shock wave velocity and the starting shock wave velocity based on the rate of change of traffic flow and traffic density. This module also extracts the physical coordinates of the stopping and starting shock wave fronts. Furthermore, it obtains the phase index of the currently executing signal. Finally, the module combines the stopping shock wave velocity, the starting shock wave velocity, the physical coordinates of the stopping and starting shock wave fronts, and the signal phase index to form the state space vector for the current time step.
[0031] The dynamic decision step size generation module acquires the physical feature data output by the shock wave feature extraction and state construction module. It reads the physical coordinates of the starting shock wave front and the stopping shock wave front. The module calculates the absolute physical distance between the starting and stopping shock wave fronts. Finally, it extracts the starting and stopping shock wave velocities.
[0032] The dynamic decision step size generation module calculates the predicted encounter time between the starting shock wave and the stopping shock wave. This module sets the predicted encounter time as a dynamic time step. It also reads the system's maximum and minimum permissible time steps. Based on the dynamic time step, the system's maximum and minimum permissible time steps, the module performs boundary constraint processing to determine the trigger time for the deep reinforcement learning model's next decision.
[0033] The flexible difficult zone determination module acquires the instantaneous speed of vehicles approaching the intersection and the stopping shock wave speed. It adds the instantaneous speed and the stopping shock wave speed to calculate the relative collision speed. The module also reads preset driver reaction time constants and maximum safe deceleration. Based on the relative collision speed, driver reaction time constant, and maximum safe deceleration, the module calculates the safe critical distance at the starting point of the flexible difficult zone.
[0034] The flexible predicament zone determination module calculates the actual spatial distance between the vehicle and the front of the parking shock wave. It then determines whether this actual spatial distance falls between the preset critical distance to the end of the flexible predicament zone and the safe critical distance to the beginning of the flexible predicament zone. When the actual spatial distance is within this range, the module generates an interception flag and assigns it a value of 1. When the actual spatial distance is outside this range, the module generates an interception flag and assigns it a value of 0.
[0035] The action masking execution module receives the state space vector transmitted by the shock wave feature extraction and state construction module. The action masking execution module inputs the state space vector into the policy network of the deep reinforcement learning model. The policy network performs forward inference calculations and outputs the original expected actions, including the phase-switching action and the action to maintain the current phase. The action masking execution module reads the interception flag generated by the resilient predicament zone determination module.
[0036] When the interception flag is 1 and the original expected action is a phase switching action, the action mask execution module performs a state overwrite operation. The action mask execution module modifies the final action to maintain the current phase. The action mask execution module triggers an internal counter to increment the cumulative number of mask interceptions. When the interception flag is 0, the action mask execution module skips the state overwrite operation and uses the original expected action as the final action. The action mask execution module converts the final action into a phase control command and sends it to the signal control device.
[0037] The model training and update module acquires the control data for the current time step at the trigger time set by the dynamic decision step size generation module. The module extracts the starting shock wave velocity and the stopping shock wave velocity. It calculates the difference between the starting shock wave velocity and the stopping shock wave velocity to generate a feedforward dynamic reward baseline value. Finally, the module reads the cumulative number of mask interceptions recorded by the internal counter of the action mask execution module.
[0038] The model training and update module extracts preset shock dissipation efficiency weight coefficients and safety mask penalty weight coefficients. It then performs a weighted calculation on the feedforward dynamics reward baseline value and the cumulative number of mask interceptions to generate a comprehensive reward value. The module further encapsulates the state space vector, the original expected action, the comprehensive reward value, and the state space vector for the next time step into an empirical data tuple. This tuple is then written into the empirical replay pool. The module periodically reads data from the empirical replay pool, calculates gradients using the backpropagation algorithm, and updates the network parameters of the deep reinforcement learning model.
[0039] The Operation Assessment and Safety Degradation module obtains the total number of valid vehicles leaving the intersection. It then uses vehicle identification codes extracted by sensing devices to track complete passage segments of individual vehicles to obtain actual travel times. Finally, it sums the differences between the actual travel times and the theoretical free-flowing times and divides this sum by the total number of valid vehicles to calculate the global delay time. When the total number of valid vehicles is 0, the module assigns the global delay time to the system's maximum waiting time limit.
[0040] The operation assessment and safety degradation module serializes global delay times and traffic operation parameters into JSON data format and pushes them to the management platform. It also generates delay alarm logs when the severe congestion alarm threshold is exceeded. The module establishes a heartbeat monitoring mechanism by sending timestamped handshake probe packets. When the timer inside the signal control equipment fails to receive handshake probe packets for more than a preset number of consecutive times, the system is deemed to have entered a communication lockout state. At this point, the underlying hardware relay takes over control and switches to a single-point timing multi-period time-sharing scheme to execute degradation protection. See attached document Figure 2 This invention provides a dynamic traffic signal control method based on deep reinforcement learning, comprising the following steps: S1. Acquire micro-trajectory data collected by sensing devices. The micro-trajectory data includes the absolute coordinates and instantaneous speed of vehicles. Process the micro-trajectory data using a continuous traffic flow dynamics model to calculate the traffic flow and traffic density of each lane at the intersection. Calculate the stopping shock wave speed and starting shock wave speed based on the rate of change of traffic flow and traffic density. Extract the physical coordinates of the stopping shock wave front and the starting shock wave front. Obtain the signal phase index of the currently executed signal. Combine the stopping shock wave speed, starting shock wave speed, physical coordinates of the stopping shock wave front, physical coordinates of the starting shock wave front, and signal phase index to form the state space vector of the current time step. S2, extract the physical coordinates of the parking shock wave front and the starting shock wave front, calculate the absolute physical distance between the physical coordinates of the starting shock wave front and the parking shock wave front, combine the starting shock wave velocity and the parking shock wave velocity, calculate the predicted encounter time between the starting shock wave and the parking shock wave, set the predicted encounter time as a dynamic time step, read the maximum allowable time step and the minimum allowable time step of the system, perform boundary constraint processing based on the dynamic time step, the maximum allowable time step and the minimum allowable time step of the system, and thus determine the trigger time of the next decision of the deep reinforcement learning model; S3: Obtain the instantaneous speed of the vehicle approaching the intersection and the stopping shock wave speed. Add the instantaneous speed and the stopping shock wave speed to calculate the relative collision speed of the vehicle. Read the preset driver reaction time constant and maximum safe deceleration. Calculate the safety critical distance at the start of the elastic dilemma zone based on the relative collision speed, driver reaction time constant and maximum safe deceleration. Calculate the actual spatial distance between the vehicle and the front of the stopping shock wave. Determine whether the actual spatial distance is between the preset critical distance at the end of the elastic dilemma zone and the safety critical distance at the start of the elastic dilemma zone. When the actual spatial distance is within this interval, generate an interception flag and assign it a value of 1. When the actual spatial distance is outside this interval, generate an interception flag and assign it a value of 0. S4, receive the state space vector, input the state space vector into the policy network of the deep reinforcement learning model, the policy network performs forward inference calculation and outputs the original expected action containing the phase switching action and the current phase holding action, read the interception flag, when the interception flag is 1 and the original expected action is the phase switching action, perform the state overwrite operation, modify the final execution action to the current phase holding action, and increment the cumulative number of mask interceptions recorded by the internal counter; when the interception flag is 0, skip the state overwrite operation, take the original expected action as the final execution action, and convert the final execution action into a phase control command and send it to the signal control device; S5: At the end of the trigger moment of the next decision, obtain the control data of the current time step, extract the starting shock wave velocity and the stopping shock wave velocity, calculate the difference between the starting shock wave velocity and the stopping shock wave velocity, generate the feedforward dynamic reward benchmark value, read the recorded cumulative number of mask interceptions, extract the preset shock wave dissipation efficiency weight coefficient and the safety mask penalty weight coefficient, perform a weighted operation on the feedforward dynamic reward benchmark value and the cumulative number of mask interceptions to generate a comprehensive reward value, combine and encapsulate the state space vector, the original expected action, the comprehensive reward value and the state space vector of the next time step into an experience data tuple, write the experience data tuple into the experience replay pool, read the data in the experience replay pool, and use the backpropagation algorithm to update the network parameters of the deep reinforcement learning model.
[0041] S6 extracts microscopic vehicle trajectories within the observation period, calculates the global average delay time at intersections, encapsulates various traffic operation parameters and the state of the deep reinforcement learning model in a structured manner, and pushes them to the visual traffic management platform. At the same time, it constructs a low-level hardware-level heartbeat monitoring mechanism to trigger physical degradation protection.
[0042] Step S1 specifically includes the following sub-steps: S101, acquire the micro-trajectory data collected by the sensing device. In this embodiment, considering the time synchronization problem of multi-source heterogeneous data, the sensing device is deployed at the intersection approach lane, and outputs the running status of traffic participants in the detection area in real time at a fixed sampling frequency. During this acquisition process, a unified network time protocol is used to timestamp each frame of data to ensure strict time alignment in subsequent dynamic calculations. The micro-trajectory data includes the absolute coordinates and instantaneous speed of each vehicle in the detection area.
[0043] To standardize spatial physical benchmarks, absolute coordinates are typically defined as the lateral and longitudinal coordinates of a vehicle in a local Cartesian coordinate system with the intersection's geometric center or the midpoint of the stop line as the origin. After receiving the micro-trajectory data, the edge computing device maps the micro-trajectory data, including absolute coordinates, to the corresponding intersection entrance lanes based on the pre-defined lane boundary geometric vector data of the intersection. For radar point cloud filtering, target tracking, and coordinate system mapping during the acquisition of micro-trajectory data by the sensing device, those skilled in the art can employ extended Kalman filtering algorithms and affine transformation matrices. The data preprocessing process is well-known in this field and will not be elaborated upon here.
[0044] After completing the lane-level mapping of the micro-trajectory data, the process proceeds to step S102, where the micro-trajectory data is processed using a continuous traffic flow dynamics model to calculate the traffic flow and traffic density for each lane at the intersection. In macro-traffic flow theory, the dynamic evolution of vehicle clusters can be analogized to the continuous movement of fluid in a pipe. Based on this general physical property, the edge computing device establishes a mapping relationship between micro-physical quantities and macro-fluid parameters based on the continuous traffic flow dynamics model. To achieve discretized numerical calculations, the system needs to set discrete spatial statistical intervals along the longitudinal direction of the lanes. The length of the spatial statistical interval is typically set to 10 to 20 meters to ensure the validity of the statistical samples. The instantaneous speeds of all vehicles and the total number of vehicles within the spatial statistical interval are extracted. The local traffic density is obtained by dividing the total number of vehicles by the length of the spatial statistical interval.
[0045] Furthermore, the spatial average speed is obtained by calculating the arithmetic mean of the instantaneous speeds of all vehicles within the spatial statistical interval. Multiplying the local traffic density by the spatial average speed yields the local traffic flow. As a preferred method, when the total number of vehicles within the spatial statistical interval is 0, to avoid physically meaningless numerical jumps in subsequent matrix or division operations, the system forcibly sets both the local traffic density and the spatial average speed to 0 and adds corresponding anomaly indicators.
[0046] S103 calculates the stopping shock wave velocity and the starting shock wave velocity based on the rate of change of traffic flow and traffic density. From the perspective of traffic flow dynamics, when traffic flow abruptly changes from an initial state to a target state within the lane space, the speed at the interface between the two states is the shock wave velocity. The core technical purpose of extracting and calculating the shock wave velocity is to enable the system to overcome the randomness of individual vehicles and accurately quantify the real-time spread and dissipation trend of queue length from a macroscopic perspective. Edge computing devices use the shock wave velocity derivation formula for calculation, which is as follows: ; In the formula, Shock wave velocity is the physical velocity of the interface between two different operating states of traffic flow. The initial traffic flow before the state change; The target traffic flow after the state change; The initial traffic density before the state change; The target traffic density after the state change; The above This represents the difference in traffic flow before and after the state change. This represents the difference in traffic density before and after the state change. In the actual physical scenario, to ensure the completeness of the computational logic, when the denominator contains... When the absolute value of the shock wave velocity is less than the preset minimum constant (the preset minimum constant is, for example, 0.001), it indicates that there has been no substantial change in the traffic flow state within the spatial area. At this time, the system will adjust the shock wave velocity. The value is directly assigned to 0 to prevent system crashes caused by division by zero.
[0047] When the traffic light at an intersection is red, vehicles gather behind the stop line, forming a queue. This queuing process creates a stopping shock wave that propagates in the opposite direction of oncoming traffic from the stop line. Edge computing devices extract traffic flow and density upstream of the queuing area as parameters before the state change. They also extract congestion density within the queuing area as the traffic density after the state change. Since the queued vehicles are stationary, the traffic flow changes after the sudden change in state. It is 0.
[0048] The stopping shock wave velocity is calculated by substituting the above parameters into the shock wave velocity derivation formula. In order to determine the specific location of the wavefront, the system does not rely solely on the vehicle coordinates at a single extreme position. Instead, it first filters out free noise data based on a spatial density clustering algorithm, and then iterates through the instantaneous speeds of all vehicles in the lane. The absolute coordinates of the boundary of the effective cluster of vehicles with instantaneous speeds lower than the preset stopping speed threshold and farthest from the stop line are extracted as the physical coordinates of the stopping shock wavefront.
[0049] Conversely, when the traffic light at the intersection is green, queued vehicles proceed sequentially across the stop line. The process of these vehicles moving across the stop line creates a starting shock wave that propagates from the stop line towards the back of the queue. Edge computing devices extract the congestion density within the queue area as the traffic density before the change in state. Similarly, the traffic flow before the sudden change in state. The value is 0. The edge computing device extracts the saturated traffic flow and critical traffic density at the location where the vehicle accelerates past the stop line as parameters after the state change. Substituting these parameters into the shock wave velocity derivation formula, the starting shock wave velocity is calculated. Similarly, after excluding outlier data caused by abnormally aggressive or sluggish driving behavior, the system iterates through the instantaneous acceleration of all vehicles in the lane and extracts the absolute coordinates of the boundary of the effective cluster of vehicles whose instantaneous acceleration exceeds the preset start threshold and is farthest from the stop line as the physical coordinates of the starting shock wave front.
[0050] Regarding the aforementioned thresholds, based on the dynamic characteristics of conventional passenger vehicles, the preset stopping speed threshold is typically between 0 and 2 km / h, and the preset starting threshold is typically between 0.5 and 1.5 m / s. For the parameter calibration of saturated traffic flow and congestion density, those skilled in the art can obtain them based on historical intersection data and fundamental traffic engineering theories. The parameter calibration method is a well-known technique in this field and will not be elaborated upon here.
[0051] Based on the extracted feature parameters, step S104 is executed to obtain the signal phase index of the current execution. The parking shock wave velocity, starting shock wave velocity, parking shock wave front physical coordinates, starting shock wave front physical coordinates, and signal phase index are combined to form the state space vector of the current time step. The edge computing device reads the green light release direction number currently output by the signal control device as the signal phase index. Since the above physical quantities will be input into the neural network of the deep reinforcement learning model, and the dimensions and values of each physical quantity are significantly different, direct input can easily lead to gradient explosion or vanishing problems during the network weight update process. Therefore, the edge computing device needs to perform dimensionless processing on the extracted physical quantities. The edge computing device uses a preset reference shock wave velocity to perform a division normalization operation on the parking shock wave velocity and the starting shock wave velocity. The reference shock wave velocity is set to the maximum absolute value of the shock wave velocity that appears in the historical statistical data.
[0052] The edge computing device performs a division normalization operation on the physical coordinates of the stopping shock wave front and the starting shock wave front using the total physical length of the lane. The total physical length of the lane is determined based on the actual channelization length parameter of the intersection. The signal phase index is converted to a one-hot encoded format. The edge computing device concatenates the normalized shock wave velocity, the normalized wavefront physical coordinates, and the one-hot encoded signal phase index into a matrix. The concatenated one-dimensional matrix serves as the input to the state space vector of the deep reinforcement learning model. In this embodiment, the policy network of the deep reinforcement learning model adopts a multilayer perceptron (MLP) structure, and the data dimension of the input layer is strictly matched with the total number of elements in the state space vector. The state space vector flows into the model through the input layer, extracts deep features through nonlinear activation mapping of several hidden layers, and finally outputs the action execution probability distribution of each candidate phase under the current environmental state by the output layer, thereby realizing the end-to-end mapping from physical state environment representation to traffic control commands.
[0053] Step S2 specifically includes the following sub-steps: S201: Extract the physical coordinates of the stopping shock wave front and the starting shock wave front, and calculate the absolute physical distance between them. The stopping shock wave front physical coordinates represent the farthest spatial boundary of the queued vehicles spreading backward, while the starting shock wave front physical coordinates represent the starting spatial boundary of the queued vehicles dissipating forward. The edge computing device obtains the remaining physical length of the queue that has not yet dissipated within the current lane by calculating the absolute value of the difference between the two absolute coordinates in the longitudinal extension direction of the lane. This calculation of absolute physical distance is not only a spatial quantitative indicator for evaluating the congestion dissipation state of the intersection, but also provides an indispensable physical geometric basis for subsequent deduction of traffic shock wave encounter times.
[0054] As a preferred approach, in continuous monitoring of actual physical scenarios, when the physical coordinates of the starting shock wave front have surpassed the physical coordinates of the stopping shock wave front in spatial position, it indicates that the current queue has completely dissipated. In order to avoid negative distance in subsequent calculations, the system forces the absolute physical distance to 0 and sends a queue clearing event flag to the internal bus of the system.
[0055] Based on the calculated residual queue physical length, to further dynamically predict the specific nodes of queue dissipation, step S202 is entered. Combining the starting shock wave velocity and the stopping shock wave velocity, the predicted encounter time between the starting shock wave and the stopping shock wave is calculated. Analysis of shock wave evolution patterns shows that during the green light period, the starting shock wave catches up with the stopping shock wave upstream at a higher speed. The physical intersection point of the two in space is the moment when the queue completely dissipates. The technical purpose of accurately obtaining this encounter time is to enable the system to predict the critical turning point of the current phase's release efficiency in advance, thereby providing a forward-looking benchmark at the time level for dynamic decision-making. The edge computing device uses the shock wave encounter time derivation formula for calculation, as follows: ; In the formula, To predict the meeting time, this represents the theoretical time span required for the current residual queue to completely dissipate under the existing traffic conditions; The absolute physical distance calculated in step S201; The relative approximation velocity between the starting shock wave and the stopping shock wave.
[0056] In this embodiment, since both the starting shock wave and the stopping shock wave propagate upstream, the relative approximation velocity... The specific value is equal to the arithmetic difference between the absolute values of the starting shock wave velocity and the stopping shock wave velocity. It should be noted that during dynamic simulations, considering nonlinear fluctuations in traffic flow and extreme conditions such as downstream overflow, if the starting shock wave velocity is less than or equal to the stopping shock wave velocity, the relative approximation velocity in the denominator will change. A negative value or 0 indicates that the queue is worsening and cannot dissipate on its own in the short term. To ensure the integrity of the computational logic and prevent system crashes caused by division by zero, the system will assign a preset time limit to the predicted meeting time (e.g., set to 999s) as a feature signal to warn that the current phase of the deep reinforcement learning model has reached its physical bottleneck in terms of operational efficiency.
[0057] After obtaining the predicted meeting time, the key to achieving dynamic adaptive control lies in how to reasonably convert it into the triggering cycle of the model control command. Therefore, step S203 is executed, setting the predicted meeting time as the dynamic time step, reading the system's maximum and minimum allowable time steps, and performing boundary constraint processing based on the dynamic time step, the system's maximum and minimum allowable time steps, thereby determining the triggering time for the deep reinforcement learning model's next decision. If the deep reinforcement learning model uses a traditional fixed time step for action decision-making, it often leads to a serious disconnect between the control command cycle and the actual physical evolution of traffic flow. Setting the dynamic time step based on the shock wave dissipation mechanism aims to drive the model to perform action evaluation at the critical critical point where the queue just dissipates, thereby maximizing the overall traffic efficiency of the intersection. The edge computing device reads the system's maximum and minimum allowable time steps pre-stored in local memory and uses upper and lower limit functions to perform boundary constraint processing on the dynamic time step. The specific limit logic is as follows: When the calculated dynamic time step is lower than the minimum allowable time step, the system will use the minimum allowable time step as the trigger interval to avoid mechanical wear of the signal control hardware and driver visual fatigue caused by high-frequency invalid decisions; when the dynamic time step exceeds the system's maximum allowable time step, the system will truncate the use of the system's maximum allowable time step to prevent a single phase from continuously allowing passage for too long, which could lead to deadlock in other conflicting directions at the intersection or cause illegal behaviors such as drivers running red lights; if the dynamic time step is between the above two, the dynamic time step will be directly retained.
[0058] Regarding the calibration of the aforementioned control boundary parameters, based on the general signal timing patterns at urban road intersections and drivers' psychological expectations, the minimum permissible time step is typically set between 5 and 10 seconds, while the maximum permissible time step is typically set between 45 and 60 seconds. The edge computing device adds the current system's absolute timestamp to the calculated, limited dynamic time step, ultimately determining the precise time node for the deep reinforcement learning model to wake up next, perform state observation, and execute action inference.
[0059] Edge computing devices read the system's maximum safe observation period and default time step pre-stored in local memory, and use upper and lower limiting functions to perform boundary constraint processing on the dynamic time step. The specific limiting logic is as follows: When the calculated dynamic time step is lower than the default time step, the system will use the default time step as the trigger interval to avoid mechanical wear and tear on signal control hardware and driver fatigue caused by high-frequency invalid decisions. When the dynamic time step exceeds the system's maximum safe observation period, the system will truncate the use of the maximum safe observation period to prevent excessively long continuous release time for a single phase, which could lead to deadlock in other conflicting directions at the intersection or cause illegal behaviors such as running red lights. If the dynamic time step is between the above two, the system will directly retain and use the dynamic time step.
[0060] Regarding the calibration of the aforementioned control boundary parameters, based on the general signal timing patterns at urban road intersections and drivers' psychological expectations, the default time step is typically set between 5 and 10 seconds, and the system's maximum safe observation period is usually set between 45 and 60 seconds. The edge computing device adds the current system's absolute timestamp to the calculated, limited dynamic time step, ultimately precisely determining the specific time node for the deep reinforcement learning model to wake up next, perform state observation, and execute action inference. Step S3 specifically includes the following sub-steps: In step S301, the state space vector generated in step S1 for the current time step is input into a pre-deployed deep reinforcement learning model. The model outputs action commands for traffic signals through forward propagation inference. To establish an effective mapping relationship from micro-level traffic conditions to macro-level control strategies, the deep reinforcement learning model in this embodiment specifically adopts a deep Q-network architecture. The state space vector, as input data, undergoes nonlinear feature extraction through the model's input layer, multiple fully connected hidden layers, and activation functions (such as the ReLU function), ultimately generating action value function values for each candidate action at the output layer. Considering the physical constraints of intersection phase sequence control, the system's action space is defined as a discrete binary variable, encompassing two action dimensions: maintaining the current signal phase and switching to the next signal phase. During the action selection phase, to balance the model's exploration of unknown traffic conditions with the utilization of known optimal strategies, the edge computing device does not simply select extreme values but employs a dynamic probability-based balance strategy to select actions.
[0061] Specifically, the system maintains a dynamically updated random exploration probability threshold. At each decision execution, the system first generates a random floating-point number between 0 and 1. If the generated random floating-point number is less than or equal to the current random exploration probability threshold, the system ignores the current action value assessment result and randomly selects an action dimension to execute, in order to fully explore the unknown traffic environment. Conversely, if the generated random floating-point number is greater than the current random exploration probability threshold, the system strictly selects the action dimension with the highest current action value function value as the action instruction to be executed at the current time step. Furthermore, to ensure the stability and convergence of the model's strategy in the later stages of training, the aforementioned random exploration probability threshold gradually decays from an initial high probability state to a very small constant approaching 0 as the number of training rounds increases, according to a preset decay function.
[0062] S302 parses the action command and sends the corresponding phase drive signal to the traffic signal controller, while introducing a safety clearing mechanism for traffic flow conflict areas. After obtaining the action command output by the model, it must be physically converted and issued in accordance with actual traffic safety regulations. Considering the physical inertia and braking limits of vehicles passing through intersections, when the action command output by the model is to maintain the current signal phase, the edge computing device directly maintains the current green light state without changing the underlying hardware level.
[0063] However, when the action command is to switch to the next signal phase, the system strictly prohibits abruptly changing the signal light color. Instead, it mandates inserting a yellow light transition time and a full-red safety time between the two conflicting phases. As a preferred method, the yellow light transition time is typically set to 3 to 5 seconds based on the intersection's speed limit, and the full-red safety time is set to 1 to 3 seconds. Only after the safety clearance phase has ended can the traffic signal controller activate the green light drive relay for the next phase, thus achieving a smooth transfer of control while ensuring the absolute safety of traffic participants.
[0064] S303 Quantitatively evaluate the optimization effect of the selected action command on the intersection's traffic efficiency and calculate the immediate reward value for the current decision cycle. After the control command is executed and the dynamic time step determined in step S2 has elapsed, the system environment will evolve to a new traffic state. The optimization objective of the reinforcement learning model is to maximize the long-term cumulative reward; therefore, the construction of the immediate reward function must accurately reflect the physical causal relationship between the control action and the degree of traffic congestion relief. The core technical purpose of extracting and calculating the immediate reward value is to provide accurate gradient descent guidance for subsequent online parameter fine-tuning of the model. Edge computing devices use the reward calculation formula to evaluate the merits of the current action. The reward calculation formula is as follows: ; In the formula, This is the instant reward value at the current time step; the smaller the absolute value, the better the overall operating status of the intersection. The sum of the physical lengths of the remaining queues of all approach lanes at the intersection is the sum of the absolute physical distances of each lane calculated in step S201, which represents the overall congestion scale of the system. As a phase-switching penalty, when the action command is to maintain the current signal phase... The value is 0 when the action command is to switch to the next signal phase. The value is 1. This parameter is used to suppress the non-productive loss of green light time caused by high-frequency phase switching in the model. This is the weighting coefficient for queue length; The weighting coefficient for switching penalties.
[0065] In practical engineering applications, the specific values of the weighting coefficients directly determine the control tendency of the model. If we assign... An excessively high weighting can cause the model to lock up other lanes when clearing one lane; if given... If the proportion is too high, the model tends to not switch phases for extended periods. Therefore, those skilled in the art typically calibrate the parameters based on the actual saturation level of the intersection. The value range is set, for example, between 0.5 and 1.0. The value range is set, for example, between 10 and 20.
[0066] In step S304, empirical data containing state space vectors, action commands, immediate reward values, and the state space vector for the next time step is stored in the empirical replay pool, and the network weights of the deep reinforcement learning model are periodically updated. To ensure the model's ability to self-evolve and adapt to non-stationary traffic flow, the edge computing device combines these four variables into a complete Markov empirical trajectory data set. When the amount of data in the empirical replay pool reaches a preset batch processing sample threshold, the system randomly selects several empirical trajectory data sets to form a training micro-batch. During this training process, to break the temporal correlation between data and stabilize the training objective, the deep Q-network is specifically divided into a current evaluation network and a target network with identical structures. The deviation between the target action value output by the target network and the predicted action value function output by the current evaluation network is calculated using the mean squared error loss function. Then, the backpropagation algorithm is executed through a preset optimizer (e.g., the adaptive moment estimation Adam optimizer) to fine-tune the weights and bias parameters of each hidden layer of the current evaluation network along the direction of gradient descent of the loss function.
[0067] Furthermore, to avoid training oscillations and divergences, the system directly copies or proportionally updates the weight parameters of the current evaluation network to the target network every preset update cycle (e.g., 500 to 1000 time steps). This training step ensures that the system can continuously correct its control strategy during long-term online operation, thereby achieving end-to-end closed-loop evolution from single physical state feedback to globally optimal traffic signal timing.
[0068] Step S4 specifically includes the following sub-steps: S401, the system performs a sliding update of the state space and synchronizes the system clock to drive the continuous closed-loop operation of the traffic signal control. After completing the action value assessment and network weight fine-tuning in step S3, the system needs to reset the initial environmental parameters for the next intelligent decision.
[0069] In practice, the edge computing device directly assigns and overwrites the state space vector of the next time step obtained in step S304 as the state space vector of the current time step, thereby completing the step transition of the Markov chain in the logical storage space. Simultaneously, the internal logic control module suspends the current inference process and, based on the truncated dynamic time step size in step S203, advances the system master clock by the corresponding safe time interval. During this clock suspension period, the underlying traffic signal controller continuously maintains the physical drive level issued in step S302 until the passage of real physical time completely coincides with the preset wake-up node of the system master clock. At this point, the system will jump back to step S1, triggering a new round of microscopic trajectory data perception and state vector construction.
[0070] S402, during continuous closed-loop iteration, monitors the spatial saturation of each approach lane at the intersection in real time and triggers a safety degradation and round reset mechanism under extreme conditions. Because deep reinforcement learning models are prone to deviating from expected control strategies during the exploration phase or when facing sudden traffic flow distortions (such as lane closures caused by traffic accidents), deterministic safety fallback logic must be introduced in actual physical scenario deployments. To quantitatively define the boundary safety state of the physical environment, edge computing devices use a spatial saturation calculation formula to assess real-time congestion risk, as follows: ; In the formula, Lane space saturation is used to characterize the degree of spatial redundancy in a lane for accommodating queuing vehicles. The physical length of the remaining queue that has not yet dissipated within the current lane, calculated in step S201; The total physical length of the lane is determined based on the intersection channelization parameters.
[0071] During real-time evaluation, the edge computing device logically compares the calculated lane space saturation with a preset overflow threshold. In this embodiment, based on traffic engineering overflow control theory, the preset overflow threshold is typically set between 0.85 and 0.95. When the lane space saturation of any approach lane exceeds this overflow threshold for several consecutive dynamic time steps, it indicates that the intersection is on the verge of deadlock. At this time, the system forcibly interrupts the online control of the deep reinforcement learning model and calls the safety fallback scheme pre-stored in the controller to forcibly clear and resolve the issue. After the space saturation of all lanes at the intersection falls back to the safe baseline (e.g., the value drops to below 0.3), the system determines that the current reinforcement learning round is terminated, clears the residual abnormal trajectories in the experience replay pool, and reinitializes the traffic environment state. Subsequently, the system clears the temporary state space vector at the current moment and forcibly requires the sensing device to re-collect microscopic trajectory data for an entire default time step to reconstruct the initial state space vector. Only after the data alignment between the physical environment and the model input is completed can the control of the deep reinforcement learning model be restored.
[0072] S403 periodically performs persistent storage of model parameters and policy freezing operations to ensure the adaptability and recoverability of the online learning system. During long-term online fine-tuning, deep reinforcement learning models are prone to policy forgetting or performance collapse due to the continuous influence of abnormal sensor noise or biased data samples. To prevent network weight divergence caused by long-term operation, the edge computing device has a dedicated non-volatile storage module and is equipped with a background model performance monitoring program. The system accumulates and records the sum of the instantaneous reward values obtained in each complete reinforcement learning round, forming a round-cumulative reward metric.
[0073] Whenever the system runs online for a preset storage period (e.g., every 24 hours or 100 decision rounds), it performs a moving average smoothing process on the cumulative reward metrics of recent rounds. To filter out data disturbances caused by single random anomalies, the statistical window for moving average smoothing typically selects the most recent 10 to 20 reinforcement learning rounds. If the current moving average reward value is higher than the system's historical best record, the edge computing device serializes and packages all weight parameters, bias parameters, and optimizer states of the current deep Q-network, overwrites them, and saves them to local read-only memory, forming an optimal policy checkpoint. If a significant degradation in the moving average reward value is detected during subsequent operation (e.g., the reward value is lower than 50% of the historical best record for several consecutive rounds), the system will proactively trigger an anomaly recovery mechanism, retrieving the most recently saved optimal policy checkpoint from local read-only memory to roll back the parameters. This mechanism effectively prevents the system from falling into an irreversible, inefficient timing state, comprehensively ensuring the continuous and reliable operation of city-level traffic control infrastructure.
[0074] Step S5 specifically includes the following sub-steps: S501 extracts the optimal policy checkpoints stored locally on the edge computing device and synchronously uploads the model weight parameters and intersection operation feature vectors to the cloud management platform. Reinforcement learning models for single-point intersections are prone to overfitting to specific time periods or local traffic flow patterns over long-term evolution, often resulting in lag in response to large-scale traffic shifts across the global road network. Therefore, the system introduces a cloud-edge collaborative architecture for unified scheduling of global policies.
[0075] Specifically, after the edge computing device generates and saves a new optimal strategy checkpoint in step S403, it transmits the weight matrix of the current deep Q-network and the feature vector representing the intersection profile to the cloud via a secure and encrypted communication link. This feature vector encompasses the intersection's average traffic flow, average lane space saturation, and effective green light utilization rate during the observation period. Uploading these statistical parameters not only avoids the privacy and compliance risks associated with the outgoing transmission of underlying microscopic vehicle trajectory data, but also provides quantitative data support for the cloud to evaluate the model generalization capabilities of each node.
[0076] S502, the cloud-based management platform receives local model weight parameters from multiple adjacent intersections and constructs a globally shared model suitable for regional arterial roads based on a federated averaging algorithm. In the cloud server's computational logic, the system does not mechanically perform an arithmetic average of the intersection parameters, but rather performs a weighted fusion based on the traffic load importance of each intersection in the road network. The core technical objective of this operation is to enable adjacent intersections to share underlying control experience in dealing with sudden congestion. The cloud-based management platform calculates the global weight using a weighted fusion formula, as follows: ; In the formula, The weight matrix of the fused globally shared model; For the first Local model weight matrix uploaded by edge computing devices at adjacent intersections; For the first The system assigns fusion weight coefficients to adjacent intersections. The calibration logic for these fusion weight coefficients is based on the proportion of average traffic flow in the feature vectors of each intersection to the total regional traffic flow. This allocation method means that key intersections with higher traffic volumes have a stronger influence on the global model. Regarding numerical constraints... The value of is strictly limited to between 0 and 1, and the sum of the fusion weight coefficients of all participating intersections is always equal to 1.
[0077] S503 distributes the globally shared model to each edge computing device and uses a smoothing factor and local optimal strategy to perform soft updates and fusion of parameters. To avoid drastic fluctuations in single-point intersection control strategies that deviate from actual traffic conditions due to the cloud-based global model directly overwriting the local model, the edge computing devices, after receiving the globally shared model from the cloud, need to retain parameters selectively based on the spatiotemporal characteristics of local traffic flow. The system calculates a new round of local deployment weights to balance global collaboration and local optimization; the specific update formula is as follows: ; In the formula, This is to integrate the updated weight matrix of the deep Q-network actually deployed on edge computing devices; This is a globally shared model weight matrix distributed from the cloud. The optimal strategy model weight matrix is stored locally on the edge computing device; This is a preset global smoothing factor. The value of this smoothing factor directly determines the degree to which the intersection accepts regional coordination strategies. Considering the actual road network topology, when the intersection is located on a coordinated phase path of an urban arterial road, the model needs to conform more to arterial control. Typically, the value is between 0.6 and 0.8; conversely, when the intersection is located on a marginal branch road or an asymmetrical intersection with strong independence, the system relies more on adaptive adjustment of the local environment. The value is usually between 0.1 and 0.3.
[0078] S504, combining the predicted meeting time and dynamic time step output from step S2, constructs a dynamic phase difference coordination mechanism based on shock wave dissipation between adjacent intersections. Traditional green wave timing often relies on fixed vehicle cruising speeds, ignoring the dynamic spread and dissipation process of queuing shock waves at downstream intersections, which can easily lead to ineffective green light operation or secondary stops of queuing vehicles.
[0079] In this embodiment, the edge computing device at the upstream intersection transmits the physical coordinates of the starting shock wave front and the corresponding shock wave velocity of the current green light phase to the downstream intersection in real time via the roadside communication unit. Upon receiving this state parameter, the downstream intersection calculates the theoretical time point for the upstream convoy to reach the downstream stop line based on the physical distance between the starting shock wave front physical coordinates and the downstream stop line, as well as the calibrated average cruising speed. Combining this with its own calculated predicted meeting time for the complete dissipation of the remaining queue, the absolute system clock point of this meeting time is used as the green light start reference point for its own coordinated phase. Through this vehicle-road cooperative communication mechanism based on fluid physical boundaries, the system can ensure that when the dense convoy released upstream reaches the downstream stop line, the remaining queue downstream has just been cleared. This process effectively compensates for the shortcomings of deep reinforcement learning models in converging when the multi-agent action space dimension is too large, thus achieving a dimensional transition from single-point adaptive control to trunk line shock wave cooperative control in the continuous evolution of multiple Markov decision cycles.
[0080] Step S6 specifically includes the following sub-steps: S601 extracts microscopic vehicle trajectories within the observation period and calculates the global average delay time at the intersection to quantitatively evaluate the actual operational efficiency of the adaptive signal timing strategy. Based on achieving regional collaborative control, visually demonstrating the system's optimization results to traffic management departments and forming a regulatory closed loop are essential steps for project implementation. A single queue length indicator is significantly affected by vehicle length and model ratio, failing to comprehensively reflect the time cost of vehicle downtime. Therefore, edge computing devices utilize vehicle identification codes recorded by sensing devices to track the complete passage slice of a single vehicle within the intersection's sensing area. The system uses a global average delay formula to statistically analyze traffic efficiency; the calculation formula is as follows: ; In the formula, This represents the global average delay time at the intersection. The lower this value, the better the current signal timing scheme of the system matches the actual traffic flow demand. From The summation operation; The total number of valid vehicles that successfully left the intersection stop line within the current observation period; For the first The actual travel time of a vehicle is the time difference between the vehicle's front bumper crossing the upstream physical boundary of the sensing area and the vehicle leaving the stop line. For the first The theoretical free-flow time for a vehicle. In this calculation logic, the theoretical free-flow time... The specific method for obtaining this information is as follows: extract the actual physical distance between the upstream physical boundary of the sensing area and the stop line, and divide it by the legally mandated speed limit for that road segment. Furthermore, to ensure the stability of the system's operation, if no vehicles enter during the observation period or all vehicles are trapped in the congestion queue, resulting in a decrease in the total number of effective vehicles... When the value is 0, the system will directly truncate the output and adjust the global average delay time accordingly. The value is set to the preset maximum system waiting time limit.
[0081] In step S602, the calculated traffic operation parameters and the state of the deep reinforcement learning model are structurally encapsulated and pushed to the visual traffic management platform via a network communication bus. After obtaining the above-mentioned accurate quantitative evaluation indicators, the system needs to convert them into human-readable control charts. A lightweight data publishing component is deployed in the edge computing device, which serializes the absolute physical distance calculated in step S2, the real-time action commands and instant reward values output in step S3, and the global average delay time output in step S601 according to a unified JSON data format.
[0082] Subsequently, the system pushes the serialized data to the remote server at a fixed refresh frequency using a publish-subscribe communication protocol. Upon receiving the data stream, the traffic management platform's graphical user interface renders and generates in real-time queue length lines at intersections, green light countdown timers, and delay time heatmaps. When the monitoring backend compares the data and finds the global average delay time... If the system exceeds the preset severe congestion alarm threshold for multiple consecutive observation periods, it will automatically generate a delay alarm log in the user interface pop-up window, prompting on-duty personnel to intervene manually or provide on-site traffic control.
[0083] The S603 establishes a hardware-level heartbeat monitoring mechanism between edge computing devices and traffic signal controllers, triggering physical degradation protection in the event of communication interruption. Beyond software-level indicator monitoring and data display, the continuous online presence of the underlying physical hardware is the cornerstone of the entire control system's operation. Due to harsh electromagnetic interference environments on the roadside or aging network cables, edge computing devices are prone to command delays or hardware crashes. To prevent traffic lights from remaining off for extended periods or stuck in a fully red state due to loss of control, the edge computing device is configured to send a timestamped handshake probe packet to the underlying traffic signal controller every preset heartbeat cycle. Upon receiving the probe packet, the traffic signal controller's underlying microcontroller must return a response signal within an acceptable fault tolerance window. If the traffic signal controller's internal timer detects that it has not received a heartbeat probe packet from the edge computing device for more than a preset number of consecutive times, it will forcibly disconnect the external control interface with the edge device. At this point, the underlying hardware relay will automatically trip and take over control, seamlessly switching to a pre-stored single-point timing multi-period time-sharing scheme within the controller. Only after the physical communication link is re-established and the heartbeat packets resume continuous and stable transmission can the system allow the edge computing device to apply for and obtain signal control rights at the intersection again, thereby ensuring the absolute operational safety of the urban traffic control infrastructure at the lowest level.
[0084] Specific application examples: See attached document Figure 3 The dynamic traffic signal control system and method based on deep reinforcement learning provided by this invention may include the following feature parameter extraction and control calculation process in specific application embodiments. Figure 3 The horizontal axis represents time, measured in seconds, and its range is from 0 to 100. Figure 3 The ordinate represents the dynamic time step, measured in seconds, with a range of 0 to 60. The system continuously acquires physical signals when implementing target phase release control at intersections. Sensing devices acquire the absolute coordinate sequences and instantaneous speed sequences of traffic participants within the detection area according to a set sampling frequency. The edge computing device sets the length parameter of the longitudinal discrete spatial statistical interval for lanes to 15m. The edge computing device reads the instantaneous speed sequences of vehicles within the spatial statistical interval and calculates the spatial average speed parameter. The edge computing device calculates the local traffic density parameter within the current spatial statistical interval by dividing the total number of vehicles by the spatial statistical interval length parameter. Simultaneously, the edge computing device extracts the traffic density difference parameter and traffic flow difference parameter from adjacent abrupt changes, performs a division operation, and calculates the stopping shock wave speed parameter as 2km / h (approximately 0.56m / s) and the starting shock wave speed parameter as 18km / h (approximately 5.00m / s).
[0085] The dynamic decision step size generation module of the edge computing device reads the actual physical state through sensors. The module collects the physical coordinate parameters of the starting shock wave front as 15m. It also collects the physical coordinate parameters of the stopping shock wave front as 65m. The module calculates the absolute physical distance between the starting and stopping shock wave fronts as 50m. It then calculates the arithmetic difference between the absolute values of the starting and stopping shock wave speeds, yielding a relative approach speed of 16km / h (approximately 4.44m / s). Finally, the module substitutes the absolute physical distance and relative approach speed parameters into the shock wave encounter time derivation formula and performs algebraic division, ultimately determining the predicted encounter time as 11.2s. This 11.2s is then directly set as the dynamic time step size parameter.
[0086] The flexible distress zone determination module reads in real time the system's preset driver reaction time constant parameter of 1.2s and the maximum safe deceleration parameter of 4.5m / s². 2 The resilient predicament zone determination module adds the instantaneous speed parameter of the vehicle approaching the intersection (60 km / h, approximately 16.67 m / s) to the aforementioned stopping shock wave speed parameter, calculating a relative collision speed parameter of 17.23 m / s. The module then substitutes the relative collision speed parameter, driver reaction time constant parameter, and maximum safe deceleration parameter into its calculation logic, outputting a resilient predicament zone starting point safety critical distance parameter of 53.6 m, dynamically modulated by vehicle kinematics. Based on the external environment, the module calculates the actual spatial distance between the vehicle and the stopping shock wave front, which is 35 m, determining that this actual spatial distance falls between the preset resilient predicament zone endpoint critical distance parameter and the resilient predicament zone starting point safety critical distance parameter. The module generates an interception flag parameter and assigns it a value of 1. The action mask execution module substitutes all extracted state space vector parameters, interception flag parameters, and switching phase action parameters output by the policy network into the state overlay logic for calculation, ultimately outputting the final execution action parameters that maintain the current phase action.
[0087] like Figure 3As shown in the dynamic decision step-size response curve, within the time interval of 0 to 50 seconds, when no severe queuing dissipation occurs, the value of the dynamic time step parameter fluctuates steadily around the minimum allowable time step parameter baseline value of 5 seconds. When the time reaches 50 seconds, the starting shock wave rapidly catches up with the stopping shock wave and enters the deep dissipation zone. The system performs dynamic calculations based on the extracted shock wave characteristic parameters, resulting in a sudden surge in the value of the dynamic time step parameter, reaching a maximum of 11.2 seconds. Subsequently, the curve of the dynamic time step parameter exhibits physical decay with the characteristic of residual queue reduction, gradually converging and recovering to the minimum allowable time step parameter baseline value range of 5 seconds after the time reaches 61.2 seconds. This dynamic time step parameter is ultimately converted into the system's internal master clock wake-up signal and, combined with the trigger moment, drives the model training and update module to generate closed-loop network parameter fine-tuning.
[0088] Experimental verification and effect comparison section: See attached document Figure 4 The dynamic traffic signal control method based on deep reinforcement learning provided by this invention may include the following equipment operation and data acquisition steps during experimental verification and effect comparison. Figure 4 The horizontal axis represents time, measured in seconds, and its range is from 0 to 3600. Figure 4 The ordinate represents the global average delay time in seconds, with a range of 0 to 80. The experimental system uses the same physical intersection environment and the same batch of micro-trajectory datasets containing known morning peak high-saturation traffic flow congestion characteristics. The experiment is divided into two independent control periods for data recording. The first control period uses a traditional timed multi-period timing control law to control the underlying traffic signal controller module, and its test data is... Figure 4 The traditional timing control method is illustrated using gray dashed lines. During the second control period, the same underlying traffic signal controller execution module is controlled using the deep reinforcement learning-based policy network control law calculated in this invention. Its test data is available in... Figure 4 The self-calibration control method of this invention is illustrated using a solid black line.
[0089] The system's visual traffic management platform connects to the output side of the edge computing device to calculate and output the global average delay time parameters of the intersection in real time. The data acquisition system synchronously records the absolute deviation between the actual travel time parameters and the theoretical free-flow time parameters of vehicles after leaving the sensing area during two control periods, using a sampling frequency of 1Hz. For example... Figure 4The comparative verification curves of the global average delay time at the intersection show that, within the time interval of 0 to 1000 seconds, both curves exhibit a basic periodic ripple caused by the superposition of random arrival of the underlying traffic flow and periodic signal interruptions. When the time reaches 1000 seconds, it corresponds to a large-scale, high-density convoy of vehicles entering the intersection entrance lane during the morning rush hour. In the traditional timing control method, corresponding to the gray dashed line, due to the inability to detect sudden changes in spatial shock wave dissipation demand, the value of the global average delay time parameter continues to rise, reaching a peak of 65 seconds. After the traffic peak ends, it is accompanied by low-frequency secondary queuing oscillations. After 1500 seconds following the traffic flow interference, the value of this curve gradually recovers to within the steady-state passage range of 30 seconds. In the self-correction control method of this invention, corresponding to the black solid line, the system relies on radar point cloud perception to extract absolute physical coordinate parameters and reconstruct dynamic time step parameters and action mask execution logic. The highest peak value of the global average delay time parameter is limited to 42s. This curve converges to within a steady-state passage range of 20s within 800s after the traffic flow interference occurs, thus completing the physical suppression of the spread of large-scale congestion at the intersection.
Claims
1. A dynamic traffic signal control method based on deep reinforcement learning, characterized in that, Includes the following steps: Acquire microscopic trajectory data, extract shock wave velocity and shock wavefront position based on the microscopic trajectory data, and construct a state space vector by combining the phase index of the currently executed signal; The estimated encounter time is calculated based on the shock wave velocity and the shock wavefront position, and the estimated encounter time is set as the dynamic time step of the deep reinforcement learning model. The motion state of vehicles approaching the intersection is obtained, and based on the motion state and the shock wave velocity, it is determined whether the vehicle is in an elastic dilemma zone to generate an interception marker. The state space vector is input into the policy network of the deep reinforcement learning model to output the original expected action, and a state overlay operation is performed on the original expected action based on the interception identifier to generate the final execution action for traffic signal control. At the same time, the cumulative number of masked interceptions is recorded. A comprehensive reward value is generated based on the shock wave velocity and the cumulative number of mask interceptions, and the network parameters of the deep reinforcement learning model are updated using the comprehensive reward value. The global delay time is calculated based on the micro-trajectory data and pushed to the management platform, and degradation protection is triggered when communication anomalies are detected.
2. The dynamic traffic signal control method based on deep reinforcement learning according to claim 1, characterized in that, The steps for extracting the shock wave velocity and the shock wave front position based on the micro-trajectory data specifically include: The micro-trajectory data is processed using a traffic flow theory model to calculate the traffic flow and traffic density of each lane at the intersection. The shock wave velocity is calculated based on the rate of change of the traffic flow and the traffic density, and the shock wave velocity includes the stopping shock wave velocity and the starting shock wave velocity. The instantaneous velocity and instantaneous acceleration of all vehicles in the lane are traversed, and the vehicle boundary coordinate values in the micro-trajectory data are extracted as the shock wave front position. The shock wave front position includes the parking shock wave front position coordinate and the starting shock wave front position coordinate.
3. The deep reinforcement learning based dynamic traffic signal control method of claim 1, wherein, The specific steps for constructing the state space vector include: The shock wave velocity is normalized using a preset reference shock wave velocity. The shock wavefront position is normalized using the total physical length of the lane. After converting the currently executed signal phase index into a one-hot encoded format, it is concatenated with the normalized parameters to generate the state space vector.
4. The deep reinforcement learning based dynamic traffic signal control method of claim 1, wherein, Following the step of setting the expected meeting time to the dynamic time step, the method further includes: Read the system's maximum and minimum allowed time steps; Boundary constraints are applied based on the dynamic time step, the maximum allowed time step, and the minimum allowed time step to determine the trigger time for the next decision of the deep reinforcement learning model.
5. The dynamic traffic signal control method based on deep reinforcement learning according to claim 1, characterized in that, The steps for determining whether a vehicle is in an elastic distress zone based on the motion state and the shock wave velocity to generate the interception marker specifically include: Calculate the relative collision speed between the vehicle's instantaneous speed and the stopping shock wave speed in the shock wave speed, and combine the preset driver reaction time and maximum safe deceleration to calculate the safe critical distance at the starting point of the elastic predicament zone; Determine whether the relative distance between the vehicle and the parking shock front position coordinates in the shock front position is between the preset critical distance to the end of the elastic predicament zone and the safe critical distance to the start of the elastic predicament zone. When the relative distance is between the preset critical distance to the end of the elastic predicament zone and the safe critical distance to the start of the elastic predicament zone, the vehicle is determined to be in the elastic predicament zone, the interception flag is generated and assigned a value of 1; When the relative distance is not between the preset critical distance to the end of the elastic predicament zone and the safe critical distance to the start of the elastic predicament zone, the interception flag is generated and assigned a value of 0.
6. The dynamic traffic signal control method based on deep reinforcement learning according to claim 1, characterized in that, The steps for performing a state overwrite operation on the original expected action based on the interception identifier specifically include: When the interception flag is 1 and the original expected action is a phase switching action, a state overwrite operation is performed to modify the original expected action to keep the current phase action as the final action to be executed, and the cumulative number of mask interceptions is updated. When the interception flag is 0, the state overwrite operation is skipped, and the original expected action is directly used as the final action to be executed.
7. The dynamic traffic signal control method based on deep reinforcement learning according to claim 1, characterized in that, The steps for generating the comprehensive reward value based on the shock wave velocity and the cumulative number of mask interceptions specifically include: Calculate the difference between the starting shock wave velocity and the stopping shock wave velocity in the shock wave velocity, and generate a feedforward dynamic reward benchmark value; The comprehensive reward value is generated by weighting the feedforward dynamic reward benchmark value with the cumulative number of masked interceptions. The state space vector, the original expected action, the comprehensive reward value, and the state space vector of the next time step are combined and encapsulated into an experience data tuple and written into the experience replay pool. The network parameters are then updated using the backpropagation algorithm.
8. The dynamic traffic signal control method based on deep reinforcement learning according to claim 1, characterized in that, It also includes safety mechanisms for extreme operating conditions: The lane space saturation is calculated based on the physical length of the remaining queue that has not yet dissipated in the current lane and the total physical length of the lane. When the lane space saturation exceeds the preset overflow threshold for several consecutive dynamic time steps, a safety fallback plan is triggered to forcefully clear and disperse the overflow.
9. The dynamic traffic signal control method based on deep reinforcement learning according to claim 1, characterized in that, The steps of calculating the global delay time and pushing it to the management platform, and triggering the degradation protection when a communication anomaly is detected, specifically include: Obtain the total number of valid vehicles leaving the intersection, and use the vehicle identification code to track the complete passage slice of a single vehicle to obtain the actual passage time; The global delay time is obtained by summing the differences between the actual travel time and the theoretical free-flow travel time and dividing by the total number of valid vehicles. When the total number of valid vehicles is 0, the global delay time is assigned the maximum system waiting time limit. The global delay time and traffic operation parameters are serialized in JSON data format and pushed to the management platform, and a delay alarm log is generated when the severe congestion alarm threshold is exceeded. A heartbeat monitoring mechanism is constructed by sending handshake probe packets with timestamps. When the timer fails to receive the handshake probe packets for more than a preset number of consecutive times, it is determined that the communication is interrupted. The underlying hardware relay takes over the authority and switches to a single-point timing multi-period timing scheme to perform the degradation protection.
10. A dynamic traffic signal control system based on deep reinforcement learning, used to implement the dynamic traffic signal control method based on deep reinforcement learning as described in any one of claims 1-9, characterized in that, include: The shock wave feature extraction and state construction module is used to acquire microscopic trajectory data, extract shock wave velocity and shock wavefront position, and construct a state space vector by combining the current signal phase index. The dynamic decision step size generation module is used to calculate the expected encounter time based on the shock wave velocity and the shock wave front position, and set it as the dynamic time step size of the deep reinforcement learning model. The elastic distress zone determination module is used to acquire the vehicle's motion state and, in conjunction with the shock wave velocity, determine whether the vehicle is in an elastic distress zone to generate an interception flag. The action mask execution module is used to input the state space vector into the policy network to output the original expected action, and to perform a state overwrite operation based on the interception identifier to generate the final execution action, while recording the cumulative number of mask interceptions; The model training and update module is used to generate a comprehensive reward value based on the shock wave velocity and the cumulative number of mask interceptions, and to update the network parameters of the deep reinforcement learning model. The operation assessment and security degradation module is used to calculate the global delay time and push it to the management platform, and trigger degradation protection when communication anomalies are detected.