Reinforcement learning based vehicle system distributed control method, system, and apparatus
By employing a distributed control method for vehicle systems based on reinforcement learning, and utilizing symbolic directed graphs and second-order feedback models, combined with dynamic compensation networks and neural networks, stable convergence of the vehicle system within a preset time period is achieved. This solves the problem of uncontrollable convergence time in existing technologies and improves the robustness and transient performance of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- QUFU NORMAL UNIV
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
Smart Images

Figure CN122018558B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of distributed system control, specifically relating to a distributed control method, system, and device for vehicle systems based on reinforcement learning. Background Technology
[0002] Through perception, communication, and collaborative control among vehicles, intelligent vehicles have demonstrated enormous application potential in fields such as autonomous driving transportation systems, smart port logistics, material handling in smart manufacturing workshops, and military patrol and strike operations. The core of realizing these advanced applications lies in effective underlying collaborative control strategies, among which consistency control and platooning control are the most basic and primary goals of vehicle cooperative motion.
[0003] In existing technologies, vehicle control systems face numerous complex challenges. First, vehicle dynamics itself is highly nonlinear and is often affected by unmodeled dynamics and external environmental disturbances. Second, communication resources are usually limited, and the network environment may contain adversarial interactions, with vehicles often coexisting in cooperative and competitive relationships. Most existing research results can only guarantee the asymptotic stability or finite-time stability of the system. In asymptotic stability, the time for the system state to converge to the equilibrium point theoretically tends to infinity. In finite-time stability, although the convergence time is finite, if the initial error of the system is large, the convergence time may become very long, making it impractical to pre-set a precise stabilization time in actual task planning. Summary of the Invention
[0004] The purpose of this invention is to provide a distributed control method, system, and apparatus for vehicle systems based on reinforcement learning.
[0005] A distributed control method for vehicle systems based on reinforcement learning includes the following steps:
[0006] S1. Based on a multi-vehicle tracking model consisting of several vehicles and a leader vehicle, a symbolic directed graph is constructed. When the connection weight in the symbolic directed graph is positive, it indicates that there is a cooperative relationship between the two corresponding vehicles. When the connection weight in the symbolic directed graph is negative, it indicates that there is a competitive relationship between the two vehicles. For each vehicle, a second-order feedback nonlinear model is constructed through coordinate transformation.
[0007] S2. The tracking error is the sum of the cooperative or competitive relationships between all vehicles and their neighbors, and the deviations between the trajectories of all vehicles and the leader vehicle.
[0008] A time performance function is constructed based on the expected convergence time, initial error boundary, and steady-state error bound.
[0009] The tracking error is converted into a transformation error using a time performance function.
[0010] S3. Based on the transformation error, combined with the dynamic compensation network, the control generation network, and the upper bound of the disturbance, a virtual control law is obtained;
[0011] The vehicle speed is obtained based on a second-order feedback nonlinear model. The speed error is obtained by subtracting this error from the virtual speed control law. The dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network are then updated accordingly.
[0012] Based on the velocity error, as well as the updated dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network, the actual control input is obtained.
[0013] S4. Convert the actual control input into linear acceleration and angular velocity commands that the vehicle can directly execute, and drive the vehicle to track the leader's state or its opposite within the expected convergence time, based on the cooperative or competitive relationship, to complete the distributed control of the vehicle system.
[0014] S2 constructs a time performance function based on the expected convergence time, initial error boundary, and steady-state error bound, specifically as follows:
[0015] ,
[0016] in, Let B0 be the performance function and B0 be the initial boundary. For steady-state error bound, For shape parameters, Let t be the expected convergence time, where t is the time variable.
[0017] In S3, based on the transformation error, and combining the dynamic compensation network, the control generation network, and the upper bound of the disturbance, a virtual control law is obtained, specifically:
[0018] ,
[0019] in, For virtual control laws, This is a time-varying scaling factor associated with the time performance function. For entry, For the traction gain of the leader on follower j, , sgn is the sign function. The design parameters are defined in the range (0, 1), and tanh() is the hyperbolic tangent function. It is an adjustable positive parameter. For unconstrained tracking error, To the expected convergence time Relevant parameters, To disturb the upper bound, To dynamically compensate for the transposition of network weights, Let be the first basis function vector. To control the transpose of the weights of the generative neural network, This is the second basis function vector.
[0020] In S3, based on the velocity error, and the updated dynamic compensation network, disturbance upper bound, dynamic compensation network, and control generation network, the actual control input is obtained, specifically:
[0021] ,
[0022] in, For actual control input, To control the gain parameters in a timely manner, Let tanh() be the hyperbolic tangent function. It is an adjustable positive parameter. For unconstrained tracking error, This is the second time-controlled gain parameter. , sgn is the sign function. These are design parameters, and their values range from (0, 1). For unconstrained tracking error, This is the updated upper bound for the perturbation. For the transpose of the updated dynamic compensation network weights, The third basis function vector, To generate the transpose of the neural network weights for the updated control, This is the fourth basis function vector.
[0023] S3 updates the dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network as follows:
[0024] ,
[0025] ,
[0026] ,
[0027] ,
[0028] in, The derivative of the updated upper bound of the perturbation. For unconstrained tracking error, sgn is the sign function. For constant parameters, These are design parameters, and their values range from (0, 1). Let m be the expected convergence time, and m be a constant parameter. For the 1-ξ power term of the upper bound estimate of the perturbation, Scaling factor For the 1+ξ power term of the upper bound estimate of the perturbation, To dynamically compensate for the network weight update rate, For the basis function vector, For the 1-ξ power term of the weights in the dynamic compensation neural network, For the 1+ξ power term of the weights in the dynamic compensation neural network, This represents the weight vector of the network at the next sampling time step, where k is the discretized sampling time step number, and Proj is the projection operator. This is the weight vector of the network at the current sampling time for performance evaluation. The sampling period is Generate neural network weight vectors for control at the next sampling time step. Generate neural network weight vectors for the control at the current sampling time. and For the gradient term based on time-series difference error, This is the learning rate.
[0029] In S4, the actual control input is converted into linear velocity, acceleration, and angular velocity commands that the vehicle can directly execute. Specifically:
[0030] ,
[0031] ,
[0032] in, For actual control input, Let x be the linear velocity and acceleration in the x-direction of the transformed rectangular coordinate system. Let be the linear velocity and acceleration in the y-direction in the transformed Cartesian coordinate system. For linear acceleration, Angular velocity, Linear velocity, Let be the orientation angle of the j-th vehicle.
[0033] In S1, a symbolic directed graph is constructed as follows:
[0034] The vehicle system communication topology consists of a structurally balanced symbolic directed graph. Description, where the adjacency matrix elements A non-zero real number, a positive value indicates that the j-th vehicle is related to the j-th vehicle. Cooperation between vehicles, negative values indicate vehicle-to-vehicle cooperation. and Competition between them.
[0035] In S2, the tracking error is converted into a transformation error through a time performance function, specifically:
[0036] ,
[0037] in, To track errors, For time performance functions, This represents the unconstrained transformation error.
[0038] A distributed control system for vehicle systems based on reinforcement learning, used to implement a distributed control method for vehicle systems based on reinforcement learning, including:
[0039] The model is constructed based on a multi-vehicle tracking model consisting of several vehicles and a leader vehicle. A symbolic directed graph is built. When the connection weight in the symbolic directed graph is positive, it indicates that there is a cooperative relationship between the two corresponding vehicles. When the connection weight in the symbolic directed graph is negative, it indicates that there is a competitive relationship between the two vehicles. For each vehicle, a second-order feedback nonlinear model is constructed through coordinate transformation.
[0040] The error transformation module uses the sum of the cooperative or competitive relationships between all vehicles and their neighbors, as well as the deviations between the trajectories of all vehicles and the leader vehicle, as the tracking error. It constructs a time performance function based on the expected convergence time, the initial error boundary, and the steady-state error bound. The tracking error is then converted into a transformation error using the time performance function.
[0041] The control design module, based on the transformation error and combining the dynamic compensation network, control generation network, and disturbance upper bound, derives a virtual control law. The vehicle speed is obtained based on a second-order feedback nonlinear model; the difference between this speed law and the virtual speed control law yields the speed error. This error is then used to update the dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network.
[0042] Based on the speed error, and the updated dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network, the actual control input is obtained. The instruction execution module converts the actual control input into linear acceleration and angular velocity commands that the vehicle can directly execute. This drives the vehicle to track the leader's state or its opposite within the expected convergence time, based on cooperation or competition, thereby completing the distributed control of the vehicle system.
[0043] A distributed control device for a vehicle system based on reinforcement learning includes a processor and a memory, wherein the processor executes a computer program stored in the memory to implement a distributed control method for a vehicle system based on reinforcement learning.
[0044] Through the implementation steps described above, the advantages of the present invention are as follows:
[0045] By using a preset time performance function, the convergence time can be explicitly set, completely decoupling the relationship between the convergence time and the initial state, enabling the system to complete the task on time even when faced with large initial errors.
[0046] It is suitable for symbolic graph topologies and can simulate more complex social group behaviors, such as red and blue team group confrontations in adversarial exercises.
[0047] The S3 step avoids higher-order derivatives of the virtual control law and, combined with a physically meaningful second-order model, significantly reduces the computational complexity of the algorithm, making it suitable for running on vehicle-mounted embedded processors with limited computing power.
[0048] The three-way neural network architecture clearly separates the functions of steady-state compensation and performance optimization. It can not only effectively handle unknown dynamics and disturbances, but also learn the optimal strategy online, and has strong robustness. Attached Figure Description
[0049] Figure 1 Let be the motion trajectory of the vehicle system based on reinforcement learning, where (1) is the trajectory of the cooperative camp, (2) is the trajectory of the competitive camp, and (3) is the trajectory of all intelligent vehicles;
[0050] Figure 2 The error convergence of the vehicle system based on reinforcement learning is shown, where (1) is the tracking error of intelligent vehicle 1, (2) is the tracking error of intelligent vehicle 2, (3) is the tracking error of intelligent vehicle 3, and (4) is the tracking error of intelligent vehicle 4. Detailed Implementation
[0051] Example 1
[0052] To further understand the content of this invention, the invention will be described in detail with reference to the embodiments.
[0053] The main technical problem to be solved by this invention is: for intelligent vehicle systems operating under symbolic graph networks, how to establish a dynamic model of the system and design a distributed control strategy when facing unknown system dynamics, external disturbances and only being able to obtain neighbor output information, so that all following vehicles can converge to the leader's reference state or its opposite within a time limit set by the user, according to their faction (cooperation or competition), and ensure that the actual preset time of all signals in the closed-loop system is bounded.
[0054] This invention relates to a distributed control method for vehicle systems based on reinforcement learning, which includes the following steps:
[0055] S1. Based on a multi-vehicle tracking model consisting of several vehicles and a leader vehicle, construct a symbolic directed graph. When the connection weight in the symbolic directed graph is positive, it indicates that there is a cooperative relationship between the two corresponding vehicles. When the connection weight in the symbolic directed graph is negative, it indicates that there is a competitive relationship between the two vehicles.
[0056] For each vehicle, a second-order feedback nonlinear model is constructed through coordinate transformation.
[0057] Specifically, the technical solution adopted by the present invention to solve the above problems is as follows:
[0058] Consider by A networked multi-vehicle system consisting of one following vehicle and one leader (numbered 0), with the vehicle system communication topology represented by a structurally balanced symbolic directed graph. Description, where the adjacency matrix elements A non-zero real number, a positive value indicates that the j-th vehicle is related to the j-th vehicle. Cooperation between vehicles, negative values indicate vehicle-to-vehicle cooperation. and Competition between them, assuming diagram It is structurally balanced, i.e., a set of nodes. It can be divided into two mutually exclusive subsets. and For any two nodes If they are in the same subset, then If they are in different subsets, then This structure leads to binary consistency: Members will tend towards a leader state. ,and Members will tend to .
[0059] Establish the first Kinematic model of a non-complete wheeled vehicle:
[0060] ,
[0061] in, These are the vehicle's position coordinates. For heading angle, and These are linear velocity and angular velocity, respectively. It should be noted that in this application... The expression for differentiation is the rate of change of a given value. This represents an estimated value. This represents the difference between the estimated value and the actual value.
[0062] Furthermore, state variables are defined through coordinate transformation. and The system dynamics are transformed into the following second-order feedback nonlinear model:
[0063] ,
[0064] in, Represents the state vector. This is the transformed virtual control input. For system output, For an unknown nonlinear smooth function, For unknown bounded external disturbances.
[0065] Design a distributed neighbor state observer based on neural networks: for vehicles Neighbors can only be obtained through the network. Output Its velocity state cannot be directly measured. In this case, construct the following distributed observer to estimate the neighbor states:
[0066] ,
[0067] in, and Neighbors Estimates of position and velocity state, and For observer gain, For the basis function vector, Here, the observer neural network weights are... Its purpose is to get closer to the neighbors. Unknown nonlinear dynamics The adaptive update law for the observer weights adopts the gradient descent method based on the estimation error, and is designed as follows:
[0068] ,
[0069] in, For learning rate, For robust damping parameters, select an appropriate gain. This ensures that the observation error converges exponentially.
[0070] S2. The tracking error is the sum of the cooperative or competitive relationships between all vehicles and their neighbors, and the deviations between the trajectories of all vehicles and the leader vehicle.
[0071] A time performance function is constructed based on the expected convergence time, initial error boundary, and steady-state error bound.
[0072] The tracking error is converted into a transformation error using a time performance function.
[0073] Specifically, a pre-defined time performance function and an unconstrained error transformation mechanism are constructed to define the binary consistency tracking error. Taking into account the cooperative / competitive interactions among neighbors and the deviation from the leader:
[0074] ,
[0075] in, For leaders to vehicles The restraining gain, A reference trajectory for leaders.
[0076] Design time performance function It is used to constrain the error convergence envelope, specifically as follows:
[0077] ,
[0078] in, Let B0 be the time performance function, and B0 be the initial boundary. For steady-state error bound, For shape parameters, Let t be the expected convergence time, where t is the time variable.
[0079] Using the tangent function to control the error Mapping to unconstrained transformation error :
[0080] ,
[0081] Derivation of the dynamic equation for the transformation error:
[0082] ,
[0083] in, , .
[0084] Analysis of this transformation shows that when hour, .when hour, .
[0085] Therefore, as long as the controller design ensures It is bounded, which in turn guarantees... Never reach the boundary This achieves strict preset time performance constraints.
[0086] S3. Based on the transformation error, combined with the dynamic compensation network, the control generation network, and the upper bound of the disturbance, a virtual control law is obtained;
[0087] The vehicle speed is obtained based on the second-order feedback nonlinear model. The speed error is obtained by subtracting the virtual speed control law from the speed error. The dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network are then updated.
[0088] Based on the velocity error, as well as the updated dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network, the actual control input is obtained.
[0089] Transformation error Differentiate:
[0090] ,
[0091] in, Contains known nonlinear terms, It includes unknown perturbations and neighbor coupling terms.
[0092] Since the goal is to design virtual control To stabilize .
[0093] Furthermore, a three-sided neural network architecture is introduced, and an adaptive reinforcement learning controller based on this architecture is designed with a control strategy: a dynamic compensation network. Performance evaluation network and control generative network .
[0094] These three network components can be implemented using the same underlying network structure (such as radial basis function neural networks or multilayer perceptrons), differing only in their inputs, outputs, and update laws. More specifically, the dynamic compensation network can be implemented using radial basis function neural networks, multilayer perceptrons, fuzzy neural networks, etc. The performance evaluation network can be implemented using Critic networks and temporal difference learning networks from reinforcement learning, while the control generation network can be implemented using Actor networks or direct adaptive neural network controllers from reinforcement learning.
[0095] The first step is the virtual control design of the position loop, defining the Hamilton-Jacobi-Bellman (HJB) equations to optimize performance indicators. , Represents the cost function. This is a cost factor.
[0096] According to the HJB equation, optimal control depends on the gradient of the cost function. We use a performance evaluation network to approximate this gradient term.
[0097] Design virtual control law Specifically:
[0098] ,
[0099] in, For virtual control laws, This is a time-varying scaling factor associated with the time performance function. For entry, For the traction gain of the leader on follower j, , sgn is the sign function. The design parameters are defined in the range (0, 1), and tanh() is the hyperbolic tangent function. It is an adjustable positive parameter. For unconstrained tracking error, To the expected convergence time Relevant parameters, To disturb the upper bound, To dynamically compensate for the transposition of network weights, Let be the first basis function vector. To control the transpose of the weights of the generative neural network, This is the second basis function vector.
[0100] The second step is the actual control design of the speed loop, defining the speed error. Design actual control input Specifically:
[0101] ,
[0102] in, For actual control input, To control the gain parameters in a timely manner, Let tanh() be the hyperbolic tangent function. It is an adjustable positive parameter. For unconstrained tracking error, This is the second time-controlled gain parameter. , sgn is the sign function. These are design parameters, and their values range from (0, 1). For unconstrained tracking error, This is the updated upper bound for the perturbation. For the transpose of the updated dynamic compensation network weights, The third basis function vector, To generate the transpose of the neural network weights for the updated control, This is the fourth basis function vector.
[0103] The specific formula for calculating the time control gain parameter is as follows:
[0104] ,
[0105] This parameter design directly embeds the user-defined preset time into the control law gain, ensuring that the upper limit of the system convergence time is [missing information]. .
[0106] Adaptive updating of parameters and weights, designing perturbation estimation parameters and dynamic compensation network The update law is as follows:
[0107] ,
[0108] ,
[0109] Furthermore, a network weight update law based on the projection operator is designed to minimize the Bellman residual:
[0110] ,
[0111] ,
[0112] in, The derivative of the updated upper bound of the perturbation. For unconstrained tracking error, sgn is the sign function. For constant parameters, These are design parameters, and their values range from (0, 1). Let m be the expected convergence time, and m be a constant parameter. For the 1-ξ power term of the upper bound estimate of the perturbation, Scaling factor , For the 1+ξ power term of the upper bound estimate of the perturbation, To dynamically compensate for the network weight update rate, For the basis function vector, For the 1-ξ power term of the weights in the dynamic compensation neural network, For the 1+ξ power term of the weights in the dynamic compensation neural network, Let be the weight vector of the performance evaluation network at the next sampling time, k be the discretized sampling time number, and Proj be the projection operator. The projection operator is applied using gradient descent based on Bellman residuals. The role of the projection operator is to project the weights back into the compact set when the magnitude of the weights exceeds a preset boundary. This is crucial for preventing the closed-loop system from diverging under strong perturbations.
[0113] This is the weight vector of the network at the current sampling time for performance evaluation. The sampling period is Generate neural network weight vectors for control at the next sampling time step. Generate neural network weight vectors for the control at the current sampling time. and For the gradient term based on time-series difference error, This is the learning rate.
[0114] To eliminate the dependence of convergence time on initial conditions and optimize transient performance, a preset time performance function is introduced, and a three-part neural network architecture integrating dynamic compensation, performance evaluation, and control generation is constructed. Furthermore, an optimized backstepping control law coordinated with the preset time parameters is designed to ensure the actual preset time of the multi-vehicle system is bounded under unknown dynamics and external disturbances.
[0115] S4. Convert the actual control input into linear acceleration and angular velocity commands that the vehicle can directly execute, and drive the vehicle to track the leader's state or its opposite within the expected convergence time, based on the cooperative or competitive relationship, to complete the distributed control of the vehicle system.
[0116] Specifically, the mapping and execution of control quantities are achieved by utilizing the feedback linearization inverse transform to calculate the virtual control force. Converted to linear velocity acceleration of the vehicle's actual actuators and angular velocity The instructions are as follows:
[0117] ,
[0118] ,
[0119] in, For actual control input, Let x be the linear velocity and acceleration in the x-direction of the transformed rectangular coordinate system. Let be the linear velocity and acceleration in the y-direction in the transformed Cartesian coordinate system. For linear acceleration, Angular velocity, Linear velocity, Let be the orientation angle of the j-th vehicle.
[0120] Stability proof: Construct a Lyapunov function V containing the sum of squares of all error terms:
[0121]
[0122] in, This is the damping term.
[0123] By substituting the control law and the update law, and using inequality scaling, we can finally obtain the following differential inequality:
[0124]
[0125] According to the presupposed time stability lemma, this inequality means that the function V will be stable in time. It converges to a constant Within the smallest neighborhood determined.
[0126] because Bounded, according to the properties of the tangent transform, the tracking error It is necessarily limited to time performance functions Within the defined funnel-shaped region, the pre-defined time-division consistency is achieved.
[0127] Furthermore, the experimental results are analyzed as follows:
[0128] See Figure 1 , Figure 1 The motion trajectory of the vehicle under the control method of this application is shown. Figure 1 (1) represents the cooperative camp. In the figure, black represents the leader's movement trajectory, and blue and red represent the movement trajectories of other vehicles. Figure 1 (2) Competitive camps, the black dotted line in the figure represents the movement trajectory of the anti-leader, and the green and brown lines represent the movement trajectories of other vehicles; Figure 1 (3) is Figure 1 (1) Cooperative relationship camp and Figure 1 The movement trajectories of all vehicles in the competitive camp (2) show that the vehicles accurately identified and arrived at their respective camps, forming a clear bipartite confrontation formation.
[0129] See Figure 2 , Figure 2 Figures (1), (2), (3), and (4) all demonstrate that under the control method of this application, the blue tracking error curves of all vehicles are within the range of... It had previously entered the steady-state region. Figure 2 In the diagram, (1), (2), (3), and (4) represent the time performance function that, regardless of the initial position, does not exceed the preset time performance function represented by the red dashed line under the control method of this application throughout the entire process. This verified the effectiveness of the preset time control.
[0130] In summary, this invention can precisely set the convergence time of the system, significantly reduce the computational complexity of the control algorithm, effectively solve the binary consistency problem of multi-vehicle systems under symbolic directed graphs, and improve the transient performance and robustness of the vehicle system.
[0131] A distributed control system for vehicle systems based on reinforcement learning, used to implement a distributed control method for vehicle systems based on reinforcement learning, including:
[0132] The model is constructed based on a multi-vehicle tracking model consisting of several vehicles and a leader vehicle. A symbolic directed graph is built. When the connection weight in the symbolic directed graph is positive, it indicates that there is a cooperative relationship between the two corresponding vehicles. When the connection weight in the symbolic directed graph is negative, it indicates that there is a competitive relationship between the two vehicles. For each vehicle, a second-order feedback nonlinear model is constructed through coordinate transformation.
[0133] The error transformation module uses the sum of the cooperative or competitive relationships between all vehicles and their neighbors, as well as the deviations between the trajectories of all vehicles and the leader vehicle, as the tracking error. It constructs a time performance function based on the expected convergence time, the initial error boundary, and the steady-state error bound. The tracking error is then converted into a transformation error using the time performance function.
[0134] The control design module, based on the transformation error and combining the dynamic compensation network, control generation network, and disturbance upper bound, derives a virtual control law. The vehicle speed is obtained based on a second-order feedback nonlinear model; the difference between this speed law and the virtual speed control law yields the speed error. This error is then used to update the dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network.
[0135] Based on the speed error, and the updated dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network, the actual control input is obtained. The instruction execution module converts the actual control input into linear acceleration and angular velocity commands that the vehicle can directly execute. This drives the vehicle to track the leader's state or its opposite within the expected convergence time, based on cooperation or competition, thereby completing the distributed control of the vehicle system.
[0136] A distributed control device for a vehicle system based on reinforcement learning includes a processor and a memory, wherein the processor executes a computer program stored in the memory to implement a distributed control method for a vehicle system based on reinforcement learning.
Claims
1. A distributed control method for vehicle systems based on reinforcement learning, characterized in that, Includes the following steps: S1. Based on a multi-vehicle tracking model consisting of several vehicles and a leader vehicle, a symbolic directed graph is constructed. When the connection weight in the symbolic directed graph is positive, it indicates that there is a cooperative relationship between the two corresponding vehicles. When the connection weight in the symbolic directed graph is negative, it indicates that there is a competitive relationship between the two vehicles. For each vehicle, a second-order feedback nonlinear model is constructed through coordinate transformation. S2. The sum of the cooperative or competitive relationships between all vehicles and their neighbors, and the deviations between the trajectories of all vehicles and the leader vehicle, is taken as the tracking error; a time performance function is constructed based on the expected convergence time, the initial error boundary, and the steady-state error boundary; the tracking error is converted into a transformation error through the time performance function; S3. Based on the transformation error, combined with the dynamic compensation network, the control generation network, and the upper bound of the disturbance, a virtual control law is obtained; The vehicle speed is obtained based on a second-order feedback nonlinear model. The speed error is obtained by subtracting this error from the virtual speed control law. The dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network are then updated accordingly. Based on the speed error, and the updated dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network, the actual control input is obtained; S4, the actual control input is converted into linear acceleration and angular velocity commands directly executed by the vehicle, driving the vehicle to track the leader's state or its opposite within the expected convergence time according to the cooperative or competitive relationship, thus completing the distributed control of the vehicle system. In S1, a symbolic directed graph is constructed as follows: The vehicle system communication topology consists of a structurally balanced symbolic directed graph. Description, where the adjacency matrix elements A non-zero real number, a positive value indicates that the j-th vehicle is related to the j-th vehicle. Cooperation between vehicles, negative values indicate vehicle-to-vehicle cooperation. and Competition between them; S2 constructs a time performance function based on the expected convergence time, initial error boundary, and steady-state error bound, specifically as follows: , in, Let B0 be the time performance function, and B0 be the initial boundary. For steady-state error bound, For shape parameters, Let t be the expected convergence time, where t is the time variable; In S2, the tracking error is converted into a transformation error through a time performance function, specifically: , in, To track errors, For time performance functions, This represents the unconstrained transformation error.
2. The distributed control method for vehicle systems based on reinforcement learning according to claim 1, characterized in that, In S3, based on the transformation error, and combining the dynamic compensation network, the control generation network, and the upper bound of the disturbance, a virtual control law is obtained, specifically: , in, For virtual control laws, This is a time-varying scaling factor associated with the time performance function. For entry, For the traction gain of the leader on follower j, , sgn is the sign function. The design parameters are defined in the range (0, 1), and tanh() is the hyperbolic tangent function. It is an adjustable positive parameter. For unconstrained tracking error, To the expected convergence time Relevant parameters, To disturb the upper bound, To dynamically compensate for the transposition of network weights, Let be the first basis function vector. To control the transpose of the weights of the generative neural network, This is the second basis function vector.
3. The distributed control method for vehicle systems based on reinforcement learning according to claim 1, characterized in that, In S3, based on the velocity error, and the updated dynamic compensation network, disturbance upper bound, dynamic compensation network, and control generation network, the actual control input is obtained, specifically: , in, For actual control input, To control the gain parameters in a timely manner, Let tanh() be the hyperbolic tangent function. It is an adjustable positive parameter. For unconstrained tracking error, This is the second time-controlled gain parameter. , sgn is the sign function. These are design parameters, and their values range from (0, 1). For unconstrained tracking error, This is the updated upper bound for the perturbation. For the transpose of the updated dynamic compensation network weights, The third basis function vector, To generate the transpose of the neural network weights for the updated control, This is the fourth basis function vector.
4. The distributed control method for vehicle systems based on reinforcement learning according to claim 1, characterized in that, S3 updates the dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network as follows: , , , , in, The derivative of the updated upper bound of the perturbation. For unconstrained tracking error, sgn is the sign function. For constant parameters, These are design parameters, and their values range from (0, 1). Let m be the expected convergence time, and m be a constant parameter. 1- is the estimated upper bound of the perturbation. ξ power term Scaling factor 1+ is the upper bound of the perturbation estimate ξ power term To dynamically compensate for the network weight update rate, For the basis function vector, To dynamically compensate for the 1- ξ power term To dynamically compensate for the weights of the neural network, 1+ ξ power term This represents the weight vector of the network at the next sampling time step, where k is the discretized sampling time step number, and Proj is the projection operator. This is the weight vector of the network at the current sampling time for performance evaluation. The sampling period is Generate neural network weight vectors for control at the next sampling time step. Generate neural network weight vectors for the control at the current sampling time. and For the gradient term based on time-series difference error, This is the learning rate.
5. The distributed control method for vehicle systems based on reinforcement learning according to claim 1, characterized in that, In S4, the actual control input is converted into linear velocity, acceleration, and angular velocity commands that the vehicle can directly execute. Specifically: , , in, For actual control input, Let x be the linear velocity and acceleration in the x-direction of the transformed rectangular coordinate system. Let be the linear velocity and acceleration in the y-direction in the transformed Cartesian coordinate system. For linear acceleration, Angular velocity, Linear velocity, Let be the orientation angle of the j-th vehicle.
6. A distributed control system for vehicle systems based on reinforcement learning, used to implement the distributed control method for vehicle systems based on reinforcement learning as described in any one of claims 1-5, characterized in that, include: The model is constructed based on a multi-vehicle tracking model consisting of several vehicles and a leader vehicle. A symbolic directed graph is built. When the connection weight in the symbolic directed graph is positive, it indicates that there is a cooperative relationship between the two corresponding vehicles. When the connection weight in the symbolic directed graph is negative, it indicates that there is a competitive relationship between the two vehicles. For each vehicle, a second-order feedback nonlinear model is constructed through coordinate transformation. The error transformation module uses the sum of the cooperative or competitive relationships between all vehicles and their neighbors, as well as the deviations between the trajectories of all vehicles and the leader vehicle, as the tracking error. It constructs a time performance function based on the expected convergence time, the initial error boundary, and the steady-state error bound. The tracking error is then converted into a transformation error using the time performance function. The control design module, based on the transformation error, combines a dynamic compensation network, a control generation network, and an upper bound for the disturbance to obtain a virtual control law; The vehicle speed is obtained based on a second-order feedback nonlinear model. The speed error is obtained by subtracting this error from the virtual speed control law. The dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network are then updated accordingly. Based on the velocity error, as well as the updated dynamic compensation network, disturbance upper bound, performance evaluation network, and control generation network, the actual control input is obtained. The instruction execution module converts the actual control input into linear acceleration and angular velocity commands that the vehicle can directly execute. It drives the vehicle to track the leader's state or its opposite within the expected convergence time, based on the cooperative or competitive relationship, to complete the distributed control of the vehicle system.
7. A distributed control device for vehicle systems based on reinforcement learning, characterized in that, It includes a processor and a memory, wherein the processor executes a computer program stored in the memory to implement the distributed control method for a vehicle system based on reinforcement learning as described in any one of claims 1-5.