Autonomous Decision-Making Optimization System Based on Reinforcement Learning and Multi-Agent Collaboration
By using an autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration, the system dynamically adjusts control parameters and risk exclusion domain, solving the problem of balancing safety and efficiency of mobile vehicles in complex environments, and achieving robust motion and safe obstacle avoidance under high uncertainty.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XINXIANG MEDICAL UNIV
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-30
AI Technical Summary
In multi-agent collaborative operations, existing mobile vehicles suffer from fixed control parameters and static safety boundaries that are difficult to adapt to complex dynamic environments, making it difficult to balance safety and traffic efficiency. Traditional collaborative mechanisms lack the quantification and sharing of agent decision-making uncertainty, leading to decision deadlock and increased collision risks.
An autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration is adopted. The system constructs a local occupancy grid map through an environment perception module, broadcasts policy entropy through a data communication module, quantifies decision uncertainty through a policy reasoning module, dynamically adjusts control parameters through a parameter adjustment module, optimizes the path through a trajectory planning module, processes low-confidence data through a data buffering module, and generates driving commands through a tracking control module, thereby achieving safe avoidance in dynamic environments.
It improves the adaptive safety and collaborative efficiency of mobile vehicles in unstructured environments. By dynamically adjusting control parameters and risk rejection domain, it reduces the probability of collisions, improves the robustness and control accuracy of the system, and ensures robust movement under extreme conditions.
Smart Images

Figure CN122308083A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of autonomous navigation and multi-agent control technology, specifically to an autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration. Background Technology
[0002] As the application of mobile robots and autonomous driving technologies continues to expand, mobile vehicles need to perform collaborative tasks in increasingly complex dynamic environments. Existing navigation and control systems typically employ optimization-based methods such as model predictive control to generate motion trajectories. These methods often have fixed prediction time domains, reference speeds, and safety constraints set during the design phase. However, in unstructured scenarios or environments with high sensor noise, fixed control parameters struggle to adapt to real-time changing conditions. For example, when environmental perception information is ambiguous or the decision-making model is uncertain about the current scenario, maintaining a long-period prediction time domain can lead to erroneous inferences based on biased states, while fixed speed settings can cause safety accidents due to the inability to respond promptly to potential risks.
[0003] Furthermore, in multi-agent collaborative operation scenarios, existing communication and interaction mechanisms are mainly limited to sharing physical state information such as the position, speed, and heading of each agent. This interaction mode lacks the quantification and transmission of the agent's own decision confidence or cognitive state. Neighboring nodes cannot perceive the uncertainty of surrounding agents regarding their current decisions, causing them to only be able to avoid obstacles based on a uniform standard when planning their own paths. This collaborative approach, lacking cognitive dimension interaction, makes it difficult for the system to implement proactive defensive avoidance strategies for agents in a highly uncertain state. When encountering sudden situations or complex interactive games, decision deadlocks or even collisions can easily occur, limiting the overall safety and operational efficiency of multi-agent systems in complex environments. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides an autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration. This system solves the problem that existing mobile vehicles typically use fixed control parameters and static safety boundary settings in multi-agent collaborative operations. In complex, dynamic, or high-density scenarios, fixed parameters cannot adapt to environmental changes, making it difficult to balance safety and traffic efficiency. Furthermore, traditional collaborative mechanisms only exchange motion state information such as position and velocity, lacking the quantification and sharing of the agent's own perception and decision-making uncertainties. This makes it impossible for neighboring nodes to identify and perform prior avoidance planning for high-risk targets.
[0005] To achieve the above objectives, this invention provides the following technical solution: an autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration, deployed in a mobile vehicle. This system mainly consists of an environment perception module, a data communication module, a policy reasoning module, a parameter adjustment module, a trajectory planning module, a data buffering module, and a tracking control module.
[0006] The environmental perception module constructs a local occupancy grid map of the mobile vehicle's surroundings and estimates its own motion state. This module establishes a unified time reference through a hardware triggering mechanism, aligning the timestamps of LiDAR, inertial measurement unit, and odometry data. It uses an extended Kalman filter to estimate the mobile vehicle's pose state and performs motion distortion compensation on the LiDAR point cloud data based on this pose state. A Bayesian filtering rolling window mechanism maps the point cloud data into a local occupancy grid map containing logarithmic probabilities.
[0007] The data communication module constructs its own state data packets using its own motion state and policy entropy, and broadcasts them to nearby mobile vehicles, while simultaneously receiving state data packets from nearby mobile vehicles. This module broadcasts data packets containing global position coordinates, velocity vectors, and policy entropy values at preset intervals, and performs timestamp-based temporal consistency checks and Euclidean distance-based spatial correlation gating on the received data packets to construct a local interactive topology.
[0008] The strategy reasoning module performs inference based on the local occupancy grid map and the proximity information contained in the state data packet. This module deploys a convolutional neural network model that receives the local occupancy grid map, the global directional potential field map, and the normalized self-motion state map as input tensors. The model outputs the action probability distribution and residual potential field parameters in discrete motion directions. The module uses the Shannon entropy formula to calculate the entropy value of the action probability distribution, generating the policy entropy to quantify the decision uncertainty in the current scenario.
[0009] The parameter adjustment module adjusts control parameters based on policy entropy and parses neighboring policy entropy in the status data packet to establish a dynamic environmental constraint model. This module calculates the predicted time-domain parameters and maximum reference speed required for trajectory planning based on a nonlinear mapping function, establishing a negative correlation mapping relationship between policy entropy and the predicted time-domain parameters and maximum reference speed. Simultaneously, it calculates the dynamic expansion radius of neighboring moving vehicles based on neighboring policy entropy. This radius is formed by superimposing the physical geometric boundary with an uncertainty expansion boundary vector based on policy entropy, and is used to construct the risk rejection domain.
[0010] The trajectory planning module solves the Model Predictive Control (MPC) problem based on a dynamic environment constraint model and control parameters. This module constructs a cost function comprising a trajectory tracking error term, a control constraint term, and a residual potential energy term, where the weight of the residual potential energy term is determined by the residual potential field parameters. The module uses the dynamic environment constraint model as a soft constraint cost and solves the nonlinear optimization problem within a time limit defined by the set prediction time domain parameters. The prediction time domain is dynamically truncated by combining a basic safety time domain with an uncertainty reduction factor based on policy entropy. The module also constructs a dynamic constraint set, dynamically shrinking the allowable range of the control quantity according to the increase of policy entropy.
[0011] The data buffer module employs a first-in, first-out (FIFO) storage structure and a double-buffered atomic pointer exchange mechanism, connecting the trajectory planning module and the tracking control module and storing the optimal trajectory sequence. The module calculates the confidence weight of the trajectory data based on the proximity policy entropy at the data source and the data transmission lag time. When the policy entropy is high, the confidence weight is reduced more rapidly by increasing the time decay rate.
[0012] The tracking control module reads the optimal trajectory sequence, calculates the reference state and feedforward control quantity at the current moment using an interpolation algorithm, and generates drive commands. When the module detects that the valid trajectory sequence in the data buffer module is exhausted, it switches to reactive obstacle avoidance mode and executes emergency braking logic using the latest environmental perception data.
[0013] This invention provides an autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration. It has the following beneficial effects: 1. This invention effectively improves the adaptive safety of mobile vehicles in unstructured environments by establishing a dynamic mapping mechanism between strategy entropy and model predictive control parameters. The system uses a nonlinear function to negatively correlate the strategy entropy, which quantifies decision uncertainty, with the prediction time domain and speed limit. When perception ambiguity or environmental complexity leads to an increase in strategy entropy, the system can proactively shorten the prediction step size and limit the movement speed, avoiding the planner from making long-term ineffective deductions based on highly uncertain state information. This reduces control divergence caused by accumulated errors and ensures that the vehicle can maintain a robust motion state even under extreme conditions.
[0014] 2. This invention utilizes a state broadcasting mechanism that includes policy entropy to construct a dynamic interactive topology, optimizes the cooperative avoidance strategy among multiple agents, and by analyzing the policy entropy of neighboring vehicles, the system can superimpose dynamic uncertainty vectors on the basis of physical geometric boundaries, construct an expanded risk rejection domain for high uncertainty targets, enabling agents to identify potential decision risks of neighboring nodes and reserve a larger safety buffer space in advance, reducing the collision probability in the dynamic obstacle interaction process without the need for complex explicit negotiation, and improving the overall passage efficiency in dense operation scenarios.
[0015] 3. This invention adopts a hybrid decision architecture that combines deep learning inference and nonlinear model predictive control, which enhances the robustness and control accuracy of the system. By incorporating the residual potential field parameters output by the neural network into the cost function of MPC, the system can use the complex environmental features extracted by the data-driven model to guide the trajectory optimization direction. At the same time, the model-driven algorithm ensures that the trajectory meets the vehicle dynamics constraints. With the data buffer and confidence decay mechanism based on entropy and time coupling, low confidence data can be effectively filtered out and communication latency can be handled, ensuring the smoothness and reliability of the underlying drive commands. Attached Figure Description
[0016] Figure 1 This is a schematic diagram of the system architecture of the present invention; Figure 2 This is a flowchart of the method of the present invention; Figure 3 This is a diagram showing the dynamic mapping relationship between the strategy entropy and control parameters of the present invention.
[0017] Among them, 10 is the mobile vehicle; 110 is the environmental perception module; 120 is the data communication module; 130 is the strategy reasoning module; 140 is the parameter adjustment module; 150 is the trajectory planning module; 160 is the data buffer module; and 170 is the tracking control module. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] Please see the appendix Figure 1 , Figure 1 This is a schematic diagram of the system architecture of the present invention. The present invention provides an autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration, which is deployed on a distributed mobile vehicle 10. The mobile vehicle 10 is equipped with independent computing units, sensing devices, and communication devices, and can independently perform navigation tasks and interact with external nodes. The navigation system includes an environment perception module 110, a data communication module 120, a policy reasoning module 130, a parameter adjustment module 140, a trajectory planning module 150, a data buffer module 160, and a tracking control module 170.
[0020] The environmental perception module 110 is configured to collect physical environment information around the mobile vehicle 10. This module uses point cloud data acquired by airborne LiDAR or depth camera, combined with odometry positioning data, to construct a local occupancy grid map centered on the mobile vehicle 10. This grid map represents the spatial distribution of static obstacles.
[0021] The data communication module 120 is configured to perform data interaction between mobile vehicles 10. This module broadcasts its own status data packets to other mobile vehicles 10 within its communication range according to a preset communication cycle. The status data packets contain global position coordinates, velocity vectors, and policy entropy values. The data communication module 120 receives status data packets broadcast by neighboring mobile vehicles 10, parses them to obtain the neighboring positions, velocities, and policy entropy values, and uses this information to construct a local interaction topology.
[0022] The strategy reasoning module 130 is configured to perform deep learning reasoning based on a local occupancy grid map and neighboring state data. This module deploys a pre-trained convolutional neural network model, which takes the grid map as input and outputs the action probability distribution and residual potential field parameters in the current state. The strategy reasoning module 130 calculates Shannon entropy based on the action probability distribution, defining the result as strategy entropy, which is used to quantify the uncertainty of the mobile vehicle 10 regarding the current decision-making environment. The residual potential field parameters include the coordinates of the virtual gravity point and the virtual repulsive field strength coefficient.
[0023] The parameter adjustment module 140 is configured to construct a mapping relationship between policy entropy and control parameters. This module receives the policy entropy and calculates the prediction time-domain parameters and replanning frequency parameters required by the trajectory planning module 150 according to a preset nonlinear mapping function. When the policy entropy increases, it outputs smaller prediction time-domain parameters and lower replanning frequency parameters; when the policy entropy decreases, it outputs smaller prediction time-domain parameters and higher replanning frequency parameters. The parameter adjustment module 140 uses the neighboring policy entropy values received by the data communication module 120 to calculate the expansion radius of neighboring obstacles over time, establishing a dynamic environmental constraint model.
[0024] The trajectory planning module 150 is configured to solve nonlinear model predictive control problems based on the parameters and constraint models output by the parameter adjustment module 140. This module constructs a cost function including trajectory tracking error terms, control constraint terms, and residual potential energy terms, and performs optimization within the prediction time domain set by the parameter adjustment module 140 to generate an optimal trajectory sequence containing state points and control quantities at multiple future time points. The triggering frequency of the trajectory planning module 150 is controlled by the parameter adjustment module 140.
[0025] The data buffer module 160 is configured as an asynchronous data interface connecting the trajectory planning module 150 and the tracking control module 170. This module employs a first-in, first-out (FIFO) storage structure, receiving and storing the optimal trajectory sequence output by the trajectory planning module 150. Whenever a new trajectory sequence is generated, the data buffer module 160 updates its internally stored data and aligns the timestamps.
[0026] The tracking control module 170 is configured to operate at a constant high-frequency cycle to perform low-level motion control. This module reads trajectory data points corresponding to the current time window from the data buffer module 160 based on the current system time, and calculates the reference state and feedforward control quantity at the current moment using an interpolation algorithm. The tracking control module 170, combined with the actual state feedback of the moving vehicle 10, calculates the final motor drive command and sends it to the low-level driver.
[0027] See attached document Figure 2 , Figure 2 This is a flowchart of the method of the present invention. The present invention provides an autonomous decision-making optimization method based on reinforcement learning and multi-agent collaboration, comprising the following steps: S100: Use the environment perception module 110 to collect environmental data and construct a spatiotemporal state tensor, and use the data communication module 120 to obtain the position, speed and strategy entropy information of the nearby mobile vehicle 10. S200, input the spatiotemporal state tensor into the policy reasoning module 130, obtain the action probability distribution through neural network reasoning and calculate its own policy entropy, and generate residual potential field parameters for correcting the cost function; S300 uses parameter adjustment module 140 to dynamically adjust the prediction time domain length and planning frequency according to its own strategy entropy, calculates its spatial expansion envelope at future prediction time according to neighboring strategy entropy and updates obstacle avoidance constraints. S400, In the trajectory planning module 150, a nonlinear model predictive control problem is constructed, the residual potential field parameters are transformed into potential energy weight terms in the objective function, and the spatial expansion envelope is introduced into the constraint conditions. The optimal control sequence is solved in the prediction time domain determined in step S300. S500, the optimal control sequence obtained by the solution is asynchronously written into the data buffer module 160, and the writing frequency changes dynamically with the planning frequency in step S300. S600 uses the tracking control module 170 to read data from the data buffer module 160 at a fixed high frequency period, and generates the final control command to drive the mobile vehicle 10 through interpolation calculation and feedback compensation.
[0028] In a distributed navigation system, the environmental perception accuracy of the mobile vehicle 10 directly determines the reliability of subsequent decisions. Different sensors differ in sampling frequency, data timeliness, and noise characteristics, so it is necessary to construct a unified spatiotemporal reference.
[0029] In this embodiment, the onboard computing unit establishes a unified time reference through a hardware triggering mechanism. The system selects the scanning cycle of the LiDAR as the master clock reference. To address the asynchronous problem of multi-source data, the computing unit maintains a circular data buffer. When a trigger signal (i.e., frame header timestamp) of a frame of LiDAR data is received, the time sequence within the buffer is traced back to extract the IMU data and odometer data within the corresponding time interval. For discrete low-frequency odometer data, the system uses linear interpolation for position and spherical linear interpolation or minimum angle difference interpolation for attitude angles to calculate the trajectory estimation value at the corresponding time. For high-frequency IMU data, interpolation is also used to ensure a smooth transition of rotation.
[0030] After acquiring aligned raw sensor data, the system performs pose state estimation based on an extended Kalman filter (EKF) architecture. The system defines a state vector containing global position, heading angle, linear velocity, and angular velocity. The system uses wheel odometer observations as measurement update inputs and IMU acceleration and angular velocity integrals as state prediction inputs. During filtering, to prevent the covariance matrix from losing positive definiteness due to numerical calculation issues, square root filtering is used, or a conventional matrix symmetry operation (i.e., the sum of the matrix and its transpose) is enforced in the update step. After filtering, the optimal estimated pose of the mobile vehicle 10 in the global coordinate system is obtained.
[0031] Based on this, the system needs to address the motion distortion problem of the laser point cloud. This embodiment employs a motion compensation algorithm based on microsecond-level timestamps. First, the system calculates the relative pose transformation matrix of the vehicle body relative to the frame start time at the instant the laser point is physically acquired, using interpolation, based on the pose trajectory output by the EKF. Then, using a rigid body transformation model, the original measurement points are left-multiplied by this relative pose transformation matrix and a pre-calibrated lidar extrinsic parameter matrix, thereby projecting all points onto a unified vehicle coordinate system at the frame start time. If rotation matrix interpolation is involved, it is performed in Lie algebra space to ensure orthogonality.
[0032] Finally, for outliers caused by sensor noise and airborne particles, the system performs statistical outlier removal. This process iterates through each point in the point cloud, searching for its nearest neighbors in 3D space (e.g., 20 to 50), and calculates the average Euclidean distance from the point to these neighbors. After iterating through all points, the global mean and standard deviation of the distances are calculated. A point is considered valid if its average neighborhood distance is less than the global mean plus a certain multiple of the global standard deviation. Points deemed invalid are directly removed from the point cloud index.
[0033] This embodiment employs a Bayesian filtering-based rolling window mapping mechanism to transform discrete point clouds into continuous, dense probabilistic raster representations containing uncertain information.
[0034] The system first initializes a fixed-dimensional two-dimensional matrix in the vehicle coordinate system. Each element in the matrix does not directly store the probability of an obstacle's existence, but rather its log-odds. The physical purpose of introducing log-odds is to transform the multiplication operations in the Bayesian update process into addition operations. The conversion between log-odds and probabilities follows the standard logistic transformation relation.
[0035] When processing each frame of input point cloud data, the system performs geometric filtering based on a height threshold to remove invalid information. Only when the height coordinates of a point are between the ground height threshold and the maximum airworthiness altitude is the point identified as a valid obstacle feature point and projected onto the two-dimensional grid plane.
[0036] Based on the projected obstacle points, the system updates the grid state using a reverse sensor model combined with a ray projection algorithm. For each valid obstacle point, a virtual ray is constructed from the sensor's optical center to that point. The area traversed by this ray is considered an empty area, while the grid cell where the ray ends is considered an occupied area. A binary Bayesian filter is used to update the logic: the log-probability at the current time step equals the log-probability at the previous time step plus the log-probability increment from the observation model (i.e., obstacles receive a positive gain, and empty areas receive a negative gain), minus the initial prior log-probability. The system employs an asymmetric gain, where the absolute value of the obstacle confirmation gain is greater than the absolute value of the empty area clearing gain, to achieve a safe strategy of fast obstacle confirmation and slow obstacle clearing. Simultaneously, a numerical saturation truncation strategy is implemented to prevent the log-probability value from approaching infinity.
[0037] As the mobile vehicle 10 moves, the system maintains the map center aligned with the vehicle center through cyclic shift operations. Old data that moves out of the field of view is discarded, and new areas that move into the field of view are reset to prior values. Finally, the system uses the standard sigmoid function to map the maintained log-odds matrix back to a probability distribution map in the 0-1 interval, which serves as the input to the subsequent neural network.
[0038] The data communication module 120 executes the data packet structure and transmission logic of the low-bandwidth state broadcast protocol. In this embodiment, a broadcast protocol based on simplified state vectors is designed.
[0039] During the data assembly phase, the system extracts the core state information of the mobile vehicle 10 at the current moment and constructs a fixed-length binary data packet. The data packet contains fields defined as an ordered tuple, including the mobile vehicle 10's unique identifier, the system's absolute timestamp when the data was generated, the mobile vehicle 10's position coordinate vector in the global coordinate system, its velocity vector, and its policy entropy value. The policy entropy value originates from the Shannon entropy calculation result of the policy inference module 130 on the probability distribution of actions in the current environment, and is used to characterize the decision uncertainty of the mobile vehicle 10 in the current decision-making environment.
[0040] During the serialization phase, the system employs strict little-endian encoding and performs byte alignment. A 32-bit Cyclic Redundancy Check (CRC-32) code is appended to the end of the data packet to ensure data integrity.
[0041] During the data transmission phase, this embodiment uses the User Datagram Protocol (UDP) for high-frequency broadcasting (e.g., 10Hz to 20Hz) to ensure that all neighboring nodes can perceive the changes in the vehicle's state and policy entropy with millisecond-level latency.
[0042] During the data frame reception and integrity verification phase, the system performs CRC-32 checksum and strict timing consistency checks. A data packet is considered valid and fresh only if the timestamp of the received data packet is later than the timestamp of the nearest update in the local cache, and the time deviation is within the allowable timeliness threshold.
[0043] For valid data that passes verification, the system performs spatial correlation gating based on Euclidean distance. The system uses the standard Euclidean distance formula to calculate the straight-line distance between the vehicle's current position and its neighboring positions. Only when this distance is less than a preset interactive sensing radius is the node marked as an active neighboring node.
[0044] Based on the above filtering results, the system dynamically updates the internally maintained neighbor state hash table. To remove invalid constraints, the system periodically performs topology pruning based on heartbeat timeout. If the time difference between the current system time and the last update time of a neighbor exceeds a preset lifetime threshold, the neighbor is determined to be offline or moved out of communication range, and is removed from the state table.
[0045] The system first constructs a standardized multi-channel input tensor, including: a local obstacle probability map directly from the environment perception module 110; a global steering potential field map representing the distance relative to the global sub-target point, calculated based on inverse proportional or Gaussian decay; and a normalized self-motion state (linear velocity and angular velocity) map filled by a broadcast mechanism.
[0046] The network feature extraction backbone uses a lightweight convolutional residual structure. After multiple convolutions and downsampling, it is transformed into a one-dimensional feature vector, which then branches into two independent task branches: Branch 1: Action Probability Distribution Output. This branch outputs the recommended confidence probability vector for each discrete motion direction in the current state through a fully connected layer and a Softmax activation function.
[0047] Branch Two: Residual Parameter Correction Output. This branch outputs dynamic corrections (normalized values) for key parameters of the artificial potential field method through a fully connected layer and the Tanh activation function. The system transforms this normalized output into physically meaningful parameter values through linear mapping, including gravity gain corrections, repulsion gain corrections, and virtual subtarget offsets.
[0048] In the training data construction phase, to improve the model's generalization ability to extreme environments, this embodiment employs a hybrid data augmentation strategy. The system not only collects teaching trajectories from human experts in standard environments but also introduces Gaussian noise interference through a simulation engine to simulate sensor distortion and communication packet loss scenarios. Specifically, for training policy entropy, the system artificially injects conflict scenarios into the expert data and labels these scenarios with high uncertainty tags, forcing the neural network to learn to identify high-risk features. Furthermore, a curriculum learning strategy is adopted, with the difficulty of training samples gradually increasing with each training round, transitioning from single-obstacle environments to high-density dynamic crowd environments.
[0049] This embodiment employs a supervised learning process based on offline expert data for training. The loss function consists of a weighted sum of multi-class cross-entropy loss (used to supervise action probability distribution) and mean squared error loss (used to supervise parameter regression).
[0050] For the probability vector generated at the output of the action probability distribution, the system performs information entropy calculation to quantify the current decision risk. The system adopts the standard Shannon entropy calculation method: multiply the probability value of each component in the probability distribution by its base-2 logarithm, sum them, and take the negative value. To ensure computational stability, a small bias term is added to the logarithm calculation. The calculated policy entropy value directly reflects the uncertainty of the mobile vehicle 10 to the current local environment.
[0051] For the original vector generated at the residual parameter output, the system performs parameter reconstruction based on physical constraints. For each control parameter, the normalized value of the network output is first multiplied by a preset maximum adjustment coefficient, then superimposed on the nominal reference value, and finally the result is forcibly constrained between preset physical safety boundaries (maximum and minimum values) using a truncation function, thereby obtaining the final physical control parameters (such as dynamic gravity / repulsion gain, sub-target offset).
[0052] For the maximum reference speed of the mobile vehicle 10, the system performs dynamic decay calculations based on uncertainties. This embodiment employs the following two-parameter inverse proportional mapping model to ensure the speed limit. It decreases smoothly as uncertainty increases: ; in, The nominal cruise speed (i.e., the maximum design speed under fully confident conditions). This is the velocity sensitivity coefficient (used to adjust the rate at which velocity decreases with entropy). The policy entropy value output by the policy reasoning module 130.
[0053] To adjust the perception sensitivity of the mobile vehicle 10 to its surroundings, the system performs an adaptive expansion of the potential field's effective range. The system calculates the dynamic influence radius using the following linear saturation model. : ; in, Based on the basic sensing radius, This is the spatial expansion coefficient. (Function) Let be the interval cutoff function, which is defined as: when Time to take ,when Time to take Otherwise take . and These are the preset lower and upper physical limits of the dynamic influence radius, respectively.
[0054] The system further dynamically reweights the obstacle avoidance weights and defines a repulsion enhancement factor. : ; in, This represents the maximum gain ratio. It is a hyperbolic tangent function. This factor utilizes the saturation characteristics of the hyperbolic tangent function to automatically amplify the repulsive effect of obstacles in complex environments.
[0055] This embodiment constructs a dynamically changing risk exclusion domain by vector superimposing the physical geometric boundaries of neighboring nodes with the uncertainty expansion boundary.
[0056] In the parsing and preprocessing steps of neighboring state data, the system extracts the physical state and policy entropy of neighboring nodes. If the policy entropy field is missing, a preset saturation entropy value is assigned.
[0057] Based on the analyzed data, the system executes the dynamic effective repulsion radius. The calculation steps are as follows. This embodiment constructs the following linear weighted expansion model: ; in, For neighbors The physical circumcircle radius; For neighbors Policy entropy value in broadcast data packets; For neighbors The Euclidean norm of the velocity vector (i.e., the magnitude of the linear velocity); The uncertainty inflation coefficient; For speed look-ahead coefficient; This represents the maximum allowed additional expansion increment.
[0058] In addition, the system also performs dynamic right-of-way arbitration based on policy entropy. During the interaction, the vehicle compares its own policy entropy. Entropy of neighboring policies .like This means that the vehicle has a high degree of confidence in its decision-making, and the system will automatically reduce the repulsive force gain towards that neighboring vehicle. When the vehicle is in a state of high uncertainty, it exhibits a proactive passage strategy with low avoidance weights; conversely, if the vehicle is in a state of high uncertainty, the system actively increases the repulsive force gain and expansion radius, exhibiting a conservative yielding behavior. This implicit negotiation mechanism based on cognitive state can effectively alleviate deadlock in multi-agent systems.
[0059] After determining the dynamic repulsion radius, the system establishes an improved artificial potential field model. This vehicle, relative to its neighbors... repulsive force vector Based on the Euclidean distance between this vehicle and its neighbors. The definition is as follows: ,but ; like ,but ; like ,but ; in, This is a preset saturation repulsion amplitude, used to prevent the calculated value from overflowing when the distance is too close; To be from the neighboring The unit direction vector pointing to this vehicle; It is a very small positive number (physical safety buffer distance); is the repulsive force gain coefficient. This model explicitly transforms the uncertainty of the neighboring state into a spatial repulsive force on the vehicle.
[0060] The motion control module maps the parameters output by the neural network to dynamic weights in the optimization problem and uses the neighboring potential field model as a soft constraint cost.
[0061] The system first performs state deduction in the prediction time domain. Based on the discretized vehicle kinematics model, in the prediction time domain... The sequence of future states is deduced from the inside forward.
[0062] Furthermore, the system constructs a total cost function that includes dynamic weights. The total cost function is defined as follows: ; in, To predict the length of the time domain; , They are respectively State variables and control variables at any given time; Reference trajectory state; To control the increment. Represented by the weight matrix Weighted quadratic norm (e.g.) ). It is a positive definite diagonal matrix, and its main diagonal elements are positively correlated with the gravitational gain; and These are the control input weight matrix and the control increment weight matrix, respectively. This is the penalty weight matrix for the terminal state.
[0063] The cost of obstacle potential field based on cognition Defined as: ; in, The repulsive gain from the neural network, for This car and the first The distance between the obstacles; The dynamic expansion radius; Basic obstacle avoidance weights; To prevent regularization terms with a denominator of zero; For indicator functions, when the condition The value is 1 if the condition is met, and 0 otherwise. The activation distance threshold for the potential field to take effect.
[0064] Finally, during the solver configuration phase, the system transforms the optimization problem into a sequence of quadratic programming subproblems, which are then solved using a real-time iterative algorithm.
[0065] To ensure real-time convergence of the NMPC algorithm within millisecond cycles, this embodiment employs a warm-start mechanism. The solver utilizes the optimal solution sequence from the previous control cycle, shifts it forward one step on the time axis, and fills the newly generated end moments with linear extrapolation of the end states, thereby constructing the initial value guess sequence for the current moment. If the policy entropy at the current moment... Mutations occur, leading to changes in the prediction time domain. If the time domain is shortened, the system will truncate the solution sequence of the previous cycle to adapt to the new dimension; if the time domain is extended, the initial values will be completed using a kinematic model. This mechanism reduces the number of iterations for the quadratic programming subproblem.
[0066] For real-time calculation of the predicted time domain length, the system employs a dual-constraint model based on braking safety and information reliability. The optimal number of prediction steps is dynamically calculated. : ; in, This indicates the floor function; For braking safety redundancy coefficient, The linear velocity of the vehicle at the current moment; It is the absolute value of the maximum braking deceleration; The discrete sampling period of the control system; To prevent the value from being divided by zero and reaching a minimum; Based on the time domain offset This is the uncertainty reduction factor. The formula reflects the change in policy entropy. When the value is high, the system will proactively shorten the prediction time domain to reduce reliance on uncertain long-term states.
[0067] After determining the current prediction time domain, the system further performs dynamic shrinkage of the control constraint set. The system constructs the following dynamic constraint set. Limit control quantity Scope: ; in, and This represents the limit boundary of the physical actuator. Constraint contraction function. Defined as: ; in, The constraint is a contraction strength coefficient. This causes the system to actively compress the available dynamic envelope under conditions of high decision uncertainty.
[0068] Based on the standardized definition of trajectory data units, the system encapsulates kinematic information and source policy entropy in memory. The data structure.
[0069] To address the temporal misalignment issue of multi-source data, the system performs linear interpolation resampling based on a sliding window. The nearest states at the requested time are calculated using standard linear interpolation formulas. If the requested time exceeds the range of the latest trajectory, a finite-duration extrapolation is performed based on kinematic formulas.
[0070] Furthermore, the system performs dynamic confidence level maintenance based on entropy and time coupling. Confidence level weights. The calculation formula is as follows: ; in, is the base of the natural logarithm; Based on the base time decay rate, It is the entropy sensitivity coefficient; The policy entropy at the data source end; This is the current system absolute time; This is the timestamp of the received data packet. This formula indicates that if a neighboring transmission trajectory carries a high policy entropy ( (For large numbers of data, the confidence level of the data decays at a faster rate over time.)
[0071] In terms of concurrent read and write memory management, the system uses double buffering combined with an atomic pointer swapping mechanism to avoid lock contention.
[0072] For data interaction between asynchronous threads, the system adopts a read-write separation strategy based on multi-version atomic snapshots. Even-number checks and version consistency comparisons are performed using atomically incrementing version numbers to ensure that no write operations occur during the read process, thus avoiding read-write fragmentation.
[0073] After acquiring a valid data snapshot, the system performs a smoothing estimate of the clock skew. A standard exponentially weighted moving average filter is used to track the difference between the transmit and receive timestamps to filter out network jitter and extract the physical clock skew.
[0074] Furthermore, the system performs kinematic-based transmission delay compensation. This is achieved using conventional kinematic formulas. The stale state is projected onto the current control moment, where the acceleration term is obtained through the first-order difference of velocity and its amplitude is truncated. If the calculated total delay exceeds a preset threshold, extrapolation stops and data loss is marked.
[0075] Based on the data loss status of the markers, the system triggers a tiered safety failure protection logic. If the valid trajectory sequence in the data buffer module 160 is exhausted due to communication interruption or planning timeout (i.e., the buffer is empty), the tracking control module 170 immediately switches to a pure reactive obstacle avoidance mode. In this mode, the system bypasses the high-level planning results of NMPC and directly uses the most recent frame of lidar data to execute low-level control based on simple virtual force fields or emergency braking logic to stop the moving vehicle 10 at maximum deceleration until the upper-level planning data returns to normal. This asynchronous safety mechanism, which decouples software and hardware, ensures the physical safety of the vehicle when the computing unit is under high load or crashes.
[0076] Specific application example: Collaborative operation of multiple AGVs in warehousing and logistics To more clearly illustrate the technical solution of this invention, the following describes a typical intelligent warehousing and logistics scenario, and refers to the attached diagram. Figure 3 The dynamic mapping relationship between strategy entropy and control parameters is described in detail.
[0077] Scene setting: Within a 50m x 50m automated logistics transfer area, mobile vehicles 10 (hereinafter referred to as AGV-A) of the system described in this invention are deployed. Dynamic obstacles (human workers) and other nearby vehicles working in coordination exist within this area. The task of the AGV-A is to transport goods to designated outbound points.
[0078] Implementation process: Environmental perception and status broadcasting: The AGV-A detects a moving object (dynamic obstacle) ahead using its onboard LiDAR. Furthermore, the system acquires the status of nearby vehicles via its communication module, constructing a local interactive topology.
[0079] Strategic Reasoning and Uncertainty Quantification: The AGV-A's policy reasoning module 130 inputs a local grid map into a convolutional neural network. Due to a sudden change in the human worker's direction of travel, the network's output action probability distribution flattens. The system calculates that the current policy entropy has increased, reaching a high uncertainty state of approximately 0.85. This instantaneous change in environmental uncertainty corresponds to the attached... Figure 3 The peak portion of the policy entropy versus time curve in the above subgraph.
[0080] Parameter dynamic adjustment: Parameter adjustment module 140 detected a sharp increase in policy entropy and immediately triggered a safety and conservative policy. The internal system execution is as follows: Figure 3 The parameter mapping response is shown below: Speed limit: as attached Figure 3 As shown in the neutron diagram, the maximum reference velocity decreases as the policy entropy increases, and the system dynamically lowers the maximum reference velocity from the nominal 1.5 m / s to a minimum of about 0.6 m / s.
[0081] Expansion radius: The system calculates the dynamic expansion radius (safety envelope) around the obstacle as the policy entropy changes. For this highly uncertain obstacle, a risk rejection domain much larger than the basic physical boundary is constructed.
[0082] Prediction time domain: as attached Figure 3 As shown in the sub-figure below, the prediction time-domain step size decreases as the policy entropy increases. The system reduces the prediction step size of NMPC from the basic 25 steps to 15 steps to avoid making ineffective extrapolations based on the current high uncertainty information.
[0083] Trajectory Planning and Control: Within the shortened prediction time domain, the trajectory planning module 150 considers the increased dynamic expansion envelope constraint and solves for an avoidance trajectory that is pre-adjusted and smoothly decelerates. This avoidance trajectory maintains a sufficient safety distance from the edge of the dynamically expanding risk exclusion domain.
[0084] In contrast, traditional methods, due to their use of fixed basic physical boundaries, fail to perceive high uncertainty risks, resulting in planned paths that are too close to obstacles, posing a risk of collision.
[0085] Experimental verification and effect comparison: To verify the effectiveness of the system of the present invention, comparative experiments were conducted in a simulation environment. The experimental subjects included: The present invention (adaptive NMPC) enables a dynamic parameter adjustment mechanism driven by policy entropy.
[0086] Comparative scheme (traditional NMPC): adopts a fixed expansion radius (0.6m), a fixed prediction time domain (30 steps), and a fixed velocity upper limit (1.2m / s).
[0087] Test conditions: 100 random navigation tests were conducted in a high-density mixed traffic flow scenario containing 5 dynamic obstacles and 3 adjacent vehicles.
[0088] Experimental results data: Performance indicators Comparative Scheme (Traditional NMPC) This invention's solution (adaptive NMPC) Performance Improvement Analysis Collision rate 12% 1% Collision rate reduced. Traditional solutions cannot cope with uncertainties under high dynamic conditions; this invention actively expands the safety envelope at high entropy values, effectively avoiding collisions. Average travel time 45.2 seconds 42.8 seconds Efficiency Improvement. Although this invention slows down the vehicle in hazardous areas, it employs a more aggressive parameter strategy in open areas (low entropy values), resulting in an overall improvement in efficiency. Number of deadlocks 18 times 2 times Interaction improvements. An implicit negotiation mechanism based on policy entropy solves the deadlock problem in multi-vehicle game theory at narrow intersections. urgency <![CDATA[2.8m / s 3 ]]> <![CDATA[1.2m / s 3 ]]> Improved comfort. Adaptive predictive time domain (attached) Figure 3 Subgraph logic and filtering algorithms prevent drastic changes in control commands. in conclusion: Experiments show that this invention, by introducing policy entropy as a measure of cognitive uncertainty and dynamically adjusting the control parameters and environmental constraints of the multi-agent system (MPC) accordingly, successfully resolves the contradiction between conservatism and efficiency in traditional methods. It improves the overall collaborative efficiency of the multi-agent system while ensuring safety under extreme conditions.
Claims
1. An autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration, deployed in a mobile vehicle (10), characterized in that, include: The environment perception module (110) is used to construct a local occupancy grid map characterizing the physical environment around the mobile vehicle (10) and to estimate the motion state of the mobile vehicle (10). The data communication module (120) is used to construct its own state data packet using its own motion state and strategy entropy and broadcast it to the neighboring mobile vehicle (10), while receiving the state data packet from the neighboring mobile vehicle (10); The strategy reasoning module (130) is used to reason based on the local occupancy grid map and the proximity information contained in the state data packet, output the action probability distribution and calculate the strategy entropy to quantify the decision uncertainty; The parameter adjustment module (140) is used to adjust the control parameters according to the policy entropy and parse the neighboring policy entropy in the state data packet to establish a dynamic environment constraint model. The trajectory planning module (150) is used to solve the model predictive control problem based on the dynamic environment constraint model and the control parameters, and generate the optimal trajectory sequence. A data buffer module (160) is used to connect the trajectory planning module (150) and the tracking control module (170) and store the optimal trajectory sequence; The tracking control module (170) is used to read the optimal trajectory sequence and generate driving instructions.
2. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 1, characterized in that, The environmental perception module (110) is used to establish a unified time reference through a hardware triggering mechanism to align the timestamps of lidar data, inertial measurement unit data and odometer data. The pose state of the mobile vehicle (10) is estimated using extended Kalman filtering as its own motion state. Motion distortion compensation is performed on the point cloud data of the lidar data based on the pose state. The point cloud data is mapped to the local occupancy grid map containing log probability through a Bayesian filtering rolling window mechanism.
3. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 1, characterized in that, The data communication module (120) is used to broadcast the self-state data packet containing the global position coordinates, velocity vector and policy entropy value of the mobile vehicle (10) according to a preset period; and to perform time-stamp-based temporal consistency check and Euclidean distance-based spatial correlation gating on the received state data packets of the neighboring mobile vehicles (10), thereby constructing a local interactive topology.
4. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 1, characterized in that, The strategy reasoning module (130) is equipped with a convolutional neural network model, which is used to receive the local occupancy grid map, the global directional potential field map and the normalized self-motion state map as input tensors; the branch output of the convolutional neural network model is used to output the action probability distribution and residual potential field parameters of the discrete motion direction; the strategy reasoning module (130) is also used to calculate the entropy value of the action probability distribution as the strategy entropy using the Shannon entropy formula.
5. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 4, characterized in that, The parameter adjustment module (140) is used to receive the strategy entropy and calculate the predicted time-domain parameters and maximum reference speed required by the trajectory planning module (150) based on the nonlinear mapping function; the nonlinear mapping function is used to establish a negative correlation mapping relationship between the strategy entropy and the predicted time-domain parameters and the maximum reference speed, resulting in the output of the predicted time-domain parameters and the maximum reference speed decreasing in value when the strategy entropy increases.
6. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 5, characterized in that, The parameter adjustment module (140) is also used to calculate the dynamic expansion radius of the neighboring mobile vehicle (10) based on the neighboring policy entropy in the status data packet received by the data communication module (120); the dynamic expansion radius is formed by superimposing the physical geometric boundary of the neighboring mobile vehicle (10) with the uncertainty expansion boundary vector calculated based on the neighboring policy entropy, and the dynamic expansion radius is used to construct the risk exclusion domain in the dynamic environment constraint model.
7. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 6, characterized in that, The trajectory planning module (150) is used to construct a cost function that includes a trajectory tracking error term, a control constraint term, and a residual potential energy term; the weight of the residual potential energy term is determined by the residual potential field parameters output by the strategy reasoning module (130); the trajectory planning module (150) is also used to use the dynamic environment constraint model as a soft constraint cost to solve the nonlinear optimization problem within the time range limited by the prediction time domain parameters set by the parameter adjustment module (140).
8. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 7, characterized in that, The trajectory planning module (150) is used to calculate the basic safety time domain based on the current linear velocity and maximum braking deceleration of the mobile vehicle (10), and dynamically truncate the basic safety time domain in combination with the uncertainty reduction factor based on the strategy entropy to obtain the final prediction time domain; and is used to construct a dynamic constraint set, and dynamically shrink the allowable range of the control quantity according to the increase of the strategy entropy.
9. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 1, characterized in that, The data buffer module (160) is used to employ a first-in-first-out storage structure and a double-buffered atomic pointer exchange mechanism; the data buffer module (160) is also used to calculate the confidence weight of the trajectory data based on the proximity policy entropy and data transmission lag time at the data source end. When the proximity policy entropy at the data source end is high, the confidence weight of the trajectory data is reduced more quickly by increasing the time decay rate.
10. The autonomous decision-making optimization system based on reinforcement learning and multi-agent collaboration according to claim 1, characterized in that, The tracking control module (170) is used to calculate the reference state and feedforward control quantity at the current moment from the data buffer module (160) through an interpolation algorithm; the tracking control module (170) is also used to detect the state of the effective trajectory sequence in the data buffer module (160), and when the effective trajectory sequence is exhausted, switch to reactive obstacle avoidance mode and execute emergency braking logic using the latest environmental perception data.