A reinforcement learning obstacle avoidance method and system based on dynamic temporal discount factors

By introducing a dynamic temporal discount factor mechanism into the reinforcement learning obstacle avoidance method, and optimizing the obstacle avoidance strategy based on real-time risk assessment indicators, the problem that fixed discount factors in traditional methods cannot balance obstacle avoidance response and trajectory optimization is solved, and more accurate and flexible obstacle avoidance decisions are achieved.

CN122308428APending Publication Date: 2026-06-30STATE GRID JIANGSU ELECTRIC POWER CO LTD SUZHOU BRANCH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
STATE GRID JIANGSU ELECTRIC POWER CO LTD SUZHOU BRANCH
Filing Date
2026-03-25
Publication Date
2026-06-30

Smart Images

  • Figure CN122308428A_ABST
    Figure CN122308428A_ABST
Patent Text Reader

Abstract

A reinforcement learning obstacle avoidance method and system based on a dynamic temporal discount factor is disclosed. The method acquires state observation data of the agent and obstacles; calculates the minimum safe distance deviation, maximum relative approach speed, and obstacle density index; and generates a risk urgency scalar by calling a risk urgency mapping function based on these three parameters. The temporal discount factor is dynamically adjusted according to this scalar, and the generalized advantage estimation function is further reconstructed in the near-end policy optimization algorithm to update the policy network. This application improves obstacle avoidance response speed in high-risk scenarios and optimizes trajectory globality in low-risk scenarios through an environment-adaptive discount mechanism, reducing collision rate and energy consumption, while also exhibiting low computational overhead and strong compatibility.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence, specifically relating to a reinforcement learning obstacle avoidance method and system based on dynamic temporal discount factors. Background Technology

[0002] In the field of autonomous navigation and dynamic obstacle avoidance for intelligent robots, reinforcement learning has become a core technology for achieving safe and efficient obstacle avoidance in complex scenarios due to its advantages such as not requiring explicit modeling of the environment and possessing online adaptability. Among them, the Proximal Policy Optimization (PPO) algorithm, with its high training stability, excellent sample utilization efficiency, and insensitivity to hyperparameters, is widely used in the decision control modules of mobile platforms such as service robots, autonomous vehicles, and warehousing and logistics AGVs. This type of method typically constructs a Markov decision process, aiming to maximize the cumulative reward of discounts, to guide the agent to learn obstacle avoidance strategies in unknown dynamic environments.

[0003] The PPO algorithm relies on a pre-set discount factor to assess the importance of immediate rewards and long-term future returns when calculating the advantage function. In real-world dynamic environments, obstacles often exhibit non-steady-state characteristics such as high-speed movement, sudden changes in direction, or dense interactions. A fixed discount factor struggles to balance the timeliness of obstacle avoidance response with the global optimality of trajectory planning. Fundamentally, the traditional PPO framework treats the discount factor as a state-independent constant, neglecting the dynamic coupling effect of key state variables such as robot speed, relative obstacle motion vectors, and minimum safe distance on the urgency of risk. Summary of the Invention

[0004] To address the shortcomings of existing technologies, this invention provides a reinforcement learning-based obstacle avoidance method and system based on dynamic temporal discount factors. Through an environment-adaptive discount mechanism, it improves obstacle avoidance response speed in high-risk scenarios and optimizes global trajectory performance in low-risk scenarios, thereby reducing collision rate and energy consumption.

[0005] The first aspect of this application discloses a reinforcement learning obstacle avoidance method based on a dynamic temporal discount factor, employing the following technical solution: During the process of the agent performing obstacle avoidance tasks, the state observation data of the agent is acquired and then converted into a Cartesian coordinate system with the current position of the agent as the origin after being synchronized with timestamps, forming a structured state observation vector. Risk assessment indicators are calculated based on state observation vectors to evaluate the current environmental risk of the agent; the risk assessment indicators include minimum safe distance deviation, maximum relative approach speed, and obstacle density coefficient. Based on risk assessment indicators, the risk urgency scalar of the agent at the current moment is calculated through a risk urgency mapping function; in the risk urgency mapping function, the weight of each risk assessment indicator is obtained through offline pre-training. The risk urgency scalar is used to generate a dynamic time-series discount factor through linear mapping, and the dynamic time-series discount factor is used to replace the fixed discount factor of the near-end policy optimization to reconstruct the generalized advantage estimation function. Based on the reconstructed generalized advantage estimation function, proximal policy optimization is performed; the parameters of the policy network and value network are updated, and after completing the reinforcement learning policy iteration, action decisions are output to drive the agent to perform obstacle avoidance actions.

[0006] Furthermore, the state observation data includes the agent's own linear velocity and angular velocity, the position vector and velocity vector of the obstacles relative to the agent, and the Euclidean distance between the agent and each obstacle.

[0007] Furthermore, the calculation process for the minimum safe distance deviation is as follows: Traverse all Euclidean distances between obstacles and the agent, and take the minimum distance as the minimum safe distance; calculate the difference between the minimum safe distance and the preset safe distance threshold; the difference is used as the minimum safe distance deviation.

[0008] Furthermore, the calculation process for the maximum relative approach speed is as follows: For each obstacle, calculate the projection of its velocity vector onto the unit line-of-sight direction from the agent to the obstacle, and take the positive part of the projection component as the approach speed of the obstacle. The maximum approach rate among all obstacles is selected as the maximum relative approach speed.

[0009] Furthermore, the calculation process for the obstacle density index is as follows: With the agent's current position as the center, construct a circle with a radius of... A circular sensing area is defined; within the circular sensing area, the number of obstacle clusters after clustering is counted, and this number is divided by a preset maximum obstacle density upper limit to obtain the obstacle density index.

[0010] Furthermore, the process of calculating the risk urgency scalar based on risk assessment indicators includes: The risk mapping function adopts a hyperbolic tangent transform structure; The risk urgency scalar is the sum of the hyperbolic tangent transform result and 1, multiplied by a scaling factor; in the hyperbolic tangent function, the net input value is the weighted sum of the minimum safe distance deviation, the maximum relative approach speed, and the obstacle density coefficient; wherein the minimum safe distance deviation is taken as the reciprocal in the weighting calculation.

[0011] Furthermore, the weights of each risk assessment indicator are obtained through offline pre-training, including: Construct a test set of multiple collision scenarios; construct an optimization function, with the optimization objective defined as minimizing the weighted sum of collision rate and path length, and define a three-dimensional parameter space containing weight coefficients; The parameter search space is traversed using a grid search method, and simulation is performed on the test set. The optimization objective function value corresponding to each set of weight coefficients is calculated, and the combination of weight coefficients that minimizes the optimization objective function value is selected as the weight of the risk assessment index in the risk mapping function.

[0012] Furthermore, the risk urgency scalar is used to generate a dynamic time-series discount factor through linear mapping, which is expressed as: The threshold lower limit of the dynamic time-series discount factor plus the dynamic mapping term; The dynamic mapping term is the difference between 1 and the risk urgency scalar, multiplied by the upper limit constraint of the threshold.

[0013] Furthermore, based on the reconstructed generalized advantage estimation function, proximal policy optimization is performed, including: During each policy evaluation phase, the dynamic discount factor at each moment of the entire trajectory is cached to form a sequence; When calculating the advantage value at each time step, the dynamic time series discount factor is first retrieved for each future step, and the discount weighted value is calculated. The discount weighted value is then multiplied by the corresponding time series difference residual, and then accumulated over time.

[0014] The second aspect of this application discloses a reinforcement learning obstacle avoidance system based on a dynamic temporal discount factor, which implements the reinforcement learning obstacle avoidance method described in the first aspect of this application. The system includes: The observation and perception module is used to acquire the state observation data of the agent during the process of the agent performing obstacle avoidance tasks. After being synchronized with the timestamp, the data is uniformly converted to the Cartesian coordinate system with the current position of the agent as the origin, forming a structured state observation vector. Environmental risk assessment module; used to calculate risk assessment indicators based on state observation vectors to assess the current environmental risk of the agent; the risk assessment indicators include minimum safe distance deviation, maximum relative approach speed and obstacle density coefficient; Risk value mapping module; based on risk assessment indicators, calculates the risk urgency scalar of the agent at the current moment through a risk urgency mapping function; in the risk urgency mapping function, the weight of each risk assessment indicator is obtained through offline pre-training; The estimation function optimization module is used to generate a dynamic time-series discount factor from the risk urgency scalar through linear mapping, and to replace the fixed discount factor of the near-end strategy optimization with the dynamic time-series discount factor to reconstruct the generalized advantage estimation function. The policy update module is used to perform proximal policy optimization based on the reconstructed generalized advantage estimation function; update the parameters of the policy network and value network, and output action decisions after completing the policy iteration of reinforcement learning to drive the agent to perform obstacle avoidance actions.

[0015] The beneficial effects of this invention are that, compared with the prior art, 1. This invention improves the accuracy and safety of obstacle avoidance decisions. By introducing a dynamic temporal discount factor, this invention dynamically adjusts the agent's decision-making process based on real-time risk assessment indicators (minimum safe distance deviation, maximum relative approach speed, and obstacle density). This dynamic adjustment not only better addresses environmental changes (such as obstacle position and speed) but also optimizes the decision-making process according to the urgency of the risk, enabling the agent to make more precise obstacle avoidance actions in complex and dangerous environments, thereby reducing the risk of collisions and ensuring the safety and reliability of task execution.

[0016] 2. This invention enhances the flexibility and adaptability of the obstacle avoidance process. By reconstructing the Generalized Advantage Estimation (GAE) function and incorporating a dynamic temporal discount factor, this invention improves the decision stability and efficiency during policy optimization. In traditional reinforcement learning methods, the GAE function typically uses a fixed discount factor for reward evaluation, which may lead to unstable or irrational decisions by the agent in rapidly changing environments. However, by integrating a dynamic discount factor into the GAE, the agent can update its policy based on risk urgency at different time steps. This allows the agent to quickly adjust its action plan and respond more accurately to different environmental feedback when facing sudden risks or complex environments. Attached Figure Description

[0017] Figure 1 This is a schematic diagram of the overall technical architecture of the reinforcement learning obstacle avoidance method based on dynamic temporal discount factors proposed in this invention. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of this invention. The embodiments described in this application are merely some embodiments of this invention, and not all embodiments. Based on the spirit of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of this invention.

[0019] In conventional PPO algorithms, the calculation of the advantage function relies on a pre-set fixed time-series discount factor to weigh the importance of immediate rewards against long-term future returns. The fixed time-series discount factor (typically 0.99) implicitly assumes a stable environment and slow risk evolution, making it suitable for static or low-speed obstacle scenarios. However, in real-world applications, obstacles are often in high-speed motion, or suddenly change direction or interact densely. In such cases, the fixed discount factor cannot effectively balance immediate obstacle avoidance response and long-term trajectory optimization, leading to sluggish obstacle avoidance strategies or redundant paths, failing to cope with rapidly changing environments and collision risks. Specifically: First, in high-speed approach scenarios (such as when the relative speed between the robot and an obstacle exceeds 1.5 m / s), a fixed γ parameter overemphasizes long-term trajectory smoothness, weakening the policy network's sensitivity to immediate collision risks and causing significant delays in action triggering. Second, in space-constrained areas such as narrow passages, a fixed γ parameter causes the policy to excessively avoid potential risk areas, generating redundant detour paths and severely compromising energy efficiency. Third, although some solutions attempt to adapt to different operating conditions by pre-setting multiple γ parameter switching mechanisms, they rely on manually dividing discrete state intervals and cannot achieve online adaptive adjustment in a continuous state space.

[0020] Example 1, see Figure 1 This invention provides a reinforcement learning obstacle avoidance method based on dynamic temporal discount factors. Its core lies in constructing a dynamic temporal discount factor generation mechanism that is coupled with the environmental state in real time, and embedding this mechanism into the near-end policy optimization algorithm framework to achieve adaptive adjustment of the obstacle avoidance strategy under different risk scenarios.

[0021] S1: In one embodiment of this example, during the process of the intelligent agent performing an obstacle avoidance task, the state observation data of the intelligent agent is acquired.

[0022] The state observation data includes the agent's own linear velocity and angular velocity, the position vectors and velocity vectors of surrounding obstacles relative to the agent, and the Euclidean distance between the agent and each obstacle.

[0023] In one specific embodiment, the linear velocity and angular velocity signals of the agent itself can be acquired using a wheel encoder. The wheel encoder is mounted on the drive wheel axle of the agent and outputs the linear velocity and angular velocity signals of the agent at a sampling frequency of 100 Hz.

[0024] In one specific embodiment, the position vectors of surrounding obstacles relative to the agent can be obtained using a two-dimensional lidar. The two-dimensional lidar is mounted at the top center of the agent, with a scanning frequency of 20 Hz and an angular resolution of 1 degree, and is used to acquire the position information of the obstacles relative to the agent (in polar coordinates).

[0025] In one specific embodiment, the velocity vectors of surrounding obstacles relative to the agent are obtained through laser point cloud data and a point cloud processing unit. The point cloud processing unit receives two consecutive frames of laser point cloud data and identifies obstacle clusters using a clustering algorithm. The velocity vector of each obstacle is calculated using optical flow matching or cluster center tracking methods.

[0026] The collected state observation data is synchronized with timestamps and then uniformly converted to a Cartesian coordinate system with the agent's current position as the origin, forming a structured state observation vector.

[0027] S2: In one embodiment of this example, the current environmental risk of the agent is assessed based on the state observation vector obtained in S1.

[0028] The current environmental risk of an intelligent agent is represented by the following three key indicators: minimum safe distance deviation, maximum relative approach speed, and obstacle density index.

[0029] The minimum safe distance deviation is defined as the difference between the actual Euclidean distance between the agent and the nearest obstacle and the preset safe distance threshold (e.g., 0.6 meters); its calculation process is as follows: Iterate through all Euclidean distances between obstacles and the agent, and take the minimum distance as the minimum safe distance; calculate the difference between this minimum safe distance and the preset safe distance threshold; if the difference is negative, it means that the agent has invaded the safe area.

[0030] The maximum relative approach speed is defined as the maximum speed at which an obstacle approaches the agent; its calculation process is as follows: For each obstacle, calculate the projection of its velocity vector onto the unit line-of-sight direction from the agent towards the obstacle, and take the positive part of the projection component as the approach rate of the obstacle. Select the maximum approach rate among all obstacles as the maximum relative approach speed.

[0031] The obstacle density index measures the density of obstacles around an agent; its calculation process is as follows: With the agent's current position as the center, construct a circle with a radius of... A circular sensing area is defined; within this area, the number of obstacle clusters after clustering is counted, and this number is divided by the preset maximum obstacle density limit (e.g., 8) to obtain a normalized value; if the calculation result is greater than 1, the result is forcibly truncated to 1 to ensure that the density index is always within the closed interval [0,1].

[0032] S3: In one embodiment of this example, based on the three risk assessment indicators output in step S2, the risk urgency scalar at the current moment is calculated through the pre-trained risk urgency mapping function.

[0033] The risk urgency mapping function is a pre-trained risk mapping function whose purpose is to convert the indicators calculated in the previous steps into a single risk urgency scalar to represent the risk level at the current moment.

[0034] The mapping function uses hyperbolic tangent transform The structure is represented as: ; As a scalar measure of risk urgency, It is the reciprocal of the minimum safe distance deviation. For the maximum relative approach speed, The obstacle density index; , and These are the weighting coefficients for each item. 0.5 is the range scaling factor.

[0035] Furthermore, the values ​​of each weight coefficient were determined through offline simulation calibration, with the optimization objective being to minimize the weighted sum of collision rate and path length. The calibration process was conducted on a test set including three typical scenarios: high-speed collision, narrow-path passage, and multi-obstacle intersection. The optimization objective was to minimize the weighted sum of collision rate and path length. A grid search method was used to traverse the three-dimensional parameter space, ultimately selecting the aforementioned combination. Specifically: A test set containing three typical scenarios was selected to simulate the performance of the agent under different environmental conditions: (1) High-speed collision scenario: The agent and the obstacle have a high relative speed and face a higher risk of collision. (2) Narrow passage scenario: The agent travels in a narrow space and faces the obstacle avoidance challenge brought by high obstacle density and narrow path. (3) Multi-obstacle crossing scenario: Multiple obstacles move dynamically in the same area and the agent needs to deal with complex intersecting paths.

[0036] Define the parameter space for the grid search method, including the weighting coefficients for the reciprocal of the minimum safe distance deviation, the maximum relative approach speed, and the obstacle density index. , and .

[0037] A grid search method is used to traverse the three-dimensional parameter space, progressively calculating the weighted sum of collision rate and path length under different weight combinations. Simulations are performed on a test set containing the three scenarios mentioned above, with multiple cycles of experiments required for each scenario to ensure the reliability of the results.

[0038] For each combination of weighting coefficients, a weighted sum of the collision rate and path length is calculated. The parameter combination with the smallest weighted sum is selected as the optimal solution and used as the parameter combination in the risk mapping function. Here, the collision rate represents the proportion of collisions that occur to the agent during the simulation, and the path length represents the total path length of the agent during obstacle avoidance.

[0039] It is understandable that when calculating the risk urgency scalar based on the minimum safe distance deviation, the maximum relative approach speed, and the obstacle density index, all three are normalized scalars in the range [0,1].

[0040] S4: In one embodiment of this example, a dynamic time-series discount factor is generated by linearly mapping the risk urgency scalar. The mapping method is as follows: ; in, γ(t) represents the dynamic time-series discount factor; 0.85 is the lower threshold of the dynamic time-series discount factor, and 0.145 is the upper threshold constraint. This mapping relationship ensures that when the risk urgency scalar is zero (i.e., no risk), γ(t) reaches its maximum value of 0.995, emphasizing long-term returns; when the risk urgency scalar is one (i.e., extremely high risk), γ(t) reaches its minimum value of 0.85, focusing on immediate rewards. To ensure numerical stability, the system sets a hard limiting mechanism: if the calculated γ(t) is less than 0.85, it is forcibly set to 0.85; if it is greater than 0.995, it is forcibly set to 0.995. This constraint mechanism prevents the discount factor from exceeding the reasonable range allowed for stable convergence of the PPO algorithm due to extreme inputs.

[0041] In further explanation, the threshold values ​​of the dynamic discount factor, 0.85 (minimum) and 0.995 (maximum), are fixed values ​​calibrated in an engineering manner. These values ​​were determined through extensive simulation testing, taking into account the convergence characteristics of the PPO algorithm and the actual needs of obstacle avoidance scenarios. Specifically: Minimum value 0.85: The lower bound of the discount factor for stable convergence of the PPO algorithm + the optimal value for immediate risk response in high-risk scenarios. Proximal Policy Optimization (PPO), as a reinforcement learning algorithm based on policy gradients, has strict requirements on the value of the discount factor—a discount factor that is too small (e.g., <0.85) will cause oscillations in the algorithm's policy updates, distortion in the calculation of cumulative rewards, and ultimately, non-convergence during training. 0.85 is a stable convergence lower bound determined through extensive PPO algorithm training simulations. Below this value, problems such as chaotic policy network parameter updates and unstable obstacle avoidance decisions will occur.

[0042] In high-risk scenarios (such as high-speed approaching obstacles or safety distances approaching zero), the discount factor needs to be sufficiently low to allow the policy network to significantly reduce its focus on long-term future returns and concentrate on immediate rewards / penalties (i.e., prioritizing collision avoidance over trajectory smoothness). This invention has undergone simulation testing in high-risk scenarios such as high-speed collisions and sudden obstacles. 0.85 is the optimal value balancing immediate response speed and algorithm convergence: it maximizes the policy network's sensitivity to collision penalties (action trigger delay ≤ 0.15s) while ensuring the training stability of the PPO algorithm.

[0043] Maximum value 0.995: This value closely matches the fixed discount factor of traditional PPO with the adaptation value for global trajectory optimization in low-risk scenarios. The fixed discount factor of traditional PPO algorithms is usually 0.99. This invention sets the discount factor to 0.995 (slightly higher than 0.99) for low-risk scenarios. This not only fully complies with the training process of standard PPO, but also allows the algorithm to focus more on long-term rewards in low-risk scenarios, avoiding trajectory local optimization problems caused by a slightly lower discount factor.

[0044] In low-risk scenarios (such as when the agent is in a safe area without obstacles or in a narrow passage with low relative speed), a discount factor close to 1 is needed to allow the policy network to focus on the optimality of the global trajectory (reducing redundant detours and lowering energy consumption). A value of 0.995, tested in scenarios such as traversing narrow passages and crossing multiple obstacles at low speeds, achieves the technical effect of "reducing trajectory length by 32% and energy consumption by 27%". If the value is higher (e.g., >0.995), the discount factor will be too close to 1, reducing the policy network's sensitivity to minor environmental risks and potentially leading to a "risk response lag".

[0045] S5: In one embodiment of this example, during the training process of the Proximal Policy Optimization (PPO) algorithm, a dynamic temporal discount factor γ(t) is used to replace the traditional fixed discount factor, thereby reconstructing the Generalized Advantage Estimation (GAE) function to more accurately consider the risk state at each time step.

[0046] The standard generalized advantage estimation function is defined as follows: ;in, The smoothing coefficient of GAE. For the first The temporal difference residual at time step, It is a fixed discount factor.

[0047] In this application, the fixed Replace with something that changes over time. Then the reconstructed generalized advantage estimation function is expressed as: ; During each policy evaluation phase, the agent system caches the dynamic discount factor at every moment along the entire trajectory, forming a sequence. .

[0048] Calculate the advantage value at each time step At that time, first determine each future step size Search And calculate the discount weighted value. ; Combine the discount weighted value with the corresponding time series difference residual After multiplying, the results are accumulated by time intervals.

[0049] The above process is completed in the trajectory playback buffer, ensuring that the advantage value of each empirical sample is weighted based on the actual risk state at the time of its occurrence.

[0050] By using a dynamic discount factor, the agent system can weight the risk level based on the risk state at each moment. In high-risk moments (such as when the agent faces a rapidly approaching obstacle), the dynamic discount factor is smaller, emphasizing immediate penalties and thus encouraging the agent to make quick obstacle avoidance decisions. In low-risk moments (such as when the agent is in a safe zone), the dynamic discount factor is larger, emphasizing long-term gains and prompting the agent to focus on global trajectory optimization.

[0051] This application reconstructs the traditional Generalized Advantage Estimation (GAE) function into a version based on dynamic risk states by replacing the fixed discount factor with a dynamic temporal discount factor. During each policy evaluation, the system adjusts the discount factor at each time step based on the real-time risk state, thereby guiding the policy network to focus more on immediate penalties during high-risk periods and more on long-term rewards during low-risk periods. This approach helps agents adjust their obstacle avoidance strategies more flexibly in dynamic environments.

[0052] S6: In one implementation of this embodiment, the parameters of the policy network and the value network are updated based on the reconstructed generalized advantage estimation (GAE) function, thereby completing one policy iteration.

[0053] The policy network and value network share a common low-level feature extractor and employ a dual-head structure to output policy and value estimates respectively. The low-level feature extractor consists of three fully connected layers, each containing 256 neurons, using the ReLU activation function. The feature extractor is responsible for extracting effective features from the raw input for subsequent policy and value estimation.

[0054] The policy head consists of two fully connected layers, outputting the mean vector and log-standard deviation vector of the action distribution. The mean vector represents the expected value of the action selection, and the log-standard deviation vector represents the uncertainty of the action selection. The value head is a single fully connected layer, outputting a scalar estimate of the state value, used to evaluate the value of the current state. The Adam optimizer is used to update the network parameters, with a learning rate of 3×10⁻⁴, a batch size of 2048, and a PPO pruning range. Set it to 0.2.

[0055] In each update, the agent system randomly selects multiple trajectory batches from the replay buffer, calculates the policy ratio for each state-action pair, and performs gradient descent based on the pruned objective function to ensure that the policy update is both efficient and stable.

[0056] S7: In one embodiment of this example, based on a trained neural network model, a process of continuous state observation, risk urgency calculation, dynamic discount factor generation, and action decision-making is executed to drive the agent to perform obstacle avoidance actions.

[0057] During the inference phase, the agent system no longer calculates the advantage function or updates parameters; instead, it fully reuses the neural network model built during the training phase for inference calculations. The purpose of inference is to generate optimal control commands in real time to drive the agent to perform obstacle avoidance actions. Specifically: Acquire the current state observation data of the agent, including the agent's own linear velocity, angular velocity and relevant information about surrounding obstacles (such as position, velocity and Euclidean distance). Based on state observation data, the minimum safe distance deviation, maximum relative approach speed, and obstacle density index are calculated. The calculated quantities are then input into the risk urgency mapping function to generate the risk urgency scalar for the current moment.

[0058] Based on the risk urgency scalar, a dynamic temporal discount factor for the current moment is generated and input into the trained policy network.

[0059] At each time step (i.e., every 1 / 20th of a second), the agent system calculates action distribution parameters through a policy network based on the current state observation vector. These parameters represent the probability of the agent performing a certain action. Using reparameterization techniques, specific linear and angular velocity control commands are sampled from this action distribution. The generated control commands are then sent to the agent's underlying motion controller to execute obstacle avoidance actions.

[0060] In one embodiment, the entire inference process is deployed within the agent's embedded computing unit, which employs a quad-core ARM Cortex-A72 processor clocked at 1.8 GHz and is equipped with 8 GB of LPDDR4 memory. The agent system runs on a real-time operating system (RTOS), ensuring that the generation latency of control instructions does not exceed 50 milliseconds within each cycle (at a frequency of 20 Hz). This hardware configuration and real-time operating system ensure the real-time performance of the inference process, guaranteeing that the agent can quickly respond to environmental changes and execute obstacle avoidance actions.

[0061] In one embodiment, the optimized performance for a high-speed approach scene is as follows: When the minimum safe distance deviation approaches zero and the maximum relative approach speed exceeds 1.5 m / s, it indicates that the agent is about to collide with the obstacle. At this point, the risk urgency scalar rapidly rises to above 0.9, indicating that the current obstacle avoidance risk is very high. Correspondingly, the dynamic temporal discount factor drops below 0.86, indicating that under such high-risk conditions, the agent will focus more on immediate penalties (avoiding collisions) rather than long-term rewards.

[0062] Due to the adjustment of the discount factor, the policy network's sensitivity to immediate collision penalties is significantly enhanced, thereby reducing action trigger latency and lowering the response time to less than 0.15 seconds. Under these conditions, the measured collision rate is reduced from 38% in the traditional fixed discount factor scheme to below 5%, indicating that this method can effectively reduce collisions in high-speed approach scenarios.

[0063] In one embodiment, the optimized performance in a narrow passage scenario is as follows: In narrow passage scenarios, obstacles are densely packed, resulting in a higher obstacle density exponent. Despite the dense obstacles, the relative speed is low, meaning the threat between the agent and the obstacles is relatively small. In this scenario, the risk urgency scalar remains between 0.4 and 0.6, indicating moderate environmental risk. Correspondingly, the dynamic discount factor remains between 0.92 and 0.95, avoiding an excessive preference for immediate rewards by the policy network and preventing overly conservative detour behavior. Because over-avoidance strategies are no longer employed, the average trajectory length is shortened by 32% compared to the fixed discount factor scheme, while energy consumption is reduced by 27%.

[0064] In one embodiment, since the risk urgency mapping function is a continuously differentiable structure, the system can generate an adaptive dynamic temporal discount factor at any state point. In the obstacle motion pattern mutation test, the parameter mismatch rate is less than 8%, which is significantly better than the existing technology that relies on manual division of state intervals.

[0065] In one embodiment, the hardware unit of the intelligent agent system includes a main unit: State Observation Unit: Responsible for collecting and preprocessing the agent's raw state data. It integrates a wheel encoder, a 2D LiDAR, and a point cloud processing unit. The wheel encoder provides linear and angular velocity information. The 2D LiDAR acquires obstacle position information. The point cloud processing unit identifies obstacles and calculates their velocity vectors.

[0066] Risk characteristic calculation unit: Built-in safe distance threshold register, maximum obstacle density upper limit register, and relative speed projection calculator. This unit efficiently calculates three risk assessment indicators (minimum safe distance deviation, maximum relative approach speed, and obstacle density index).

[0067] Risk urgency generation unit: This unit receives and processes three risk assessment indicators. It calculates and outputs a risk urgency scalar, which represents the current level of environmental risk.

[0068] Dynamic discount factor generation unit: based on the formula The dynamic time-series discount factor is calculated, ensuring that its value is strictly limited to between 0.85 and 0.995. This unit generates a dynamic discount factor that matches the current risk state, ensuring that the agent can make appropriate decision adjustments under different risk conditions.

[0069] Advantage function reconstruction unit: During the training phase, it manages the trajectory cache and calculates the advantage function at each time step, providing an accurate advantage value for policy updates at each time step, enabling the agent to make more refined decision adjustments in dynamic risk environments.

[0070] Policy Update Unit: Used to update the policy during each training phase, utilizing the core algorithm of PPO for optimization. It updates network parameters through optimization algorithms (such as gradient descent) to gradually improve the agent's decision-making ability.

[0071] Action Decision Unit: Responsible for generating action decisions based on the neural network model parameters obtained during the training phase and the current state. It samples from the distribution of the policy network output using reparameterization techniques, calculates specific control commands (such as linear velocity and angular velocity), and sends them to the motion controller.

[0072] The entire intelligent agent system achieves fine-grained control of obstacle avoidance strategies by introducing dynamic discount factor generation and advantage function reconstruction mechanisms, without changing the core update logic of the proximal policy optimization algorithm. This mechanism has extremely low computational overhead, adding less than 3% of the extra computation, is fully compatible with the standard PPO training process, requires no additional labeled data or modification of the loss function structure, and has high engineering practicality and ease of deployment.

[0073] As an embodiment of this application, a reinforcement learning obstacle avoidance system based on a dynamic temporal discount factor is disclosed. Employing the specific implementation described above for the reinforcement learning obstacle avoidance method, the system includes: The observation and perception module is used to acquire the state observation data of the agent during the process of the agent performing obstacle avoidance tasks. After being synchronized with the timestamp, the data is uniformly converted to the Cartesian coordinate system with the current position of the agent as the origin, forming a structured state observation vector. Environmental risk assessment module; used to calculate risk assessment indicators based on state observation vectors to assess the current environmental risk of the agent; the risk assessment indicators include minimum safe distance deviation, maximum relative approach speed and obstacle density coefficient; Risk value mapping module; based on risk assessment indicators, calculates the risk urgency scalar of the agent at the current moment through a risk urgency mapping function; in the risk urgency mapping function, the weight of each risk assessment indicator is obtained through offline pre-training; The estimation function optimization module is used to generate a dynamic time-series discount factor from the risk urgency scalar through linear mapping, and to replace the fixed discount factor of the near-end strategy optimization with the dynamic time-series discount factor to reconstruct the generalized advantage estimation function. The policy update module is used to perform proximal policy optimization based on the reconstructed generalized advantage estimation function; update the parameters of the policy network and value network, and output action decisions after completing the policy iteration of reinforcement learning to drive the agent to perform obstacle avoidance actions.

[0074] As an embodiment of this application, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the computer program is loaded onto the processor, it employs the specific implementation described above for the reinforcement learning obstacle avoidance method.

[0075] As an embodiment of this application, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, employs the specific implementation described above for the reinforcement learning obstacle avoidance method.

[0076] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination of the foregoing. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.

[0077] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0078] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.

[0079] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation of the present invention. Any modifications or equivalent substitutions that do not depart from the spirit and scope of the present invention should be covered within the protection scope of the claims of the present invention.

Claims

1. A reinforcement learning obstacle avoidance method based on dynamic temporal discount factors, characterized in that, include: During the process of the agent performing obstacle avoidance tasks, the state observation data of the agent is acquired and then converted into a Cartesian coordinate system with the current position of the agent as the origin after being synchronized with timestamps, forming a structured state observation vector. Risk assessment indicators are calculated based on state observation vectors to evaluate the current environmental risk of the agent; the risk assessment indicators include minimum safe distance deviation, maximum relative approach speed, and obstacle density coefficient. Based on risk assessment indicators, the risk urgency scalar of the agent at the current moment is calculated through a risk urgency mapping function; In the risk urgency mapping function, the weight of each risk assessment indicator is obtained through offline pre-training; The risk urgency scalar is used to generate a dynamic time-series discount factor through linear mapping, and the dynamic time-series discount factor is used to replace the fixed discount factor of the near-end policy optimization to reconstruct the generalized advantage estimation function. Based on the reconstructed generalized advantage estimation function, proximal policy optimization is performed; the parameters of the policy network and value network are updated, and after completing the policy iteration of reinforcement learning, action decisions are output to drive the agent to perform obstacle avoidance actions.

2. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The state observation data includes the agent's own linear velocity and angular velocity, the position vector and velocity vector of the obstacles relative to the agent, and the Euclidean distance between the agent and each obstacle.

3. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The calculation process for the minimum safe distance deviation is as follows: Traverse all Euclidean distances between obstacles and the agent, and take the minimum distance as the minimum safe distance; calculate the difference between the minimum safe distance and the preset safe distance threshold; the difference is used as the minimum safe distance deviation.

4. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The calculation process for the maximum relative approach speed is as follows: For each obstacle, calculate the projection of its velocity vector onto the unit line-of-sight direction from the agent to the obstacle, and take the positive part of the projection component as the approach speed of the obstacle. The maximum approach rate among all obstacles is selected as the maximum relative approach speed.

5. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The calculation process for the obstacle density index is as follows: With the agent's current position as the center, construct a circle with a radius of... A circular sensing area; Within the circular sensing area, the number of obstacle clusters after clustering is counted, and this number is divided by the preset maximum obstacle density upper limit to obtain the obstacle density index.

6. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The process of calculating the risk urgency scalar based on risk assessment indicators includes: The risk mapping function adopts a hyperbolic tangent transform structure; The risk urgency scalar is the sum of the hyperbolic tangent transform result and 1, multiplied by a scaling factor; in the hyperbolic tangent function, the net input value is the weighted sum of the minimum safe distance deviation, the maximum relative approach speed, and the obstacle density coefficient; wherein the minimum safe distance deviation is taken as the reciprocal in the weighting calculation.

7. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The weights of each risk assessment indicator are obtained through offline pre-training, including: Construct a test set of multiple collision scenarios; construct an optimization function, with the optimization objective defined as minimizing the weighted sum of collision rate and path length, and define a three-dimensional parameter space containing weight coefficients; The parameter search space is traversed using a grid search method, and simulation is performed on the test set. The optimization objective function value corresponding to each set of weight coefficients is calculated, and the combination of weight coefficients that minimizes the optimization objective function value is selected as the weight of the risk assessment index in the risk mapping function.

8. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, The risk urgency scalar is used to generate a dynamic time-series discount factor through linear mapping, which is expressed as: The threshold lower limit of the dynamic time-series discount factor plus the dynamic mapping term; The dynamic mapping term is the difference between 1 and the risk urgency scalar, multiplied by the upper limit constraint of the threshold.

9. The reinforcement learning obstacle avoidance method based on dynamic temporal discount factor according to claim 1, characterized in that, Based on the reconstructed generalized advantage estimation function, near-end policy optimization is performed, including: During each policy evaluation phase, the dynamic discount factor at each moment of the entire trajectory is cached to form a sequence; When calculating the advantage value at each time step, the dynamic time series discount factor is first retrieved for each future step, and the discount weighted value is calculated. The discount weighted value is then multiplied by the corresponding time series difference residual, and then accumulated over time.

10. A reinforcement learning obstacle avoidance system based on a dynamic temporal discount factor, executing the reinforcement learning obstacle avoidance method as described in any one of claims 1-9, characterized in that, The system includes: The observation and perception module is used to acquire the state observation data of the agent during the process of the agent performing obstacle avoidance tasks. After being synchronized with the timestamp, the data is uniformly converted to the Cartesian coordinate system with the current position of the agent as the origin, forming a structured state observation vector. Environmental risk assessment module; used to calculate risk assessment indicators based on state observation vectors to assess the current environmental risk of the agent; the risk assessment indicators include minimum safe distance deviation, maximum relative approach speed and obstacle density coefficient; Risk value mapping module; based on risk assessment indicators, calculates the risk urgency scalar of the agent at the current moment through a risk urgency mapping function; in the risk urgency mapping function, the weight of each risk assessment indicator is obtained through offline pre-training; The estimation function optimization module is used to generate a dynamic time-series discount factor from the risk urgency scalar through linear mapping, and to replace the fixed discount factor of the near-end strategy optimization with the dynamic time-series discount factor to reconstruct the generalized advantage estimation function. The policy update module is used to perform proximal policy optimization based on the reconstructed generalized advantage estimation function; update the parameters of the policy network and value network, and output action decisions after completing the policy iteration of reinforcement learning to drive the agent to perform obstacle avoidance actions.

11. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the computer program is loaded into the processor, it implements the reinforcement learning obstacle avoidance method according to any one of claims 1-9.

12. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the reinforcement learning obstacle avoidance method according to any one of claims 1-9.