AUV path planning method based on reward adaptive priority experience replay
By introducing a reward-adaptive priority experience replay mechanism and a deep reinforcement learning algorithm, the problems of low sample utilization and instability in AUV path planning are solved, and efficient and stable path planning in complex marine environments is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2026-03-04
- Publication Date
- 2026-06-12
AI Technical Summary
Existing deep reinforcement learning methods suffer from low sample utilization and instability in AUV path planning, especially in complex marine environments where it is difficult to obtain effective reward signals, resulting in low learning efficiency and slow convergence speed. Furthermore, traditional priority experience replay mechanisms cannot fully utilize reward information.
A reward-adaptive priority experience replay mechanism (RAPER) is introduced. By dynamically adjusting the priority of experience samples, combined with a multidimensional composite reward function and a deep reinforcement learning algorithm, a distributed soft actor-critic network architecture is designed to optimize policy convergence and improve the stability of path planning.
The convergence speed and stability of path planning for AUVs in complex marine environments have been improved, and shorter and smoother trajectory planning has been achieved. Simulation results show that it has a faster convergence speed and better adaptability in multi-obstacle environments.
Smart Images

Figure CN122197941A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and autonomous control technology, and relates to a three-dimensional path planning method for autonomous underwater vehicles, specifically an AUV path planning method based on reward-adaptive priority experience playback. Background Technology
[0002] An Autonomous Underwater Vehicle (AUV) is an intelligent underwater robot capable of independently completing tasks without tethering or human intervention. Unlike Remotely Operated Vehicles (ROVs), which require real-time human control, AUVs possess a high degree of autonomy and environmental adaptability, automatically performing path planning, obstacle avoidance, and navigation decisions based on mission objectives and perceived information. AUVs have wide-ranging applications in marine surveying, environmental monitoring, seabed resource exploration, military reconnaissance, and disaster assessment.
[0003] When AUVs operate in complex marine environments, they are affected by various uncertainties, such as dynamic ocean currents, complex topographic relief, changing obstacle distribution, and sensor noise. These factors lead to highly nonlinear and uncertain path planning problems. Traditional path planning methods, such as artificial potential field methods, A* algorithms, particle swarm optimization, and ant colony optimization, while achieving relatively optimal paths in static or regular environments, have limitations in dynamic and unstructured marine environments. For example, some algorithms rely on accurate environmental modeling, making it difficult to adapt to real-time changes in ocean currents and obstacles; secondly, these algorithms are prone to getting trapped in local optima and lack the ability to search for the global shortest and safest paths; furthermore, the computational cost increases dramatically with environmental complexity, making it difficult to meet the real-time and stability requirements of AUV missions. Therefore, traditional methods are insufficient for real-time path planning in scenarios with multiple obstacles and strong disturbances.
[0004] With the development of artificial intelligence technology, deep reinforcement learning (DRL) offers a new solution for AUV path planning. This method, through continuous interaction between the agent and the environment, automatically learns the optimal strategy through trial and error and feedback, without requiring a precise environmental model. It can achieve complex control in a high-dimensional, continuous state space. Particularly when facing unknown environments, complex dynamics, and dynamic disturbances, it exhibits superior self-learning capabilities and adaptive characteristics, enabling AUVs to autonomously generate safe and efficient paths in changing marine environments.
[0005] Despite this, existing AUV path planning methods based on deep reinforcement learning still have certain shortcomings. On the one hand, in complex environments, AUVs face sparse objectives and sluggish feedback, making it difficult for algorithms to obtain effective reward signals, thus affecting learning efficiency and convergence speed. On the other hand, traditional experience-first replay mechanisms adjust sample weights only based on temporal difference errors, failing to fully utilize reward information, resulting in low experience utilization and training instability. Furthermore, in underwater environments with high noise and strong disturbances, the algorithms suffer from convergence instability and overestimation problems. Summary of the Invention
[0006] This invention aims to overcome the problems of low sample utilization and instability in existing deep reinforcement learning methods for AUV path planning. It proposes an AUV path planning method based on Reward Adaptive Prioritized Experience Replay (RAPER). On the framework of the distributed soft actor-critic algorithm with three improvements, the Reward Adaptive Prioritized Experience Replay mechanism is introduced. By dynamically adjusting the priority of experience samples, a balance between exploration and utilization is achieved, thereby improving the convergence speed, path performance indicators and stability of AUV path planning in complex marine environments.
[0007] The AUV path planning method based on reward-adaptive-priority experience replay specifically includes the following steps:
[0008] Step 1: Model the seabed topography, underwater currents, and underwater obstacles in a complex marine environment.
[0009] Step 2: To achieve target guidance and safe obstacle avoidance for AUVs in complex marine environments, a multidimensional composite reward function, state space, and action space are designed.
[0010] The state space includes the AUV's pose information, linear velocity, angular velocity, target relative distance, relative distance and direction information with obstacles, terrain height, and ocean current velocity and direction.
[0011] The action space is a continuous six-dimensional vector, corresponding to the thrust and torque inputs in the AUV body coordinate system, among which forward thrust, pitching moment and yaw moment are the main control quantities.
[0012] To guide the AUV to reach the target location via the shortest path while ensuring safety, a composite reward function was designed that integrates success rewards, distance changes, collision penalties, ocean current utilization, and attitude balance factors.
[0013] Step 3: An architecture for AUV 3D path planning was constructed based on deep reinforcement learning to realize 3D path planning for AUVs in complex marine environments, accelerate policy convergence, and improve the performance indicators of the planned trajectory.
[0014] s3.1 Deep Reinforcement Learning Algorithm Network Architecture Design
[0015] The deep reinforcement learning algorithm network includes an actor network and a critic network.
[0016] The input to the actor network is the current state. The output is the policy distribution of actions. AUV from strategy distribution The actual actions to be performed during random sampling Then, control is executed based on the action instructions output by the current actor network, and path planning tasks are implemented within this framework.
[0017] The input to the critic network is a pair of current states and actions. The output is a continuous Gaussian distribution. .
[0018] Preferably, the critic network comprises two independent value distribution networks. and Used to independently evaluate input state and action pair The value distribution is determined by selecting the distribution with the smallest mean between the two as the objective of strategy optimization, in order to alleviate the traditional Existing in learning The problem of overestimation.
[0019] s3.2 Design of the Reward Adaptive Priority Experience Replay Mechanism (RAPER)
[0020] Integrate reward information and introduce event sets into the quadruple experience samples. Define the event composite weight for each experience sample. for:
[0021]
[0022] in, This represents the weights of the target events defined at different training periods. The weighting factor represents the non-target event. Indicates the weight of non-target events. Represents the empty set. This indicates the default weight.
[0023] Based on the success rate during the training phase, the priority weights of experience samples are dynamically adjusted to increase the sampling probability of key experiences and enhance the stability of learning.
[0024]
[0025] in, This indicates the adjusted target event weight. Indicates the training success rate. The set success rate threshold, As an adaptive adjustment factor, The weight of a success event is indicated. The weights represent the proximity to the target event. This indicates taking the minimum value between the two.
[0026] Calculate the importance sampling weight based on the overall event weights of the empirical samples. Sampling probability P(m) and overall priority value p m :
[0027]
[0028]
[0029]
[0030] in, For timing difference error, This indicates taking the absolute value. It is a very small positive integer. ∈[0,1], which is the priority intensity coefficient. , The current capacity of the experience playback buffer. ∈[0,1], which is the importance sampling correction coefficient.
[0031] s3.3 Design of AUV 3D Path Planning Method
[0032] First, use the environment model established in step 1 to obtain the current state information. The state is then fed into the actor network of the deep reinforcement learning algorithm's network architecture, outputting continuous control action instructions for the scope AUV model. This allows the AUV to interact with the marine environment and obtain its status information for the next moment. Furthermore, after performing the action, a composite reward function is calculated based on the task objective to obtain an external reward signal. .
[0033] Record the current state Execution of actions Rewards received and the state information at the next moment Experience samples are constructed, and an adaptive priority experience replay mechanism is used to dynamically adjust the sample priority based on the task stage and event type. Experience replay is performed based on the sampling probability of the samples, thereby highlighting the impact of key experiences in the training process.
[0034] The present invention has the following beneficial effects:
[0035] 1. This method integrates deep reinforcement learning algorithms with a reward-adaptive priority experience replay mechanism, adjusting the priority weights of experience samples based on training phase and reward information, thereby improving experience utilization and policy convergence speed.
[0036] 2. The reward function design comprehensively considers multiple constraints, including target distance, obstacle avoidance safety, attitude stability, and ocean current utilization, constructing a composite reward model that enables AUVs to obtain shorter and smoother trajectories during missions. Furthermore, a three-dimensional simulation environment is built based on real ocean currents and simulated terrain data, making the algorithm training and validation more closely resemble real ocean conditions. Simulation results show that the algorithm exhibits fast convergence speed, stable value estimation, and good adaptability in multiple environments, providing a reliable and effective approach for AUV path planning in complex ocean environments. Attached Figure Description
[0037] Figure 1 This is a schematic diagram of an AUV path planning method based on reward-adaptive priority experience playback.
[0038] Figure 2 A diagram illustrating the reward-based adaptive priority experience replay mechanism.
[0039] Figure 3 This is a comparison diagram of the trajectories of different algorithms in a marine environment with obstacles in the two examples.
[0040] Figure 4 This is a comparison chart of the trajectories of different algorithms in a marine environment with four obstacles, as shown in the example.
[0041] Figure 5 This is a comparison chart of different algorithm trajectories in an ocean environment with eight obstacles, as shown in the example.
[0042] Figure 6 This is a comparison chart of trajectory indicators of different algorithms in a marine environment with obstacles in the two examples.
[0043] Figure 7 This is a comparison chart of trajectory indicators of different algorithms in a marine environment with four obstacles, as shown in the example.
[0044] Figure 8 This is a comparison chart of trajectory indicators of different algorithms in an ocean environment with eight obstacles, as shown in the example.
[0045] Figure 9 This is a comparison chart of rewards from different algorithms in an obstacle-prone marine environment, as shown in the second example.
[0046] Figure 10 This is a comparison chart of rewards from different algorithms in a marine environment with four obstacles, as shown in the example.
[0047] Figure 11 This is a comparison chart of rewards from different algorithms in an ocean environment with eight obstacles, as shown in the example. Detailed Implementation
[0048] The present invention will be further explained below with reference to the accompanying drawings:
[0049] The AUV path planning method based on reward-adaptive-priority experience replay specifically includes the following steps:
[0050] Step 1: To achieve path planning and decision-making control of AUVs in complex marine environments, it is necessary to construct a realistic underwater environment model, including three parts: terrain modeling, ocean current modeling, and obstacle modeling, to simulate the AUV's operating scenarios.
[0051] s1.1 Terrain Modeling
[0052] When modeling the seabed topography, a uniformly spaced rectangular grid is first established on a pre-defined two-dimensional horizontal region as discrete sampling points. The initial elevation is set at the shallow sea baseline depth. Then, four types of parametric topographic feature functions are sequentially superimposed. The algebraic sum of the calculated values is truncated according to the depth range to obtain the final elevation values of each node. A seabed topography model is generated by superimposing multiple functions to ensure the continuity and controllability of the topography.
[0053] ①Base topography:
[0054]
[0055] in, Based on the elevation of the terrain, and These are slope control parameters. and The horizontal coordinate is used.
[0056] ②Shabo:
[0057]
[0058] in, Represents an exponential function. For the height of the sand wave field, The height of the sand dune. and For wavelength, and For rotating coordinates.
[0059] ③ Rock protrusions:
[0060]
[0061] in, The height of the rock protrusion. The height of the coral reef. and With the center coordinates, To control the scope of expansion.
[0062] ④Erosion channels:
[0063]
[0064] in, The depth of the erosion channel, The maximum depth of the depression For channel width, coordinate point Distance to the center of the passage.
[0065] In AUV path planning simulations, the AUV's position is an arbitrary real coordinate in continuous space, and may not necessarily fall exactly on a grid point. Therefore, an interpolation method is used to convert the discrete DEM into an approximately continuous terrain function.
[0066]
[0067] in, Query point after interpolation The terrain elevation, , , , To surround the query point The four grid nodes, , For grid spacing, , This is the offset of the query point relative to the bottom left node.
[0068] s1.2, Ocean Current Model
[0069] The dynamic ocean current field based on real observation data from the Copernicus Marine Environmental Observation Service (CMEMS) was used as the ocean current model to improve the realism and environmental adaptability of AUV path planning.
[0070] s1.3, obstacle model
[0071] The geometry of obstacles is abstracted as a sphere model, with each obstacle defined by its three-dimensional coordinates at its center point. and radius The only certainty.
[0072] Step 2: The AUV uses its onboard sensor systems and underwater sonar equipment to collect real-time dynamic information about its environment and its own state. To achieve efficient and safe path planning for the AUV in complex marine environments, a reward function with clear physical meaning and guidance was designed, and a state space and action space comprehensively describing the AUV's motion state and environmental information were constructed:
[0073] s2.1 The state space, used to characterize the AUV's environmental perception and attitude feedback in the path planning task, is constructed as a high-dimensional vector, encompassing the AUV's current state, target relationships, environmental situation, and ocean dynamics information:
[0074]
[0075] in, This indicates the AUV's own status, including its current position. ,speed Attitude angle and angular velocity . This indicates the position information of the AUV relative to the target point. Indicates the location of the target point. This represents the Euclidean distance from the AUV to the target point. This indicates the ocean current velocity information at the current location of the AUV. This indicates the position information of the AUV relative to the obstacle, where Indicates the first The distance between each obstacle and the AUV. The superscript T denotes the transpose of the vector.
[0076] s2.2 The action space is defined as a controllable execution variable of the AUV, specifically a 6-dimensional continuous vector. This indicates that the AUV has propeller control in the forward direction, while pitch and yaw are controlled by the tail rudder and vertical rudder, respectively. The 0 element corresponds to zero control input in the sway, heave, and roll directions of the AUV, which conforms to the dynamic constraints of an underactuated six-DOF AUV.
[0077] s2.3. Combining the above AUV parameters and control objectives, the following comprehensive reward function R is constructed:
[0078]
[0079] ① A success reward is given when the AUV successfully reaches the target point as required.
[0080]
[0081] in This represents the success reward value. This indicates the success range for determining whether an AUV mission is successful; that is, the mission is considered successful when the distance between the AUV and the target point is less than the success range.
[0082] ② The progress reward is indicated by the change in distance between the AUV and the target point, reflecting the progress of the AUV mission.
[0083]
[0084] in, This represents the difference between the distance between the AUV and the target point at the previous moment and the distance between the AUV and the target point at the current moment. A value greater than 0 indicates that the AUV is closer to the target point than the previous time step, and a positive reward is given at this time; when A penalty is imposed if the value is less than 0. , This represents the reward coefficient.
[0085] ③ This indicates an obstacle collision penalty, used to penalize AUVs for approaching and colliding with obstacles:
[0086]
[0087] in, This indicates the penalty value for obstacle collision. This indicates the distance between the AUV and the obstacle. When the distance between the AUV and the obstacle is less than or equal to the radius of the obstacle, a collision is considered to have occurred, and a penalty is imposed. This represents the collision penalty coefficient. It should be a very small positive number to prevent the denominator from being 0; This indicates the safe distance between the AUV and obstacles.
[0088] ④ This indicates a terrain collision penalty, used to ensure that the AUV remains above the seabed topography during movement:
[0089]
[0090] in, This indicates the terrain collision penalty value. This represents the difference between the current altitude of the AUV and the terrain altitude. This indicates that the AUV has collided with the terrain, at which point a penalty is imposed.
[0091] ⑤ This indicates an ocean current utilization reward, used to guide strategies to achieve better control in the downstream direction:
[0092]
[0093] in, Indicates the ocean current utilization incentive coefficient. The angle between the direction of the ocean current at the current location of the AUV and the direction of the AUV's own velocity. Indicates the magnitude of ocean current speed. This indicates taking the modulus.
[0094] ⑥ This indicates an attitude penalty, used to prevent the AUV from failing due to excessive attitude angles.
[0095]
[0096] in, This is the attitude penalty coefficient, which is designed to reduce excessive attitude tilting, thereby ensuring hydrodynamic stability during navigation.
[0097] ⑦ This represents a step penalty, used to limit the training length and encourage quick task completion.
[0098]
[0099] in, This is the penalty value for each step.
[0100] Step 3, using as follows Figure 1 The distributed soft actor-critic algorithm shown here is used to construct a system capable of efficient and safe path planning in complex marine environments:
[0101] s3.1 Deep Reinforcement Learning Algorithm Network Architecture Design
[0102] A distributed soft agent-critic algorithm network is built on top of the maximum entropy reinforcement learning framework. By optimizing the expected cumulative reward and policy entropy, the exploration ability and policy robustness of the agent are improved. The optimization objective can be expressed as:
[0103]
[0104]
[0105] in, This represents the state of the agent at time t. This represents the action performed by the agent at time t. This represents the optimal strategy. For state The policy function of the agent under the given conditions For strategy Induced state distribution, This means finding all feasible strategies. Strategies that maximize the objective function Indicates the state According to the distribution Actions taken Expectations Indicates the state Take action below The instant rewards received This is the discount factor, and its value range is... , For temperature coefficient, Represents policy entropy, Indicates in Take action below The corresponding strategy uses the natural constant Logarithm with base 0.
[0106] In each round of policy optimization in the actor network, the first step is to start from the experience replay pool. A batch of state data is sampled, and a Gaussian distribution is output for each state. Then, actions are sampled from this distribution using a reparameterization method. The strategy optimization objective is constructed by combining the Q-value output of the critic network. Calculate the loss function with respect to the network parameters. The gradient is calculated, and the parameters are iteratively updated using gradient descent. The strategy optimization objective is... This means maximizing expected value while ensuring the strategy has sufficient entropy to encourage exploration:
[0107]
[0108] in, For the parameters of the actor network, Describes the loss function of the actor network. Indicates the experience replay buffer Mid-sampling state, Indicates according to the current strategy Use reparameterization techniques to sample actions. This represents the Q-value of the commentator network output.
[0109] Using the idea of double-Q learning, two independent value distribution networks are constructed in the critic network. and Learn state-action value separately The expected value of the target Q-value is obtained by applying a Gaussian distribution. When choosing between two value distribution networks, select the one with the lower mean output as the Q-value of the critic network output. :
[0110]
[0111] in For the selected Q network index, Indicates in or Choose the one that minimizes the expression. , The parameter is The Q-value network outputs in the state and actions The expected cumulative return; Indicates action Obtain the parameter as The strategy distribution.
[0112] Critics network from experience replay pool Randomly sample a batch of state-action-reward-next state quadruple experience samples Each value distribution network is based on the input state-action pairs Output the mean of the current value distribution. and standard deviation During the commentator network update process, the originally random target reward was changed. Replace with the expected value:
[0113]
[0114] in, For the target Q value, The Q-value output by the target critic network. Indicates the agent's state Next action The updated status Indicates the state at time t+1 Downsampling.
[0115] An adaptive gradient adjustment strategy is introduced to address the algorithm's sensitivity to reward scale:
[0116]
[0117]
[0118] ,
[0119] in, This is a pruning function, its purpose is to prune the input random target. The value is limited to a fixed range. For value distribution The expected value is used as the center point of the clipping interval. The clipping boundary is related to the standard deviation of the value distribution. Indicates the use of 3 in principle. The loss function represents the value distribution after scaling. The gradient scaling weights, which are related to the variance of the current value distribution, can automatically adjust the scale of the entire loss function. This indicates the buffer from the experience replay. Experience samples from medium sampling Seeking expectations, This represents the KL divergence (Kullback-Leibler Divergence). Indicates the target distribution. This represents the value distribution output by the current value network.
[0120] Utilizing target actor networks and the target critic network Calculate the improved target Q value and the random target reward after cropping The scaled KL divergence loss function To measure the current value distribution Calculate the loss function based on the difference between the target distribution and the target distribution. Regarding network parameters The gradient is calculated, and the loss is minimized using gradient descent:
[0121]
[0122] s3.2 Design of an adaptive reward-priority experience replay mechanism
[0123] To achieve efficient utilization of experience samples and stable policy improvement, a reward-adaptive priority experience replay mechanism is designed. This mechanism dynamically adjusts the priority of experience samples by comprehensively considering temporal difference errors and reward information, and samples are taken according to the importance sampling weight of each experience sample. Figure 2 As shown:
[0124] Experience tuples in traditional experience replay mechanisms Based on this, add a set of triggered events. This forms a new empirical tuple structure. .in, Representing empirical tuples The set of triggered events, corresponding to the reward function, includes success events, collision events, terrain approach events, target approach events, and ocean current utilization events. Different importance weights are assigned to different events based on their relative importance, and a comprehensive event weight is defined for each empirical sample. for:
[0125]
[0126] in, This represents the weights of the target events defined at different training periods. This represents the weighting factor for non-target events. Indicates the weight of non-target events. This represents the sum of the weights of all events in the event set except the target event. Represents the empty set. This indicates the default weight.
[0127] The weights of target events are dynamically adjusted based on the success rate during the training phase.
[0128]
[0129] in, This indicates the adjusted target event weight. The training success rate is expressed as the ratio of the number of successful arrivals to the total number of tests. The set success rate threshold, As an adaptive adjustment factor, The weight of a success event is indicated. The weights represent the proximity to the target event. This means taking the minimum value between the two. hour This indicates that there is no target event at this time.
[0130] Based on the event comprehensive weights of the empirical samples, calculate the sample... The overall priority, sampling probability, and importance sampling weights:
[0131] ,
[0132] ,
[0133] .
[0134] in, For timing difference error, This indicates taking the absolute value. It is a very small positive integer. This is the priority intensity coefficient, with a value range of [0,1]. When uniform sampling is used, when Sampling is performed entirely according to priority. The current capacity of the experience replay buffer. This is the importance sampling correction coefficient, with a value range of [0,1].
[0135] The number of obstacles was set to 2, 4, and 8 respectively. The performance of this method in AUV path planning was compared with other classic reinforcement learning algorithms in environments of different complexities. Figures 3-5 As shown in the figure, the path planning performance of the DSAC-T-RAPER algorithm proposed in this invention is significantly better than other methods. Figures 6-8 A comparison chart showing the various metrics of different methods for planning trajectories in environments of varying complexity. Figures 9-11 The diagram shows a comparison of rewards for various methods in environments of different complexities. It can be seen that the method proposed in this invention exhibits a faster reward increase rate, the highest success rate, and the fastest and most stable convergence speed. Therefore, the simulation results demonstrate that the deep reinforcement learning method for AUV path planning based on a reward-adaptive priority experience replay mechanism proposed in this invention is effective. It can solve the problems of slow convergence speed, low exploration efficiency, and low experience utilization efficiency faced by traditional reinforcement learning algorithms in complex marine environments with sparse reward signals. It is suitable for complex marine environments and has significant engineering practical value and promising prospects for widespread application.
Claims
1. An AUV path planning method based on reward-adaptive priority experience replay first models the complex marine environment, then designs the AUV's state space, action space, and multi-dimensional composite reward function; finally, it constructs a three-dimensional AUV path planning architecture based on deep reinforcement learning methods. Its key features are: s1.1 Design a deep reinforcement learning algorithm network including an actor network and a critic network; utilize the actor network to determine the current state. The policy distribution of output actions is used by the AUV to randomly sample the actual actions to be executed from the policy distribution. The system executes control based on the action instructions output by the current actor network to achieve path planning; the input of the critic network is the current state and action pair. The output is a continuous Gaussian distribution; s1.2 Introducing event sets into the quadruple experience samples of traditional deep reinforcement learning. Define the event composite weight for each experience sample. for: in, This represents the weights of the target events defined at different training periods. The weighting factor represents the non-target event. Indicates the weight of non-target events. To represent the empty set, Indicates the default weight; Calculate importance sampling weights Sampling probability P(m) and overall priority value p m : in, For timing difference error, This indicates taking the absolute value. It is a very small positive integer; ∈[0,1], which is the priority intensity coefficient; , The current capacity of the experience replay buffer; ∈[0,1], which are importance sampling correction coefficients; s1.3 Utilize the established environment model to obtain the current state information The input is fed into the actor network, and the output is the continuous control action command of the scope AUV model. This allows the AUV to interact with the marine environment and obtain its status information for the next moment. Furthermore, after performing the action, a composite reward function is calculated based on the task objective to obtain an external reward signal. ; Record the current state Execution of actions Rewards received and the state information at the next moment Experience samples are constructed, and an adaptive priority experience replay mechanism is used to dynamically adjust the sample priority based on the task stage and event type, and to replay the experience based on the sampling probability of the samples.
2. The AUV path planning method based on reward adaptive priority experience replay as described in claim 1, characterized in that: The modeling of the complex marine environment includes terrain modeling, ocean current modeling, and obstacle modeling.
3. The AUV path planning method based on reward adaptive priority experience replay as described in claim 2, characterized in that: Terrain modeling establishes uniformly spaced rectangular grids as discrete sampling points on a pre-defined two-dimensional horizontal region. The initial elevation is based on the shallow sea baseline depth. Then, four types of parametric terrain feature functions—base topography, sand waves, rock protrusions, and erosion channels—are superimposed sequentially. The algebraic sum of the calculated values is truncated by the depth range to obtain the final elevation value of each node.
4. The AUV path planning method based on reward adaptive priority experience replay as described in claim 2, characterized in that: Ocean current modeling uses dynamic ocean current fields based on real observation data from the Copernicus Ocean Environment Observation Service as the ocean current model. Obstacle modeling abstracts the geometry of obstacles into spherical models, with each obstacle represented by the three-dimensional coordinates of its center point. and radius The only certainty.
5. The AUV path planning method based on reward adaptive priority experience replay as described in claim 1, characterized in that: The state space is a high-dimensional vector that encompasses the current state of the AUV, target relationships, environmental situation, and ocean dynamics information. in, Including the current location of the AUV ,speed Attitude angle and angular velocity ; Indicates the AUV's relative to the target point Location information, This represents the Euclidean distance from the AUV to the target point; This indicates the ocean current velocity information at the current location of the AUV; This indicates the position information of the AUV relative to the obstacle, where Indicates the first The distance between an obstacle and the AUV; the superscript T indicates the transpose of the vector.
6. The AUV path planning method based on reward adaptive priority experience replay as described in claim 1, characterized in that: The action space is defined as a 6-dimensional continuous vector. , indicates that the AUV has propulsion control in the forward direction, where pitch and yaw are controlled by the tail rudder and vertical rudder respectively, and the 0 element corresponds to the zero control input of the AUV in the sway, heave and roll directions.
7. The AUV path planning method based on reward adaptive priority experience replay as described in claim 1, characterized in that: The multidimensional composite reward function R is as follows: Among them, r1~r7 represent success reward, progress reward, collision penalty, terrain collision penalty, ocean current utilization reward, attitude penalty and step penalty, respectively.
8. The AUV path planning method based on reward adaptive priority experience replay as described in claim 1, characterized in that: In each round of policy optimization, the actor network first starts from the experience replay pool. A batch of state data is sampled, and a Gaussian distribution is output for each state. Then, an action is sampled from this Gaussian distribution using a reparameterization method. The strategy optimization objective is constructed by combining the Q-value output of the critic network. Calculate the loss function with respect to the network parameters. The gradient is calculated, and the parameters are iteratively updated using the gradient descent method.
9. The AUV path planning method based on reward adaptive priority experience replay as described in claim 1, characterized in that: Employing the concept of double-Q learning, two independent value distribution networks are constructed within the critic network; the target Q-value is then calculated. When choosing between two value distribution networks, select the one with the lower mean output as the Q-value of the critic network output. ; Critics network from experience replay pool A batch of empirical samples of quadruplets were randomly sampled. Each value distribution network is based on the input state-action pairs Output the mean and standard deviation of the current value distribution; during the commentator network update process, random target rewards will be applied. Replace with expected value ; An adaptive gradient adjustment strategy is introduced, using the expected value of the value distribution as the center point of the clipping interval, and the input random target reward is adjusted through the clipping function. The value is limited to a fixed range; the expected value is calculated using a target actor network and a target critic network. The cropped random target reward is then processed using a scaled KL divergence loss function. To measure the difference between the current value distribution and the target distribution, the gradient of the loss function with respect to the network parameters is calculated, and the loss is minimized using gradient descent.
10. The AUV path planning method based on reward-adaptive priority experience replay as described in any one of claims 1, 8, or 9, characterized in that: The weights of the target events are dynamically adjusted based on the success rate during the training phase. in, This indicates the adjusted target event weight. The training success rate is expressed as the ratio of the number of successful arrivals to the total number of tests. The set success rate threshold, As an adaptive adjustment factor, The weight of a success event is indicated. The weights represent the proximity to the target event. This means taking the minimum value between the two. hour This indicates that there is no target event at this time.