An unmanned surface ship path following method based on DQN and backstepping control

By combining deep reinforcement learning and backstepping control, the problems of communication delay and data packet loss in path following of unmanned surface vessels are solved, achieving efficient path following in complex marine environments, improving robustness and control accuracy, and making it suitable for collaborative operations of multiple unmanned vessels.

CN119781481BActive Publication Date: 2026-06-19JIANGSU UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIANGSU UNIV OF SCI & TECH
Filing Date
2024-12-30
Publication Date
2026-06-19

Smart Images

  • Figure CN119781481B_ABST
    Figure CN119781481B_ABST
Patent Text Reader

Abstract

This invention discloses a path-following method for unmanned surface vessels based on DQN and backstepping control, comprising the following steps: designing a communication system based on the MAVLink protocol to achieve real-time transmission of state information between the pilot vessel and the controlled vessel; constructing dynamic models of the pilot and controlled vessels and using a rendering window to observe the vessel's following performance in real time during model training; optimizing the path-following strategy using the constructed DQN model through a reward function; employing backstepping control for real-time feedback compensation under unstable communication conditions; selecting the next following action using a state prediction method when communication is severely interfered with; and achieving path-following in complex marine environments by dynamically adjusting the weights of DQN and backstepping control. This invention achieves adaptive optimization of the path-following strategy and effectively addresses communication delays and data packet loss in complex marine environments, significantly improving the robustness and control accuracy of unmanned vessel path-following.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of ship control and relates to unmanned surface vessel path following technology, specifically to an unmanned surface vessel path following method based on DQN and backstepping control. Background Technology

[0002] Path following for unmanned surface vessels (USVs) is a challenging task in the marine environment. With increasing marine environmental complexity and diversified operational requirements, the capabilities of a single USV are limited, while multiple USVs can cover larger areas and perform a wider range of missions. Therefore, path following and formation control have become research hotspots. Pilot-following strategies, as a common path following method, enable controlled vessels to closely follow a pilot vessel in performing complex tasks.

[0003] In existing research, Kapitanyuk et al. implemented path-following control for nonholonomic mobile robots using the steering vector field method. Fossen and Lekkas designed a line-of-sight tracking controller based on the relative velocity of a USV, established a kinematic model, generated relative velocity in amplitude and phase form, and used underwater acoustic measurements to convert the relative velocity into absolute velocity to realize the control law. Khamseh and Janabi-Sharifi established a USV model and used a linear quadratic regulator to synchronously control a quadcopter and a robotic arm. Dong et al. solved the USV tracking problem for curved and straight paths using state feedback-based backstepping control.

[0004] With the breakthroughs in deep reinforcement learning (DRL) by DeepMind projects, reinforcement learning (RL) has been widely researched and applied in fields such as robotics, industrial automation, and machine learning. RL is considered one of the core technologies for designing intelligent systems. Combined with deep learning, deep reinforcement learning can characterize and control extremely complex systems in dynamic environments. Therefore, it is reasonable to establish an effective path-following control decision model based on DRL. To overcome the usability problem caused by the complexity of control laws in traditional analysis methods, a concise DRL was developed using Deep Q-Network (DQN). Li et al. used RL to reduce travel time at highway bottlenecks and proposed a Q-learning-based reinforcement learning model to control vehicle speed limits under various traffic conditions. Using Q-learning-based reinforcement learning can significantly reduce training time. However, the training efficiency of Q-learning is an issue, and the improvement of USV path tracking by Q-learning-based reinforcement learning is rarely discussed in the literature.

[0005] To enable communication between the pilot vessel and the controlled vessel, the lightweight MAVLink protocol is employed. However, complex communication environments (such as electromagnetic interference) often lead to data packet loss and delays, affecting communication and control performance. Summary of the Invention

[0006] Purpose of the invention: To overcome the shortcomings of existing technologies, this invention provides a path following method for unmanned surface vessels based on DQN and backstepping control. By combining deep reinforcement learning and control theory, it not only achieves adaptive optimization of the path following strategy, but also effectively addresses communication delays and data packet loss in complex marine environments, significantly improving the robustness and control accuracy of unmanned vessel path following. It is suitable for path planning tasks in multi-unmanned vessel collaborative operations.

[0007] Technical Solution: To achieve the above objectives, this invention provides a path following method for unmanned surface vessels based on DQN and backstepping control, comprising the following steps:

[0008] S1: Design a communication system based on the MAVLink protocol to realize real-time transmission of status information between the pilot vessel and the controlled vessel;

[0009] S2: In the constructed model training environment, build dynamic models of the pilot vessel and the controlled vessel, and use a rendering window to observe the vessel's following performance in real time during the model training process;

[0010] S3: Utilize the constructed deep reinforcement learning (DQN) model to optimize the path following strategy through the reward function;

[0011] S4: Under unstable communication conditions, backstepping control is used for real-time feedback compensation to improve control accuracy;

[0012] In the event of severe communication interference, packet loss, or delays, the state prediction method is used to enable the controlled vessel to select the next following action based on state prediction.

[0013] S5: By dynamically adjusting the weights of DQN and backstep control, path following in complex marine environments can be achieved.

[0014] Furthermore, in step S1, the communication system uses the MAVLink protocol to realize the transmission of status data between the pilot vessel and the controlled vessel via WIFI, including position information and attitude information (roll angle, pitch angle, yaw angle), and the communication robustness is evaluated by introducing an electromagnetic interference model through simulation.

[0015] Furthermore, the operation of the communication system in step S1 includes:

[0016] A1: Pilot vessel transmits information: Collects and processes the status information data of the pilot vessel, uses the MAVLink protocol to generate a payload data part containing key control information, and the pilot vessel, as the transmitting module, periodically transmits data packets containing position and attitude information (roll angle, pitch angle, yaw angle) to ensure the accuracy and validity of the data and adapt to complex marine environments;

[0017] A2: WIFI Protocol Simulation and Transmission: MAVLink data is transmitted via the WIFI protocol; the simulation considers channel attenuation, noise interference and multipath effects, and uses a dual-path model to reflect signal reflection and refraction, evaluates packet loss rate and delay, and simulates the characteristics of real wireless communication.

[0018] A3: Introduction and Control of Electromagnetic Interference: Design an adjustable electromagnetic interference model, adjust interference parameters, and test the robustness of communication under different noise levels; use WIFI protocol interference simulation to simulate packet loss rate and delay, and verify the stability of data transmission and decoding in complex environments.

[0019] The introduction of electromagnetic interference not only affects data transmission, but also makes the reception and decoding of MAVLink data packets more challenging by simulating interference with the WIFI protocol.

[0020] A4: Controlled vessel receives information: The controlled vessel extracts the position information and attitude data of the pilot vessel through the MAVLink unpacking process. Abnormal data (such as interference or packet loss) is recorded and fed back to the motion control algorithm to analyze the impact of interference on motion control.

[0021] Furthermore, in step S2, the dynamic model is divided into a pilot vessel dynamic model and a controlled vessel dynamic model, wherein the pilot vessel dynamic model includes:

[0022] The dynamic equations of a pilot vessel are determined by the combined effects of inertia, Coriolis force, centrifugal force, and damping force, and are in the following form:

[0023]

[0024] Where M is the inertia matrix (including the added mass effect), C(v) is the Coriolis and centrifugal force matrix, D(v) is the damping matrix, and τ is the control input (force and torque).

[0025] The expression for the inertia matrix M is:

[0026]

[0027] Where m is the mass of the ship. X represents the hydrodynamic added mass coefficient. g I represents the position of the ship's center of gravity in the hull coordinate system.z Let be the moment of inertia of the ship about the z-axis;

[0028] The expressions for the Coriolis force and centrifugal force matrix C(v) are:

[0029]

[0030] Where u, v, and r represent the ship's longitudinal speed, lateral speed, and yaw rate, respectively;

[0031] The damping matrix D(v) includes linear and nonlinear damping terms, and its expression is:

[0032]

[0033] Linear terms (such as X) u ): Represents the linear damping coefficient, describing the linear relationship between speed and drag; nonlinear terms (such as X) uu ): Represents the nonlinear damping coefficient, describing the effect of the square or cube of velocity on resistance; |u|, |v|, and |r| represent the absolute values ​​of the ship's longitudinal velocity, lateral velocity, and yaw rate, respectively, reflecting that nonlinear resistance depends on the direction of motion;

[0034] The rotation matrix R(ψ) is used to transform the velocity in the ship's coordinate system to the global coordinate system, and its expression is:

[0035]

[0036] The thrust and torque acting on the unmanned surface vessel by the control input τ are in the following forms:

[0037]

[0038] The state update equation consists of two parts: position update and velocity update;

[0039] Position update (global coordinate system):

[0040]

[0041] Where p = [x, y, ψ] T V = [u, v, r] T ;

[0042] Speed ​​update (hull coordinate system):

[0043]

[0044] This dynamic model describes the motion behavior of an unmanned vessel in three degrees of freedom, including inertial effects, Coriolis and centrifugal effects, damping effects, and the influence of external control inputs on the motion state.

[0045] The dynamics model of the controlled vessel employs a relatively simplified approach when implementing the reference trajectory simulation environment. It primarily focuses on the unmanned vessel's position updates and does not delve into complex dynamic equations. The main components of this model involve attitude rotation and displacement updates. The following are the main formulas and their explanations:

[0046] The dynamic model of a controlled vessel includes:

[0047] Position updates are obtained by integrating the velocity, using the Euler method for numerical integration:

[0048] p new =p old +Δt·R(ψ)·v

[0049] Where p = [x, y, ψ] T It is a position vector, V = [u, v, r] T It is the velocity vector, and Δt is the time step;

[0050] The attitude rotation matrix R(ψ) is used to map the velocity [u,v,r] to the global coordinate system, thereby updating the position [x,y,ψ] of the unmanned vessel;

[0051] Speed ​​update formula:

[0052]

[0053] Velocity is determined by changes in displacement within the global coordinate system. The coordinates are obtained by remapping back to the ship's coordinate system.

[0054] Furthermore, in step S2, the rendering window, based on the `gym.envs.classic_control.rendering` module, sets a display area with a blue background to simulate the color of the ocean environment. Different colors (e.g., red for the pilot vessel and blue for the controlled vessel) are used to distinguish the vessels. The position information of the vessels is updated in real time through the `Transform` object, allowing the vessels to move dynamically within the rendering window. In the early stages of training, because the model has not yet converged, the controlled vessels may deviate from their trajectory. This deviation can be observed through the rendering window, allowing for real-time adjustments to the model training parameters or reward function design. As training progresses, the rendering window will gradually display the process of the controlled vessels approaching the trajectory of the pilot vessel.

[0055] Furthermore, in step S3, the DQN model is trained using an experience replay and epsilon-greedy strategy, including an evaluation network and a target network, wherein the parameters of the target network are periodically synchronized to improve training stability.

[0056] The goal of the DQN model is to find an optimal policy that maximizes the state-action value function Q(s, a); the optimal state-action value function satisfies the Bellman equation:

[0057]

[0058] Where s is the current state, a is the current action, s' is the next state (determined by the state transition probability), and a' is the action in the next state; Q * (s,a) is the optimal state-action value function, representing the maximum expected cumulative reward that the agent can obtain after taking action a in state s; E[·] is the mathematical expectation, representing the uncertainty and expectation calculation for the next state s'; r is the immediate reward, representing the reward obtained immediately after performing action a; γ is the weighting factor, ranging from [0,1], used to weigh the importance of current reward and future reward; This represents the maximum Q value that can be obtained by choosing action a' in the next state;

[0059] In existing Q-learning, values ​​are stored using a lookup table. However, storing the state-action value function in a Q-table becomes impractical when the state and action spaces are large or continuous. DQN uses a deep neural network to approximate the Q-value, enabling it to learn efficiently in high-dimensional state spaces. The goal of the DQN model is to minimize the following loss function, allowing the Q-network to gradually approach the optimal state-action value function:

[0060]

[0061] Where θ is the parameter of the current Q-network, θ - These are the parameters of the target network; the target network Q... target The parameters are kept constant for a period of time to improve training stability.

[0062] Furthermore, in step S3, the DQN model is instantiated using the DQN class, which contains two deep neural networks: one is the current Q network eval_net, used to estimate the Q value of the current state; the other is the target Q network target_net, used to estimate the target Q value of the next state. The initial parameters of these two networks are the same, and as training progresses, the parameters of the target network will be synchronized from the Q network at certain intervals.

[0063] At each time step, an epsilon-greedy strategy will be used to select actions. During the exploration phase (initially with a larger epsilon), the agent will select actions more randomly to explore more possibilities; as training progresses, the epsilon value gradually decreases, and the unmanned surface vessel will be more inclined to select the optimal action evaluated by the current Q-network.

[0064] Specifically, in the `select_action` method, DQN randomly selects an action based on the value of `epsilon` or selects an action based on the maximum Q-value of the Q-network. This strategy allows the unmanned surface vessel to balance exploration and exploitation.

[0065] After each action is executed, the selected action, the reward obtained, and the next state are stored in the experience replay cache. When the cache is full, new experiences overwrite old ones. Through the experience replay mechanism, small batches of data are randomly drawn from the cache for training, reducing the correlation between samples and making the model's learning more stable.

[0066] In the DQN update method, a small batch of empirical samples is first drawn from the empirical replay cache. For each sample, the target Q-value is calculated according to the Bellman equation.

[0067]

[0068] Where θ- are the parameters of the target network;

[0069] The mean square error between the estimated Q value q_eval of the current Q-network and the target Q value q_target is calculated, and this error is minimized through backpropagation. This process makes the estimated value of the Q-network gradually approach the true Q value.

[0070] To improve stability, the parameters of the target network are not updated at every step, but are periodically synchronized with the parameters of the current Q-network at certain intervals. This reduces the fluctuation of Q-values ​​during the update process, making the model more stable.

[0071] Furthermore, in step S3, the reward function is dynamically adjusted based on the differences between the controlled vessel and the pilot vessel's states (such as position deviation, speed deviation, and path deviation angle). The closer the reward value is to zero, the better the path following effect.

[0072] The `calculate_reward` function requires defining a reward function based on state differences to measure the effectiveness of the current action. Actions with larger rewards are considered better control policies. The DQN network continuously optimizes the reward value and adjusts the control policy, enabling the agent to gradually learn how to track paths and minimize errors.

[0073] The reward function evaluates the effectiveness of the control strategy by calculating the difference between the state of the controlled vessel and the state of the pilot vessel, and assesses the overall control effect within a round.

[0074]

[0075] The reward function includes:

[0076] Total Round Reward: The total reward output at the end of each round is the sum of the rewards of all time steps in that round; the higher the total reward value (the closer to a negative value of zero), the smaller the difference between the state of the controlled ship and the state of the lead ship in that round, and the better the control effect;

[0077] Instant reward: Instant feedback calculated at each time step to evaluate the quality of the current action; calculated based on the difference between the current state and the target state;

[0078] Moving average reward: The moving average reward is obtained by calculating the average of the total rewards of a certain number of recent rounds; it reflects the average performance of the model over a recent period.

[0079]

[0080] Where N is the size of the sliding window and the total reward i It is the total reward for the i-th round.

[0081] Furthermore, in step S4, the backstepping control achieves feedback compensation by calculating errors and recursively designing a controller. Under unstable communication conditions, it generates control inputs based on path deviations and real-time feedback to correct the state of the controlled vessel; specifically, it includes:

[0082] Error z1 is calculated based on environmental and reference conditions:

[0083] z1 = env - env r

[0084] This leads to the intermediate variables ω and

[0085] ω=k1·z1

[0086]

[0087] Recalculate the error z2:

[0088] z2 = env - env r -ω

[0089] Generate control input τ:

[0090] τ = -k²·z² + dynamic model term;

[0091] The backstepping control strategy calculates the error and uses the system's dynamic model to generate the control input τ so that the state of the controlled vessel (env) approximates that of the reference vessel (env). r ) state.

[0092] Hybrid strategy: By recursively designing the controller, the control objective is gradually approached; under unstable communication conditions, control inputs are generated based on path deviation and real-time feedback to correct the state of the controlled vessel.

[0093] In the early stages of training, DQN suffers from slow convergence and high path-following error due to the long exploration time for random actions. Furthermore, DQN is limited by the discrete action space; high update frequencies easily cause Q-value oscillations, while low frequencies reduce control accuracy, and it struggles to converge effectively when state transition uncertainties are high.

[0094] Backstepping control (BC) is a recursive controller design method that decomposes the control objective to progressively approach the final goal. By introducing BC, the DQN output is combined with the BC control result. BC smooths the DQN action selection, improving system stability and shortening the exploration time, especially effective in the early stages of training. BC provides strong dynamic stability, quickly responds to tracking errors, and corrects path deviations; DQN, on the other hand, captures the long-term trend of path changes.

[0095] The combination of DQN and BC forms a balance of "fast response - slow learning": DQN optimizes the path in the long term, BC corrects the action in the short term, and the target synchronization mechanism avoids the instability of DQN, ultimately achieving faster convergence and excellent control performance in complex paths.

[0096] The state prediction method utilizes the latest and most available state information of the pilot vessel, including position, velocity, and acceleration, combined with its dynamic model (such as uniform linear motion or uniform acceleration model), to predict state changes over a short period of time; the prediction formula is as follows:

[0097]

[0098] Among them, s predicted It is the predicted state of the pilot vessel, s last valid The most recently received valid status, The speed (or estimated rate of change of state) of the pilot vessel, Δt is the estimated time difference (delay time).

[0099] Furthermore, the method for dynamically adjusting the weights of DQN and backstep control in step S5 is as follows: weight allocation ratio: tau = alpha * tau_dqn + (1-alpha) * tau_bc;

[0100] Experimental analysis was conducted to determine the weight allocation between DQN and backstep control (e.g., alpha = 0.5). Backstep control was dominant in the early stages of training, and DQN was gradually adopted in the later stages to improve control performance and convergence speed.

[0101] This invention addresses the issues of packet loss and delay in communication under electromagnetic interference by proposing a hybrid control method based on deep reinforcement learning (DQN) and backstepping control (BC). Specifically, DQN optimizes path following under stable communication conditions. Backstepping control provides real-time compensation under delay and packet loss conditions. State prediction techniques are used to improve anti-interference performance, and the weights between DQN and BC are adjusted to achieve adaptive control, ensuring system stability under high-level interference environments. The use of backstepping control in this invention enhances the training efficiency of Q-learning.

[0102] Beneficial effects: Compared with the prior art, the present invention has the following advantages:

[0103] (1) Improved robustness and control accuracy: This invention combines deep reinforcement learning (DQN) and backstepping control to effectively address communication delay and data packet loss in complex marine environments, significantly improving the robustness and control accuracy of unmanned surface vessels' path following.

[0104] (2) Achieve adaptive control in dynamic environments: Utilize DQN to optimize the path following strategy, dynamically adjust the reward function to adapt to different disturbance conditions, and combine it with backstep control to provide real-time feedback compensation to ensure that the controlled ship operates smoothly in a variable environment.

[0105] (3) Combining the “fast response-slow learning” strategy: DQN is responsible for capturing the long-term trend of path changes, while backstepping control provides short-term real-time correction. The dynamic balance between the two is achieved through weight allocation, which shortens the exploration time in the early stage of training and improves the control accuracy in the later stage.

[0106] (4) Applicable to multi-unmanned vessel collaborative operations: Through the design of a communication system based on the MAVLink protocol, real-time status transmission and path planning between multiple unmanned vessels are supported, enabling the invention to be applied in multi-unmanned vessel collaborative operations, especially maintaining high performance under high interference conditions. The state prediction method adopted ensures that the control strategy can enable the controlled vessel to select the optimal action for navigation and following even under high interference conditions. Attached Figure Description

[0107] Figure 1 This is a diagram illustrating the pilot-follower method.

[0108] Figure 2 This is a flowchart illustrating the operation of a communication system based on MAVLink.

[0109] Figure 3 This is a schematic diagram of a ship dynamics model;

[0110] Figure 4To provide a visual representation of the rendering window;

[0111] Figure 5 A comparison chart of reward value curves under different strategies;

[0112] Figure 6 A comparison chart of reward value curves and loss value curves under different weights;

[0113] Figure 7 This is a comparison chart of reward value curves under different levels of interference. Detailed Implementation

[0114] The present invention will be further illustrated below with reference to the accompanying drawings and specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. After reading this invention, any modifications of the invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.

[0115] This invention provides a path following method for unmanned surface vessels based on DQN and backstepping control, employing a pilot-following method, such as... Figure 1 As shown, the controlled vessel follows the trajectory of the pilot vessel by continuously adjusting its state (position, speed, and heading angle). The pilot vessel, as the guide, provides a reference for the controlled vessel with its preset path, while the controlled vessel obtains the real-time state information of the pilot vessel through the communication system and makes adjustments accordingly.

[0116] Based on the above-mentioned pilot-following method, this invention provides an unmanned surface vessel path following method based on DQN and backstepping control, comprising the following steps:

[0117] S1: Design a communication system based on the MAVLink protocol to realize real-time transmission of status information between the pilot vessel and the controlled vessel;

[0118] The communication system uses the MAVLink protocol to realize the transmission of status data between the pilot vessel and the controlled vessel via WIFI, including position information and attitude information (roll angle, pitch angle, yaw angle), and the communication robustness is evaluated by introducing an electromagnetic interference model through simulation.

[0119] like Figure 2 As shown, the operation of the communication system includes the following steps:

[0120] A1: Pilot vessel transmits information: Collects and processes the status information data of the pilot vessel, uses the MAVLink protocol to generate a payload data part containing key control information, and the pilot vessel, as the transmitting module, periodically transmits data packets containing position and attitude information (roll angle, pitch angle, yaw angle) to ensure the accuracy and validity of the data and adapt to complex marine environments;

[0121] A2: WIFI Protocol Simulation and Transmission: MAVLink data is transmitted via the WIFI protocol; the simulation considers channel attenuation, noise interference and multipath effects, and uses a dual-path model to reflect signal reflection and refraction, evaluates packet loss rate and delay, and simulates the characteristics of real wireless communication.

[0122] A3: Introduction and Control of Electromagnetic Interference: Design an adjustable electromagnetic interference model, adjust interference parameters, and test the robustness of communication under different noise levels; use WIFI protocol interference simulation to simulate packet loss rate and delay, and verify the stability of data transmission and decoding in complex environments.

[0123] The introduction of electromagnetic interference not only affects data transmission, but also makes the reception and decoding of MAVLink data packets more challenging by simulating interference with the WIFI protocol.

[0124] A4: Controlled Vessel Receives Information: The controlled vessel extracts the pilot vessel's position and attitude data through the MAVLink unpacking process. Abnormal data (such as communication delays or packet loss) is recorded and fed back to the motion control algorithm to analyze the impact of interference on motion control. It should be noted that abnormal data is reflected in the training process and is part of the algorithm's training process.

[0125] S2: In the constructed model training environment (e.g.) Figure 1 In the training environment shown (with a fixed motion reference trajectory of the pilot vessel), dynamic models of the pilot vessel and the controlled vessel are constructed, and a rendering window is used to observe the vessel's following performance in real time during the model training process.

[0126] The dynamics model is divided into the pilot vessel dynamics model and the controlled vessel dynamics model. The pilot vessel dynamics model includes:

[0127] The dynamic equations of a pilot vessel are determined by the combined effects of inertia, Coriolis force, centrifugal force, and damping force, and are in the following form:

[0128]

[0129] Where M is the inertia matrix (including the added mass effect), C(v) is the Coriolis and centrifugal force matrix, D(v) is the damping matrix, and τ is the control input (force and torque).

[0130] The expression for the inertia matrix M is:

[0131]

[0132] Where m is the mass of the ship. X represents the hydrodynamic added mass coefficient. g I represents the position of the ship's center of gravity in the hull coordinate system.z Let be the moment of inertia of the ship about the z-axis;

[0133] The expressions for the Coriolis force and centrifugal force matrix C(v) are:

[0134]

[0135] Where u, v, and r represent the ship's longitudinal speed, lateral speed, and yaw rate, respectively;

[0136] The damping matrix D(v) includes linear and nonlinear damping terms, and its expression is:

[0137]

[0138] Linear terms (such as X) u ): Represents the linear damping coefficient, describing the linear relationship between speed and drag; nonlinear terms (such as X) uu ): Represents the nonlinear damping coefficient, describing the effect of the square or cube of velocity on resistance; |u|, |v|, and |r| represent the absolute values ​​of the ship's longitudinal velocity, lateral velocity, and yaw rate, respectively, reflecting that nonlinear resistance depends on the direction of motion;

[0139] The rotation matrix R(ψ) is used to transform the velocity in the ship's coordinate system to the global coordinate system, and its expression is:

[0140]

[0141] The thrust and torque acting on the unmanned surface vessel by the control input τ are in the following forms:

[0142]

[0143] The state update equation consists of two parts: position update and velocity update;

[0144] Position update (global coordinate system):

[0145]

[0146] Where p = [x, y, ψ] T V = [u, v, r] T ;

[0147] Speed ​​update (hull coordinate system):

[0148]

[0149] This dynamic model describes the motion behavior of an unmanned vessel in three degrees of freedom, including inertial effects, Coriolis and centrifugal effects, damping effects, and the influence of external control inputs on the motion state.

[0150] The dynamics model of the controlled vessel employs a relatively simplified approach when implementing the reference trajectory simulation environment. It primarily focuses on the unmanned vessel's position updates and does not delve into complex dynamic equations. The main components of this model involve attitude rotation and displacement updates. The following are the main formulas and their explanations:

[0151] The dynamic model of a controlled vessel includes:

[0152] Position updates are obtained by integrating the velocity, using the Euler method for numerical integration:

[0153] p new =p old +Δt·R(ψ)·v

[0154] Where p = [x, y, ψ] T It is a position vector, V = [u, v, r] T It is the velocity vector, and Δt is the time step;

[0155] The attitude rotation matrix R(ψ) is used to map the velocity [u,v,r] to the global coordinate system, thereby updating the position [x,y,ψ] of the unmanned vessel;

[0156] Speed ​​update formula:

[0157]

[0158] Velocity is determined by changes in displacement within the global coordinate system. The coordinates are obtained by remapping back to the ship's coordinate system.

[0159] The ship dynamics model established in this embodiment is as follows: Figure 3 As shown, the dynamic model of the USV is the foundation for designing the control strategy. To simplify calculations and reduce the complexity of the control system, both the pilot and controlled vessels are described using a three-degree-of-freedom (3-DoF) dynamic model. Surge is linear motion along the X-axis with velocity u; sway is linear motion along the Y-axis with velocity v; and yaw is rotational motion about the Z-axis with velocity r. This model includes the lateral (x), longitudinal (y), and heading angle (ψ) motion states.

[0160] The rendering window, based on the `gym.envs.classic_control.rendering` module, sets the display area, with a blue background to simulate the colors of the ocean environment. Different colors (e.g., red for the pilot vessel, blue for the controlled vessel) are used to distinguish the vessels. The Transform object updates the vessel's position information in real time, allowing the vessel to move dynamically within the rendering window. In the early stages of training, because the model has not yet converged, the controlled vessel may deviate from its trajectory. The rendering window allows observation of this deviation and real-time adjustments to model training parameters or reward function design. As training progresses, the rendering window gradually displays the process of the controlled vessel approaching the pilot vessel's trajectory.

[0161] In this embodiment, the rendering window is as follows: Figure 4 As shown, the learning process is visualized using a rendering window, allowing for intuitive observation of the state differences between the pilot and controlled vessels, as well as the execution of the control strategy and reinforcement learning. In the initial rounds, during the exploration phase, the states of the controlled and pilot vessels differ significantly, resulting in path following deviations. As the number of training rounds increases, the states of the controlled vessels gradually approximate those of the pilot vessel, achieving path following.

[0162] S3: Utilize the constructed deep reinforcement learning (DQN) model to optimize the path following strategy through the reward function;

[0163] The DQN model is trained using an experience replay and an epsilon-greedy strategy, including an evaluation network and a target network, where the parameters of the target network are periodically synchronized to improve training stability.

[0164] The goal of the DQN model is to find an optimal policy that maximizes the state-action value function Q(s, a); the optimal state-action value function satisfies the Bellman equation:

[0165]

[0166] Where s is the current state, a is the current action, s' is the next state (determined by the state transition probability), and a' is the action in the next state; Q * (s,a) is the optimal state-action value function, representing the maximum expected cumulative reward that the agent can obtain after taking action a in state s; E[·] is the mathematical expectation, representing the uncertainty and expectation calculation for the next state s'; r is the immediate reward, representing the reward obtained immediately after performing action a; γ is the weighting factor, ranging from [0,1], used to weigh the importance of current reward and future reward; This represents the maximum Q value that can be obtained by choosing action a' in the next state;

[0167] In existing Q-learning, values ​​are stored using a lookup table. However, storing the state-action value function in a Q-table becomes impractical when the state and action spaces are large or continuous. DQN uses a deep neural network to approximate the Q-value, enabling it to learn efficiently in high-dimensional state spaces. The goal of the DQN model is to minimize the following loss function, allowing the Q-network to gradually approach the optimal state-action value function:

[0168]

[0169] Where θ is the parameter of the current Q-network, θ - These are the parameters of the target network; the target network Q... target The parameters are kept constant for a period of time to improve training stability.

[0170] The DQN model is instantiated using the DQN class and contains two deep neural networks: an eval_net (the current Q-network used to estimate the Q-value of the current state) and a target_net (the target Q-network used to estimate the target Q-value of the next state). The initial parameters of these two networks are the same, and as training progresses, the parameters of the target network are synchronized from the Q-network at regular intervals.

[0171] At each time step, an epsilon-greedy strategy will be used to select actions. During the exploration phase (initially with a larger epsilon), the agent will select actions more randomly to explore more possibilities; as training progresses, the epsilon value gradually decreases, and the unmanned surface vessel will be more inclined to select the optimal action evaluated by the current Q-network.

[0172] Specifically, in the `select_action` method, DQN randomly selects an action based on the value of `epsilon` or selects an action based on the maximum Q-value of the Q-network. This strategy allows the unmanned surface vessel to balance exploration and exploitation.

[0173] After each action is executed, the selected action, the reward obtained, and the next state are stored in the experience replay cache. When the cache is full, new experiences overwrite old ones. Through the experience replay mechanism, small batches of data are randomly drawn from the cache for training, reducing the correlation between samples and making the model's learning more stable.

[0174] In the DQN update method, a small batch of empirical samples is first drawn from the empirical replay cache. For each sample, the target Q-value is calculated according to the Bellman equation.

[0175]

[0176] Where θ- are the parameters of the target network;

[0177] The mean square error between the estimated Q value q_eval of the current Q-network and the target Q value q_target is calculated, and this error is minimized through backpropagation. This process makes the estimated value of the Q-network gradually approach the true Q value.

[0178] To improve stability, the parameters of the target network are not updated at every step, but are periodically synchronized with the parameters of the current Q-network at certain intervals. This reduces the fluctuation of Q-values ​​during the update process, making the model more stable.

[0179] Reward function: It is dynamically adjusted based on the differences between the controlled vessel and the lead vessel's status (such as position deviation, speed deviation, and path deviation angle). The closer the reward value is to zero, the better the path following effect.

[0180] The `calculate_reward` function requires defining a reward function based on state differences to measure the effectiveness of the current action. Actions with larger rewards are considered better control policies. The DQN network continuously optimizes the reward value and adjusts the control policy, enabling the agent to gradually learn how to track paths and minimize errors.

[0181] The reward function evaluates the effectiveness of the control strategy by calculating the difference between the state of the controlled vessel and the state of the pilot vessel, and assesses the overall control effect within a round.

[0182]

[0183] The reward function includes:

[0184] Total Round Reward: The total reward output at the end of each round is the sum of the rewards of all time steps in that round; the higher the total reward value (the closer to a negative value of zero), the smaller the difference between the state of the controlled ship and the state of the lead ship in that round, and the better the control effect;

[0185] Instant reward: Instant feedback calculated at each time step to evaluate the quality of the current action; calculated based on the difference between the current state and the target state;

[0186] Moving average reward: The moving average reward is obtained by calculating the average of the total rewards of a certain number of recent rounds; it reflects the average performance of the model over a recent period.

[0187]

[0188] Where N is the size of the sliding window and the total reward i It is the total reward for the i-th round.

[0189] S4: Under unstable communication conditions, backstepping control is used for real-time feedback compensation to improve control accuracy;

[0190] Backstepping control achieves feedback compensation by calculating errors and recursively designing a controller. Under unstable communication conditions, it generates control inputs based on path deviations and real-time feedback to correct the state of the controlled vessel; specifically including:

[0191] Error z1 is calculated based on environmental and reference conditions:

[0192] z1 = env - env r

[0193] This leads to the intermediate variables ω and

[0194] ω=k1·z1

[0195]

[0196] Recalculate the error z2:

[0197] z2 = env - env r -ω

[0198] Generate control input τ:

[0199] τ = -k²·z² + dynamic model term;

[0200] The backstepping control strategy calculates the error and uses the system's dynamic model to generate the control input τ so that the state of the controlled vessel (env) approximates that of the reference vessel (env). r ) state.

[0201] Hybrid strategy: By recursively designing the controller, the control objective is gradually approached; under unstable communication conditions, control inputs are generated based on path deviation and real-time feedback to correct the state of the controlled vessel.

[0202] In the early stages of training, DQN suffers from slow convergence and high path-following error due to the long exploration time for random actions. Furthermore, DQN is limited by the discrete action space; high update frequencies easily cause Q-value oscillations, while low frequencies reduce control accuracy, and it struggles to converge effectively when state transition uncertainties are high.

[0203] Backstepping control (BC) is a recursive controller design method that decomposes the control objective to progressively approach the final goal. By introducing BC, the DQN output is combined with the BC control result. BC smooths the DQN action selection, improving system stability and shortening the exploration time, especially effective in the early stages of training. BC provides strong dynamic stability, quickly responds to tracking errors, and corrects path deviations; DQN, on the other hand, captures the long-term trend of path changes.

[0204] The combination of DQN and BC forms a balance of "fast response - slow learning": DQN optimizes the path in the long term, BC corrects the action in the short term, and the target synchronization mechanism avoids the instability of DQN, ultimately achieving faster convergence and excellent control performance in complex paths.

[0205] In the event of severe communication interference, packet loss, or delays, the state prediction method is used to enable the controlled vessel to select the next following action based on state prediction.

[0206] The state prediction method utilizes the latest and most available state information of the pilot vessel, including position, velocity, and acceleration, combined with its dynamic model (such as uniform linear motion or uniform acceleration model), to predict state changes over a short period of time; the prediction formula is as follows:

[0207]

[0208] Among them, s predicted It is the predicted state of the pilot vessel, s last_valid The most recently received valid status, The speed (or estimated rate of change of state) of the pilot vessel, Δt is the estimated time difference (delay time).

[0209] S5: By dynamically adjusting the weights of DQN and backstep control, path following in complex marine environments can be achieved;

[0210] Control the input weight allocation ratio:

[0211] tau=alpha*tau_dqn+(1-alpha)*tau_bc

[0212] Experimental analysis was conducted to determine the weight allocation between DQN and backstep control (e.g., alpha = 0.5). Backstep control was dominant in the early stages of training, and DQN was gradually adopted in the later stages to improve control performance and convergence speed.

[0213] Experimental environment: The simulation environment includes a 1000×1000 two-dimensional static map to simulate different interference conditions (such as packet loss rate and delay) and verify the robustness and adaptability of the method.

[0214] To verify the effectiveness and effect of the method of the present invention, this embodiment compares the method of the present invention with existing control methods through examples, as follows:

[0215] Comparison of training methods for hybrid control strategies and DQN control strategies (reference) Figure 5 ,exist Figure 5 In the diagram, the left side shows a communication system without additional interference, while the right side incorporates uncertain electromagnetic interference factors such as a random delay of 10-20 seconds and a packet loss rate of approximately 20%, simulating a complex environment. The left side chart shows that in the interference-free environment, the hybrid strategy (blue curve) has a higher average reward value, less fluctuation, and a more stable control process. The DQN control strategy (red curve) exhibits greater reward value fluctuation and requires a longer time to reach a similar reward level. The hybrid strategy enables the unmanned vessel to achieve high-precision following early on, reducing training time. During training, the hybrid strategy's reward function gradually approaches zero, the controlled vessel's state approximates the lead vessel's state, and the path following error is smaller.

[0216] A comparison of the input weights for the actions selected by DQN and the backstep control calculations is shown below. Figure 6 As shown in the experiment, the training effect is optimal when the weight alpha = 0.5. In the early stages of training, DQN requires extensive exploration; introducing 50% BC control at this point effectively reduces fluctuations, smooths the reward curve, and accelerates convergence. Compared to DQN-dominated control with alpha = 0.7, the weight balance with alpha = 0.5 significantly reduces the uncertainty of DQN output in the early stages. In later training, DQN has learned effective policies; by fusing with BC, the action output becomes more stable, the risk of deviating from the target state is reduced, and the fluctuations in reward and loss are decreased, enhancing the model's convergence and stability. Low weights (alpha < 0.5) weaken DQN's learning ability, leading to a lack of flexibility in complex environments; while high weights (alpha = 0.7) cause DQN to become overly dominant, exacerbating control fluctuations and making it less robust in disturbed environments.

[0217] Comparison of Communication and Control Coordination Performance under Different Interference Intensities (Reference) Figure 7 , Figure 7 The diagram illustrates the changes in the reward function under high interference (red curve) and low interference (blue curve) conditions. It can be seen that even under high interference conditions, the reward curve initially exhibits significant oscillations, but gradually stabilizes as training progresses, approaching the reward level under low interference conditions and achieving a higher average reward value. This demonstrates the strong robustness of the control strategy in interference environments; even with high data loss rates and significant latency, the system can effectively adjust control parameters and maintain path-following stability.

Claims

1. An unmanned surface vehicle path following method based on DQN and backstepping control, characterized in that, Includes the following steps: S1: Design a communication system based on the MAVLink protocol to realize real-time transmission of status information between the pilot vessel and the controlled vessel; S2: In the constructed model training environment, build dynamic models of the pilot vessel and the controlled vessel, and use a rendering window to observe the vessel's following performance in real time during the model training process; S3: Utilize the constructed DQN model to optimize the path following strategy through the reward function; S4: Under unstable communication conditions, backstepping control is used for real-time feedback compensation; When packet loss or delay occurs in communication, the state prediction method is used to enable the controlled vessel to select the next following action based on the state prediction. S5: By dynamically adjusting the weights of DQN and backstep control, path following in complex marine environments can be achieved; In step S4, backstepping control achieves feedback compensation by calculating errors and recursively designing a controller. Under unstable communication conditions, control inputs are generated based on path deviation and real-time feedback to correct the state of the controlled vessel. Specifically, it includes: Error calculated based on environmental conditions and reference conditions. : ; This leads to the intermediate variables. and : ; ; Recalculate the error : ; Generating control input : ; The state prediction method utilizes the latest available state information of the pilot vessel, including position, velocity, and acceleration, combined with its dynamic model, to predict state changes over a short period of time; the prediction formula is as follows: ; wherein, is the predicted state of the pilot vessel, is the last received valid state, is the speed of the pilot vessel, is the estimated time difference; In step S5, the weight allocation ratio between DQN and backstep control is dynamically adjusted: 。 2. The path following method for unmanned surface vessels based on DQN and backstepping control according to claim 1, wherein, In step S1, the communication system uses the MAVLink protocol to realize the state data transmission between the pilot vessel and the controlled vessel via WIFI, including position information and attitude information, and the communication robustness is evaluated by introducing an electromagnetic interference model through simulation.

3. The path following method for unmanned surface vessels based on DQN and backstepping control according to claim 2, wherein, The operation of the communication system in step S1 includes: A1: Lead vessel sends information: Collects and processes the status information data of the lead vessel, and uses the MAVLink protocol to generate a payload data portion containing key control information; A2: WIFI Protocol Simulation and Transmission: MAVLink data is transmitted via the WIFI protocol; the simulation considers channel attenuation, noise interference and multipath effects, and uses a dual-path model to reflect signal reflection and refraction, evaluates packet loss rate and delay, and simulates the characteristics of real wireless communication. A3: Introduction and control of electromagnetic interference: Design an adjustable electromagnetic interference model, adjust interference parameters, and simulate interference using the WIFI protocol; A4: Controlled vessel receives information: The controlled vessel extracts the position information and attitude data of the pilot vessel through the MAVLink unpacking process. Abnormal data is recorded and fed back to the motion control algorithm to analyze the impact of interference on motion control.

4. The path following method for unmanned surface vessels based on DQN and backstepping control according to claim 1, wherein, In step S2, the rendering window is based on the gym.envs.classic_control.rendering module, which sets the display area, sets the background to blue to simulate the color of the ocean environment, and uses different colors to distinguish ships. The ship's position information is updated in real time through the Transform object, so that the ship moves dynamically in the rendering window.

5. The unmanned surface vessel path following method based on DQN and backstepping control according to claim 1, characterized in that, In step S3, the DQN model is trained using an experience replay and epsilon-greedy strategy, including an evaluation network and a target network, wherein the parameters of the target network are periodically synchronized to improve training stability. The goal of the DQN model is to find an optimal policy that maximizes the state-action value function Q(s, a); the optimal state-action value function satisfies the Bellman equation: ; in, This is the current state. For the current action, For the next state, The action to be performed in the next state; The optimal state-action value function represents the state... Take action below The maximum expected cumulative reward that the intelligent agent can obtain afterward; To obtain the mathematical expectation, we represent the expectation of the next state. The uncertainty expectation is calculated; For immediate reward, it indicates the execution of the action. The reward received immediately afterwards; This is a weighting factor, ranging from [0,1], used to weigh the importance of current rewards and future rewards; This indicates the action to be selected in the next state. The maximum Q value that can be obtained; The goal of the DQN model is to minimize the following loss function, so that the Q-network gradually approaches the optimal state-action value function: ; in, These are the parameters of the current Q network. These are the parameters of the target network; the target network The parameters are kept constant for a period of time to improve training stability.

6. The unmanned surface vessel path following method based on DQN and backstepping control according to claim 5, characterized in that, In step S3, the DQN model is instantiated using the DQN class, which contains two deep neural networks: one is the current Q network eval_net, used to estimate the Q value of the current state; the other is the target Q network target_net, used to estimate the target Q value of the next state. In the DQN update method, a small batch of empirical samples is first drawn from the empirical replay cache. For each sample, the target Q-value is calculated according to the Bellman equation. ; in, These are the parameters of the target network; The mean square error between the estimated Q value q_eval of the current Q-network and the target Q value q_target is calculated, and this error is minimized through backpropagation. This process makes the estimated value of the Q-network gradually approach the true Q value.