Unmanned ship multi-ship collision avoidance decision method and system based on reinforcement learning
By combining RVO and VO algorithms with a reinforcement learning-based approach for collision risk assessment and using a BiGRU network model to train the ship's rudder angle decision, the problem of insufficient collision avoidance capability of traditional algorithms in complex marine environments is solved, realizing the effectiveness and engineering practicality of autonomous collision avoidance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- DALIAN MARITIME UNIVERSITY
- Filing Date
- 2023-08-16
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional deterministic model-based ship obstacle avoidance algorithms lack generalization ability in complex marine environments, resulting in high ship collision risk and difficulty in achieving autonomous collision avoidance.
A reinforcement learning-based approach is adopted, combining RVO and VOS algorithms for collision risk assessment. A BiGRU network model is used to train the ship's rudder angle decision, and collision avoidance decisions are made based on historical data and lidar information, which complies with international collision avoidance rules.
Effectively avoid ship collisions in complex marine environments, reduce economic losses, and improve the engineering practicality of autonomous collision avoidance.
Smart Images

Figure CN116954232B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of autonomous obstacle avoidance technology for ships, and in particular to a multi-ship collision avoidance decision-making method and system for unmanned ships based on reinforcement learning. Background Technology
[0002] According to official maritime accident reports, ship collisions are the most common type of maritime accident, potentially leading to serious casualties, large-scale property damage, and environmental pollution. Developing a method that enables ships to autonomously avoid collisions in various navigation scenarios has significant practical importance and application value.
[0003] Traditional obstacle avoidance algorithms based on deterministic models include Dijkstra's algorithm, A* algorithm, D* algorithm, RRT algorithm, artificial potential field method, vector field histogram method, velocity obstacle method, dynamic window method, and BUG algorithm. However, with the increasing complexity of modern marine environments, these algorithms, which rely heavily on mathematical models, lack generalization and learning capabilities. Furthermore, some marine environments are difficult to model comprehensively, leading to significant uncertainties in practical applications. Summary of the Invention
[0004] The purpose of this invention is to provide a multi-ship collision avoidance decision-making method and system based on reinforcement learning for unmanned vessels, which can solve behavioral decision-making problems in different complex environments, prevent huge economic losses caused by ship collisions, and has high engineering applicability.
[0005] To achieve the above objectives, the present invention provides the following solution:
[0006] A multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning includes:
[0007] Establish a mathematical model of ship motion, and determine the current state vector of the ship based on the ship kinematic model;
[0008] A navigation decision-making method is constructed for multi-ship encounter situations; the navigation decision-making method is used to determine the current encounter situation of the ships.
[0009] Collision risk assessment is performed based on RVO and VO algorithms to determine collision risk areas;
[0010] The BiGRU network model is trained using historical data, which includes historical vessel state vectors, historical collision risk assessment vectors, historical lidar lines, historical estimated collision times, and historical vessel rudder angles. The BiGRU network model incorporates the navigation decision-making method and reward function.
[0011] The current ship's own state vector, collision risk assessment vector, ship's expected collision time, and the lidar lines used for surrounding environment state perception are input into the trained BiGRU network model to obtain the current ship's rudder angle; the collision risk assessment vector is used to represent the collision risk area.
[0012] Optionally, the expression for the ship motion model is as follows:
[0013]
[0014] in, For the ship's drift acceleration, Let v be the bow angle acceleration, v be the lateral drift velocity of the ship's two degrees of freedom, r be the bow angle velocity, δ be the rudder angle input, and a be the angular acceleration. 11 a 12 a 21 a 22 b 11 b 22 Determined by the ship's basic parameters.
[0015] Optionally, the navigation decision determination method is as follows:
[0016] Encounter situation: When two vessels meet in opposite directions while in sight of each other, there is a risk of collision, and it should be judged as a head-on encounter. Both the target vessel and the vessel should be giving way and change course to starboard to pass through the port side of the other vessel to avoid collision.
[0017] Crossing Encounter: When the bows of two vessels cross and a collision is imminent, there are two scenarios. In one scenario, the relative position of the vessel to the other vessel is [247.5-355°]. The vessel is traveling straight and should maintain its course and speed, while the other vessel should give way. In the other scenario, the relative position of the vessel to the other vessel is [5°, 112.5°]. The vessel is the one giving way and should change course to starboard and pass behind the other vessel to avoid a collision.
[0018] Overtaking situation: The speed of the vessel is greater than that of the target vessel, the bearing of the vessel is 112.5-247.5°, the vessel is directly behind the target vessel, and the vessel is giving way. At this time, the vessel should pass to the port or starboard side of the target vessel to avoid collision.
[0019] Optionally, the formula for calculating the collision risk zone is as follows:
[0020]
[0021]
[0022] in, This indicates the speed obstacle zone generated by the target ship's TS in the local OS, vTS The velocity v of the target ship is represented by the number v. OS Let p represent the speed of the ship, λ represent the collision coefficient, and p represent the collision coefficient. OS This indicates the current ship's position, TS indicates the target ship, and OS indicates the current ship's position. Minkowski and; This represents the velocity obstacle region calculated using the RVO algorithm, v′ OS This indicates the new speed selected by the ship's OS in addition to the reciprocal speed.
[0023] Optionally, the reward function includes: a target point guidance reward function, a collision and arrival reward function, an optimized steering function, an optimal route function, a rule-based reward function, and a ship safety range reward function.
[0024] Optionally, the calculation formula for the target point-guided reward function is as follows:
[0025]
[0026] Among them, (G) α i ) t Let the target point guidance reward function be defined for the i-th ship at time t. Let y be the position of the i-th ship at time t-1. Let x be the x-coordinate of the position of the i-th ship at time t-1. Let x be the x-coordinate of the target point of the i-th ship. Let τ be the y-coordinate of the target point of the i-th ship. g The coefficients of the reward function are used to guide the target point;
[0027] The calculation formulas for the collision and arrival reward functions are as follows:
[0028]
[0029] Among them, (G) β i ) t Let p be the collision and arrival reward function for the i-th ship at time t. i ) t Let g be the coordinates of the i-th ship at time t. i Let P be the coordinates of the target point of the i-th ship. obs V represents the coordinates of the obstacle. t Let VO be the speed of the ship at time t. OS For the speed obstacle zone of this vessel identified by the speed obstacle method risk assessment, r collision and r arrival It is a constant value;
[0030] The formula for calculating the optimized steering function is as follows:
[0031] (G ω i ) t =τ ω |(ψ i ) t |
[0032] Among them, (G) ω i ) t Let be the optimized steering function for the i-th ship at time t, (ψ i ) t Let τ be the rudder angular velocity of the i-th ship at time t. ω To optimize the rudder angle function coefficients;
[0033] The formula for calculating the optimal route function is as follows:
[0034] (G γ i ) t =-γ t
[0035] Among them, (G) γ i ) t Let γ be the optimal route function for the i-th ship at time t. t The coefficients of the optimal route function at time t;
[0036] The formula for calculating the rule reward function is as follows:
[0037]
[0038] Among them, (G) δ i ) t Let r be the rule-based reward function for the i-th ship at time t. δ COLREGs is the rule reward coefficient, and COLREGs is the International Maritime Collision Prevention Regulations.
[0039] The formula for calculating the ship safety range reward function is as follows:
[0040] (G ε i ) t =-μ if D_TS<R_radar
[0041] Among them, (G) ε i ) t Let be the reward function for the safe range of the i-th ship at time t, where D_TS is the distance between the target ship or obstacle and the ship, R_radar is the safe range set by the ship's radar line, and μ is the penalty coefficient.
[0042] This invention also provides a multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning, including:
[0043] The model building module is used to build a mathematical model of ship motion and determine the current state vector of the ship based on the ship kinematic model.
[0044] The navigation decision-making method construction module is used to construct a navigation decision-making method for multi-ship encounter situations; the navigation decision-making method is used to determine the encounter situation that the ship is in at the current moment.
[0045] The collision risk area determination module is used to assess collision risk based on the RVO and VO algorithms and determine the collision risk area.
[0046] The model training module is used to train a BiGRU network model using historical data; the historical data includes historical ship state vectors, historical collision risk assessment vectors, and historical lidar lines; the BiGRU network model is configured with the navigation decision-making method and reward function.
[0047] The rudder angle determination module is used to input the current ship's own state vector, collision risk assessment vector, ship's expected collision time, and the lidar lines used for surrounding environment state perception into the trained BiGRU network model to obtain the current ship's rudder angle; the collision risk assessment vector is used to represent the collision risk area.
[0048] Optionally, the navigation decision determination method is as follows:
[0049] Encounter situation: When two vessels meet in opposite directions while in sight of each other, there is a risk of collision and it should be considered a head-on encounter. Both the target vessel and the vessel should be giving way and change course to starboard to pass over the port side of the other vessel to avoid collision.
[0050] Crossing Encounter: When the bows of two vessels cross and a collision is imminent, there are two scenarios. In one scenario, the relative position of the vessel to the other vessel is [247.5-355°]. The vessel is traveling straight and should maintain its course and speed, while the other vessel should give way. In the other scenario, the relative position of the vessel to the other vessel is [5°, 112.5°]. The vessel is the one giving way and should change course to starboard and pass behind the other vessel to avoid a collision.
[0051] Overtaking situation: The speed of the vessel is greater than that of the target vessel, the bearing of the vessel is 112.5-247.5°, the vessel is directly behind the target vessel, and the vessel is giving way. At this time, the vessel should pass to the port or starboard side of the target vessel to avoid collision.
[0052] Optionally, the reward function includes: a target point guidance reward function, a collision and arrival reward function, an optimized steering function, an optimal route function, a rule-based reward function, and a ship safety range reward function.
[0053] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects:
[0054] This invention assesses collision risk based on the RVO and VO algorithms, identifies collision risk zones, and trains a BiGRU network model using vectors representing these zones to output the current rudder angle of the vessel. The trained BiGRU network model can make collision avoidance decisions for multiple unmanned vessels in complex encounter situations, solving behavioral decision-making problems in various complex environments. The method provided by this invention can prevent the huge economic losses caused by ship collisions and has high engineering practicality. Attached Figure Description
[0055] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0056] Figure 1 The flowchart shows a multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning.
[0057] Figure 2 The flowchart shows the overall process of a multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning.
[0058] Figure 3 This is a schematic diagram of collision risk assessment based on the RVO and VO algorithms;
[0059] Figure 4 A schematic diagram illustrating the interaction between the learning algorithm and the environment;
[0060] Figure 5 A flowchart of the near-end strategy optimization method;
[0061] Figure 6 This is a schematic diagram of the constructed neural network. Detailed Implementation
[0062] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0063] The purpose of this invention is to provide a navigation decision-making method and system for multi-surface autonomous vehicles encountering each other, so as to realize obstacle avoidance decision-making for multi-surface autonomous vehicles in complex waters.
[0064] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0065] Example 1
[0066] like Figures 1-2 As shown in the figure, the multi-ship collision avoidance decision-making method based on reinforcement learning for unmanned vessels provided in this embodiment specifically includes the following steps:
[0067] S1: Establish a mathematical model of ship motion, and determine the current state vector of the ship based on the ship kinematic model.
[0068] S2: Construct a navigation decision-making method for multi-ship encounter situations; the navigation decision-making method is used to determine the current encounter situation of the ships.
[0069] S3: Collision risk assessment is performed based on the RVO and VO algorithms to determine the collision risk area.
[0070] S4: Train a BiGRU network model using historical data; the historical data includes the historical ship's own state vector, historical collision risk assessment vector, historical lidar lines, historical ship's predicted collision time, and historical ship's rudder angle; the BiGRU network model is equipped with the navigation decision-making method and reward function.
[0071] S5: Input the current ship's own state vector, collision risk assessment vector, ship's expected collision time, and the lidar lines used for surrounding environment state perception into the trained BiGRU network model to obtain the current ship's rudder angle; the collision risk assessment vector is used to represent the collision risk area.
[0072] Furthermore, step S1 specifically includes:
[0073] Mathematical modeling is a method of describing the dynamic characteristics of an object's actual motion using mathematical language or differential equations. Considering only the two degrees of freedom of a ship—the lateral drift velocity v and the bow turning angular velocity r—the linear equations of the ship's maneuvering motion are as follows:
[0074]
[0075] In the formula, For the ship's drift acceleration, Here, δ is the bow angle acceleration, rudder angle input, and a 11 a 12 a 21 a 22 b 11 b 22 Determined by the ship's basic parameters.
[0076] Equation (1) can be transformed into a simple descriptive equation as follows:
[0077]
[0078] In equation (2), T1, T2, T3, and K are all manipulation indices. The Laplace transform of this formula yields the following result:
[0079]
[0080] In equation (3), s is the Laplace operator. Then, the second-order transfer function is simplified to a first-order function. Because ships are large inertial objects, their dynamic characteristics are particularly important at low frequencies. Let s = jω → 0 and expand it as a power series. Also, expand the first-order inertial function as a power series. Let T = T1 + T2 - T3. Based on the relationship, the following formula can be obtained:
[0081]
[0082]
[0083] This is the first-order motion model of ship maneuvering, known as the Nomoto model. It is simpler than the second-order motion model while capturing the essence of the response characteristics.
[0084] Furthermore, step S2 specifically includes:
[0085] After summarizing the collision avoidance rules in Chapter 2, Articles 13-17 of the COLREGs (International Regulations for Preventing Collisions at Sea), this invention categorizes vessels into vessels traveling in the straight course and vessels giving way, and summarizes the following situations:
[0086] Encounter Situation: When two vessels meet in opposite or nearly opposite directions while in sight of each other, there is a risk of collision, and this should be considered a head-on encounter. The bearing angle between the target vessel (TS) and the vessel (OS) in opposite directions is defined as (0°, 5°) or (355°, 360°). In this situation, both vessels are giving way and should change course to starboard, passing the other vessel's port side to avoid a collision.
[0087] Crossing Encounters: When two motorized vessels cross each other bow-to-bow and pose a collision hazard, there are two scenarios. One scenario is when the relative position of the vessel (OS) to the other vessel is [247.5-355°]. In this case, the vessel is proceeding straight and should maintain its course and speed; the other vessel should give way. The other scenario is when the relative position of the vessel to the other vessel is [5°, 112.5°]. In this case, the vessel is the yielding vessel and should change course to starboard, passing behind the other vessel to avoid a collision.
[0088] Overtaking situation: In this situation, the speed of the following vessel (OS) is usually greater than that of the preceding vessel (target vessel, TS), with an azimuth of 112.5-247.5°. The vessel is directly behind the target vessel and is in a yielding position. It can then pass to the target vessel's port or starboard side to avoid a collision.
[0089] Furthermore, step S3 specifically includes:
[0090] To better describe ship collision risk, this invention uses Reciprocal Velocity Obstacles (RVO) and Velocity Obstacles (VO) to assess the collision risk between ships. The RVO algorithm treats the velocities of the ship and surrounding objects as points in velocity space, then uses geometric relationships to calculate the range of velocities the ship can move at, as well as the range of velocities at which it will collide with other objects. The VO algorithm limits the ship's speed to a range that avoids collisions, ensuring that the ship will not collide with other objects. In this invention, the velocity obstacle region is represented by a vector, which is used as the collision risk region and then input into the neural network as the observation state. For the ship's OS and TS, the ship's radius is R. OS and R TS The speed is V OS and V TS The position is P OS and P TS The speed obstacle zone of the ship's OS generated by the ship's TS is given by the following formula:
[0091]
[0092] In the above formula, This indicates the speed obstacle zone generated by the target ship's TS in the ship's OS, which is also the collision risk zone for the ship. Geometrically, this is interpreted as a velocity barrier region, and also as a ship collision risk region represented by the RVO algorithm, v′ OS This represents the new speed chosen by the ship's OS in addition to the reciprocal speed. λ represents the collision coefficient, λ(p OS ,v OS -v TS (starting from p) OS The direction is v OS -v TS rays, It is Minkowski and... (as in...) Figure 3 As shown in (a), the velocity barrier region can be represented by a vertex coordinate χ3 and direction vectors χ1 and χ2. This space represents the set of velocities where a collision may occur in the future. During ship collision avoidance, to avoid a collision, the ship's OS velocity selected in the collision risk assessment should be outside the velocity barrier space. Similarly, the velocity barrier region for static obstacles can be defined, such as... Figure 3 As shown in (c) above. However, for the edges of narrow waterways, which are static obstacles with long distances, the VO and RVO algorithms are not very effective. Therefore, this invention proposes a method that uses vectors generated by the RVO and VO algorithms as mixed inputs with lidar lines. This method can better achieve collision avoidance decisions in decision problems that the RVO and VO algorithms cannot solve, and will be introduced in the state space section of this invention.
[0093] In the collision risk assessment process of the RVO algorithm, the speed range of the vessel is calculated by predicting the motion of other vessels. By predicting the reactions of other vessels, the RVO algorithm is more suitable for multi-vehicle collision avoidance and better adapts to complex environments. Instead of selecting a new speed outside of another surrogate speed barrier for each vessel, the RVO algorithm selects a new speed that is the average of its current speed and the speed outside of another surrogate speed barrier.
[0094]
[0095] Geometrically, this can be interpreted as a velocity barrier. It is translated to the position of its vertex, χ3=(v OS +v TS ) / 2, such as Figure 3 As shown in (b), the RVO velocity region is represented.
[0096] In summary, a risk assessment vector can be used to represent the speed barrier regions with collision risk generated by the VO and RVO algorithms. This invention uses the vector α = [χ1, χ2, χ3] to represent this. Here, χ1 and χ2 represent the directions of the left and right rays originating from χ3. χ3 is the vertex coordinate of the collision risk region. During collision avoidance, the ship cannot use its speed within the VO and RVO regions.
[0097] Furthermore, step S4 specifically includes:
[0098] (1) Design an improved proximal strategy optimization method
[0099] Figure 4 A diagram illustrating the interaction between the learning algorithm and the environment, such as... Figure 4 As shown, a mathematical model of ship motion is used to connect each motion state of the ship, and a Markov decision process is used for modeling. In state s0, the ship takes actions according to the strategy... Take action a0 that maximizes future reward to bring the ship to state s1, where a0 is the set of actions that can be taken in state s0, and the reward for this step is r1. To maximize the total reward, the reinforcement learning algorithm, after reaching state s1, uses policy π... θ In (a|s), the parameter θ will be updated based on the reward value. Then, continue the above process to reach the final state s. n .
[0100] The loss function of the near-end policy optimization method is:
[0101]
[0102] Where clip(a,b,c) := max(min(a,b),c), it restricts a to the range [b,c]. In the above formula, ∈ is a hyperparameter representing the range to be truncated.
[0103] This invention uses generalized dominance estimation to estimate the dominance function. First, using Represents the timing difference error, where and It is a learned state-value function. Therefore, based on the idea of multi-step time difference, we have:
[0104]
[0105] Then, the generalized advantage estimation (GAE) performs an exponentially weighted average of the advantage estimates for these different time points:
[0106]
[0107] Here, λ∈[0,1] is an additional hyperparameter introduced in GAE. When λ=0, only the advantage obtained by one step of differencing is considered. When λ=1, it is the complete average of the advantages obtained by each step of differencing.
[0108] like Figure 5 As shown, firstly, samples are taken from the experience replay pool (s t ,a t ,r t+1 ,s t+1 ,π θold (a t |s t Five elements. During the action network computation process, the policy π is derived based on the new action network. θ (a t |s t Then calculate π with the old strategy. θold (a t |s t Ratio r t (θ). Using Represents the timing difference error, where and It is a learned state-value function. According to the formula... Calculate GAE(γ,λ); then calculate the loss value L according to formula (8). PPO (θ) Updates the action network parameters θ from 0. In the value network, data is taken from the experience replay pool to update the value network. The value network is then used to derive... and Updating the parameters of the value network using a multi-step temporal difference method.
[0109] After using the proximal policy optimization method, a BiGRU network will be used to process the data in the input algorithm. This can effectively improve the generalization ability and convergence speed of the algorithm, instead of using a fully connected network in the traditional method to process the input data of the algorithm.
[0110] Gradients in traditional fully connected networks and recurrent neural networks are prone to decay or explosion. While gradient pruning can address gradient explosion, it cannot solve the problem of gradient decay. Therefore, this invention uses a BiGRU network. BiGRU is an extension of the GRU model. Its basic idea is to add a backward-looking layer to the GRU model to simultaneously consider the forward and backward information of the input sequence. Specifically, in BiGRU, the input sequence first passes through a forward-forward GRU layer and a backward-looking GRU layer, and then the outputs of the two layers are concatenated to obtain a more informative representation. This data processing method can provide complete historical and future information for each time point in the input sequence of the output layer. This method combines past, present, and future information, thereby improving the model's performance. In this invention, the action strategy utilizes BiGRU modules to discover more potential relationships between surrounding ships and obstacles and predict future situations given limited information.
[0111] like Figure 6 As shown, this invention uses a neural network as a function approximator to estimate the value function or policy function. After performing layer normalization, the state observation vector is input into the fully connected neural network, which improves training speed and is more conducive to neural network training. Specifically, the layer normalization method normalizes the features of each sample; that is, for each input feature, its mean and standard deviation are calculated and transformed into a standard normal distribution, i.e., a mean of 0 and a variance of 1. This reduces internal covariate bias, making the model more stable during training.
[0112] (2) Design state and motion space
[0113] The state space of this invention is divided into the ship's own state vector O. self and external observation state vector O out The ship's own state vector is its current velocity v. t Ship rudder angle speed reading δ t Ship heading Let r be the radius of the ship, which is half its length, and P be the coordinates of the ship's current position. t The coordinates G of the target point and the distance D of the target point t , External Observation O out It is divided into three parts. The first part is the collision risk assessment vector α based on the RVO algorithm and the VO algorithm. i = [χ1,χ2,χ3], where i is the number of static and dynamic obstacles, i∈(0…n). The second part consists of 32 lidar lines [L1,L2,…L…]. 32 The third part is the estimated collision time t between the ships. c , t cThe distance to the obstacle detected by the ship's lidar is divided by the straight-line speed, where the straight-line speed is the maximum speed O that the ship can travel directly from its current position to the target point without obstacles. out =[α 1 α 2 ,…α i L1, L2, ... L 32 ,t c ].
[0114] In this invention, the motion space is the change in ship speed Δv. Since the key maneuver for a ship in decision-making and collision avoidance is the change in rudder angle, another motion space parameter is the change in rudder angle Δψ to simulate the steering process of a real ship. Simultaneously, the ship speed must be limited to a range between its maximum and minimum values, v. t ∈[v min ,v max Therefore, the action space is a = [Δv, Δψ]. In this decision-making method, all ships share the same collision avoidance strategy π. θ And independently find the best movement.
[0115] (3) Design reward function
[0116] The reinforcement learning task of this invention is qualitative goal achievement, namely, that the ship intelligently avoids collisions while adhering to rules, and minimizes the time it takes to reach the target point. This invention divides the reward function into the following five parts:
[0117] 1) Target point guided reward function
[0118] To allow the ship to approach the target point, and considering the influence of the ship's heading angle, this function is designed as follows:
[0119]
[0120] Among them, (G) α i ) t Let the target point guidance reward function be defined for the i-th ship at time t. Let y be the position of the i-th ship at time t-1. Let x be the x-coordinate of the position of the i-th ship at time t-1. Let x be the x-coordinate of the target point of the i-th ship. Let τ be the y-coordinate of the target point of the i-th ship. g The coefficients of the reward function are used to guide the target point.
[0121] 2) Collision and arrival reward function
[0122] To encourage ships in the intelligent navigation decision-making system to better avoid collisions and reach the target point, a collision and arrival reward function is used to reward reaching the target. Collision behaviors, the ship's current speed falling into the VO region generated by a static obstacle, and the ship's current speed falling into the RVO region generated by the target ship are penalized. This trains the agent ship to better learn obstacle avoidance. The formula is set as follows:
[0123]
[0124] Among them, (G) β i ) t Let p be the collision and arrival reward function for the i-th ship at time t. i ) t Let g be the coordinates of the i-th ship at time t. i Let P be the coordinates of the target point of the i-th ship. obs V represents the coordinates of the obstacle. t Let VO be the speed of the ship at time t. OS For the speed obstacle area identified by the speed obstacle method risk assessment, r collision and r arrival It is a constant value.
[0125] 3) Optimize the steering function
[0126] In the process of obstacle avoidance decisions made by a ship in a navigation intelligent decision-making system, it is necessary to prevent the ship from spinning in place, which could lead to unsafe navigation. Simultaneously, considering practical considerations, multiple small-angle rudder adjustments should be encouraged during ship steering to avoid accidents or dangerous situations caused by a single large rudder angle adjustment. To prevent a large rudder angle adjustment at once and encourage multiple small-angle rudder adjustments, this invention introduces an optimized rudder angle function. Spinning in place and making a single large rudder angle adjustment are discouraged, as these can lead to getting trapped in local optima. The formula is set as follows:
[0127] (G ω i ) t =τ ω |(ψ i ) t | (13)
[0128] Among them, (G) ω i ) t Let be the optimized steering function for the i-th ship at time t, (ψ i ) t Let τ be the rudder angular velocity of the i-th ship at time t. ω To optimize the rudder angle coefficient, this invention is designed to be a negative value to penalize large rudder angles.
[0129] 4) Optimal route function
[0130] To minimize the collision avoidance navigation path, an optimal route function was designed to reduce the route travel time. The formula is set as follows:
[0131] (G γ i ) t =-γ t (14)
[0132] γ t The coefficient is based on the principle that a small negative reward is given to each ship for every time step, guiding the ship to reach the target point as quickly as possible.
[0133] 5) Rule-based reward function
[0134] To ensure that the designed intelligent navigation decision-making system for ships complies with the International Maritime Collision Avoidance Regulations (COLREGs), a rule-based reward function should be introduced. In designing the rule-based reward function, the principle followed is to provide negative rewards for actions that do not comply with the rule requirements. δ This is the rule coefficient. The formula is as follows:
[0135]
[0136] r δ This is the rule reward coefficient, which is a negative value.
[0137] In this invention, a vessel that does not comply with the rules will be penalized for turning to starboard when its radar detects an approaching vessel at an angle between -5° and 112.5°. Similarly, changes to the vessel's rudder angle will also be penalized when the approaching vessel is at an angle between 247.5° and 355°.
[0138] 6) Ship safety range reward function
[0139] (G ε i ) t =-μ if D_TS<R_radar (16)
[0140] Where D_TS represents the distance between the target vessel or obstacle and the ship, R_radar is the safe range set by the ship's radar, and μ is a penalty coefficient that gives a negative reward when an obstacle or target vessel is within the set safe range.
[0141] Therefore, the total reward of the reward function is:
[0142] (R i ) t =(G αi ) t +(G β i ) t +(G ω i ) t +(G γ i ) t +(G δ i ) t +(G ε i ) t (17)
[0143] (4) Deployment
[0144] Deploying the trained model onto a real intelligent unmanned vessel enables it to navigate autonomously and make obstacle avoidance decisions in complex waters. The specific deployment method is as follows: the pre-trained model is called within the program and transmitted remotely to the mainboard of the actual surface autonomous vessel, where it is directly invoked within the pre-programmed logic.
[0145] The present invention has the following beneficial effects:
[0146] (1) This invention combines deep reinforcement learning with traditional VO and RVO algorithms and proposes a collision risk assessment model based on RVO and VO algorithms. This method divides the speed obstacle area generated by RVO and VO algorithms and uses neural networks to calculate the collision risk between the ship and the target ship and the obstacle.
[0147] (2) This invention introduces a mixed input of lidar lines and velocity obstacle regions generated by the RVO and VO algorithms, which solves the problem of poor collision avoidance capability of traditional RVO and VO algorithms when facing long-distance obstacles, such as the edge of a waterway. At the same time, the detection of lidar lines is used to design a reward function that conforms to COLREGS.
[0148] (3) A neural network based on bidirectional recurrent module (BiGRU) was established in the neural network, which can directly map different numbers of surrounding obstacles and ships that need to be avoided onto the rudder angle of the algorithm output. The input state normalization method can improve the efficiency of the algorithm and increase the training speed.
[0149] (4) The reward function designed in this invention meets the corresponding requirements of collision avoidance rules and conforms to the human steering habit of making multiple small-angle rudder movements instead of using a single large-angle rudder movement. It can replace the decision-making of unmanned ships in complex encounter situations.
[0150] Example 2
[0151] In order to implement the method corresponding to Embodiment 1 above and achieve the corresponding functions and technical effects, a multi-ship collision avoidance decision system for unmanned vessels based on reinforcement learning is provided below.
[0152] The system includes:
[0153] The model building module is used to build a mathematical model of ship motion and determine the current state vector of the ship based on the ship kinematic model.
[0154] The navigation decision-making method construction module is used to construct a navigation decision-making method for multi-ship encounter situations; the navigation decision-making method is used to determine the current encounter situation of the ships.
[0155] The collision risk area determination module is used to assess collision risk and determine collision risk areas based on the RVO and VO algorithms.
[0156] The model training module is used to train a BiGRU network model using historical data; the historical data includes historical ship state vectors, historical collision risk assessment vectors, and historical lidar lines; the BiGRU network model is configured with the navigation decision-making method and reward function.
[0157] The rudder angle determination module is used to input the current ship's own state vector, collision risk assessment vector, ship's expected collision time, and lidar lines into the trained BiGRU network model to obtain the current ship's rudder angle; the collision risk assessment vector is used to represent the collision risk area.
[0158] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple; relevant parts can be referred to the method section.
[0159] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning, characterized in that, include: Establish a mathematical model of ship motion, and determine the current state vector of the ship based on the ship kinematic model; A navigation decision-making method is constructed for multi-ship encounter situations. This method is used to determine the current encounter situation of a vessel. The navigation decision-making method is as follows: Head-on encounter: When two vessels meet each other on opposite course, there is a risk of collision, and it should be judged as a head-on encounter. Both the target vessel and the vessel are yielding vessels and should change course to starboard, passing the other vessel's port side to avoid collision. Crossing encounter: When the bows of two vessels cross and a collision risk is present, there are two situations: one is... When the relative position of this vessel to other vessels is [247.5-355°], this vessel is traveling in a straight line and should maintain its course and speed; the target vessel should give way. Alternatively, if the relative position of this vessel to other vessels is [5°, 112.5°], this vessel is the one giving way and should change course to starboard and pass behind the target vessel to avoid a collision. In an overtaking situation: if the speed of this vessel is greater than the speed of the target vessel, and the bearing of this vessel is 112.5-247.5°, and this vessel is directly behind the target vessel, this vessel is the one giving way. In this case, this vessel should pass to the port or starboard side of the target vessel to avoid a collision. Collision risk assessment is performed based on RVO and VO algorithms to determine collision risk areas; The BiGRU network model is trained using historical data, which includes historical vessel state vectors, historical collision risk assessment vectors, historical lidar lines, historical estimated collision times, and historical vessel rudder angles. The BiGRU network model incorporates the navigation decision-making method and reward function. The current ship's own state vector, collision risk assessment vector, ship's expected collision time, and the lidar lines used for surrounding environment state perception are input into the trained BiGRU network model to obtain the current ship's rudder angle; the collision risk assessment vector is used to represent the collision risk area. The reward functions include: target point guidance reward function, collision and arrival reward function, optimized steering function, optimal route function, rule reward function, and ship safety range reward function; The calculation formula for the target point-guided reward function is as follows: in, For time t, the first... The target point guidance reward value function for each ship For time t-1, the first The y-coordinate of the ship's position. For time t-1, the first The x-coordinate of the ship's position. For the first The x-coordinate of the ship's target point For the first The y-coordinate of the ship's target point The coefficients of the reward function are used to guide the target point; The calculation formulas for the collision and arrival reward functions are as follows: in, For time t, the first... The collision and arrival reward function for a ship. For time t, the first The coordinates of the ship For the first The coordinates of the ship's target point The coordinates of the obstacle. Let be the speed of the ship at time t. The speed obstacle zone for this vessel, as determined by the speed obstacle method risk assessment. and It is a constant value; The formula for calculating the optimized steering function is as follows: in, For time t, the first... Optimized steering function for a ship For time t, the first... The rudder angular speed of the ship, To optimize the rudder angle function coefficients; The formula for calculating the optimal route function is as follows: in, For time t, the first... The optimal route function for a ship. The coefficients of the optimal route function at time t; The formula for calculating the rule reward function is as follows: in, For time t, the first... The rule-based reward function for a ship. COLREGs is the rule reward coefficient, and COLREGs is the International Maritime Collision Prevention Regulations. The formula for calculating the ship safety range reward function is as follows: in, For time t, the first... The ship safety range reward function for a vessel. The distance between the target vessel or obstacle and the vessel. The safety zone set for the ship's radar line. This is the penalty coefficient.
2. The multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning according to claim 1, characterized in that, The expression for the ship motion model is as follows: in, For the ship's drift acceleration, For the angular acceleration of the bow, For the ship's two degrees of freedom, the drift velocity, The angular velocity of the bow. Input the rudder angle. , , , , , Determined by the ship's basic parameters.
3. The multi-ship collision avoidance decision-making method for unmanned vessels based on reinforcement learning according to claim 1, characterized in that, The formula for calculating the collision risk zone is as follows: in, This indicates the speed obstacle zone generated by the target ship's TS in the target ship's OS. Indicates the speed of the target ship. Indicates the speed of the own ship, Indicates the collision coefficient. Indicates the position of the own ship, Indicates the target ship. Indicates own ship, Minkowski and; This indicates the speed barrier region calculated using the RVO algorithm. This indicates the new speed selected by the ship's OS in addition to the reciprocal speed.
4. A multi-ship collision avoidance decision-making system for unmanned vessels based on reinforcement learning, characterized in that, include: The model building module is used to build a mathematical model of ship motion and determine the current state vector of the ship based on the ship kinematic model. The navigation decision-making module is used to construct navigation decision-making methods for multi-ship encounter situations. These methods determine the current encounter situation of a vessel. The navigation decision-making methods are as follows: Head-on encounter: When two vessels see each other but are on opposite course, there is a risk of collision, and this should be considered a head-on encounter. Both the target vessel and the vessel should be yielding and change course to starboard, passing the other vessel's port side to avoid collision. Crossing encounter: When the bows of two vessels cross and a collision risk exists, this is further divided into two types. In two scenarios: First, if the relative position of the vessel to another vessel is [247.5-355°], and the vessel is traveling in a straight line, it should maintain its course and speed, and the target vessel should give way. Second, if the relative position of the vessel to another vessel is [5°, 112.5°], and the vessel is giving way, it should change course to starboard and pass behind the target vessel to avoid a collision. In overtaking situations: If the vessel's speed is greater than the target vessel's speed, its bearing is 112.5-247.5°, and it is directly behind the target vessel, it is giving way. In this case, it should pass to the target vessel's port or starboard side to avoid a collision. The collision risk area determination module is used to assess collision risk based on the RVO and VO algorithms and determine the collision risk area. The model training module is used to train a BiGRU network model using historical data; the historical data includes historical ship state vectors, historical collision risk assessment vectors, and historical lidar lines; the BiGRU network model is configured with the navigation decision-making method and reward function. The rudder angle determination module is used to input the current ship's own state vector, collision risk assessment vector, ship's expected collision time, and the lidar lines used for surrounding environment state perception into the trained BiGRU network model to obtain the current ship's rudder angle; the collision risk assessment vector is used to represent the collision risk area. The reward functions include: target point guidance reward function, collision and arrival reward function, optimized steering function, optimal route function, rule reward function, and ship safety range reward function; The calculation formula for the target point-guided reward function is as follows: in, For time t, the first... The target point guidance reward value function for each ship For time t-1, the first The y-coordinate of the ship's position. For time t-1, the first The x-coordinate of the ship's position. For the first The x-coordinate of the ship's target point For the first The y-coordinate of the ship's target point The coefficients of the reward function are used to guide the target point; The calculation formulas for the collision and arrival reward functions are as follows: in, For time t, the first... The collision and arrival reward function for a ship. For time t, the first The coordinates of the ship For the first The coordinates of the ship's target point The coordinates of the obstacle. Let be the speed of the ship at time t. The speed obstacle zone for this vessel, as determined by the speed obstacle method risk assessment. and It is a constant value; The formula for calculating the optimized steering function is as follows: in, For time t, the first... Optimized steering function for a ship For time t, the first... The rudder angular speed of the ship, To optimize the rudder angle function coefficients; The formula for calculating the optimal route function is as follows: in, For time t, the first... The optimal route function for a ship. The coefficients of the optimal route function at time t; The formula for calculating the rule reward function is as follows: in, For time t, the first... The rule-based reward function for a ship. COLREGs is the rule reward coefficient, and COLREGs is the International Maritime Collision Prevention Regulations. The formula for calculating the ship safety range reward function is as follows: in, For time t, the first... The ship safety range reward function for a vessel. The distance between the target vessel or obstacle and the vessel. The safety zone set for the ship's radar line. This is the penalty coefficient.