A reinforcement learning-based autonomous carrier landing control method and related equipment for unmanned aerial vehicles (UAVs)

By using a reinforcement learning-based approach and leveraging a policy neural network and a lower-level controller to generate rotor thrust control commands, autonomous landing of UAVs on ships is achieved. This solves the problems of poor accuracy, signal loss, and high computing power consumption in existing technologies, and improves the robustness and flexibility of UAV landing.

CN122308452APending Publication Date: 2026-06-30烟台哈尔滨工程大学研究院

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
烟台哈尔滨工程大学研究院
Filing Date
2026-06-01
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing autonomous ship landing control methods for unmanned aerial vehicles (UAVs) suffer from problems such as poor accuracy, signal loss, difficulty in decoupling, high computing power consumption, high data requirements, and insufficient robustness under complex sea conditions, which limit the efficient docking of UAVs and ships.

Method used

By employing a reinforcement learning-based approach, the system acquires the state information of the UAV and the ship. It then outputs desired acceleration and yaw torque commands through a policy neural network, which, combined with the lower-level controller, generate rotor thrust control commands to enable the UAV to autonomously land on the ship.

Benefits of technology

It improves the robustness, flexibility and scalability of UAV autonomous ship landing control in complex sea conditions, and solves the problems of jitter and decoupling difficulties, high computing power consumption and high data requirements in traditional methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122308452A_ABST
    Figure CN122308452A_ABST
Patent Text Reader

Abstract

This invention relates to the interdisciplinary field of unmanned aerial vehicle (UAV) autonomous control and artificial intelligence, specifically disclosing a UAV autonomous ship landing control method and related equipment based on reinforcement learning. This invention cascades a policy neural network trained with a near-end policy optimization algorithm with a lower-level controller. The UAV and ship state information, relative position and velocity vectors, and the angle parameter between their normal vectors are input into the policy network, which then outputs desired acceleration and yaw torque commands. These commands are then converted by the lower-level controller into rotor thrust control commands to drive the UAV landing. This invention solves the problems of traditional sliding mode control, such as chattering and decoupling difficulties, high computational cost and insufficient real-time performance of model predictive control, and the reliance on accurate models and high data requirements of backstepping methods. It improves the robustness, flexibility, and scalability of UAV autonomous ship landing control in complex sea conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of interdisciplinary technology of unmanned aerial vehicle (UAV) autonomous control and artificial intelligence, and in particular to a UAV autonomous carrier landing control method and related equipment based on reinforcement learning. Background Technology

[0002] Unmanned surface vehicles (USVs) are playing an increasingly important role in maritime operations. However, the inherent horizontal two-dimensional surface movement characteristics of multiple USV formations significantly limit their detection range and operational area. USVs equipped with unmanned aerial vehicles (UAVs) can significantly extend their detection and operational range, with quadcopter UAVs gaining popularity over vast areas due to their vertical takeoff and landing capabilities. Multirotor UAVs offer excellent maneuverability and can provide aerial information, but have shorter battery life; USVs have longer ranges, but are limited by factors such as field of vision and speed. Combining the two can enhance maritime applications, including search and rescue, surveillance and remote environmental monitoring, and the execution of diverse missions such as maritime patrol, reconnaissance, marine exploration, and multi-domain collaborative operations.

[0003] However, building an efficient USV-UAV collaborative system still faces many challenges, with the key bottleneck being the autonomous landing and recovery of UAVs on ships. In complex sea conditions, accurately landing UAVs on the narrow and undulating deck of a ship is an extremely challenging task. While quadcopter UAVs possess vertical takeoff and landing capabilities, they cannot eliminate the fundamental problem of landing on swaying platforms.

[0004] During drone recovery, traditional methods use GPS to obtain relative position, IMU to estimate its own state, and communication to acquire ship attitude, but these methods suffer from poor accuracy and signal loss. While visual servo control frameworks can achieve relative state estimation through Kalman filter-assisted marker tracking and PnP pose calculation, they suffer from low sampling rates.

[0005] In autonomous recovery control methods based on relative pose, sliding mode control faces decoupling difficulties for underactuated systems like quadrotor UAVs and still suffers from inherent chattering issues. Model predictive control is highly dependent on model accuracy, resulting in high computational costs and latency in high-dynamic tasks. Backstepping also requires an accurate model; some studies have introduced long short-term memory networks to predict ship motion, but this requires a large amount of raw data and suffers from overfitting and generalization difficulties. Therefore, existing methods still have shortcomings in terms of robustness, flexibility, and scalability.

[0006] Reinforcement learning achieves significant results in complex robot control problems by enabling continuous policy optimization through simulated interaction. However, many existing simulation environments suffer from unrealistic visual effects, inaccurate physical simulations, and insufficient task complexity, which limits the effective application of reinforcement learning in the autonomous landing control of unmanned aerial vehicles (UAVs).

[0007] Therefore, there is an urgent need to provide a technical solution to address the above problems. Summary of the Invention

[0008] To address the aforementioned technical problems, this invention provides a method and related equipment for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning.

[0009] In a first aspect, the present invention provides an autonomous carrier landing control method for unmanned aerial vehicles based on reinforcement learning, the technical solution of which is as follows: The first state information of the UAV and the second state information of the ship are obtained. The first state information includes the position vector and velocity vector of the UAV in the inertial coordinate system, and the attitude angle and angular velocity vector in the body coordinate system. The second state information includes the position vector, velocity vector, attitude angle and normal vector of the ship in the inertial coordinate system. Based on the first state information and the second state information, the relative position vector and relative velocity vector of the UAV relative to the ship are determined, and the first angle parameter between the normal vector of the UAV and the normal vector of the ship is determined. The first state information, the second state information, the relative position vector, the relative velocity vector, and the first included angle parameter are input into a pre-trained strategy neural network. The strategy neural network outputs the desired acceleration command and the desired yaw moment command in the body coordinate system. The strategy neural network is a neural network trained using a near-end strategy optimization algorithm. The desired acceleration command and the desired yaw moment command are input into the lower-level controller, which generates thrust control commands that act on each rotor of the UAV, driving the UAV to move and complete the landing on the ship. The training process of the policy neural network includes calculating the reward value at each time step based on a reward function, which includes a projection distance reward, a pose reward, and a state alignment reward. The projection distance reward is determined based on the projection component of the relative position vector onto the ship's normal vector, and it takes effect when the magnitude of the relative position vector is less than a first preset threshold. The pose reward is determined based on a second angle parameter between the normal vector of the UAV and the normal vector of the ship, and it takes effect when the magnitude is less than a second preset threshold. The state alignment reward is determined based on the consistency of the directions of the relative velocity vector and the relative position vector.

[0010] The beneficial effects of the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles (UAVs) of the present invention are as follows: The method of this invention cascades a policy neural network trained by a near-end policy optimization algorithm with a lower-level controller. It inputs the state information of the UAV and the ship, the relative position and velocity vectors, and the angle parameter between the normal vectors into the policy network and outputs the desired acceleration and yaw torque commands. The lower-level controller then converts these commands into rotor thrust control commands to drive the UAV to land on the ship. This method solves the problems of traditional sliding mode control, such as chattering and decoupling difficulties, high computational power consumption and insufficient real-time performance of model predictive control, and the dependence of backstepping on accurate models and high data requirements. It improves the robustness, flexibility, and scalability of UAV autonomous landing control in complex sea conditions.

[0011] Based on the above scheme, the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles of the present invention can be further improved as follows.

[0012] In one alternative approach, the position vector and velocity vector of the UAV in the inertial coordinate system are obtained by fusing the inertial measurement unit and the global positioning system, and the attitude angle and angular velocity vector in the body coordinate system are obtained by the inertial measurement unit.

[0013] The advantages of adopting the above-mentioned optional method are as follows: by further integrating the inertial measurement unit with the global positioning system, the position vector and velocity vector of the UAV in the inertial coordinate system can be obtained. At the same time, the attitude angle and angular velocity vector in the body coordinate system can be calculated by using the inertial measurement unit, realizing multi-source fusion perception and accurate acquisition of UAV state information, and laying a reliable data foundation for subsequent relative pose calculation and policy network input.

[0014] In one alternative approach, the ship's position vector, velocity vector, attitude angle, and normal vector in the inertial coordinate system are obtained by fusing the inertial navigation system and the satellite positioning system on board the ship.

[0015] The advantages of adopting the above-mentioned optional method are as follows: by further integrating the ship's inertial navigation system and satellite positioning system, the ship's position vector, velocity vector, attitude angle and normal vector in the inertial coordinate system can be obtained, realizing real-time monitoring and accurate characterization of the ship's motion state, and ensuring the accuracy and synchronization of the relative state calculation between the UAV and the ship.

[0016] In one alternative embodiment, the policy neural network includes a policy network and a value network, wherein the policy network is used to output the desired acceleration command and the desired yaw moment command in the body coordinate system, and the value network is used to estimate state values ​​to assist in updating the policy network.

[0017] The advantages of adopting the above optional approach are as follows: the policy neural network is further divided into a policy network and a value network. The policy network is responsible for outputting the expected acceleration command and the expected yaw torque command, while the value network is used to estimate the state value to assist the policy network in updating. This achieves the synergistic effect of policy optimization and value evaluation, and improves the stability and convergence efficiency of the training process.

[0018] In one alternative approach, the projection distance reward is composed of a base projection reward value, a power of the ratio of the length of the projection component of the relative position vector in the direction of the ship's normal vector to the magnitude of the relative position vector, and a first switching function, wherein the power is a first exponent.

[0019] The beneficial effects of adopting the above optional method are as follows: by further multiplying the basic projection reward value, the first power of the ratio of the projection component length to the relative position vector magnitude length, and the first switching function to form the projection distance reward term, the refined design of the reward signal in the distance approach phase is realized, guiding the UAV to maintain the optimal descent trajectory during the approach to the ship.

[0020] In one alternative approach, the first switching function is an S-shaped function with the magnitude of the relative position vector and the first preset threshold as variables; when the magnitude is greater than the first preset threshold, the output of the first switching function approaches zero, and when the magnitude is less than the first preset threshold, the output of the first switching function approaches one.

[0021] The beneficial effects of adopting the above optional method are as follows: further adopting an S-shaped function with the relative position vector magnitude and the first preset threshold as variables as the first switching function, the output approaches zero when the magnitude is greater than the first preset threshold and approaches one when it is less than the first preset threshold, thereby realizing smooth start and stop control of the projection distance reward item and avoiding sudden changes and jitter of the reward signal.

[0022] In one alternative approach, the second included angle parameter is the dot product of the normal vector of the UAV and the normal vector of the ship, and the pose reward term is composed of the product of the base pose reward value, the power of the second included angle parameter, and the second switching function, wherein the power is the second power exponent.

[0023] The beneficial effects of adopting the above optional method are as follows: the dot product of the normal vector of the UAV and the normal vector of the ship is further used as the second included angle parameter. The pose reward term is formed by multiplying the basic pose reward value, the second power of the second included angle parameter and the second switching function. This realizes the quantitative design of the attitude alignment stage reward mechanism and promotes the precise alignment of the UAV landing attitude with the deck plane.

[0024] Secondly, this invention provides an autonomous carrier landing control system for unmanned aerial vehicles (UAVs) based on reinforcement learning. The technical solution of this system is as follows: The acquisition module is used to acquire first state information of the UAV and second state information of the ship. The first state information includes the position vector and velocity vector of the UAV in the inertial coordinate system, and the attitude angle and angular velocity vector in the body coordinate system. The second state information includes the position vector, velocity vector, attitude angle and normal vector of the ship in the inertial coordinate system. The determination module is used to determine the relative position vector and relative velocity vector of the UAV relative to the ship based on the first state information and the second state information, and to determine the first angle parameter between the normal vector of the UAV and the normal vector of the ship. The generation module is used to input the first state information, the second state information, the relative position vector, the relative velocity vector and the first included angle parameter into a pre-trained strategy neural network, and output the desired acceleration command and desired yaw moment command in the body coordinate system through the strategy neural network. The strategy neural network is a neural network trained using a near-end strategy optimization algorithm. The control module is used to input the desired acceleration command and the desired yaw moment command into the lower-level controller, and generate thrust control commands acting on each rotor of the UAV through the lower-level controller, so as to drive the UAV to move and complete the landing on the ship. The training process of the policy neural network includes calculating the reward value at each time step based on a reward function, which includes a projection distance reward, a pose reward, and a state alignment reward. The projection distance reward is determined based on the projection component of the relative position vector onto the ship's normal vector, and it takes effect when the magnitude of the relative position vector is less than a first preset threshold. The pose reward is determined based on a second angle parameter between the normal vector of the UAV and the normal vector of the ship, and it takes effect when the magnitude is less than a second preset threshold. The state alignment reward is determined based on the consistency of the directions of the relative velocity vector and the relative position vector.

[0025] The beneficial effects of the reinforcement learning-based autonomous carrier landing control system for unmanned aerial vehicles (UAVs) of the present invention are as follows: The system of this invention cascades a policy neural network trained by a near-end policy optimization algorithm with a lower-level controller. It inputs the state information of the UAV and the ship, the relative position and velocity vectors, and the angle parameter between the normal vectors into the policy network and outputs the desired acceleration and yaw torque commands. The lower-level controller then converts these commands into rotor thrust control commands to drive the UAV to land on the ship. This solves the problems of traditional sliding mode control, such as chattering and decoupling difficulties, high computational power consumption and insufficient real-time performance of model predictive control, and the dependence of backstepping on accurate models and high data requirements. It improves the robustness, flexibility, and scalability of UAV autonomous landing control in complex sea conditions.

[0026] Thirdly, the technical solution of an electronic device according to the present invention is as follows: It includes a memory, a processor, and a program stored in the memory and running on the processor, wherein the processor executes the program to implement the steps of the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles as described in this invention.

[0027] Fourthly, the technical solution of a computer-readable storage medium provided by the present invention is as follows: The computer-readable storage medium stores instructions that, when read, cause the computer-readable storage medium to perform the steps of the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles of the present invention.

[0028] The above description is merely an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention and to implement it in accordance with the contents of the specification, and in order to make the above and other objects, features and advantages of the present invention more apparent and understandable, specific embodiments of the present invention are described below. Attached Figure Description

[0029] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings: Figure 1 This is a flowchart illustrating an embodiment of the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles (UAVs) according to the present invention. Figure 2 This is a schematic diagram of an embodiment of an autonomous carrier landing control system for unmanned aerial vehicles based on reinforcement learning according to the present invention. Figure 3 This is a schematic diagram of an embodiment of an electronic device according to the present invention. Detailed Implementation

[0030] Exemplary embodiments of the invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be implemented in various forms and should not be limited to the embodiments set forth herein.

[0031] Figure 1 This diagram illustrates a flowchart of an embodiment of a reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles (UAVs) provided by the present invention. This reinforcement learning-based autonomous carrier landing control method can be executed by electronic devices such as terminal devices or servers. The terminal device can be any fixed or mobile terminal, such as user equipment (UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, vehicle-mounted device, or wearable device. The server can be a single server or a server cluster consisting of multiple servers. Any electronic device can implement the reinforcement learning-based autonomous carrier landing control method for UAVs by having its processor call computer-readable instructions stored in its memory. Figure 1 As shown, it includes the following steps: S1. Obtain the first state information of the UAV and the second state information of the ship. The first state information includes the position vector and velocity vector of the UAV in the inertial coordinate system, and the attitude angle and angular velocity vector in the body coordinate system. The second state information includes the position vector, velocity vector, attitude angle and normal vector of the ship in the inertial coordinate system.

[0032] In this context, the position vector and velocity vector of the UAV in the inertial coordinate system refer to the vector representation of the coordinates of the UAV's center of mass in three-dimensional space, with a reference coordinate system fixed to the Earth as the reference, and the vector representation of the rate of change of position per unit time. For example, in an autonomous carrier landing mission on January 1, 2026, after the UAV takes off and flies towards the ship, with the origin of the inertial coordinate system set at a fixed point on the sea level, the UAV's position vector is represented as follows: m represents the eastward distance, northward distance, and vertical downward distance, respectively. The velocity vector is represented as... m / s indicates that the drone is approaching the ship at speeds of 3.2 m / s to the east, 1.5 m / s to the south, and 0.8 m / s upwards.

[0033] In the body coordinate system, the attitude angles and angular velocity vectors refer to the Euler angles of the UAV relative to the inertial coordinate system and the instantaneous rotational angular velocity vectors around each axis of the UAV, constructed with the UAV's center of gravity as the origin and the body axes as the coordinate axes. For example, in the mission described above, when the UAV approaches the ship, its nose slightly rises and tilts to the right. In the body coordinate system, the attitude angles are represented as a roll angle of 5.2°, a pitch angle of 8.7°, and a yaw angle of 32.1°, and the angular velocity vector is represented as... rad / s, which correspond to the rotational angular velocities about the transverse, longitudinal, and vertical axes of the aircraft, respectively.

[0034] In this context, the ship's position vector, velocity vector, attitude angle, and normal vector in the inertial coordinate system refer to: the three-dimensional spatial position coordinate vector of the ship's center of mass, the rate of change of position per unit time, the Euler angles of the ship's deck coordinate system relative to the inertial coordinate system, and the unit direction vector perpendicular to the ship's deck plane and pointing upwards, all based on the same inertial coordinate system fixed to the Earth. For example, when a drone begins landing on a ship, which is traveling at a low speed, the ship's position vector in the inertial coordinate system is represented as: m, the velocity vector is represented as The ship rolls due to the waves at a speed of m / s, with attitude angles of 2.1° roll, 1.8° pitch, and 30.0° yaw. The ship's normal vector is... The result is calculated based on the attitude angle. .

[0035] S2. Based on the first state information and the second state information, determine the relative position vector and relative velocity vector of the UAV relative to the ship, and determine the first included angle parameter between the normal vector of the UAV and the normal vector of the ship.

[0036] The relative position vector refers to a three-dimensional vector pointing from the ship's center of mass to the UAV's center of mass in the same inertial coordinate system, representing the UAV's spatial orientation relative to the ship; for example, based on the aforementioned UAV position vector... With the ship's position vector Calculate the relative position vector m indicates that the UAV is located slightly above and to the left rear of the ship. The relative velocity vector is the vector difference between the UAV's velocity vector and the ship's velocity vector in the same inertial coordinate system, representing the instantaneous velocity of the UAV relative to the ship; for example, based on the above velocity vector calculation, the relative velocity vector... m / s indicates that the drone is approaching the ship in a direction slightly to the right and downward relative to it.

[0037] The first included angle parameter refers to a numerical value representing the directional difference between the UAV's body normal vector and the ship's deck normal vector, obtained through the dot product of the two unit vectors. For example, suppose the body normal vector corresponding to the UAV's current attitude angle... for Ship normal vector for Then the first included angle parameter is the dot product of the two vectors. .

[0038] S2. Input the first state information, the second state information, the relative position vector, the relative velocity vector and the first included angle parameter into a pre-trained strategy neural network, and output the desired acceleration command and desired yaw torque command in the body coordinate system through the strategy neural network. The strategy neural network is a neural network trained using a near-end strategy optimization algorithm.

[0039] Among them, the policy neural network refers to an artificial neural network trained by a near-end policy optimization algorithm, used to output high-level control commands based on the input state information, and includes two parts: a policy network and a value network. For example, in the above-mentioned landing process, the policy neural network receives dozens of dimensions of state data, including the UAV's position vector, velocity vector, attitude angle, angular velocity vector, ship's position vector, velocity vector, attitude angle, normal vector, relative position vector, relative velocity vector, and the first included angle parameter. After forward calculation, it outputs the desired acceleration command in the body coordinate system. m / s² and desired yaw moment command N·m.

[0040] The desired acceleration command refers to the three-dimensional vector of linear acceleration output by the strategy neural network, which is expected to be achieved by the UAV in the body coordinate system. This vector guides the UAV to accelerate or decelerate in a specified direction. For example, in the aforementioned desired acceleration command, the X-axis component of 1.2 m / s² indicates forward acceleration to reduce horizontal distance, while the Z-axis component of -2.8 m / s² indicates increased downward acceleration to reduce altitude. The desired yaw moment command refers to the torque value output by the strategy neural network, which is expected to be applied to the vertical axis of the UAV body. This torque is used to adjust the UAV's yaw angle to align with the ship's deck. For example, the aforementioned desired yaw moment command of 0.15 N·m drives the body to generate a positive yaw rate, making the UAV's nose direction more aligned with the ship's bow direction.

[0041] Among them, the proximal policy optimization algorithm refers to a reinforcement learning algorithm based on policy gradients that uses a truncation function to limit the change in the probability ratio between the old and new policies to stabilize the training process. For example, when training a policy neural network, the state, action, and reward sequences of the drone's interaction with the environment are collected, the advantage function is calculated, and then the objective function of the proximal policy optimization algorithm is used. Update network parameters and truncate parameters. Set to 0.2, when the probability ratio of the new and old strategies is... Exceeding The gradient is truncated over time.

[0042] S4. Input the desired acceleration command and the desired yaw moment command into the lower-level controller, and generate thrust control commands acting on each rotor of the UAV through the lower-level controller to drive the UAV to move and complete the landing on the ship.

[0043] In this context, the lower-level controller refers to the feedback control loop located downstream of the policy neural network, responsible for converting the desired commands from the higher-level controller into thrust control commands for each rotor. For example, in this method, the lower-level controller employs a dual-loop proportional-derivative controller, with the outer loop based on the desired acceleration command. The desired roll angle, desired pitch angle, and total thrust are calculated. The inner loop calculates the motor thrust distribution based on the deviation between the desired attitude angle and the current attitude angle. Thrust control commands refer to the target values ​​of speed or thrust generated by the lower-level controller and sent to each rotor motor of the UAV, directly controlling the UAV's motion state. For example, if the lower-level controller calculates the thrust of the four rotors to be 5.2N, 5.6N, 5.1N, and 5.4N respectively, the UAV adjusts the speed of each motor accordingly to generate the required resultant force and torque.

[0044] The training process of the policy neural network includes calculating the reward value at each time step based on a reward function, which includes a projection distance reward, a pose reward, and a state alignment reward. The projection distance reward is determined based on the projection component of the relative position vector onto the ship's normal vector, and it takes effect when the magnitude of the relative position vector is less than a first preset threshold. The pose reward is determined based on a second angle parameter between the normal vector of the UAV and the normal vector of the ship, and it takes effect when the magnitude is less than a second preset threshold. The state alignment reward is determined based on the consistency of the directions of the relative velocity vector and the relative position vector.

[0045] Here, the reward function refers to a mathematical expression used to evaluate the quality of an agent's actions during reinforcement learning training, mapping states and actions to scalar reward values; for example, the reward function of this invention consists of a projection distance reward term. Posture Rewards Approximation state alignment reward item And successful landing reward items The reward value for each time step is calculated using a weighted summation. The reward value for each time step refers to the single-step reward scalar calculated using a reward function based on the state and actions of the current time step within the simulation or actual control cycle. For example, if, within a certain control cycle, the UAV is close to the ship and its attitude is well aligned, the projected distance reward contributes 0.8 points, the pose reward contributes 0.6 points, and the approximation state alignment reward contributes 0.5 points, then the reward value for that time step is 1.9 points.

[0046] The projection distance reward term refers to the component of the reward function that encourages the UAV to approach the ship along the ship's normal vector direction, and it is activated only when the relative distance is less than a first preset threshold; for example, the first preset threshold is set to 3.0m, and when the relative position vector magnitude is less than 3.0m... When the distance is reduced from 3.5m to 2.8m, the projection distance bonus begins to take effect, calculated as follows: Among them, the basic projection bonus value Take 1.0, slope Set the threshold to 5.0. It is 3.0m, the first power exponent. Take 1.5. The pose reward term refers to the component in the reward function used to align the UAV's body normal vector with the ship's deck normal vector, activated only when the relative distance is less than a second preset threshold; for example, setting the second preset threshold to 2.0m, when the relative position vector magnitude... Once the distance is less than 2.0m, the pose bonus begins to take effect, and the calculation method is as follows: Among them, the basic pose reward value Set to 1.0, effective threshold. It is 2.0m, the second power exponent. We set the value to 2.0. The approximation state alignment reward term refers to the component of the reward function that encourages the relative velocity vector and the relative position vector to maintain consistent directions; for example, it is calculated as follows: Alignment weight coefficient Taking a value of 0.5, when the dot product of the relative velocity vector and the relative position vector is 0.92, this contribution is... point.

[0047] Here, the projection component refers to the scalar projection length of the relative position vector onto the direction of the ship's normal vector; for example, the relative position vector m is in the ship's normal vector The projected components on are m. The first preset threshold refers to a pre-set spatial distance threshold value used to control the activation timing of the projection distance reward item; for example, the first preset threshold can be set to 3.0m. When the relative position vector magnitude is greater than 3.0m, the S-shaped function output in the projection distance reward item is close to 0, and this item basically does not contribute to the reward.

[0048] The second included angle parameter refers to a numerical value that characterizes the directional difference between the UAV's body normal vector and the ship's deck normal vector, and is used in the calculation of the pose reward term in the reward function; for example, in the reward calculation, the second included angle parameter is taken as the UAV normal vector. With ship normal vector dot product value The second preset threshold refers to a pre-set spatial distance threshold value used to control the activation timing of the pose reward item. For example, the second preset threshold can be set to 2.0m. When the relative position vector magnitude is greater than 2.0m, the S-shaped function output in the pose reward item approaches 0, and this item contributes little to the total reward.

[0049] Directional consistency refers to the degree to which the relative velocity vector and the relative position vector align in direction, measured by the dot product of the normalized versions of the two vectors. For example, if the relative velocity vector... The unit vector is relative position vector The unit vector is: The dot product is: .

[0050] The technical solution of this embodiment cascades a policy neural network trained by a near-end policy optimization algorithm with a lower-level controller. It inputs the state information of the UAV and the ship, the relative position and velocity vectors, and the angle parameter between the normal vectors into the policy network and outputs the desired acceleration and yaw torque commands. The lower-level controller then converts these commands into rotor thrust control commands to drive the UAV to land on the ship. This solves the problems of traditional sliding mode control, such as chattering and decoupling difficulties, high computational power consumption and insufficient real-time performance of model predictive control, and the dependence of backstepping on accurate models and high data requirements. It improves the robustness, flexibility and scalability of UAV autonomous landing control in complex sea conditions.

[0051] In one alternative approach, the position vector and velocity vector of the UAV in the inertial coordinate system are obtained by fusing the inertial measurement unit and the global positioning system, and the attitude angle and angular velocity vector in the body coordinate system are obtained by the inertial measurement unit.

[0052] Inertial measurement unit (IMU) refers to a sensor assembly installed inside the drone's fuselage, used to measure three-axis acceleration and three-axis angular velocity in the drone's coordinate system; for example, the six-axis IMU on the drone outputs the current angular velocity as follows: rad / s, acceleration is m / s², through attitude calculation, the roll angle is 5.2°, the pitch angle is 8.7°, and the yaw angle is 32.1°. The Global Positioning System (GPS) refers to a satellite navigation system that calculates the receiver's three-dimensional position on the Earth's surface by receiving navigation satellite signals; for example, the GPS module on a UAV outputs latitude, longitude, and altitude data, which, after coordinate transformation, yields the position vector in the inertial coordinate system. m, velocity vector m / s.

[0053] In the above-mentioned optional methods, the position vector and velocity vector of the UAV in the inertial coordinate system are obtained by further fusing the inertial measurement unit and the global positioning system. At the same time, the attitude angle and angular velocity vector in the body coordinate system are calculated by using the inertial measurement unit. This realizes the multi-source fusion perception and accurate acquisition of UAV state information, and lays a reliable data foundation for subsequent relative pose calculation and policy network input.

[0054] In one alternative approach, the ship's position vector, velocity vector, attitude angle, and normal vector in the inertial coordinate system are obtained by fusing the inertial navigation system and the satellite positioning system on board the ship.

[0055] Inertial navigation systems refer to autonomous navigation equipment installed on ships that uses gyroscopes and accelerometers to measure the ship's motion parameters and continuously provide position, velocity, and attitude information through integration calculations. For example, a ship's inertial navigation system outputs the ship's current position vector in an inertial coordinate system. m, velocity vector m / s, attitude angles: roll 2.1°, pitch 1.8°, yaw 30.0°. A satellite positioning system refers to the equipment on a ship used to receive signals from global navigation satellite systems to determine the ship's absolute position; for example, a ship's satellite positioning system receives GPS and BeiDou signals, calculates the raw position data, and then fuses it with inertial navigation system data using Kalman filtering to obtain a smooth and stable ship position and velocity vector.

[0056] In the above-mentioned optional methods, the ship's position vector, velocity vector, attitude angle and normal vector in the inertial coordinate system are obtained by further integrating the ship's inertial navigation system and satellite positioning system. This enables real-time monitoring and accurate representation of the ship's motion state, ensuring the accuracy and synchronization of the relative state calculation between the UAV and the ship.

[0057] In one alternative embodiment, the policy neural network includes a policy network and a value network, wherein the policy network is used to output the desired acceleration command and the desired yaw moment command in the body coordinate system, and the value network is used to estimate state values ​​to assist in updating the policy network.

[0058] The policy network refers to the part of the policy neural network responsible for outputting actions, directly generating control commands based on the input state. For example, the policy network receives a normalized, multi-dimensional state vector, passes it through a three-layer fully connected network for forward propagation, and outputs a four-dimensional action vector, corresponding to the three components of the expected acceleration and the expected yaw moment. The value network refers to the part of the policy neural network responsible for evaluating the quality of a state, outputting an estimate of the expected cumulative reward in the current state. For example, the value network shares some of the underlying feature extraction layers with the policy network, ultimately outputting a scalar value. If the output is 15.6 in a certain state, it means that starting from that state, the total reward expected to be obtained according to the current policy is approximately 15.6 points.

[0059] In the above-mentioned optional approach, the policy neural network is further divided into a policy network and a value network. The policy network is responsible for outputting the expected acceleration command and the expected yaw torque command, while the value network is used to estimate the state value to assist the policy network in updating. This achieves the synergistic effect of policy optimization and value evaluation, and improves the stability and convergence efficiency of the training process.

[0060] In one alternative approach, the projection distance reward is composed of a base projection reward value, a power of the ratio of the length of the projection component of the relative position vector in the direction of the ship's normal vector to the magnitude of the relative position vector, and a first switching function, wherein the power is a first exponent.

[0061] Here, the first power exponent refers to the exponent applied to the ratio of the projection component length to the modulus in the calculation of the projection distance bonus; for example, the first power exponent. When the ratio is 1.5, it increases from 0.46 to 0.57 after being raised to the power of 1.5, as the ratio increases from 0.6 to 0.8.

[0062] In the above-mentioned optional methods, the projection distance reward term is further constructed by multiplying the basic projection reward value, the first power of the ratio of the projection component length to the relative position vector magnitude, and the first switching function. This achieves a refined design of the reward signal for the approach phase, guiding the UAV to maintain the optimal descent trajectory as it approaches the ship.

[0063] In one alternative approach, the first switching function is an S-shaped function with the magnitude of the relative position vector and the first preset threshold as variables; when the magnitude is greater than the first preset threshold, the output of the first switching function approaches zero, and when the magnitude is less than the first preset threshold, the output of the first switching function approaches one.

[0064] In the above-mentioned optional methods, an S-shaped function with the relative position vector magnitude and a first preset threshold as variables is further adopted as the first switching function. When the magnitude is greater than the first preset threshold, the output approaches zero, and when it is less than the first preset threshold, the output approaches one, thereby realizing smooth start and stop control of the projection distance reward item and avoiding sudden changes and jitter in the reward signal.

[0065] Here, the S-shaped function refers to a mathematical function with smooth step characteristics, whose output value changes continuously between 0 and 1, and is used as a reward item switch; for example, using the relative position vector magnitude as a variable, the first preset threshold is 3.0m, and the slope parameter... Taking 5.0, the sigmoid function expression is: When the mold length is 3.5m, the output is approximately 0.08, and when the mold length is 2.8m, the output is approximately 0.88.

[0066] In one alternative approach, the second included angle parameter is the dot product of the normal vector of the UAV and the normal vector of the ship, and the pose reward term is composed of the product of the base pose reward value, the power of the second included angle parameter, and the second switching function, wherein the power is the second power exponent.

[0067] Here, the second power exponent refers to the exponent applied to the second included angle parameter in the pose reward calculation; for example, the second power exponent. If the second included angle parameter is 2.0, and the square value is 0.98, the reward value is slightly reduced to 0.9604; if the second included angle parameter is reduced to 0.90, the square value is 0.81, and the reward value is significantly reduced.

[0068] In the above-mentioned optional methods, the dot product of the UAV's normal vector and the ship's normal vector is further used as the second included angle parameter. The pose reward term is formed by multiplying the basic pose reward value, the second power of the second included angle parameter, and the second switching function. This realizes the quantitative design of the attitude alignment stage reward mechanism and promotes the precise alignment of the UAV's landing attitude with the deck plane.

[0069] In the specific implementation of this invention, constructing a high-fidelity simulation environment and conducting reinforcement learning training are the core prerequisites for realizing autonomous ship landing control of unmanned aerial vehicles (UAVs). The following provides a detailed explanation combining quadrotor UAV dynamics modeling, ship sway modeling, and a reinforcement learning planning method based on geometric alignment near-end strategy optimization.

[0070] For ease of description, the following is used: This represents the Euclidean norm of a vector and the induced norm of a matrix. Vectors and matrices are both indicated in bold to distinguish them from scalars.

[0071] First, dynamic modeling of a quadcopter UAV: Introduce two reference coordinate systems: an inertial coordinate system fixed to the Earth, with coordinate axes denoted as... A body coordinate system fixed at the center of gravity of the quadrotor, with coordinate axes denoted as... The position vector of the quadrotor's center of mass in the inertial coordinate system is expressed as: The velocity vector is represented as Rotation matrix Map the body coordinate system to the inertial coordinate system. The angular velocity vector in the body coordinate system is expressed as: The quadrotor's dynamic equations are expressed as follows: in, and These represent the mass and gravitational acceleration of the quadcopter drone, respectively. and The third unit vectors of the inertial coordinate system and the body coordinate system, respectively. . Indicates along the body coordinate system directional thrust. This represents the disturbance acting on the inertial coordinate system. To meet A skew-symmetric matrix. The inertia matrix, The torque is in the body coordinate system.

[0072] Second, ship swaying modeling: The rolling motion of a ship under complex sea conditions is simulated in two stages: the first stage constructs a wave model, and the second stage solves for the ship's motion response based on the wave model. The wave simulation employs a spectral method based on Fast Fourier Transform, combined with a nonlinear displacement model to simulate wave crest breaking and foam generation. Sea surface height field. Modeled by linear wave superposition, initial complex amplitude spectrum Based on JONSWAP spectrum and directional diffusion function Build: in, For wavenumber vectors, Based on water depth The dispersion relation frequency. To meet Complex Gaussian random numbers are generated using the Box-Muller transform. This is the group velocity term. At any given time... Height field Calculated using inverse fast Fourier transform: Introducing horizontal displacement components Constructing parametric surfaces : The horizontal displacement is derived from the vertical displacement gradient: Wave crest breaking via Jacobian determinant Detection: when At this time, the waveform folds. Foam strength Generate and decay as follows: in, The breakage threshold, To generate the attenuation rate, This represents the dissipation rate.

[0073] When solving for ship motion, the bounding box of the ship's collider is discretized into... voxels, each voxel has a volume of For each voxel center point Get real-time sea surface height Define the local immersion depth. and immersion factor : Each voxel is subject to buoyancy. for: Total buoyancy and restoring torque They are respectively: in, This refers to the location of the ship's center of gravity.

[0074] Linear damping is defined based on an empirical damping model using immersion volume fraction. and angular damping : Corresponding damping force and damping torque for: in, The location of the ship's center of gravity. This is the angular velocity of the ship's rotation around its center of mass.

[0075] The equations of motion for a ship are expressed as follows: in, For the quality of the ship, This is the ship's inertia matrix.

[0076] Third, a reinforcement learning planning method based on geometric alignment proximal policy optimization: The reinforcement learning problem is formulated as a Markov decision process. ,in For state space, For the action space, Let be the state transition probability. For the reward function, This is the discount factor.

[0077] The state space input contains the state information of the drone and the ship. The motion space output is set to... ,in This represents the desired acceleration command in the body coordinate system. This indicates the desired yaw moment command.

[0078] The proximal policy optimization algorithm belongs to the Actor-Critic framework and uses a policy network to represent the policy function. Value functions are represented using value networks. In strategy The cumulative expected reward is expressed as follows: The state-value function is defined as: The action value function is defined as: The advantage function is defined as: set up The objective function of the near-end policy optimization algorithm is: (This represents the ratio of the probabilities of the new and old policies.) in, To truncate the hyperparameters, the policy gradient is: The policy network parameter update method is as follows: in, Let be the learning rate. The objective function of the value network is: in, The time-series difference objective value is expressed as: .

[0079] To avoid the strategy getting trapped in local optima, a reward structure based on physical priors was designed. A projected distance reward term was introduced to construct an anisotropic reward field, with a larger reward only given when the UAV approaches along the ship's normal vector. A switching function was designed. ,in Indicates the distance between the drone and the ship. This represents the first preset threshold. The projection distance reward is described as follows: in, This represents the relative position vector of the UAV compared to the ship. This represents the ship's normal vector. Based on the basic projection bonus value, It is the first power exponent.

[0080] The pose reward is used to achieve phased control for rapid long-distance approximation and fine-grained close-distance alignment, and is described as follows: in, This represents the normal vector of the drone. The base pose reward value, The second preset threshold, It is the second power exponent.

[0081] To maintain control stability and retain maneuverability, a runtime penalty term is introduced: in, For the drone's velocity vector, Let ω be the angular velocity vector of the UAV. and These are the upper limits for velocity and angular velocity, respectively. and This is the penalty coefficient.

[0082] The approximation state alignment reward encourages the relative velocity vector and the relative position vector to be collinear and co-directed, expressed as follows: in, It is a relative velocity vector. Alignment weight coefficients.

[0083] A sparse reward is given upon successful landing: in, Represents the relative angular velocity vector. As the optimal landing bonus base value, , , , These are the weighting coefficients for each component. For smoothing parameters.

[0084] Through the above combination of reward functions, the policy neural network learns a safe and efficient landing strategy during training, thereby achieving highly robust autonomous landing control in complex sea conditions.

[0085] Figure 2 A schematic diagram of an embodiment of a reinforcement learning-based autonomous carrier landing control system 200 for unmanned aerial vehicles provided by the present invention is shown. Figure 2 As shown, the reinforcement learning-based autonomous carrier landing control system 200 for unmanned aerial vehicles includes: The acquisition module 201 is used to acquire first state information of the UAV and second state information of the ship. The first state information includes the position vector and velocity vector of the UAV in the inertial coordinate system, and the attitude angle and angular velocity vector in the body coordinate system. The second state information includes the position vector, velocity vector, attitude angle and normal vector of the ship in the inertial coordinate system. The determining module 202 is used to determine the relative position vector and relative velocity vector of the UAV relative to the ship based on the first state information and the second state information, and to determine the first included angle parameter between the normal vector of the UAV and the normal vector of the ship. The generation module 203 is used to input the first state information, the second state information, the relative position vector, the relative velocity vector and the first included angle parameter into a pre-trained strategy neural network, and output the desired acceleration command and desired yaw moment command in the body coordinate system through the strategy neural network. The strategy neural network is a neural network trained using a near-end strategy optimization algorithm. Control module 204 is used to input the desired acceleration command and the desired yaw moment command into the lower-level controller, and generate thrust control commands acting on each rotor of the UAV through the lower-level controller, so as to drive the UAV to move and complete the landing on the ship. The training process of the policy neural network includes calculating the reward value at each time step based on a reward function, which includes a projection distance reward, a pose reward, and a state alignment reward. The projection distance reward is determined based on the projection component of the relative position vector onto the ship's normal vector, and it takes effect when the magnitude of the relative position vector is less than a first preset threshold. The pose reward is determined based on a second angle parameter between the normal vector of the UAV and the normal vector of the ship, and it takes effect when the magnitude is less than a second preset threshold. The state alignment reward is determined based on the consistency of the directions of the relative velocity vector and the relative position vector.

[0086] In one alternative approach, the position vector and velocity vector of the UAV in the inertial coordinate system are obtained by fusing the inertial measurement unit and the global positioning system, and the attitude angle and angular velocity vector in the body coordinate system are obtained by the inertial measurement unit.

[0087] In one alternative approach, the ship's position vector, velocity vector, attitude angle, and normal vector in the inertial coordinate system are obtained by fusing the inertial navigation system and the satellite positioning system on board the ship.

[0088] In one alternative embodiment, the policy neural network includes a policy network and a value network, wherein the policy network is used to output the desired acceleration command and the desired yaw moment command in the body coordinate system, and the value network is used to estimate state values ​​to assist in updating the policy network.

[0089] In one alternative approach, the projection distance reward is composed of a base projection reward value, a power of the ratio of the length of the projection component of the relative position vector in the direction of the ship's normal vector to the magnitude of the relative position vector, and a first switching function, wherein the power is a first exponent.

[0090] In one alternative approach, the first switching function is an S-shaped function with the magnitude of the relative position vector and the first preset threshold as variables; when the magnitude is greater than the first preset threshold, the output of the first switching function approaches zero, and when the magnitude is less than the first preset threshold, the output of the first switching function approaches one.

[0091] In one alternative approach, the second included angle parameter is the dot product of the normal vector of the UAV and the normal vector of the ship, and the pose reward term is composed of the product of the base pose reward value, the power of the second included angle parameter, and the second switching function, wherein the power is the second power exponent.

[0092] It should be noted that the beneficial effects of the reinforcement learning-based UAV autonomous carrier landing control system 200 provided in the above embodiments are the same as those of the reinforcement learning-based UAV autonomous carrier landing control method described above, and will not be repeated here. Furthermore, the system provided in the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the system can be divided into different functional modules according to the actual situation to complete all or part of the functions described above. In addition, the system and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process is detailed in the method embodiments, and will not be repeated here.

[0093] The reinforcement learning-based autonomous carrier landing control system 200 of the present invention can be a computer program (including program code) running on a computer device. For example, the reinforcement learning-based autonomous carrier landing control system 200 of the present invention is an application software that can be used to execute the corresponding steps in the reinforcement learning-based autonomous carrier landing control method of the present invention.

[0094] In some embodiments, the reinforcement learning-based autonomous carrier landing control system 200 of the present invention can be implemented in a combination of hardware and software. As an example, the reinforcement learning-based autonomous carrier landing control system 200 of the present invention can be a processor in the form of a hardware decoding processor, which is programmed to execute the reinforcement learning-based autonomous carrier landing control method of the present invention. For example, the processor in the form of a hardware decoding processor can be one or more application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field-programmable gate arrays (FPGAs), or other electronic components.

[0095] The modules described in the embodiments of this invention can be implemented in software or hardware. The names of the modules are not, in some cases, limiting the scope of the module itself.

[0096] An electronic device according to an embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements any of the above-mentioned reinforcement learning-based autonomous carrier landing control methods for unmanned aerial vehicles. That is, an electronic device according to an embodiment of the present invention may include, but is not limited to: a processor and a memory; the memory is used to store the computer program; the processor is used to execute the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles shown in any embodiment of the present invention by calling the computer program.

[0097] In one alternative embodiment, an electronic device is provided, such as Figure 3 As shown, Figure 3The illustrated electronic device 4000 includes a processor 4001 and a memory 4003. The processor 4001 and the memory 4003 are connected, for example, via a bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, which can be used for data interaction between the electronic device and other electronic devices, such as sending and / or receiving data. It should be noted that in practical applications, the transceiver 4004 is not limited to one type, and the structure of the electronic device 4000 does not constitute a limitation on the embodiments of the present invention.

[0098] Processor 4001 may be a CPU (Central Processing Unit), a general-purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It can implement or execute the various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this invention. Processor 4001 may also be a combination that implements computational functions, such as including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

[0099] Bus 4002 may include a path for transmitting information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc. Bus 4002 can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 3 The bus 4002 is represented by only one thick line, but this does not mean that there is only one bus or one type of bus.

[0100] The memory 4003 may be ROM (Read Only Memory) or other types of static storage devices capable of storing static information and instructions, RAM (Random Access Memory) or other types of dynamic storage devices capable of storing information and instructions, or EEPROM (Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read Only Memory) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but not limited thereto.

[0101] The memory 4003 stores application code (computer program) for executing the present invention, and its execution is controlled by the processor 4001. The processor 4001 executes the application code stored in the memory 4003 to implement the content shown in the foregoing method embodiments.

[0102] Among them, electronic devices can also be terminal devices. A terminal device can be any terminal device that can install applications and access web pages through applications, including at least one of smartphones, tablets, laptops, desktop computers, smart speakers, smartwatches, smart TVs, and smart in-vehicle devices.

[0103] It should be noted that, Figure 3 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.

[0104] An embodiment of the present invention provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements any of the above-mentioned reinforcement learning-based autonomous carrier landing control methods for unmanned aerial vehicles.

[0105] Alternatively, the computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), magnetic tape, a floppy disk, and an optical data storage device, etc.

[0106] In an exemplary embodiment, a computer program product or computer program is also provided, which includes computer instructions stored in a computer-readable storage medium. The processor of an electronic device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the electronic device to perform the aforementioned reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles.

[0107] Computer program code for performing the operations of this invention can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0108] It should be understood that the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing the specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0109] The computer-readable storage medium provided in this invention can be, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this invention, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

[0110] The aforementioned computer-readable storage medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the method shown in the above embodiments.

[0111] The above description is merely a preferred embodiment of the present invention and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of disclosure in this invention is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-disclosed concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this invention.

[0112] It should be noted that the terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and represent a limitation on a specific order or sequence. Where appropriate, the order of use for similar objects can be interchanged so that the embodiments of this application described herein can be implemented in an order other than that shown or described.

[0113] Those skilled in the art will recognize that this invention can be implemented as a system, method, or computer program product. Therefore, this invention can be specifically implemented in the following forms: it can be entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software, generally referred to herein as a "circuit," "module," or "system." Furthermore, in some embodiments, this invention can also be implemented as a computer program product contained in one or more computer-readable media, which includes computer-readable program code.

[0114] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.

Claims

1. A method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning, characterized in that, include: The first state information of the UAV and the second state information of the ship are obtained. The first state information includes the position vector and velocity vector of the UAV in the inertial coordinate system, and the attitude angle and angular velocity vector in the body coordinate system. The second state information includes the position vector, velocity vector, attitude angle and normal vector of the ship in the inertial coordinate system. Based on the first state information and the second state information, the relative position vector and relative velocity vector of the UAV relative to the ship are determined, and the first angle parameter between the normal vector of the UAV and the normal vector of the ship is determined. The first state information, the second state information, the relative position vector, the relative velocity vector, and the first included angle parameter are input into a pre-trained strategy neural network. The strategy neural network outputs the desired acceleration command and the desired yaw moment command in the body coordinate system. The strategy neural network is a neural network trained using a near-end strategy optimization algorithm. The desired acceleration command and the desired yaw moment command are input into the lower-level controller, which generates thrust control commands that act on each rotor of the UAV, driving the UAV to move and complete the landing on the ship. The training process of the policy neural network includes calculating the reward value at each time step based on a reward function, which includes a projection distance reward, a pose reward, and a state alignment reward. The projection distance reward is determined based on the projection component of the relative position vector onto the ship's normal vector, and it takes effect when the magnitude of the relative position vector is less than a first preset threshold. The pose reward is determined based on a second angle parameter between the normal vector of the UAV and the normal vector of the ship, and it takes effect when the magnitude is less than a second preset threshold. The state alignment reward is determined based on the consistency of the directions of the relative velocity vector and the relative position vector.

2. The method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning according to claim 1, characterized in that, The position and velocity vectors of the UAV in the inertial coordinate system are obtained by fusion calculation of the inertial measurement unit and the global positioning system, and the attitude angles and angular velocity vectors in the body coordinate system are obtained by the inertial measurement unit.

3. The method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning according to claim 2, characterized in that, The ship's position vector, velocity vector, attitude angle, and normal vector in the inertial coordinate system are obtained by fusing the inertial navigation system and the satellite positioning system on board the ship.

4. The method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning according to claim 1, characterized in that, The policy neural network includes a policy network and a value network. The policy network is used to output the desired acceleration command and the desired yaw moment command in the body coordinate system. The value network is used to estimate the state value to assist the updating of the policy network.

5. The method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning according to claim 1, characterized in that, The projection distance bonus is composed of a base projection bonus value, a power of the ratio of the length of the projection component of the relative position vector in the direction of the ship's normal vector to the magnitude of the relative position vector, and a first switching function, wherein the power is the first exponent.

6. The method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning according to claim 5, characterized in that, The first switching function is an S-shaped function with the magnitude of the relative position vector and the first preset threshold as variables; when the magnitude is greater than the first preset threshold, the output of the first switching function approaches zero, and when the magnitude is less than the first preset threshold, the output of the first switching function approaches one.

7. The method for autonomous carrier landing control of unmanned aerial vehicles based on reinforcement learning according to claim 1, characterized in that, The second included angle parameter is the dot product of the normal vector of the UAV and the normal vector of the ship. The pose reward term is composed of the product of the basic pose reward value, the power of the second included angle parameter, and the second switching function. The power is the second power exponent.

8. A reinforcement learning-based autonomous carrier landing control system for unmanned aerial vehicles, characterized in that, include: The acquisition module is used to acquire first state information of the UAV and second state information of the ship. The first state information includes the position vector and velocity vector of the UAV in the inertial coordinate system, and the attitude angle and angular velocity vector in the body coordinate system. The second state information includes the position vector, velocity vector, attitude angle and normal vector of the ship in the inertial coordinate system. The determination module is used to determine the relative position vector and relative velocity vector of the UAV relative to the ship based on the first state information and the second state information, and to determine the first angle parameter between the normal vector of the UAV and the normal vector of the ship. The generation module is used to input the first state information, the second state information, the relative position vector, the relative velocity vector and the first included angle parameter into a pre-trained strategy neural network, and output the desired acceleration command and desired yaw moment command in the body coordinate system through the strategy neural network. The strategy neural network is a neural network trained using a near-end strategy optimization algorithm. The control module is used to input the desired acceleration command and the desired yaw moment command into the lower-level controller, and generate thrust control commands acting on each rotor of the UAV through the lower-level controller, so as to drive the UAV to move and complete the landing on the ship. The training process of the policy neural network includes calculating the reward value at each time step based on a reward function, which includes a projection distance reward, a pose reward, and a state alignment reward. The projection distance reward is determined based on the projection component of the relative position vector onto the ship's normal vector, and it takes effect when the magnitude of the relative position vector is less than a first preset threshold. The pose reward is determined based on a second angle parameter between the normal vector of the UAV and the normal vector of the ship, and it takes effect when the magnitude is less than a second preset threshold. The state alignment reward is determined based on the consistency of the directions of the relative velocity vector and the relative position vector.

9. An electronic device, characterized in that, The electronic device includes a processor coupled to a memory, the memory storing at least one computer program, which is loaded and executed by the processor to enable the electronic device to implement the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one computer program, which, when executed by a processor, implements the reinforcement learning-based autonomous carrier landing control method for unmanned aerial vehicles as described in any one of claims 1 to 7.