Methods, equipment, and media for constructing deep reinforcement learning network models for UAV control
By splitting the value network into short-sighted and long-sighted networks and combining pre-training and temporal difference, the problem of low sample utilization in UAV flight control models is solved, achieving more efficient training and lower costs, while improving the model's decision-making ability and interpretability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TONGJI UNIV
- Filing Date
- 2023-11-20
- Publication Date
- 2026-06-30
Smart Images

Figure CN117519239B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method, device, and medium for constructing a deep reinforcement learning network model for drone flight control actions. Background Technology
[0002] Deep reinforcement learning (DRL) is a product of the combination of deep learning and reinforcement learning. It integrates the powerful understanding capabilities of deep learning in perceptual problems with the decision-making capabilities of reinforcement learning, achieving end-to-end learning. The emergence of deep reinforcement learning has made reinforcement learning technology truly practical, enabling the solution of complex problems in real-world scenarios.
[0003] While deep reinforcement learning performs exceptionally well in planning and decision-making problems such as games and drone flight control, it still faces significant challenges. Reinforcement learning requires samples whose distribution closely matches the actions of the policy being trained. During training, samples are often discarded after use, necessitating repeated recollection. Therefore, to achieve the same performance level, the required sample size (number of interactions) far exceeds that of humans. The difficulty in reusing samples leads to low sample utilization during a single training session, and the inability to reuse samples across multiple training sessions significantly increases costs. This is particularly true for drone flight control, which requires sampling from real-world scenarios, resulting in substantial costs.
[0004] The actor-critic framework is significant for the construction of drone flight control actions because it provides a systematic approach to designing and implementing drone behavior.
[0005] The actor-critic framework is an important method in reinforcement learning, combining the advantages of value function methods and policy gradient methods. In the actor-critic framework, there are two main components: the actor, responsible for decision-making (selecting an action based on the current state); and the critic, responsible for evaluation (providing feedback based on the current state and the actor's chosen action). The critic typically uses a value function to evaluate the quality of the state or state-action pair. During the learning process, the actor and critic alternate roles. First, the actor selects an action based on the current policy; then, the critic provides feedback based on that action; finally, the actor updates the policy based on the feedback. This process is repeated until the policy converges.
[0006] The actor-commentator framework has the advantage of simultaneously learning the policy and value function, and it can handle continuous state and action spaces. Furthermore, by introducing a commentator, the variance introduced by sampling in policy gradient methods can be reduced, improving the stability of the learning process.
[0007] The actor-commentator framework for constructing drone flight control actions introduces a value network to assist policy iteration, significantly improving the performance of deep reinforcement learning models. In traditional actor-commentator frameworks for drone flight control action construction, the value network and policy network iterate alternately and depend on each other. In the early stages of model training, the value network cannot guide the policy network to iterate in a beneficial direction, resulting in a significant waste of time and training samples. The policy network can only effectively iterate after the value network has roughly converged. However, the optimization objective of the value network is influenced by the policy network, making it impossible to directly pre-train the value network. Therefore, traditional actor-commentator frameworks for drone flight control action construction cannot escape the problems of low sample utilization and non-reusability of samples between training iterations. Summary of the Invention
[0008] The purpose of this invention is to overcome the shortcomings of the existing technology by providing a method and medium for constructing a deep reinforcement learning network model for drone flight control actions. The value network for drone flight control actions is further divided into a short-sighted network and a long-sighted network. On the one hand, this allows training samples to be reused in multiple training sessions, improving the utilization rate of training samples and reducing the number of interactions between the agent and the environment. On the other hand, pre-training accelerates the network convergence speed and further reduces training costs.
[0009] The applicant initially argued that in the early stages of model training for the actor-commentator framework used for drone flight control maneuvers, the value network upon which policy iteration relies had not yet converged, making it difficult to efficiently guide policy network iteration. This resulted in a significant waste of time and samples during the initial stages of model training.
[0010] The objective of this invention can be achieved through the following technical solutions:
[0011] The first aspect of this invention provides a method for constructing a deep reinforcement learning network model for unmanned aerial vehicle (UAV) flight control, comprising the following steps:
[0012] S1: Construct the value network for drone flight control actions into a short-sighted network and a long-sighted network, and then construct a policy network;
[0013] S2: Based on sampling of UAV flight control action scenarios, or reusing previous sampling datasets, construct training samples, initialize network parameters, train short-sighted network, and pre-train policy network. Based on the pre-trained short-sighted network, perform gradient boosting on policy network and train until it converges or reaches a preset number of rounds.
[0014] S3: Fine-tune the pre-trained network model to serve as the drone's flight motion control model.
[0015] Furthermore, in S1, the specific process includes:
[0016] The action value function for UAV flight control actions is determined as follows:
[0017]
[0018] Where r(s,a) is the immediate reward for taking action a in state s, s' is the next state, determined by both s and a, and a' is the next action, determined by π. θ The decision is made by the short-sighted network predicting r(s,a) and the long-sighted network predicting...
[0019]
[0020] The short-sighted network and the visionary network sharing the backbone network are defined as:
[0021]
[0022] Where R(s,a,ξ',ξ1) is the short-sighted network, F(s,a,ξ',ξ2) is the long-sighted network, ξ=(ξ',ξ1,ξ2) are the network parameters, and ξ' is the shared network parameter.
[0023] Furthermore, in S2, the process of constructing training samples includes:
[0024] Interact with the scene using random or designed actions based on drones, and collect the current state s, interaction action a, reward r, and next state s', where the training set elements are in the format (s,a,r,s').
[0025] Furthermore, in S2, the process of initializing the network parameters is as follows:
[0026] The parameter ξ2, which belongs only to the foresight network F(s,a,ξ',ξ2), is initialized to ensure that its output is 0 for any input. The other network parameters are randomly initialized with fixed mean and variance.
[0027] Furthermore, in S2, the process of training the short-sighted network is as follows:
[0028] The short-sighted network is trained using training sample pairs, and the loss function is defined as follows:
[0029]
[0030] The loss function is to minimize the sum of the short-sighted network R(s,a,ξ',ξ1) and the real reward. Interval error;
[0031] The parameter update method is as follows:
[0032]
[0033]
[0034] The network parameters are updated using stochastic gradient descent, with α' and α1 being the learning rates. The training process continues until convergence or a predetermined number of rounds is reached.
[0035] Furthermore, in S2, the process of pre-training the policy network includes:
[0036] The decision actions output by the policy network will be used as part of the input to the value network, where the expression for the decision actions output by the policy network is:
[0037] Q(s,π θ (s),ξ)
[0038] The performance of the strategy can be expressed as:
[0039] J(π)=E s [Q(s,π θ (s),ξ)];
[0040] Gradient boosting is performed on the policy network based on the pre-trained value network. The gradient expression for the policy network parameters is as follows:
[0041]
[0042] The policy network parameters are then updated using stochastic gradient descent, and training continues until convergence or a predetermined number of iterations are reached.
[0043]
[0044] Where β is the learning rate.
[0045] Furthermore, in S3, the process of fine-tuning the trained network model includes:
[0046] S31: Fixed parameters for short-sighted network and backbone network:
[0047] Value network In the parameter ξ', ξ1 is fixed;
[0048] S32: Construct the target network:
[0049] Create a target network a' = π' with the same structure as the policy network and value network respectively. θ (s) and Q'(s,a) are used to fix the parameters;
[0050] S33: Using the aforementioned policy network output action and interaction with the environment, a reward and the next state are obtained:
[0051] The policy network takes environmental state information s as input and outputs action a = π. θ (s), then the output action a interacts with the environment to obtain the reward r and the next state s', and this interaction is added to the training samples;
[0052] S34: Input the next state into the target network to obtain the action estimate:
[0053] Input the next state s' into the target policy network, and output a' = π' θ (s').
[0054] Furthermore, in S3, the process of fine-tuning the trained network model also includes:
[0055] S35: Update value network parameters:
[0056] Predict the action state value Q'(s',a') of the next state using the target value network, and define the expected action state value using temporal differencing:
[0057] y=r+γQ'(s',a')
[0058] The loss function is then defined as follows:
[0059]
[0060] The parameter update expression is:
[0061]
[0062] in These are the extracted training samples. The loss function reflects the difference between the value network output and the expected value. Stochastic gradient descent is used to update the value network, where only the parameter ξ2 belonging to the foresight network is updated.
[0063] S36: Update policy network parameters:
[0064] This enables the policy network to perform gradient boosting based on the target value network, where the gradient expression is as follows:
[0065]
[0066] The parameter update expression is as follows:
[0067]
[0068] Where β' is the learning rate;
[0069] S37: Network parameter coverage;
[0070] After the model converges or after a specified number of rounds, the parameters of the policy network and the value network are respectively made to cover their corresponding target networks;
[0071] Repeat steps S33 through S37 until the model converges or reaches the predetermined number of rounds.
[0072] A second aspect of the present invention provides an electronic device, including a memory and a processor, wherein the processor is used to execute a program in the memory to implement the deep reinforcement learning network model construction method for UAV flight control as described above.
[0073] A third aspect of the present invention provides a storage medium containing computer-executable instructions, which, when executed by a computer processor, is used to execute a deep reinforcement learning network model construction method for UAV flight control as described above.
[0074] Compared with the prior art, the present invention has the following technical advantages:
[0075] 1) This method uses a deterministic policy gradient scheme and uses temporal difference for value iteration. At the same time, by further splitting the value network into short-sighted network and long-sighted network, on the one hand, training samples in multiple training sessions can be reused, improving the utilization rate of training samples and reducing the number of interactions between the agent and the environment. On the other hand, pre-training accelerates the convergence speed of the network and further reduces the training cost.
[0076] 2) After pre-training, the model possesses a certain degree of decision-making ability and can be further fine-tuned to improve decision-making performance. Furthermore, the short-sighted network and foresight network in this invention are interpretable and can be further applied to downstream tasks. The short-sighted network and foresight network in this invention can be used for evaluating the actions of UAVs in specific environments. By inputting the current environmental state, the short-sighted network provides a score for immediate gains, the foresight network provides a score for future gains, and the policy network is responsible for outputting actions. In this way, the UAV can select appropriate actions to execute based on the needs of the current task. The short-sighted network and foresight network can provide explanations and understanding of the UAV's decisions. By analyzing the network outputs, the reasons and basis for the UAV's selection of different actions in different environments can be understood. This is crucial for understanding the behavior of UAVs and judging the rationality of their decisions. Attached Figure Description
[0077] Figure 1 This is a schematic diagram of the deep reinforcement learning network model structure for UAV flight control in this invention;
[0078] Figure 2 This is a flowchart of the network model pre-training process in this invention;
[0079] Figure 3This is a flowchart of the network model fine-tuning training process in this invention. Detailed Implementation
[0080] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. Component models, material names, connection structures, control methods, algorithms, and other features not explicitly described in this technical solution are considered common technical features disclosed in the prior art.
[0081] S1: Divide the value network into short-sighted networks and long-sighted networks:
[0082] In the actor commentator framework, the value network prediction objective corresponding to the deterministic strategy is the action value function.
[0083] This means adopting strategy π θ In the case of θ, the expected cumulative future reward of taking action a in state s. Here, θ represents the policy network parameters. According to the Bellman equation, this can be further expressed as:
[0084]
[0085] Where r(s,a) is the direct reward for taking action a in state s, s' is the next state, determined by both s and a, and a' is the next action, determined by π. θ Decide.
[0086] like Figure 1 As shown, this invention splits the value network prediction objective into a short-sighted network and a long-sighted network, where the short-sighted network predicts r(s,a), and the long-sighted network predicts... Since both the short-sighted network and the long-sighted network have inputs of (s, a), they can share a backbone network. The short-sighted network and the long-sighted network in the model are collectively referred to as the value network.
[0087]
[0088]
[0089] Where R(s,a,ξ',ξ1) is the short-sighted network, F(s,a,ξ',ξ2) is the long-sighted network, ξ=(ξ',ξ1,ξ2) are the network parameters, and ξ' is the shared network parameter. The final state value is composed of the sum of the short-sighted network and the long-sighted network. Thus, the short-sighted network is freed from the constraints of the policy network and can be trained independently.
[0090] S2: Network Model Pre-training Methods Figure 2 The following specific steps are shown:
[0091] S21: Construction of training samples;
[0092] Sampling of the scene and training samples. Interacting with the scene using random or pre-designed actions, collecting the current state s, interaction action a, reward r, and next state s'. The training set elements are in the format (s, a, r, s'), and this training set is reusable and can be directly used for the next pre-training.
[0093] S22: Initialize network parameters;
[0094] The parameter ξ2, which belongs only to the foresight network F(s,a,ξ',ξ2), is initialized to ensure that its output is 0 for any input. The other network parameters are randomly initialized with fixed mean and variance.
[0095] S23: Training a short-sighted network;
[0096] The short-sighted network is trained using training sample pairs, and the loss function is defined as follows:
[0097]
[0098] This means minimizing the difference between the short-sighted network R(s,a,ξ',ξ1) and the real reward. Inter-range error. Parameter update method is as follows:
[0099]
[0100]
[0101] The network parameters are updated using stochastic gradient descent, where α' and α1 are the learning rates, and training continues until convergence or a predetermined number of iterations are reached.
[0102] S24: Pre-trained policy network;
[0103] The decision actions output by the policy network will serve as part of the input to the value network. The expression is:
[0104] Q(s,π θ (s),ξ)
[0105] The performance of the strategy can be expressed as:
[0106] J(π)=E s [Q(s,π θ (s),ξ)]
[0107] Gradient boosting is performed on the policy network based on the pre-trained value network. The gradient expression for the policy network parameters is as follows:
[0108]
[0109] The policy network parameters are updated using stochastic gradient descent, and training continues until convergence or a predetermined number of iterations are reached.
[0110]
[0111] Where β is the learning rate.
[0112] Methods for pre-training and subsequent fine-tuning of S3 network models:
[0113] Furthermore, the pre-training and subsequent fine-tuning method of the network model in step S3 is based on... Figure 3 The following specific steps are shown:
[0114] S31: Fixed short-sightedness network and backbone network parameters;
[0115] Value network The parameters ξ' and ξ1 in the equation are fixed and will not change thereafter.
[0116] S32: Construct the target network;
[0117] Create a target network a' = π' with the same structure as the policy network and value network respectively. θ (s) and Q'(s,a) are used to fix the parameters.
[0118] S33: Use the policy network to output actions and interact with the environment to obtain rewards and the next state;
[0119] The policy network takes environmental state information s as input and outputs action a = π. θ (s). Interact with the environment by outputting action a, obtain reward r and next state s', and add this interaction to the training samples.
[0120] S34: Input the next state into the target policy network to obtain the action estimate;
[0121] The target policy network takes the next state s' as input and outputs a' = π'. θ (s').
[0122] S35: Update value network parameters;
[0123] Predict the action state value Q'(s',a') of the next state using the target value network, and define the expected action state value using temporal differencing:
[0124] y=r+γQ'(s',a')
[0125] The loss function is then defined as follows:
[0126]
[0127] The parameter update expression is as follows:
[0128]
[0129] in These are the extracted training samples, and the loss function reflects the difference between the value network output and the expected value. Stochastic gradient descent is used to update the value network. Only the parameter ξ2 belonging exclusively to the foresight network is updated.
[0130] S36: Update policy network parameters;
[0131] The policy network will perform gradient boosting based on the target value network, where the gradient expression is as follows:
[0132]
[0133] The parameter update expression is as follows:
[0134]
[0135] Where β' is the learning rate.
[0136] S37: Network parameter coverage;
[0137] After the model converges or after a specified number of rounds, the parameters of the policy network and value network are respectively made to cover their target network.
[0138] S38: Repeat S33 to S37 until the model converges or reaches the predetermined number of rounds.
[0139] Finally, the trained deep reinforcement learning network model is applied to the drone's controller. The process can be referred to as follows:
[0140] Model Saving: Saves the trained deep reinforcement learning network model to a file. Typically, deep learning frameworks (such as PyTorch and TensorFlow) provide functions or methods for saving and loading models, which can be used to save the model as a file for later loading and use.
[0141] Controller Integration: This integrates the model loading functionality into the drone's control platform. This allows for customized development based on the drone's hardware and software platform. For example, if the drone is based on ROS (Robot Operating System), custom ROS nodes or services can be written to control the drone's behavior by loading model files and using their output.
[0142] Input processing: The current state or environmental input of the drone is passed to the loaded deep reinforcement learning network model. This may involve processing and preprocessing the sensor data to convert it into an input format acceptable to the model. For example, image data may be preprocessed, such as resizing, normalizing, or converting it to an appropriate tensor format.
[0143] Output Interpretation: Interpret the model's output and convert it into actions or commands that the drone can understand and execute. This may require appropriate mapping and transformation based on the drone's motion space and execution mechanisms.
[0144] Control execution: Based on the model's output, corresponding actions or commands are transmitted to the drone's actuators to control the drone's behavior. This may involve underlying motion control, path planning, or actuator control methods, depending on the drone's hardware and capabilities.
[0145] In this invention, short-sightedness networks and far-sightedness networks can be used to evaluate the actions of drones in specific environments. By inputting the current environmental state, the short-sightedness network provides a score for immediate gains, while the far-sightedness network provides a score for future gains. The policy network provides recommended actions. In this way, the drone can select appropriate actions to execute based on the requirements of the current mission. The short-sightedness network and far-sightedness network can provide explanations and understanding of the drone's action decisions. By analyzing the network outputs, the reasons and basis for the drone's selection of different actions in different environments can be understood. This is crucial for understanding the drone's behavior and judging the rationality of its decisions.
[0146] This embodiment also proposes a device for constructing a deep reinforcement learning network model for UAV flight control. This device includes a processor and a memory, which are coupled. The memory stores program instructions, and when these instructions are executed by the processor, the aforementioned task management method is implemented. The processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The memory may include Random Access Memory (RAM) and may also include Non-Volatile Memory, such as at least one disk storage device. The memory can be an internal memory of the Random Access Memory (RAM) type. The processor and memory can be integrated into one or more independent circuits or hardware, such as an Application Specific Integrated Circuit (ASIC). It should be noted that when the computer program in the aforementioned memory is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, electronic device, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of the present invention.
[0147] This embodiment also proposes a computer-readable storage medium storing computer instructions for instructing a computer to execute the aforementioned method for constructing a deep reinforcement learning network model for UAV flight control. The storage medium can be an electronic medium, magnetic medium, optical medium, electromagnetic medium, infrared medium, or a semiconductor system or propagation medium. The storage medium may also include semiconductor or solid-state memory, magnetic tape, removable computer disk, random access memory (RAM), read-only memory (ROM), hard disk, and optical disc. Optical discs may include optical disc-read-only memory (CD-ROM), optical disc-read / write (CD-RW), and DVD.
[0148] The above description of the embodiments is provided to enable those skilled in the art to understand and use the invention. It will be apparent to those skilled in the art that various modifications can be made to these embodiments, and the general principles described herein can be applied to other embodiments without inventive effort. Therefore, the present invention is not limited to the above embodiments, and any improvements and modifications made by those skilled in the art based on the disclosure of the present invention without departing from the scope of the invention should be within the protection scope of the present invention.
Claims
1. A method for constructing a deep reinforcement learning network model for UAV flight control, characterized in that, Includes the following steps: S1: Construct the value network for drone flight control actions into a short-sighted network and a long-sighted network, and then construct a policy network; S2: Based on sampling of UAV flight control action scenarios, construct training samples, or reuse previous sampling data, initialize network parameters, and apply this to networks that belong only to the Foresight Network. parameters Initialize the network to ensure that the output is 0 for any input. Randomly initialize the other network parameters with fixed mean and variance. Train the short-sighted network and pre-train the policy network. Perform gradient boosting on the policy network based on the pre-trained short-sighted network. Train until it converges or reaches the preset number of rounds. S3: Fine-tune the pre-trained network model, fix the short-sighted network and the backbone network, and use it as the drone flight control action control model. In S1, the specific process includes: The action value function for UAV flight control actions is determined as follows: in In the state Use actions direct rewards, For the state at the next moment, by and Joint decision, For the next action, by The decision, including short-sighted network prediction Vision Network Prediction ; The short-sighted network and the visionary network sharing the backbone network are defined as: in For short-sighted networks, For Vision Network, = For network parameters, To share network parameters.
2. The method for constructing a deep reinforcement learning network model for UAV flight control according to claim 1, characterized in that, In S2, the process of constructing training samples includes: Use random or designed actions based on UAV flight control to interact with the scene and collect the current state. Interactive actions ,award Next state The training set elements are in the following format: .
3. The method for constructing a deep reinforcement learning network model for UAV flight control according to claim 1, characterized in that, In S2, the process of training the short-sighted network is as follows: The short-sighted network is trained using training sample pairs, and the loss function is defined as follows: The loss function is to minimize the short-sighted network. With real rewards Interval error; The parameter update method is as follows: Among these methods, stochastic gradient descent is used to update the network parameters. and The learning rate is used to train the system until it converges or reaches a predetermined number of rounds.
4. The method for constructing a deep reinforcement learning network model for UAV flight control according to claim 1, characterized in that, In S2, the process of pre-training the policy network includes: The decision actions output by the policy network will be used as part of the input to the value network, where the expression for the decision actions output by the policy network is: The performance of the strategy can be expressed as: ; Gradient boosting is performed on the policy network based on the pre-trained value network. The gradient expression for the policy network parameters is as follows: ; The policy network parameters are then updated using stochastic gradient descent, and training continues until convergence or a predetermined number of iterations are reached. in It is the learning rate.
5. The method for constructing a deep reinforcement learning network model for UAV flight control according to claim 1, characterized in that, In S3, the process of fine-tuning the trained network model includes: S31: Fixed parameters for short-sighted network and backbone network: Value network Parameters in , fixed; S32: Construct the target network: Create a target network with the same structure as the policy network and value network, respectively. and Used to fix parameters; S33: Using the aforementioned policy network output action and interaction with the environment, a reward and the next state are obtained: Policy network input environment status information Output action Then the action will be output. Interact with the environment to earn rewards With the next state And this interaction will be added to the training samples; S34: Input the next state into the target network to obtain the action estimate: Input the next state into the target policy network Output .
6. The method for constructing a deep reinforcement learning network model for UAV flight control according to claim 5, characterized in that, In S3, the process of fine-tuning the trained network model also includes: S35: Update value network parameters: Predict the action-state value of the next state using the target value network. Using temporal difference to define the expected value of action state: The loss function is then defined as follows: The parameter update expression is: in These are the extracted training samples. The loss function reflects the difference between the value network output and the expected value. Stochastic gradient descent is used to update the value network, where only the parameters belonging to the foresight network are updated. ; S36: Update policy network parameters: This enables the policy network to perform gradient boosting based on the target value network, where the gradient expression is as follows: The parameter update expression is as follows: in The learning rate; S37: Network parameter coverage; After the model converges or after a specified number of rounds, the parameters of the policy network and the value network are respectively made to cover their corresponding target networks; Repeat steps S33 to S37 until the model converges or reaches the predetermined number of rounds.
7. An electronic device, comprising a memory and a processor, characterized in that, The processor is used to execute the program in the memory to implement the deep reinforcement learning network model construction method for UAV flight control as described in any one of claims 1 to 6.
8. A storage medium containing computer-executable instructions, characterized in that, When executed by a computer processor, the storage medium of the computer-executable instructions is used to perform the deep reinforcement learning network model construction method for UAV flight control as described in any one of claims 1 to 6.