A shared autonomous method based on deep reinforcement learning
By combining deep reinforcement learning and long short-term memory networks, the problem of identifying invalid human behaviors and changing intentions in shared autonomous systems was solved, enabling the system to adjust autonomously and improve the success rate of tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF SCI & TECH OF CHINA
- Filing Date
- 2022-04-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing shared autonomous systems struggle to accurately distinguish between invalid human actions and intentional changes when dealing with invalid human behavior, leading to system failures and requiring additional information or complex algorithms.
By employing a deep reinforcement learning-based approach, deep Q-networks and long short-term memory networks are used to calculate behavioral reward values and intent confidence, thereby enabling the effective judgment and arbitration of human behavior and dynamically adjusting the human-machine shared control strategy.
Even when external control commands are invalid for an extended period, the system can still achieve the correct objective and can determine the validity of control commands without additional information. It can distinguish between invalid behavior and intentional changes, thus improving the system's robustness and task success rate.
Smart Images

Figure CN115345273B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of human-machine hybrid intelligent systems, and more specifically to a shared autonomy method based on deep reinforcement learning. Background Technology
[0002] In shared autonomy, humans and intelligent machines collaborate with complementary capabilities to complete real-time control tasks, achieving performance that neither can achieve individually. Take drone landing as an example: for humans, it's difficult to simultaneously control altitude, speed, attitude, and other dimensions; for automated landing systems, it's difficult to help them understand what constitutes a good and safe landing, and how to achieve landing in various complex environments. Shared autonomy has been widely applied in many fields, such as teleoperation of machines, semi-autonomous driving, and rehabilitation exoskeletons. In recent years, due to the rapid development of artificial intelligence, this field has received increasing attention.
[0003] Many shared autonomous systems involve intelligent machines assisting humans in completing tasks decided by humans. In this context, successful shared autonomous strategies typically rely on two fundamental components: inference of human intent (which machines often cannot directly ascertain), and arbitration between machine decision-making and human behavior. Human intent reasoning is the first step in arbitration, and its quality directly impacts its success. Reasoning is usually accomplished by analyzing observed human behavior, with a key assumption being that this observed behavior is "effective"—that is, it facilitates task completion and effectively reflects the true human intent. However, in reality, due to fatigue, distraction, and other factors, human behavior is often "ineffective" to some extent. As human behavior remains ineffective for a period, the inference of human intent also becomes ineffective, leading to the failure of the robot-assisted system and, consequently, the failure of the entire system.
[0004] To address invalid human behavior, the first and crucial challenge is determining whether a range of human actions are invalid. This is no simple task, as it must be accomplished by a machine with limited information. One possible approach is to utilize additional information such as human physiological states, assuming that behavior is invalid when a person is physiologically abnormal, which can be measured through facial recognition, neural signals, heart rate, etc. However, this type of approach cannot handle invalid behavior under normal physiological conditions due to cognitive and environmental limitations (such as time constraints and incomplete information). For example, in emergency situations, due to limited processing time and psychological stress, a person's controlled behavior may be detrimental to the system. The second challenge is distinguishing between invalid human behavior and changes in human intent. Inference components infer human intent based on a range of historical data, making them robust to occasional invalid human behavior, but this also means that changes in intent may not be immediately detected. For systems where the task objective is a previous intent, input from a person with a changed intent may be invalid. The distinction between these two directly relates to the success or failure of the task. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention proposes a shared autonomy method based on deep reinforcement learning.
[0006] The objective of this invention can be achieved through the following technical solutions:
[0007] A shared autonomy method based on deep reinforcement learning, wherein the device includes a control module, comprising the following steps:
[0008] Step 1: Train an end-to-end mapping from environmental state and intervention behavior to behavior reward value through deep reinforcement learning, and calculate the reward value of each behavior, representing the benefit of the intervention behavior to the current task of the device;
[0009] Step 2: Once the reward value drops below the preset value, the intervention behavior is deemed invalid, and the device will be controlled by the control module to complete the goal inferred from the effective behavior;
[0010] Step 3: If several pre-defined behaviors are valid, then the arbitration function is used to perform shared control based on the human behavior and the calculation results of the control module.
[0011] Optionally, prior to step 1, the intent of the intervention is inferred through a long short-term memory network.
[0012] Optionally, the deep reinforcement learning in step 1 is a deep Q-network.
[0013] Optionally, in step 1, the input of the deep Q-network is the environmental state and the intervention behavior, and the output is the reward value of all behaviors under the current environmental state.
[0014] Optionally, in step 1, when the difference between the highest reward value of the action calculated by the deep Q-network and the reward value of the intervention behavior is greater than a preset value, the device will be controlled separately by the control module.
[0015] Optionally, the distance between actions can be calculated using the reward value of the action:
[0016]
[0017] Where A is the set of all possible actions, and Q′(s,b,a)=Q(s,b,a)-min a′∈A Q(s,b,a′) is the minimum reward value among all actions minus the reward value of the action; a max f(a,a) is the highest-value action calculated by the deep Q-network in the current environment state. h ) for calculating behavior a and intervention behavior a h The similarity between them.
[0018] Optionally, the confidence level of the inference result from the Long Short-Term Memory Network is the maximum probability minus the minimum probability in the target set. When the confidence level is lower than a preset low value, the device is subject to intervention control; when the confidence level is higher than a preset high value, the device is controlled by a control module; when the confidence level is between the preset low value and the preset high value, it is jointly controlled by the intervention and control modules.
[0019] Intervention can be a human-controlled action, or other interference actions or control signals / commands from external control programs, control systems, machines, or equipment. For example, in autonomous driving, another algorithm can be used as an external control command for the driver assistance system, achieving more robust control based on the discrepancies between the commands of the two control systems.
[0020] A computer-readable storage medium storing instructions that, when executed, enable the control method described above.
[0021] The aforementioned control module can be a control system, a controller, or a storage medium storing control commands and algorithms.
[0022] The beneficial effects of this invention are:
[0023] 1. Even when external control commands are invalid for an extended period, the system can still ensure that it achieves the correct objectives.
[0024] 2. The validity of external control commands can be determined without additional information;
[0025] 3. Able to distinguish between ineffective human behavior and intentional change;
[0026] 4. It takes into account ineffective human behavior and the uncertainty of intelligent machines. Attached Figure Description
[0027] The invention will now be further described with reference to the accompanying drawings.
[0028] Figure 1 This is a schematic diagram of the control method of this application;
[0029] Figure 2 This is a schematic diagram of a specific experimental scenario for the present invention;
[0030] Figure 3 The figures show the experimental results of this invention; where 3(a): the success rate of 10 players completing the task using three control methods with valid human input; 3(b): the success rate of 10 players completing the task using three control methods with partially invalid input; 3(c): the average number of steps taken by 10 players to complete the task using three control methods with valid human input; and 3(d): the average number of steps taken by 10 players to complete the task using three control methods with partially invalid input. Detailed Implementation
[0031] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0032] like Figure 1 As shown, an example of the present invention discloses a device control method, the key components of which include four parts: inferring human intent and calculating the confidence level of intent inference; an action selection module calculating the arbitration adaptive ratio of human-machine combination to determine the action when performing shared control; a human input validity judgment module judging whether human input is invalid and whether it is necessary to switch from human-machine shared control to machine-only control; and an arbitration module deciding the final work to be performed by the controlled system based on all the above information.
[0033] In a shared control system, user input consists of highly correlated, chronologically ordered sequential decisions. The system presents a certain state based on the user's previous control action, and the user makes new decisions based on the system's current state. Recurrent Neural Networks (RNNs) are neural networks used to process sequential data. Compared to general neural networks that require independent and identically distributed data, they can handle data with varying sequences; for example, the meaning of a word can change depending on the context. Long Short-Term Memory (LSTM) networks are a special type of RNN that can solve the vanishing and exploding gradient problems during long sequence training, thus performing better than ordinary RNNs on long sequences. Therefore, we use an LSTM network for supervised prediction of user control behavior: predicting the target g based on a series of trajectories and actions, normalizing the distance of the predicted result g from all targets in the target set G, and using the result as the probability of each target, resulting in the overall probability distribution b on the target set. The machine selects the target with the highest probability, i.e., the target closest to the predicted result, as the user's intended target.
[0034] In some shared control systems, the machine selects the user's target based on the one with the highest probability in the target set and then chooses its action accordingly. A drawback of deterministic targets is the unknown uncertainty in intent reasoning, which may lead to incorrect inferences and erroneous actions in complex situations. This example selects actions based on the probability distribution b of the target set, preserving the possibilities of all targets. The more accurate the target reasoning, the more effective the machine's chosen action; the greater the uncertainty in the target reasoning, the greater the probability of the machine's erroneous action selection. Therefore, we calculate the uncertainty of the target reasoning. If the uncertainty reaches a certain level, the user's control command is executed directly; otherwise, arbitration is conducted between the user's control action and the machine's chosen action.
[0035] The confidence level of target inference is the difference between the maximum probability and the minimum probability in the target set, as shown in the following formula. Where G is the target set, g' is the target in the target set, and a h Human control behavior, p(g′|a h ) represents the inferred probability that the control target is g'. There are two extreme cases: (1) one target has a probability of 1 and the other targets have a probability of 0. In this case, the confidence is 1, that is, the machine is completely certain that the user is the target; (2) all targets have equal probabilities. In this case, the confidence is 0, that is, the machine is completely uncertain which target the user is.
[0036] confidence = max g′∈G p(g′|a h )-min g′∈G p(g′|a h c∈[0,1] (1)
[0037] The action selection module calculates the action to be taken when executing shared control. s We use a Deep Q-Network (DQN) approach. The input to a DQN is the environmental state and the human input; the output is the reward value for all actions in the current environmental state. The reward value Q(s,a) is the expected sum of discounted rewards obtainable in a finite number of steps after performing action a in state s, used to evaluate the value of the action. The machine selects actions based on these reward values. In some shared control systems, the machine selects the action with the highest reward value as the optimal action. However, we believe that when human input is effective, intelligent machines should modify human input as little as possible to increase their acceptance of assistance. If the system consistently performs actions deviating from human input, the human may lose trust in the system because their commands are not being executed accurately. Therefore, actions that are sufficiently good and most similar to human input are chosen as the actions executed under shared control.
[0038] The range of behaviors selected is determined by the reasoning confidence. Higher reasoning confidence makes the machine more certain, thus allowing for a smaller selection range; conversely, lower reasoning confidence increases the likelihood of machine errors, requiring the selection of behaviors closely resembling the user's command within a larger range. We use the reward value of each behavior to calculate the distance between them, as shown in the following formula.
[0039]
[0040] Where a s The action selection module calculates the action to be taken when executing shared control, where A is the set of all possible actions, and Q′(s,b,a)=Q(s,b,a)-min a′∈A Q(s,b,a′) is the minimum reward value among all actions minus the reward value of the action, to prevent errors caused by negative Q values. c is the confidence level of intention reasoning calculated by formula (1), and a max The action with the highest value in the current environment state, calculated by the DQN network. f(a,a) h Let be the similarity between behavior a and user-controlled behavior ah. For example, when confidence = 0.8, the similarity is calculated based on the condition Q′ ≥ 0.8Q′. max Choose and a h The most similar behavior; when confidence = 0.4, from Q ≥ 0.4Q max Choose and a h The most similar behaviors.
[0041] In particular, the absence of human input will cause the system to directly execute the action that the machine calculates to be of the highest value. When humans intervene to guide the task by inputting actions, the machine becomes compliant and obedient. The lack of input implies that humans are satisfied with the current situation, and the machine will attempt to lead the task.
[0042] The reward value represents the worth of the action in the current task. We assume that both humans and robots strive for higher values, so the system becomes cautious when humans issue low-value commands. Therefore, inputs are deemed invalid when the reward penalty is sufficiently large.
[0043] The difference between the reward values of actions is used as the distance between actions, as shown in the following formula. The distance d(a) between the human input and the action with the highest value calculated by DQN is... h ,a max When the probability distribution is sufficiently large multiple times, it is assumed that the human cannot effectively control the system, and therefore the system is controlled solely by the robot. When the machine is in sole control, it uses the probability distribution calculated at the last moment of effective human control as the task objective, regardless of the current human input, and selects the action with the highest value, 'a'. max Transmitted to the controlled system:
[0044]
[0045]
[0046] Meanwhile, the person is constantly making changes, and the network is constantly calculating the distance of the actions. When several consecutive distances d(a) occur... h ,a max If the time is sufficiently short, it is determined that humans have returned to rationality and are able to make an effective response, and the system returns to shared control.
[0047] It is particularly important to note that machines may mistakenly interpret changes in a human's target as invalid behavior. LSTM networks calculate a probability distribution over a target set based on a series of trajectories and actions. Therefore, changes in the target can be progressively identified by the network and completed through shared control between the human and the machine. However, when the system is close to the inferred target and the inference confidence is high, a sudden change in the target may result in a large action distance, causing the human to lose control and the machine to guide the system to complete the previous target, i.e., task failure. Therefore, when the machine controls the system alone, we re-collect trajectories and human input to re-infer the human's intention. If the robot repeatedly infers the same target multiple times, then the machine and human jointly control the system to complete the new target.
[0048] In summary, the controlled system can execute three possible actions, as shown in the following formula. When the confidence level of the intention inference is sufficiently low, if a person controls the system alone, the system will execute the person's control action a.h When the continuous motion distance is sufficiently high, the machine is controlled independently by the system, which will execute the machine's control actions a. r In other cases, the system is jointly controlled by humans and the robot, and the system will execute the optimal action 'as' calculated by the action selection module for shared control. The overall algorithm is shown in Algorithm 1.
[0049]
[0050] Algorithm 1 is a shared autonomy based on DQN and subject to invalid human input.
[0051] Initialize the experience pool D with a capacity of N, the update step size η, the discount factor γ, and the weight update interval K;
[0052] Initialize the evaluation network Q with random weights θ;
[0053] Initialize the target network The weight is θ - =θ;
[0054] cycle
[0055] cycle
[0056] Get the current environment state s t and human input a h ;
[0057] Reasoning intention and obtaining action a according to formula (2) s ;
[0058] According to formula (5), arbitration action a t =a;
[0059] Perform action a t , and obtain the new state s t+1 and reward value r t ;
[0060] The quadruple (s) obtained through online interaction t ,a t ,r t ,s t+1 Stored in experience pool D;
[0061] If state s t+1 If it is the final state, then:
[0062] cycle
[0063] A batch of samples is sampled from the experience pool to distinguish them from samples obtained through online interaction. The sampled samples are labeled j, and the quadruple is (s j ,a j ,r j ,sj+1 );
[0064] a′ j+1 =argmax a′ Q(s j+1 ,a′;θ);
[0065]
[0066] Until the set number of cycles is reached
[0067] Weights are replicated every K steps.
[0068] Until the end of each act
[0069] Until the set number of episodes is reached
[0070] In some examples of this invention, experiments were conducted using the control methods described above in the LunarLander scenario of OpenAI Gym. LunarLander simulates a real landing environment, where the closer the lander is to the ground, the faster its descent speed and the more difficult it becomes to control. Therefore, the machine requires a large amount of data to learn the control of the three engines, and these training processes are an unbearable burden for human participants; thus, the intelligent machine is trained independently without user intervention. Figure 2 As shown, there are three pairs of flags on the landing surface, their coordinates randomly generated at the start of each mission. The area between each pair of flags is flat, while the heights of the other areas are randomly generated. The lunar lander has a thruster on each side and a main engine in the center. The user and the machine jointly control these three engines to land the lunar lander without collision between a pair of flags; if so, the mission is successful. If the lunar lander crashes into the ground, flies out of bounds, remains stationary outside the flag pair, or fails to land within 1000 steps (i.e., time runs out), the mission fails. The agent knows its current position and the positions of the three pairs of flags, but does not know which one the user's target is. While playing the game, the user selects a landing point by flag color and takes control actions. The agent infers the user's target based on these actions and controls the lunar lander to approach the landing point.
[0071] The action space for this task consists of six discrete actions: turning on or off three engines. The state space is an 11-dimensional vector, including the lunar lander's current position, angle, velocity, angular velocity, whether it has touched the ground, and the coordinates of the three flags. The reward function penalizes the velocity, angle, and distance to the inferred target; that is, the faster the velocity or the greater the tilt, the greater the penalty, aiming to train the agent to move steadily and slowly towards the target point. A smooth landing at the target point is rewarded with a large amount, while crashing into the ground or flying off the boundary is penalized with a large amount.
[0072] The experiment used an LSTM network with two hidden layers and 32 neurons per layer for intent reasoning. A multilayer perceptron with two hidden layers and 64 nodes per layer was used to implement the DQN algorithm. The similarity function between actions is f(a, a). h This is used to determine whether two actions control the same engine or whether they control the lander to move in the same direction. For example, the similarity of simultaneously controlling the left engine is 1, i.e., f((left,on),(right,off))=1; the similarity of controlling the left thruster to close and controlling the right thruster to open is 1, i.e., f((left,on),(left,off))=-1.
[0073] Using random operations to simulate abnormal human states, if seven out of ten consecutive actions of a person satisfy d(a h ,a max If the confidence level is ≥0.7, meaning the cumulative reward is less than 30% of the maximum reward, the human input is deemed invalid, and the machine takes over the task and re-infers the intent. Conversely, if the cumulative reward exceeds 70% of the maximum value in 7 out of 10 consecutive human inputs, the human is considered to have returned to normal, and the lander is jointly controlled by the human and the robot. When re-inferring the intent, if the target is the same in 5 out of 7 consecutive re-inferences, it is considered that the user has actually switched to that target, and the human and the machine will jointly control the lander to approach the new target. If the inference confidence level is less than 0.3, the lander will directly execute the human input, as the robot's actions are too uncertain to be considered.
[0074] Each player was informed of the game rules beforehand and practiced independently 20 times to familiarize themselves with the controls and environment. They then collaborated with a trained agent for 20 more trials to adapt and optimize their gameplay. Each player completed two experiments during the descent: one where they changed their target and another where they didn't. To facilitate data collection and analysis, target flag colors were assigned to each player. In the first experiment, the user could not change their target and always guided the lunar lander to land in the middle of the yellow flags. In the second experiment, the user's target changed from the blue flag to the yellow flag; the timing of this target change was determined by the user.
[0075] This experiment was designed to verify the effectiveness of the method and analyze its differences in control performance compared to other shared autonomy methods that do not consider invalid human input. Therefore, DQN was used to implement a commonly used shared autonomy method, consistently executing the action with the highest reward value, as a control experiment. Each player completed six tasks: Human Individual Control (HIC), Highest Value Shared Autonomy (HVSA), and our proposed shared autonomy method (SAIHI), using both fully valid and partially invalid input modes. Each task had 20,000 steps, with a maximum of 1,000 steps per scene, resulting in at least 20 scenes per task. The specific number of steps per scene depended on the player's ability, typically ranging from 300 to 700 steps. To ensure continuous invalid human input in each scene, players were given randomized steps 100 and 200. The end time of these randomized actions was determined by the player. The order in which the control tasks were presented to the players was balanced to avoid bias in the results due to increasing player proficiency.
[0076] Figure 3 The success rate and average path length of 10 players across 6 tasks are displayed. Figure 3 (a) demonstrates the quantitative and qualitative advantages of combining player and machine control. When dealing with sudden descents of a lander, players struggle to simultaneously control the engines in three dimensions to maintain stability, often causing the lander to crash into the ground or fly off the boundary within 150 steps, making it difficult to land accurately in the required position without a collision. This is primarily due to humans' lack of ability to accurately manipulate object movement in multiple dimensions simultaneously. Shared control between player and machine significantly increases the likelihood of a successful landing, as the machine can precisely control the lander's dynamics. When all human input is valid, our method has a slightly higher success rate than the shared autonomous method. The ANOVA (analysis of variance, statistically testing differences between all variables used in the experiment) result is F = 7.1130, p = 0.0157, indicating that our method is more likely to succeed than the other method. We attribute this improvement to the arbitration function of shared control. Figure 3 As shown in (b), our method significantly outperforms conventional methods when there is persistent invalid human input: the ANOVA results are F = 30.48, p = 3.04902e-5. A larger F value indicates greater differences between groups, and p < 0.05 indicates statistically significant differences between groups. Our method can determine whether input is invalid, and the machine will promptly take over the system for individual control, preventing invalid player behavior from affecting the task process, thereby effectively improving the success rate. Figure 3 (c) and Figure 3(d) It can be seen that when all human input is valid, our method can complete the task in a shorter time, thanks to our more efficient arbitration method. However, when some human input is invalid, our method can continue for a longer period, ensuring that the task does not fail immediately after losing valid external control commands. The machine provides the player with some buffer time to attempt to return to shared control, which is considered the optimal control mode.
[0077] The machines, equipment, and intelligent agents in the above examples can refer to instruments, controllers, control systems, etc., with automatic or semi-automatic control capabilities. The automatic or semi-automatic control capabilities originate from built-in control modules, storage media carrying control programs, instructions, or algorithms, etc. The control method described in this invention can be built into a computer-readable storage medium in the form of instructions, and the control method described in the above examples is implemented when the instructions are executed. More specifically, the instructions can be a computer-readable language. The computer mentioned above can be a general-purpose computer device or a special-purpose computer device. In specific implementations, the computer can be a desktop computer, a portable computer, a network server, a PDA (Personal Digital Assistant), a mobile phone, a tablet computer, a wireless terminal device, a communication device, or an embedded device. The storage medium can be any available medium that a computer can access, or a data storage device such as a server or data center that integrates one or more available media. For example, the storage medium may be, but is not limited to, magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., Digital Versatile Discs (DVDs)), or semiconductor media (e.g., Solid State Disks (SSDs)).
[0078] In the description of this specification, references to terms such as "an embodiment," "example," "specific example," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0079] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the claimed invention.
Claims
1. A deep reinforcement learning based shared autonomy method, characterized in that, include: The data acquisition module is used to collect the environmental status of the device and the user's intervention behavior commands, wherein the environmental status includes the device's current position, angle, speed, and angular velocity; The intent reasoning module infers the intent of the intervention behavior instruction through a long short-term memory network and calculates the confidence level of the intent reasoning; The action selection module trains an end-to-end mapping from the environmental state and the intervention behavior instruction to the behavior reward value through deep reinforcement learning, and calculates the reward value for each behavior, wherein the reward value represents the benefit brought by the intervention behavior instruction to the current task of the device; The effectiveness determination module is used to monitor the effectiveness of the intervention behavior instructions; When the reward value corresponding to the intervention behavior instruction drops below a preset value, or when the difference between the intervention behavior instruction and the highest reward value calculated by the action selection module is greater than a preset value, the intervention behavior instruction is determined to be invalid. The arbitration and execution module is used to execute the following control strategy based on the result of the validity determination module: when the intervention behavior instruction is determined to be valid, the arbitration function is used to generate a shared control instruction based on the intervention behavior instruction and the highest reward value behavior calculated by the action selection module, thereby driving the device to move; When the intervention behavior command is determined to be invalid, the control of the intervention behavior command over the device is cut off, and the highest reward value behavior calculated by the action selection module generates a control command to drive the device to move. When the intervention behavior command is determined to be invalid multiple times in a row, the device is switched to control the device alone. The following arbitration function is used for shared control: In the formula, a s The action selection module calculates the action to be taken when executing shared control instructions, where A is the set of all possible actions, and Q′(s,a) = Q(s,a) - min. a′∈A Q(s,a′) is the minimum reward value among all actions minus the reward value of the action; a max The action with the highest value in the current environmental state, calculated by the deep Q-network; f (a,a h ) for calculating behavior a and intervention behavior a h The similarity between them; s is the environmental state; c This represents the confidence level of the intentional reasoning.
2. The shared autonomous method based on deep reinforcement learning according to claim 1, characterized in that, The action selection module uses deep reinforcement learning to train and map deep Q-networks.
3. The shared autonomous method based on deep reinforcement learning according to claim 2, characterized in that, The deep Q-network takes the environmental state and the intervention behavior instruction as input, and outputs the reward value of all behaviors under the current environmental state.
4. The shared autonomous method based on deep reinforcement learning according to claim 3, characterized in that, The specific method by which the intent reasoning module calculates the confidence level is as follows: the difference between the maximum probability and the minimum probability in the target set output by the long short-term memory network is calculated as the confidence level; When the confidence level is lower than a preset low value, the device is controlled by the intervention behavior command; When the confidence level is higher than a preset high value, the device is individually controlled by the action selection module based on the highest reward value behavior calculated by the action selection module. When the confidence level is between a preset low value and a preset high value, the device is jointly controlled by the intervention behavior instruction and the highest reward value behavior calculated by the action selection module.