A robot-oriented partial observation reinforcement learning method, device and equipment
By training the decision model through feature extraction and memory modules, and optimizing the memory modules and the actor network using the MSE loss function, the problem of inaccurate decision-making caused by robot sensor noise is solved, and accurate action decision-making is achieved in partially observed Markov environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTHERN MARINE SCI & ENG GUANGDONG LAB (ZHUHAI)
- Filing Date
- 2024-09-20
- Publication Date
- 2026-06-12
AI Technical Summary
In real-world scenarios, the state information acquired by robot sensors is noisy and incomplete, which prevents reinforcement learning agents from making accurate decisions. Existing technologies struggle to achieve effective action decision-making in partially observed Markov environments.
We employ a partial observation reinforcement learning approach for robots, training a decision model through a feature extraction network and a memory module. We then optimize the memory module and the actor network using the MSE loss function to improve the robustness and denoising performance of the decision model. Finally, we utilize a contrastive auxiliary task to learn confidence states and generate more accurate decisions.
In environments with noise and incomplete information, the accuracy of robot actions and the robustness of decision-making models are improved, enabling more accurate environmental state perception and decision-making.
Smart Images

Figure CN119274035B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of reinforcement learning technology, and in particular to a partial observation reinforcement learning method, apparatus and device for robots. Background Technology
[0002] Deep reinforcement learning (DRL) is a product of combining deep learning and reinforcement learning. It possesses the perception and representation capabilities of deep learning as well as the decision-making capabilities of reinforcement learning, thus overcoming the shortcomings of classical reinforcement learning in estimating value functions and fitting policy functions. Currently, the main achievements of reinforcement learning are still concentrated in virtual environments such as games. This is because the observation space of the virtual environment constructed by training reinforcement learning agents has the advantages of accurate observation and inexpensive dataset generation.
[0003] However, when reinforcement learning is applied to real-world scenarios, the environment is often stochastic, complex, and partially observable. Specifically, in real-world scenarios, the acquired state information often contains noise due to various reasons (e.g., sensor measurements of velocity and displacement often have errors), resulting in a partially observed Markov environment. Such inaccurate state information often leads the agent to make incorrect decisions, ultimately causing the model to fail. Therefore, it is necessary to develop a partially observed reinforcement learning method, device, and apparatus for robots to achieve accurate decision-making for actions even when the acquired information is noisy and incomplete. Summary of the Invention
[0004] In view of the above problems, embodiments of this application provide a partial observation reinforcement learning method, apparatus, and device for robots, in order to overcome the above problems or at least partially solve the above problems.
[0005] A first aspect of this application provides a partial observation reinforcement learning method for robots, applied to robots, the method comprising:
[0006] Obtain a first training dataset, wherein each first training data in the first training dataset is an observation information and a corresponding historical observation information sequence obtained in a partially observed Markov environment at different times; the historical observation information sequence represents a sequence composed of the observation information of the first N times in chronological order.
[0007] The first training data at time t is input into the feature extraction network to extract features, thereby obtaining the current observation features and the historical observation feature sequence at time t.
[0008] The current observation features and historical observation feature sequences at time t are input into the memory module to obtain the confidence state at time t;
[0009] Based on the confidence state at time t, the first candidate confidence state at time t+1 is predicted;
[0010] The first training data at time t+1 is input into the feature extraction network to extract features, thereby obtaining the current observation features and the historical observation feature sequence at time t+1.
[0011] The current observation features and historical observation feature sequences at time t+1 are input into the target memory module to obtain the second candidate confidence state at time t+1.
[0012] Calculate the MSE loss function value based on the first candidate confidence state and the second candidate confidence state at time t+1;
[0013] Based on the MSE loss function value, the parameters of the memory module and the executor network are updated;
[0014] The first training data is reselected, and iterative training is performed until the MSE loss function value reaches a convergent state or the preset number of training iterations are reached, so as to obtain the trained memory module and the executor network.
[0015] Using the trained memory module and the executor network, the robot's action at the current moment is determined based on the observation information at the current moment.
[0016] A second aspect of this application also provides a robot-oriented partial observation reinforcement learning apparatus for executing the robot-oriented partial observation reinforcement learning method described in the first aspect, the apparatus comprising:
[0017] The first training dataset acquisition module is used to acquire the first training dataset. Each first training data in the first training dataset is an observation information and corresponding historical observation information sequence obtained in a partially observed Markov environment at different times. The historical observation information sequence represents a sequence composed of the observation information of the first N times in chronological order.
[0018] The first feature extraction module is used to input the first training data at time t into the feature extraction network, perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t.
[0019] The first confidence state generation module is used to input the current observation features and historical observation feature sequences at time t into the memory module to obtain the confidence state at time t.
[0020] The first confidence state prediction module is used to predict the first candidate confidence state at time t+1 based on the confidence state at time t.
[0021] The second feature extraction module is used to input the first training data at time t+1 into the feature extraction network, perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t+1.
[0022] The second confidence state generation module is used to input the current observation features and historical observation feature sequences at time t+1 into the target memory module to obtain the second candidate confidence state at time t+1.
[0023] The loss function calculation module is used to calculate the MSE loss function value based on the first candidate confidence state and the second candidate confidence state at time t+1.
[0024] The parameter update module is used to update the parameters of the memory module and the executor network based on the MSE loss function value.
[0025] The training end module is used to reselect the first training data and perform iterative training until the MSE loss function value reaches a convergent state or the preset number of training times is reached, so as to obtain the training completed memory module and the executor network.
[0026] An action prediction module is used to determine the robot's action at the current moment based on the observation information at the current moment, using the trained memory module and the executor network.
[0027] A third aspect of this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the steps of the partial observation reinforcement learning method for robots described in the first aspect of this application.
[0028] The fourth aspect of this application also provides a computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implements the steps in the robot-oriented partial observation reinforcement learning method described in the first aspect of this application.
[0029] The fifth aspect of this application also provides a computer program product that, when run on an electronic device, causes a processor to execute the steps in the partial observation reinforcement learning method for robots as described in the first aspect of this application.
[0030] This application provides a partial observation reinforcement learning method for robots, applied to robots. The method includes: acquiring a first training dataset, where each first training data point in the first training dataset is observation information at different times and a corresponding historical observation information sequence acquired in a partial observation Markov environment; the historical observation information sequence represents a sequence composed of observation information from the previous N times arranged in chronological order; inputting the first training data at time t into a feature extraction network to perform feature extraction, obtaining the current observation features and the historical observation feature sequence at time t; inputting the current observation features and the historical observation feature sequence at time t into a memory module to obtain the confidence state at time t; predicting the first candidate confidence state at time t+1 based on the confidence state at time t; and inputting the first training data at time t+1 into the memory module. A feature extraction network is used to extract features, obtaining the current observation features and historical observation feature sequence at time t+1. These features are then input into the target memory module to obtain the second candidate confidence state at time t+1. Based on the first and second candidate confidence states at time t+1, the MSE loss function value is calculated. The parameters of the memory module and the executor network are updated based on the MSE loss function value. The first training data is reselected, and iterative training is performed until the MSE loss function value converges or a preset number of training iterations are reached, resulting in the trained memory module and the executor network. Using the trained memory module and the executor network, the robot's action at the current time is determined based on the current observation information.
[0031] The specific beneficial effects are as follows: This application proposes to use a contrastive auxiliary task to train the decision model (memory module and actor network) to promote the learning of the similarity between the decision model and the confidence state (first candidate confidence state) inferred from the current confidence state and the confidence state inferred by the memory module (second candidate confidence state). By using the idea of contrastive learning, the similarity between observations is measured by calculating the mean square loss function between them and used as the target of the auxiliary task, thereby improving the robustness and denoising performance of the decision model. This makes it applicable to partially observed Markov environments (i.e., environments where the collected information is not accurate enough and there is noise), improving the performance of offline reinforcement learning, thereby generating a more accurate understanding of the environmental state (confidence state) and thus obtaining a more accurate decision (execution action at the current moment). Attached Figure Description
[0032] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0033] Figure 1 This is a flowchart illustrating the steps of a partial observation reinforcement learning method for robots provided in an embodiment of this application.
[0034] Figure 2 This is a schematic diagram of the structure of a decision model provided in an embodiment of this application;
[0035] Figure 3 This is a schematic diagram of a process for training a decision model using a comparative auxiliary task, provided in an embodiment of this application.
[0036] Figure 4 This application provides a partial observation reinforcement learning device for robots.
[0037] Figure 5 This is a schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0038] Exemplary embodiments of this application will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of this application and to fully convey the scope of this application to those skilled in the art.
[0039] Deep learning has rapidly propelled the development of artificial intelligence. Algorithms based on deep learning, such as image recognition, object detection, text generation, and language translation, have gradually matured and are increasingly becoming part of people's daily lives. As an important branch of artificial intelligence, reinforcement learning, distinct from supervised and unsupervised learning, is a different paradigm of deep learning. It learns optimal decision-making strategies by interacting with the environment and receiving rewards, and is suitable for general sequential decision-making processes.
[0040] Deep reinforcement learning (DRL) is a product of combining deep learning and reinforcement learning. It possesses the perception and representation capabilities of deep learning and the decision-making capabilities of reinforcement learning, thus overcoming the shortcomings of classical reinforcement learning in estimating value functions and fitting policy functions. Deep reinforcement learning has many well-known applications. For example, the recently released large language model ChatGPT uses reinforcement learning methods with human feedback, enabling the model to better understand human language. Deep reinforcement learning is also applied to bioinformatics; the AlphaFold algorithm can accurately predict the three-dimensional structure of a protein from an amino acid sequence. Furthermore, deep reinforcement learning has achieved significant results in fields such as Go, e-sports, and autonomous driving.
[0041] Offline Reinforcement Learning (Offline RL), also known as Batch Reinforcement Learning (BRL), is a reinforcement learning paradigm that differs from online reinforcement learning. It requires the agent to learn from a fixed dataset, rather than exploring it. In other words, Offline RL studies how to maximize the use of static, offline datasets to train reinforcement learning agents.
[0042] Despite its remarkable achievements in many fields, reinforcement learning's main breakthroughs are currently concentrated in virtual environments such as games. This is because the observation space of the virtual environment built for training reinforcement learning agents has the advantages of accurate observation and inexpensive dataset generation. However, when applied to the real world, such as medical robotics, the challenging problem of how to implement agents in virtual environments becomes unavoidable. In real-world scenarios (e.g., medical scenarios), the environment is typically random, complex, and partially observable. For example, in cardiac ultrasound robots, ultrasound images are used as input to reinforcement learning algorithms, aiming to obtain accurate, noise-free ultrasound images. However, in reality, ultrasound images often contain significant noise, and their quality is limited by the characteristics of ultrasound waves and imaging equipment. Furthermore, images are often blurred or incomplete due to the patient's breathing or slight body movements. In practical scenarios, it is difficult for agents to obtain accurate state information (e.g., sensors often have errors in measuring velocity and displacement), and this inaccurate state information often leads the agent to make incorrect decisions, ultimately causing the model to fail.
[0043] In view of the above problems, this application proposes a partial observation reinforcement learning method, apparatus, and device for robots, thereby achieving accurate decision-making on robot actions even when the collected information is noisy or incomplete. The partial observation reinforcement learning method for robots provided in this application will be described in detail below with reference to the accompanying drawings and through some embodiments and application scenarios.
[0044] The first aspect of this application provides a partial observation reinforcement learning method for robots, applied to robots, with reference to... Figure 1 , Figure 1 A flowchart illustrating the steps of a partial observation reinforcement learning method for robots, as provided in this application embodiment, is shown below. Figure 1 As shown, the method includes:
[0045] Step S101: Obtain the first training dataset. Each first training data in the first training dataset is an observation information and corresponding historical observation information sequence obtained in a partially observed Markov environment at different times. The historical observation information sequence represents a sequence composed of the observation information of the first N times in chronological order.
[0046] Step S102: Input the first training data at time t into the feature extraction network to extract features and obtain the current observation features and historical observation feature sequence at time t.
[0047] Step S103: Input the current observation features and historical observation feature sequences at time t into the memory module to obtain the confidence state at time t.
[0048] Step S104: Based on the confidence state at time t, predict the first candidate confidence state at time t+1.
[0049] Step S105: Input the first training data at time t+1 into the feature extraction network to perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t+1.
[0050] Step S106: Input the current observation features and historical observation feature sequences at time t+1 into the target memory module to obtain the second candidate confidence state at time t+1.
[0051] Step S107: Calculate the MSE loss function value based on the first candidate confidence state and the second candidate confidence state at time t+1.
[0052] Step S108: Update the parameters of the memory module and the executor network according to the MSE loss function value;
[0053] Step S109: Select the first training data again and perform iterative training until the MSE loss function value reaches the convergence state or the preset number of training times are reached, and obtain the trained memory module and the executor network.
[0054] Step S110: Using the trained memory module and the executor network, determine the robot's action at the current moment based on the observation information at the current moment.
[0055] For environments with noise, incomplete or noisy sensor information, and other challenging conditions, a mathematical model called Partial Observation Markov Decision Process (POM DP) has been proposed. This model posits that due to noise or incomplete observations, it is unreliable for an agent to select an action solely based on the current observation. A common approach in POM DP is to encode past observations and actions into confidence states and select the robot's action based on these confidence states. This application's embodiment uses offline reinforcement learning to train a decision model (including a feature extraction network, a memory module, and an actor network) to generate accurate decisions (the robot's action at the current moment) in POM DP environments (i.e., environments with noise and incomplete sensor information).
[0056] A partially observed Markov decision process can be defined by a 7-tuple (S, A, O, P, R, Z, γ), where S represents the state space, A represents the action space, O represents the observation space, P represents the state transition probability, R represents the reward function, Z represents the observation function, and γ∈(0,1] represents the discount factor. Assume s∈S represents the state of the environment, α∈A represents the action of the agent (i.e., the decision model), and o∈O represents the agent's observation. The difference between a partially observed Markov decision process and a Markov decision process is that at the same time t, in a Markov decision process, o... t =s t In some observational Markov decision-making processes, o t ≠s t Therefore, the Markov decision process can be regarded as a special case of the partially observed Markov decision process.
[0057] Specifically, in step S101, a first training dataset is first established. This first training dataset includes multiple first training data points at different times. Each first training data point includes: observation information acquired in a partially observed Markov environment at that time and a corresponding sequence of historical observation information. The observation information represents the information collected by the sensor in the POMDP at that time; that is, the observation information in the first training data may contain some noise. The sequence of historical observation information represents a sequence of observation information from N times prior to that time, arranged in chronological order. N is a positive integer greater than 1. For example, in a medical scenario, a cardiac ultrasound robot needs to perform the next surgical operation based on the cardiac ultrasound images acquired by the image sensor. In this scenario, the observation information... t This represents the echocardiogram image acquired by the image sensor at time t. Assuming one echocardiogram image is acquired every 1 second, then the corresponding historical observation information sequence h... t This refers to a sequence of 60 echocardiogram images acquired within the previous minute, arranged chronologically. t-N o t-N+1 , ..., o t-1 If the current time t satisfies t≥N, then the historical observation information sequence h at time t is... t It is defined as a set of historical observation information of length N in the past; otherwise, the observation vectors at times less than or equal to 0 will be filled with zero vectors consistent with the observation dimension.
[0058] In step S102, the first training data at time t is input into the feature extraction network for feature extraction to obtain the current observation features and historical observation feature sequence at time t. Specifically, a first training data point is randomly selected from the first training dataset, where t represents the time corresponding to the first training data. Feature extraction is performed through the feature extraction network. Specifically, the first training data at time t includes: the observation information at time t, and a historical observation information sequence composed of the observation information of N times prior to time t. The observation information at time t is input into the feature extraction network to obtain the current observation features at time t; the historical observation information sequence at time t is input into the feature extraction network to obtain the historical observation feature sequence at time t. In this embodiment, the selected feature extraction network can be a feature extractor based on a convolutional neural network, such as ResNet34. For example, in the application scenario of a cardiac ultrasound robot, the cardiac ultrasound image (observation information) at time t is input into the feature extraction network to extract the cardiac ultrasound image features at time t (in this case, the feature extraction network extracts image features).
[0059] Before performing step S102, in order to accurately extract the desired features, the feature extraction network needs to be trained first. In one possible implementation, before training the memory module, the method further includes: training the feature extraction network; the training process of the feature extraction network is as follows:
[0060] Step S201: Obtain the second training dataset. Each second training data in the second training dataset includes: observation information at the corresponding time and a sequence of historical observation information.
[0061] Step S202: Input the second training data into the feature extraction network to extract features, and obtain the current observation features and historical observation feature sequence of the second training data.
[0062] Step S203: Input the current observation features and historical observation feature sequences of the second training data into the classifier for classification prediction to obtain the classification result. The classification result represents the probability distribution of the execution action corresponding to the second training data; specifically, the classifier can be a linear classifier, and its specific structure is not limited in this embodiment.
[0063] Step S204: Calculate the loss function value based on the classification result and the label of the second training data. The label of the second training data represents the action that should be taken at the corresponding time point.
[0064] Step S205: Update the parameters of the feature extraction network according to the loss function value.
[0065] Step S206: Reselect the second training data to train the feature extraction network until the loss function value converges or the preset number of training iterations are reached, then end the training and obtain the trained feature extraction network.
[0066] In related technologies, the task of an intelligent agent (i.e., a decision-making model used to predict actions) is often to extract observational features or temporal features. To enable the agent to fully learn both types of features, this application proposes a two-stage deep reinforcement learning framework, allowing the agent to learn observational and temporal features in stages. Specifically, in the first stage, through steps S201-S206, a feature extraction network is trained to learn observational features. The data format of observational information varies depending on the application scenario; specifically, observational information can be measured numerical data, captured image data, or recorded audio data, etc. In the case of a cardiac ultrasound robot, the captured cardiac ultrasound images constitute the observational information, and the corresponding feature extraction network needs to learn image features. The second training data is then fed into the feature extraction network and then into a linear classifier to obtain the probability distribution of the action to be executed at that moment. The purpose of this stage is to better train the feature extraction network and avoid training the feature extraction network and memory module together, which would increase the model's learning difficulty and lead to a decrease in the ability to extract features and historical data.
[0067] After training the feature extraction network in the first stage, the linear classifier is discarded, and the parameters of the feature extraction network are frozen. Then, through steps S101-S109, the memory module and the executor network are trained. That is, the main target of training in the second stage (steps S101-S109) is the memory module, and the auxiliary task is used to improve the memory module's ability to extract temporal features.
[0068] In step S103, the current observation feature and historical observation feature sequence extracted in step S102 at time t are input into the memory module to obtain the confidence state b at time t. t .
[0069] Specifically, in POM DP environments, decisions based solely on the current state often perform poorly because current observations do not provide sufficient information to accurately describe the true state. Related research indicates that the distribution of the true state can be inferred from prior experience. This inference is called a confidence state, which refers to the decision model's belief in the possible true states and is expressed through a Bayesian posterior distribution p(s). t |o ≤t a <t The posterior distribution is formally defined. It combines prior experience with the uncertainty of the unobserved world. In this embodiment, h is defined. t :=(o <t a <tLet be the observation-action history (i.e., the historical observation information sequence composed of the observation information of the previous N time steps, and the historical execution action sequence composed of the execution actions of the previous N time steps), and define the confidence state b. t :=φ(h t ) for h t The function. If we can learn b t And it is used as a sufficient statistic of the state, i.e., p(s) t | o ≤t a <t ≈p(s t |b t ), then we can use b t As a representation of confidence, it is applied to train the decision model in POMDP. Therefore, the POMDP problem can be solved by introducing the concept of confidence states.
[0070] In one possible implementation, step S103, which involves inputting the current observation features and the historical observation feature sequence into the trained memory module to extract temporal features and obtain the confidence state at the current moment, includes:
[0071] Step S1031: Encode the current observed features to obtain the current encoding state.
[0072] Step S1032: Encode the historical observation feature sequence to obtain the historical coded state.
[0073] Step S1033: Concatenate the current encoding state with the historical encoding state to obtain the confidence state at the current moment.
[0074] Specifically, the embodiments of this application refer to Figure 2 , Figure 2 A schematic diagram of the structure of a decision-making model is shown, such as Figure 2 As shown, the decision-making model consists of a memory module and an agent network, where the memory module includes a memory neural network (e.g., [missing information]) for storing a set of historical observations. Figure 2 The diagram shows a network consisting of a fully connected FC layer, an LSTM network, and another fully connected FC layer connected in series, along with a fully connected layer (such as...). Figure 2 The FC module shown has two inputs: the current observed feature and the sequence of historical observed features. The memory module uses a memory neural network to encode the historical observed feature sequence, obtaining the historical encoded state; it uses a fully connected layer to encode the current observed feature, obtaining the current encoded state. The outputs are the historical encoded state and the current encoded state, which are then concatenated (e.g., ...). Figure 2 The splicing symbol shown The concatenated vector is used as the confidence state at the current time step, and then passed through a fully connected layer (such as...). Figure 2 As shown, the actor network consists of multiple fully connected (FC) layers, with the activation function between the FC layers being the tanh function, which produces the action output. This design (the memory module includes a memory neural network and a fully connected layer) aims to balance the weights between historical and current observations, ensuring the decision model can simultaneously extract important information from both. Since the current observation contains the latest information about the environment, its importance is correspondingly increased. By increasing the weight of the current observation as input in the neural network, the algorithm can focus more on the features of the current state, thus making the agent's decisions more accurate and timely.
[0075] In one possible implementation, the memory module and the target memory module include a long short-term memory network, which consists of a fully connected layer-LSTM layer-fully connected layer, with the activation function between each layer being the ReLU function.
[0076] Specifically, in order to make confidence state b t To fully represent historical information, embodiments of this application employ models with memory capabilities to construct memory modules, such as recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). These models can remember previous state information and incorporate it into the current state representation, thereby improving the performance and stability of the decision-making model. In this embodiment, as... Figure 2 The diagram shows the use of a Long Short-Term Memory (LSTM) network as the memory neural network in the memory module (corresponding to step S1031 above, used to encode based on the current observation features to obtain the current encoding state) to extract information from historical observations. LSTM is a memory network improved from recurrent neural networks; it can solve the gradient vanishing and gradient exploding problems that are common in ordinary recurrent neural networks, and it has better short-term and long-term memory capabilities. Since the rewards from the environment are usually sparse signals used for training reinforcement learning agents, the LSTM layer's ability to extract temporal features will be weak, thus failing to provide sufficient statistical information for the confidence state b. t In this embodiment of the application, the estimation of confidence state is improved by using a comparative auxiliary task in steps S104-S108.
[0077] In step S104, based on the confidence state at time t obtained in step S103, the first candidate confidence state at time t+1 is predicted.
[0078] In one possible implementation, step S104, predicting the first candidate confidence state at time t+1 based on the confidence state at time t, includes:
[0079] Obtain the historical action information sequence at time t. The historical action information sequence represents the sequence of action information from the previous N times, including time t, arranged in chronological order.
[0080] The historical action information sequence at time t and the confidence state at time t are input into the state prediction module to predict the first candidate confidence state at time t+1.
[0081] The state prediction module consists of a linear layer to generate continuous confidence states.
[0082] Specifically, N is a positive integer greater than 1. For example, in a medical scenario, a cardiac ultrasound robot needs to perform the next surgical procedure based on cardiac ultrasound images acquired by an image sensor. In this scenario, the observation information o t This represents the echocardiogram image acquired by the image sensor at time t. Assuming one echocardiogram image is acquired every 1 second, then the corresponding historical observation information sequence h... t This refers to a sequence of 60 echocardiogram images acquired within the previous minute, arranged chronologically. t-N o t-N+1 , ..., o t-1 Correspondingly, the motion information represents the actions performed by the cardiac ultrasound robot during surgery. Assuming an action is performed every 1 second, the historical motion information sequence represents the 60 actions performed by the cardiac ultrasound robot in the previous minute, arranged in chronological order (α). t-N α t-N+1 , ..., α t-1 In the auxiliary task, a new neural network (state prediction module) is created. This is a linear neural network that utilizes the input sequence of historical action information and the confidence state at time t (b t To predict the confidence state at the next time step, i.e., the first candidate confidence state (b) at time t+1. t+1 ).
[0083] In step S105, the first training data at time t+1 is input into the feature extraction network to perform feature extraction, thereby obtaining the current observation features and the historical observation feature sequence at time t+1.
[0084] Specifically, in step S102, the first training data at time t is input. Correspondingly, in step S105, the first training data for the next time step, i.e., the first training data at time t+1, needs to be input. Feature extraction is then performed through a feature extraction network. Specifically, the first training data at time t+1 includes: the observation information at time t+1, and a sequence of historical observation information composed of observation information from the N time steps prior to time t+1. The observation information at time t+1 is input into the feature extraction network to obtain the current observation features at time t+1; the sequence of historical observation information at time t+1 is input into the feature extraction network to obtain the historical observation feature sequence at time t+1. For example, in the application scenario of a cardiac ultrasound robot, the cardiac ultrasound image (observation information) at time t+1 is input into the feature extraction network to extract the cardiac ultrasound image features at time t+1 (in this case, the feature extraction network extracts image features).
[0085] In step S106, the current observation features and historical observation feature sequences at time t+1 are input into the target memory module to obtain the second candidate confidence state at time t+1.
[0086] Reference Figure 3 , Figure 3 A flowchart illustrating a process for training a decision-making model using a comparative auxiliary task is shown, such as... Figure 3 As shown, the current observation features and historical observation feature sequences extracted in step S105 at time t+1 are input into the target memory module to predict the second candidate confidence state at time t+1 (e.g., ...). Figure 3 b shown t+1 This application embodiment includes a target memory module. The target memory module Initialization parameters and memory modules before training Completely identical, its function is also to store experiential data from past interactions between the decision model and the environment. By using a target memory module, an independent target memory neural network can be used during training to output the current action and estimate the confidence state for the next moment. This prevents the parameters of the memory module from gradually converging to zero during training, reduces instability during training, and improves the convergence speed and performance of the algorithm. This target memory module can be represented as...
[0087] In step S107, the MSE loss function value is calculated based on the first candidate confidence state and the second candidate confidence state at time t+1.
[0088] This application embodiment creates a linear connection layer, treating it as a forward model (such as...). Figure 3The state prediction module shown takes a confidence state (bt) and an action (α) as input. t-N α t-N+1 , ..., α t-1 α t The output is a prediction of the confidence state at the next time step (e.g., ...). Figure 3 The first candidate confidence state is shown. This embodiment uses the mean squared error (MSE) of the predicted confidence state (second candidate confidence state) and the confidence state (first candidate confidence state) inferred by the forward model to minimize the memory module. Figure 3 The MSE Loss shown is used to improve the information retrieval capability of the memory module.
[0089] In one possible implementation, calculating the MSE loss function based on the first candidate confidence state and the second candidate confidence state at time t+1 includes:
[0090] The MSE loss function is calculated according to the following formula, based on the first candidate confidence state and the second candidate confidence state at time t+1:
[0091]
[0092] in, This represents the first candidate confidence state at time t+1. This represents the second candidate confidence state at time t+1. (Using...) and We calculate the loss function that minimizes the mean squared error (MSE), thereby optimizing the memory module so that it can learn to better extract historical information (temporal features) and ignore some incomplete observation features.
[0093] Step S108: Update the parameters of the memory module and the executor network according to the MSE loss function value.
[0094] like Figure 3 As shown, in the first stage, the feature extraction network is trained to provide better feature representations for the reinforcement learning in the second stage. The second stage focuses on extracting historical features, using historical information to correct for incomplete observation information such as artifacts. Specifically, the memory module is trained through steps S101-S108 to learn how to extract temporal features. In this embodiment, the idea of contrastive learning is introduced into the auxiliary task, changing noisy observations in the target dataset into confidence states. The confidence state at time t+1 is represented by two neural networks (i.e., the first candidate confidence state). Second candidate confidence state Using two representations that are closer together allows the model to extract more general historical features and avoids noise interference. To prevent the decision model from collapsing, gradient backpropagation to the target memory module is stopped, and its parameters are updated using a soft update method (e.g., ...). Figure 3 (EMA shown).
[0095] In one possible implementation, the initialization parameters of the target memory module are the same as those of the memory module, and the training process further includes:
[0096] After updating the parameters of the memory module, the parameters of the target memory module are updated according to the following formula:
[0097] φ′←(1-τ)φ′+τφ;
[0098] Wherein, φ′ represents the parameters of the target memory module, φ represents the parameters of the memory module, and τ represents the pre-set hyperparameter (generally set to a constant close to 1, such as 0.99).
[0099] Specifically, if the target memory module is updated along with the memory module, and the first candidate confidence state is minimized. Second candidate confidence state If the mean squared error between the parameters is too large, the network can easily learn that when all network parameters approach 0, the loss function approaches its minimum, leading to network collapse and training failure. Therefore, this embodiment employs a soft update method, making the update speed of the target memory module slower than that of the memory module, thereby improving the stability of the model.
[0100] Step S109: Select the first training data again and perform iterative training until the MSE loss function value reaches a convergent state or the preset number of training iterations are reached, and obtain the trained memory module and the executor network.
[0101] Step S110: Using the trained memory module and the executor network, determine the robot's action at the current moment based on the observation information at the current moment.
[0102] After training the memory module and executor network through steps S101-S109, the model is used to predict the action to be performed at the current moment. For example, in the application scenario of cardiac ultrasound robots, the robot can determine the action to be performed during surgery based on the cardiac ultrasound image captured at the current moment and perform the corresponding surgical operation. In the partially observed Markov environment (where the observed information contains noise), this embodiment uses an offline reinforcement learning method to obtain higher decision-making performance, so as to generate more accurate decisions (robot actions) using the observation information at the current moment. To achieve this goal, this embodiment uses a memory-based reinforcement learning method and also uses auxiliary tasks, contrastive learning, and other methods to improve the decision model (memory module and executor network). This significantly improves the performance of the final trained decision model, which can be applied to various decision-making scenarios (such as cardiac ultrasound robots).
[0103] In one possible implementation, step S110, utilizing the trained memory module and the executor network, determines the robot's action at the current moment based on the observation information at the current moment, including:
[0104] Step S1101: Obtain the observation information at the current moment and the corresponding historical observation information sequence. Specifically, in practical applications, the obtained observation information (including the historical observation information sequence) may contain some noise.
[0105] Step S1102: Input the current observation information and the corresponding historical observation information sequence into the feature extraction network to extract features, and obtain the current observation features and the historical observation feature sequence.
[0106] Step S1103: Input the current observation features and the historical observation feature sequence into the trained memory module to extract temporal features and obtain the confidence state at the current moment; the confidence state represents the predicted true state at the current moment.
[0107] Step S1104: Input the confidence state at the current moment into the trained executor network to determine the robot's execution action at the current moment.
[0108] This application employs a memory-based reinforcement learning method, introducing a memory module and using auxiliary tasks, contrastive learning, and other methods to improve the reinforcement learning algorithm. This allows it to be applied to partially observed Markov environments (i.e., environments where the collected information is inaccurate or noisy), improving the performance of offline reinforcement learning and generating a more accurate understanding of the environmental state (confidence state), thereby obtaining a more accurate decision (the robot's action at the current moment).
[0109] In recent years, reinforcement learning has demonstrated enormous application potential in many fields. For example, it has achieved remarkable success in tasks such as robot control, autonomous driving, and esports. Furthermore, reinforcement learning has shown broad application prospects in practical problems such as resource allocation, network optimization, and medical decision-making. Although current research on reinforcement learning mainly focuses on problems requiring rich and accurate spatial information, this assumption is not practical in real-world robotic applications, especially in the field of medical robotics. Limited by sensor sensitivity, sensor noise, the complexity of human physiology, and certain characteristics of medical examinations, the environmental information received by medical robots often contains incomplete or inaccurate information. Partially observable Markov decision processes (POMDPs) provide a way to model such incomplete observations. In POMDPs, the agent can only observe partial information about the environment and needs to infer and estimate to gain a more accurate understanding of the environmental state. Reinforcement learning algorithms in POMDPs can make decisions and learn by learning a model of the environment and inferring states, thus adapting to the incomplete observations of the real world.
[0110] This application proposes using a contrastive auxiliary task to train a decision model (memory module and executor network) to facilitate the learning of the similarity between the decision model and the confidence state (first candidate confidence state) inferred from the current confidence state, and the confidence state inferred by the memory module (second candidate confidence state). Utilizing the idea of contrastive learning, the similarity between observations is measured by calculating their mean squared loss function and used as the target of the auxiliary task, thereby improving the robustness and denoising performance of the decision model, enabling it to still exhibit good decision performance in the POMDP environment.
[0111] A second aspect of this application also provides a partial observation reinforcement learning device for robots, referring to... Figure 4 , Figure 4 A partial observation reinforcement learning device for robots is shown, such as Figure 4 As shown, the apparatus for performing the partial observation reinforcement learning method for robots as described in the first aspect includes:
[0112] The first training dataset acquisition module is used to acquire the first training dataset. Each first training data in the first training dataset is an observation information and corresponding historical observation information sequence obtained in a partially observed Markov environment at different times. The historical observation information sequence represents a sequence composed of the observation information of the first N times in chronological order.
[0113] The first feature extraction module is used to input the first training data at time t into the feature extraction network, perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t.
[0114] The first confidence state generation module is used to input the current observation features and historical observation feature sequences at time t into the memory module to obtain the confidence state at time t.
[0115] The first confidence state prediction module is used to predict the first candidate confidence state at time t+1 based on the confidence state at time t.
[0116] The second feature extraction module is used to input the first training data at time t+1 into the feature extraction network, perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t+1.
[0117] The second confidence state generation module is used to input the current observation features and historical observation feature sequences at time t+1 into the target memory module to obtain the second candidate confidence state at time t+1.
[0118] The loss function calculation module is used to calculate the MSE loss function value based on the first candidate confidence state and the second candidate confidence state at time t+1.
[0119] The parameter update module is used to update the parameters of the memory module and the executor network based on the MSE loss function value.
[0120] The training end module is used to reselect the first training data and perform iterative training until the MSE loss function value reaches a convergent state or the preset number of training times is reached, so as to obtain the training completed memory module and the executor network.
[0121] An action prediction module is used to determine the robot's action at the current moment based on the observation information at the current moment, using the trained memory module and the executor network.
[0122] In one possible implementation, the action prediction module includes:
[0123] The observation information acquisition submodule is used to acquire the observation information at the current moment, as well as the corresponding historical observation information sequence;
[0124] The feature extraction submodule is used to input the current observation information and the corresponding historical observation information sequence into the feature extraction network for feature extraction to obtain the current observation features and the historical observation feature sequence.
[0125] The confidence state generation submodule is used to input the current observation features and the historical observation feature sequence into the trained memory module to extract temporal features and obtain the confidence state at the current moment; the confidence state represents the predicted true state at the current moment.
[0126] The action determination submodule is used to input the confidence state at the current moment into the trained executor network to determine the robot's action at the current moment.
[0127] In one possible implementation, the initialization parameters of the target memory module are the same as those of the memory module, and the device further includes:
[0128] The target memory module parameter update module is used to update the parameters of the target memory module according to the following formula after updating the parameters of the memory module:
[0129] φ′←(1-τ)φ′+τφ;
[0130] Wherein, φ′ represents the parameters of the target memory module, φ represents the parameters of the memory module, and τ represents the preset hyperparameters.
[0131] In one possible implementation, the first confidence state prediction module includes:
[0132] The historical action information acquisition submodule is used to acquire the historical action information sequence at time t. The historical action information sequence represents a sequence composed of action information from the previous N times, including time t, arranged in chronological order.
[0133] The first candidate confidence state prediction submodule is used to input the historical action information sequence at time t and the confidence state at time t into the state prediction module to predict the first candidate confidence state at time t+1.
[0134] In one possible implementation, the memory module and the target memory module include a long short-term memory network, which consists of a fully connected layer-LSTM layer-fully connected layer, with the activation function between each layer being the ReLU function.
[0135] In one possible implementation, the loss function calculation module includes:
[0136] The calculation submodule is used to calculate the MSE loss function according to the first candidate confidence state and the second candidate confidence state at time t+1, using the following formula:
[0137]
[0138] in, This represents the first candidate confidence state at time t+1. This represents the second candidate confidence state at time t+1.
[0139] In one possible implementation, the confidence state generation submodule includes:
[0140] The first encoding unit is used to encode based on the current observed features to obtain the current encoding state;
[0141] The second encoding unit is used to encode based on the historical observation feature sequence to obtain the historical encoding state;
[0142] The splicing unit is used to splice the current encoding state with the historical encoding state to obtain the confidence state at the current moment.
[0143] In one possible implementation, the apparatus further includes: a feature extraction network training module, used to train the feature extraction network before training the memory module; the training process of the feature extraction network is as follows:
[0144] Obtain a second training dataset, wherein each second training data in the second training dataset includes: observation information at the corresponding time and a sequence of historical observation information;
[0145] The second training data is input into the feature extraction network for feature extraction to obtain the current observation features and historical observation feature sequences of the second training data.
[0146] The current observation features and historical observation feature sequences of the second training data are input into the classifier for classification prediction to obtain the classification result, which represents the probability distribution of the execution action corresponding to the second training data.
[0147] Calculate the loss function value based on the classification result and the labels of the second training data;
[0148] The parameters of the feature extraction network are updated based on the loss function value;
[0149] The second training data is selected again to train the feature extraction network until the loss function value converges or the preset number of training iterations are reached. The training ends, and the trained feature extraction network is obtained.
[0150] This application also provides an electronic device, see embodiments thereof. Figure 5 , Figure 5 This is a schematic diagram of the electronic device proposed in an embodiment of this application. Figure 5 As shown, the electronic device 100 includes a memory 110 and a processor 120. The memory 110 and the processor 120 are connected via a bus for communication. The memory 110 stores a computer program that can run on the processor 120 to implement the steps in the robot-oriented partial observation reinforcement learning method disclosed in the embodiments of this application.
[0151] This application also provides a computer-readable storage medium storing a computer program / instructions thereon, which, when executed by a processor, implements the steps in the robot-oriented partial observation reinforcement learning method disclosed in this application.
[0152] This application also provides a computer program product that, when run on an electronic device, causes a processor to execute the steps of the partial observation reinforcement learning method for robots disclosed in this application.
[0153] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0154] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, electronic devices, and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0155] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0156] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0157] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.
[0158] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0159] The foregoing has provided a detailed description of a partial observation reinforcement learning method, apparatus, and device for robots provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and its core ideas. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A partial observation reinforcement learning method for robots, characterized in that, Applied to robots, the method includes: A first training dataset is obtained, wherein each first training data in the first training dataset is an observation information and a corresponding historical observation information sequence obtained in a partially observed Markov environment at different times; the historical observation information sequence represents a sequence of observation information from the previous N times arranged in chronological order; the observation information represents the cardiac ultrasound image at the corresponding time obtained by the image sensor. The first training data at time t is input into the feature extraction network to extract features, thereby obtaining the current observation features and the historical observation feature sequence at time t. The current observation feature and historical observation feature sequence at time t are input into the memory module to obtain the confidence state at time t; the confidence state is expressed by the following formula: ;in, Let be the confidence state at time t. The observation-action history at time t includes: a sequence of historical observation information composed of observation information from the previous N time steps, and a sequence of historical execution actions composed of execution actions from the previous N time steps. Based on the confidence state at time t, the first candidate confidence state at time t+1 is predicted; The first training data at time t+1 is input into the feature extraction network to extract features, thereby obtaining the current observation features and the historical observation feature sequence at time t+1. The current observation features and historical observation feature sequences at time t+1 are input into the target memory module to obtain the second candidate confidence state at time t+1. Calculate the MSE loss function value based on the first candidate confidence state and the second candidate confidence state at time t+1; Based on the MSE loss function value, the parameters of the memory module and the executor network are updated; The first training data is reselected, and iterative training is performed until the MSE loss function value reaches a convergent state or the preset number of training iterations are reached, so as to obtain the trained memory module and the executor network. Using the trained memory module and the executor network, the robot's action at the current moment is determined based on the observation information at the current moment; The step of determining the robot's action at the current moment based on the observation information at the current moment, using the trained memory module and the executor network, includes: Obtain the observation information at the current moment, as well as the corresponding historical observation information sequence; The current observation information and the corresponding historical observation information sequence are respectively input into the feature extraction network for feature extraction to obtain the current observation features and the historical observation feature sequence; The current observation features and the historical observation feature sequence are input into the trained memory module to extract temporal features and obtain the confidence state at the current moment; the confidence state represents the predicted true state at the current moment. The confidence state at the current moment is input into the trained executor network to determine the robot's action at the current moment.
2. The partial observation reinforcement learning method for robots according to claim 1, characterized in that, The initialization parameters of the target memory module are the same as those of the memory module, and the training process further includes: After updating the parameters of the memory module, the parameters of the target memory module are updated according to the following formula: ; in, The parameters represent the target memory module. The parameters representing the memory module, This represents the pre-defined hyperparameters.
3. The partial observation reinforcement learning method for robots according to claim 1, characterized in that, The step of predicting the first candidate confidence state at time t+1 based on the confidence state at time t includes: Obtain the historical action information sequence at time t, where the historical action information sequence represents a sequence composed of action information from the previous N times, including time t, arranged in chronological order. The historical action information sequence at time t and the confidence state at time t are input into the state prediction module to predict the first candidate confidence state at time t+1.
4. The partial observation reinforcement learning method for robots according to claim 1, characterized in that, The memory module and the target memory module include a long short-term memory network, which is composed of fully connected layers (LSTM layers) and the activation function between each layer is the ReLU function.
5. The partial observation reinforcement learning method for robots according to claim 1, characterized in that, The step of calculating the MSE loss function based on the first candidate confidence state and the second candidate confidence state at time t+1 includes: The MSE loss function is calculated according to the following formula, based on the first candidate confidence state and the second candidate confidence state at time t+1: ; in, This represents the first candidate confidence state at time t+1. This represents the second candidate confidence state at time t+1.
6. The partial observation reinforcement learning method for robots according to claim 1, characterized in that, The step of inputting the current observation features and the historical observation feature sequence into the trained memory module to extract temporal features and obtain the confidence state at the current moment includes: Based on the current observed features, encoding is performed to obtain the current encoding state; Based on the historical observation feature sequence, encoding is performed to obtain the historical encoding state; The current encoding state is concatenated with the historical encoding state to obtain the confidence state at the current moment.
7. The partial observation reinforcement learning method for robots according to claim 1, characterized in that, Before training the memory module, the method further includes: training the feature extraction network; The training process for the feature extraction network is as follows: Obtain a second training dataset, wherein each second training data in the second training dataset includes: observation information at the corresponding time and a sequence of historical observation information; The second training data is input into the feature extraction network for feature extraction to obtain the current observation features and historical observation feature sequences of the second training data. The current observation features and historical observation feature sequences of the second training data are input into the classifier for classification prediction to obtain the classification result, which represents the probability distribution of the execution action corresponding to the second training data. Calculate the loss function value based on the classification result and the labels of the second training data; The parameters of the feature extraction network are updated based on the loss function value; The second training data is selected again to train the feature extraction network until the loss function value converges or the preset number of training iterations are reached. The training ends, and the trained feature extraction network is obtained.
8. A partial observation reinforcement learning device for robots, characterized in that, The apparatus for performing the robot-oriented partial observation reinforcement learning method according to any one of claims 1-7, the apparatus comprising: The first training dataset acquisition module is used to acquire the first training dataset. Each first training data in the first training dataset is an observation information and corresponding historical observation information sequence obtained in a partially observed Markov environment at different times. The historical observation information sequence represents a sequence composed of the observation information of the first N times in chronological order. The first feature extraction module is used to input the first training data at time t into the feature extraction network, perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t. The first confidence state generation module is used to input the current observation features and historical observation feature sequences at time t into the memory module to obtain the confidence state at time t. The first confidence state prediction module is used to predict the first candidate confidence state at time t+1 based on the confidence state at time t. The second feature extraction module is used to input the first training data at time t+1 into the feature extraction network, perform feature extraction, and obtain the current observation features and the historical observation feature sequence at time t+1. The second confidence state generation module is used to input the current observation features and historical observation feature sequences at time t+1 into the target memory module to obtain the second candidate confidence state at time t+1. The loss function calculation module is used to calculate the MSE loss function value based on the first candidate confidence state and the second candidate confidence state at time t+1. The parameter update module is used to update the parameters of the memory module and the executor network based on the MSE loss function value. The training end module is used to reselect the first training data and perform iterative training until the MSE loss function value reaches a convergent state or the preset number of training times is reached, so as to obtain the training completed memory module and the executor network. An action prediction module is used to determine the robot's action at the current moment based on the observation information at the current moment, using the trained memory module and the executor network.
9. An electronic device, characterized in that, It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executed, implements the steps of the partial observation reinforcement learning method for robots as described in any one of claims 1-7.