A brain-like reinforcement learning method and system based on hierarchical experience replay
By using a hierarchical experience replay method, experiences are divided into short-term and long-term memory pools. An attention discrimination module is used to filter and transfer experiences, which solves the problem of insufficient utilization of experience in traditional methods and achieves more efficient learning and decision-making.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF TECH
- Filing Date
- 2025-11-20
- Publication Date
- 2026-06-26
Smart Images

Figure CN121503573B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of reinforcement learning and brain-like computing technology, and in particular to a brain-like reinforcement learning method and system based on hierarchical experience playback. Background Technology
[0002] Reinforcement learning, an important branch of machine learning, aims to learn optimal policies through the interaction between an agent and its environment to maximize cumulative rewards. In traditional reinforcement learning, the agent learns through trial and error, adjusting its behavioral strategies based on environmental feedback as it tries different actions. However, this learning method often requires a large amount of sample data, resulting in low data utilization efficiency. To address this issue, the experience replay method has emerged. This method stores the agent's historical experiences generated during interactions with the environment in an experience pool. During the learning process, it randomly draws experience samples from the pool for learning, allowing the agent to reuse these experiences and effectively improving data utilization efficiency. However, traditional experience replay methods generally use a single experience pool to store all experiences, which has significant limitations.
[0003] On the one hand, this storage method does not fully consider the temporal hierarchy of experience. In practical applications, the impact of experiences generated at different times on the agent's learning and decision-making varies considerably. Recently generated experiences are more relevant to the current environment, helping the agent quickly adapt to changes; while earlier generated experiences, although less applicable in the current environment, contain general information and long-term patterns that are crucial for improving the agent's overall decision-making ability. Traditional single experience pools cannot effectively distinguish and utilize these experiences at different time levels, thus affecting learning efficiency and decision quality.
[0004] On the other hand, traditional experience replay methods fail to adequately differentiate the importance of experiences. Different types of experiences in the experience pool have varying roles in the agent's learning and decision-making. Analogous to the human brain's memory patterns, the information in the experience pool can be divided into short-term and long-term memory. Short-term experiences reflect the agent's recent behavior and immediate feedback from the environment, helping the agent respond quickly and adapt to environmental changes; long-term experiences contain more stable and in-depth information, helping the agent make more rational choices in long-term decision-making. Existing methods fail to effectively distinguish and manage these experiences, making it difficult for the agent to fully explore and utilize the value of different experiences during the learning process, thus limiting the performance improvement of reinforcement learning algorithms.
[0005] In summary, existing reinforcement learning experience replay methods have shortcomings in terms of data utilization efficiency and experience management. There is an urgent need for innovative methods and devices to solve these problems in order to improve the performance and effectiveness of reinforcement learning. Summary of the Invention
[0006] The purpose of this invention is to propose a brain-like reinforcement learning method and system based on hierarchical experience replay. By introducing a brain-like mechanism to optimize the efficiency of experience replay, and by utilizing the thinking patterns of short-term and long-term memory to optimize the experience replay process, the invention improves the experience utilization rate and decision-making efficiency of the agent in the learning process, thereby enhancing the learning effect and performance of reinforcement learning.
[0007] To achieve the above objectives, this invention proposes a brain-like reinforcement learning method based on hierarchical experience playback, comprising the following steps:
[0008] Step S1: Collect observation data of the interaction between the agent and the environment and perform data preprocessing to construct hierarchical training data for reinforcement learning networks. The observation data includes state feature vector data, action encoding data, instant reward signal data, visual observation frames and task description text.
[0009] Step S2: Initialize the actor network, critic network, and corresponding target network. Simultaneously, initialize the experience buffer pool and perform parameter initialization, including the discount factor γ and the maximum number of episodes. M and maximum time steps T ;
[0010] Step S3: At the beginning of each training round, initialize the exploration noise, the agent obtains the initial state of the current environment, selects an action from the actor network based on the current state and executes it, and stores the experience samples obtained after executing the action into the experience buffer pool.
[0011] Step S4: Short-term memory pool update. The short-term memory experience pool obtains new samples from the experience buffer pool. If the short-term memory experience pool is full, outdated samples are deleted according to the first-in-first-out strategy.
[0012] Step S5: Experience screening and transfer. The attention discrimination module is used to calculate the similarity between the new experience sample and the existing experience samples in the short-term memory experience pool. The similarity is compared to see if it exceeds the threshold. The number of times the experience is repeatedly rated as similar is recorded. It is then decided whether to transfer some of the experience in the short-term memory experience pool to the long-term memory experience pool.
[0013] Step S6: Network parameters are updated by retrieving data from the short-term memory experience pool. N 1 One sample was obtained from the long-term memory experience pool. N 2 A set of samples are input into the actor network and the critic network for parameter updates. The target network's parameters are updated at fixed intervals of network parameter updates. N 1 and N 2 It is a positive integer.
[0014] Preferably, in step S1, the data preprocessing includes: denoising the observed data, filtering out outliers and invalid data points, normalizing or standardizing the numerical feature data, and performing one-hot encoding or embedding representation on the discrete action code.
[0015] Preferably, the short-term memory experience pool has a limited capacity first-in-first-out (FIFO) structure, and the FIFO strategy ensures that the experience in the pool always remains timely.
[0016] Preferably, the long-term memory experience pool is a full experience pool, in which the experiences are derived from the screening of the short-term memory experience pool. The attention discrimination module evaluates the long-term preservation value of the experiences in the short-term memory experience pool and decides whether to transfer them.
[0017] Preferably, the attention discrimination module uses the cosine similarity method to calculate the similarity between experience samples. By comparing the cosine similarity between the new experience sample and the feature vectors of each existing sample in the short-term memory experience pool, the degree of similarity between the two is quantified.
[0018] Preferably, in step S3, the exploration noise is Gaussian noise, and its standard deviation is set to a large value in the early stage of training to promote the agent to explore the environment extensively. As training progresses, it is gradually reduced so that the agent can focus on learning the optimal policy.
[0019] Preferably, in step S6, the parameter update of the target network adopts a delayed update strategy. Every fixed number of network parameter updates, the parameters of the actor network and the commentator network are copied to the target network to update the parameters of the target network.
[0020] This invention also provides a brain-like reinforcement learning system based on hierarchical experience playback, comprising:
[0021] Initialization module: Used to initialize the actor network, critic network and corresponding target network, as well as the experience buffer pool and parameters;
[0022] Interaction module: In each training round, initialize exploration noise, obtain the initial state of the environment, control the agent to select actions from the actor network to execute, and store experience samples in the experience buffer pool;
[0023] Short-term memory pool management module: responsible for retrieving new samples from the experience buffer pool to update the short-term memory experience pool, and deleting outdated samples according to the FIFO strategy;
[0024] Attention filtering module: Calculates the similarity of experience samples and determines whether an experience is transferred from the short-term memory experience pool to the long-term memory experience pool based on the similarity and the number of repetitions;
[0025] Network update module: Samples are drawn from the short-term memory experience pool and the long-term memory experience pool, input into the actor network and the critic network to update parameters, and the parameters of the actor network and the critic network are copied into the target network at a certain update frequency to achieve delayed update of the target network.
[0026] The present invention also provides a computer device, including: a memory and a processor; the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described brain-like reinforcement learning method based on hierarchical experience playback.
[0027] The present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the above-described brain-like reinforcement learning method based on hierarchical experience playback.
[0028] Therefore, this invention proposes a brain-like reinforcement learning method and system based on hierarchical experience playback, the beneficial effects of which are as follows:
[0029] (1) Improve experience utilization: By designing a hierarchical experience pool, experiences are divided into short-term memory experiences and long-term memory experiences for separate management and utilization. The short-term memory experience pool can update experiences in a timely manner, allowing the agent to quickly acquire the latest environmental information; the long-term memory experience pool stores important experiences that have been filtered, avoiding the loss of important information. During the learning process, the agent can make full use of experiences of different time levels and importance, effectively improving the utilization rate of experience.
[0030] (2) Optimizing decision-making efficiency: The introduction of the attention discrimination module enables the agent to more accurately filter out experiences valuable for long-term decision-making. By evaluating experience similarity and repetition frequency, experiences with high long-term retention value are transferred to the long-term memory experience pool, providing strong support for the agent to make long-term and stable decisions in complex environments. This helps the agent make more reasonable decisions when facing various situations, improving decision-making efficiency and quality.
[0031] (3) Improved learning effectiveness and performance: The brain-like reinforcement learning method and system based on hierarchical experience replay proposed in this invention comprehensively considers the temporal hierarchy and importance of experience and optimizes the experience replay process. During training, the agent can learn environmental rules more efficiently and converge to the optimal policy faster, thereby improving the learning effectiveness and performance of reinforcement learning. In practical applications, this method and system can significantly improve the performance of the agent in various tasks and has broad application prospects.
[0032] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description
[0033] Figure 1 This is a flowchart of a brain-like reinforcement learning method based on hierarchical experience playback according to the present invention. Detailed Implementation
[0034] To make the technical solutions, advantages, and objectives of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below. The described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the described embodiments of the present invention without creative effort are within the protection scope of this application.
[0035] Unless otherwise defined, the technical or scientific terms used in this invention shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.
[0036] Example 1
[0037] like Figure 1 The diagram shows a flowchart of a brain-like reinforcement learning method based on hierarchical experience playback according to the present invention, which includes the following steps:
[0038] 1. Data acquisition and preprocessing.
[0039] Collect observation data of the interaction between the agent and the environment and perform data preprocessing to construct hierarchical training data for reinforcement learning networks. The observation data includes state feature vector data, action encoding data, instant reward signal data, visual observation frames and task description text.
[0040] The specific operations of data preprocessing include denoising the observed data, filtering out outliers and invalid data points to ensure the stability of the state feature vector. Numerical features (such as state feature vectors and reward signals) are normalized or standardized to improve the numerical stability of the training process. Discrete action codes are encoded using one-hot encoding or embedding representation to adapt to the input requirements of different network structures. The observed data collected in each interaction are integrated into a set of vectors as training input to the reinforcement learning network to improve the stability and efficiency of policy optimization.
[0041] 2. Network and parameter initialization.
[0042] 2.1 First, initialize the actor network. and the network of critics The parameters are set using random initialization, with weights and biases of neurons in each layer of the network randomly generated using a normal distribution. The target network is then initialized. and The parameters of the target network are set to be the same as the initial parameters of the actor network and the critic network. The parameters of the target network will be updated according to a certain delayed update strategy to improve the stability of the learning process.
[0043] 2.2 Initialize the experience buffer pool. The experience buffer pool can be implemented using data structures such as queues or lists. Set the maximum capacity of the experience buffer pool. When the experience buffer pool reaches its maximum capacity, new experience samples will overwrite the oldest experience samples.
[0044] 2.3 Set a discount factor γ, typically between 0 and 1. The discount factor measures the importance of future rewards; a larger discount factor indicates a greater emphasis on future rewards, while a smaller discount factor focuses more on current rewards.
[0045] 2.4. Set the maximum number of episodes M and maximum time steps T The maximum number of episodes determines the total number of training rounds, while the maximum number of time steps limits the duration of the agent's interaction with the environment in each training round.
[0046] 3. Training process.
[0047] 3.1 At the beginning of each training round, initialize the exploration noise. The exploration noise uses random noise such as Gaussian noise for the agent's action space. A This generates a Gaussian noise vector ϵ with a mean of 0 and a standard deviation of σ. The noise standard deviation σ can be adjusted according to the training phase. In the early stages of training, a larger σ value helps the agent explore the environment more extensively; as training progresses, the σ value is gradually decreased, allowing the agent to focus more on learning the optimal policy.
[0048] 3.2 The agent obtains the initial state of the current environment. Based on the current state Selecting actions from the actor network The actions output by the actor network can be discrete or continuous. If the actor network outputs a probability distribution of actions, then specific actions are obtained by sampling based on that probability distribution; if the output is a deterministic action, then that action is used directly.
[0049] 3.3 The agent performs actions It interacts with the environment, and the environment returns a reward after the action is performed. and the next state . Experience samples Stored in the experience buffer pool.
[0050] 4. Experience pool update and transfer.
[0051] 4.1 The short-term memory experience pool retrieves new samples from the experience buffer pool. If the short-term memory experience pool is full, the earliest sample is deleted according to the FIFO strategy to ensure that the experience pool can be updated in a timely manner.
[0052] 4.2 The attention discrimination module starts working and calculates new experience samples. Cosine similarity between the empirical sample and other empirical samples in the short-term memory experience pool. Assume that an empirical sample can be represented as a feature vector. The feature vector of the new experience sample is Then the empirical sample i Cosine similarity with new empirical samples The calculation formula is:
[0053] .
[0054] 4.3 For each experience sample in the short-term memory experience pool, if it exceeds a threshold β, it is rated as similar, and the number of times it is repeatedly rated as similar is recorded. If the number of times an experience sample is repeatedly rated as similar reaches a set number... K If so, the experience sample is transferred from the short-term memory experience pool to the long-term memory experience pool.
[0055] 5. Network parameter update
[0056] 5.1 Randomly draw from the short-term memory pool N 1 Each sample is randomly drawn from the long-term memory pool. N 2 One sample. N 1 and N 2 The value can be adjusted according to the actual situation.
[0057] 5.2. Input the extracted samples into the network for parameter updates. First, calculate the target of the samples based on the critic network. Q Value, for a sample in the termination state, target Q Value equals reward r For samples in non-terminating states, the target Q The formula for calculating the value is as follows:
[0058] ;
[0059] in, For the goal Q value, For the reward function, Let t be the state at time t. for t Actions that are happening all the time Actions computed by the target actor network. For critics' goals Q The value network calculates the next state and action. Q value;
[0060] Then, based on the extracted samples and the calculated target... Q The value is used to calculate the loss function, as shown in the following formula:
[0061] ;
[0062] in, For loss function, For the first i The target of each sample Q value, It is the network of critics on the first i Calculated from samples Q value, s For the state of the sample, a The action selected for the sample;
[0063] The parameters of the critic network are updated by minimizing the loss function.
[0064] 5.3 Update the actor network using the policy gradient method. The goal of the policy gradient is to maximize the expected return, and its basic form is:
[0065] ;
[0066] in, For the objective function J(θ) The gradient with respect to the policy parameter θ, In strategy The result is the probability-weighted average of all possible actions 'a'. For the target strategy.
[0067] The importance sampling method is introduced to correct the policy gradient. Furthermore, to reduce the variance of importance sampling, truncated importance weights are introduced, and the gradient update formula is modified as follows:
[0068] ;
[0069] in, , c To truncate the threshold, ρ The ratio of the probability of the target policy to the probability of the action policy is used to update the actor network parameters based on the policy gradient.
[0070] ;
[0071] in, This is the learning rate.
[0072] 5.4. Copy the parameters of the actor network and the critic network to the target network at a certain update frequency to achieve delayed updates of the target network.
[0073] Example 2
[0074] A brain-inspired reinforcement learning system based on hierarchical experience playback includes:
[0075] Initialization module: Used to initialize the actor network, critic network and corresponding target network, as well as the experience buffer pool and parameters;
[0076] Interaction module: In each training round, initialize exploration noise, obtain the initial state of the environment, control the agent to select actions from the actor network to execute, and store experience samples in the experience buffer pool;
[0077] Short-term memory pool management module: responsible for retrieving new samples from the experience buffer pool to update the short-term memory experience pool, and deleting outdated samples according to the FIFO strategy;
[0078] Attention filtering module: Calculates the similarity of experience samples and determines whether an experience is transferred from the short-term memory experience pool to the long-term memory experience pool based on the similarity and the number of repetitions;
[0079] Network update module: Samples are drawn from the short-term memory experience pool and the long-term memory experience pool, input into the actor network and the critic network to update parameters, and the parameters of the actor network and the critic network are copied into the target network at a certain update frequency to achieve delayed update of the target network.
[0080] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, essentially, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0081] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-including system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.
[0082] More specific examples (a non-exhaustive list) of computer-readable media include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which programs can be printed, because programs can be obtained electronically, for example, by optically scanning the paper or other media, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0083] It is worth noting that all contents not described in detail in this invention are existing technologies and are well known to those skilled in the art.
[0084] Therefore, this invention provides a brain-like reinforcement learning method and system based on hierarchical experience replay. By introducing a brain-like mechanism to optimize the efficiency of experience replay, and by utilizing the thinking patterns of short-term and long-term memory to optimize the experience replay process, the invention improves the experience utilization rate and decision-making efficiency of the agent in the learning process, thereby enhancing the learning effect and performance of reinforcement learning.
[0085] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the technical solutions of the present invention, and these modifications or equivalent substitutions cannot cause the modified technical solutions to deviate from the spirit and scope of the technical solutions of the present invention.
Claims
1. A brain-like reinforcement learning method based on hierarchical experience playback, characterized in that, Includes the following steps: Step S1: Collect observation data of the interaction between the agent and the environment and perform data preprocessing to construct hierarchical training data for reinforcement learning networks. The observation data includes state feature vector data, action encoding data, instant reward signal data, visual observation frames and task description text. Step S2: Initialize the actor network, critic network, and corresponding target network. Simultaneously, initialize the experience buffer pool and perform parameter initialization, including the discount factor γ and the maximum number of episodes. M and maximum time steps T ; Step S3: At the beginning of each training round, initialize the exploration noise, the agent obtains the initial state of the current environment, selects an action from the actor network based on the current state and executes it, and stores the experience samples obtained after executing the action into the experience buffer pool. Step S4: Short-term memory experience pool is updated. The short-term memory experience pool obtains new samples from the experience buffer pool. If the short-term memory experience pool is full, outdated samples are deleted according to the first-in-first-out strategy. The short-term memory experience pool has a limited capacity first-in-first-out structure, and the FIFO strategy ensures that the experience in the pool always remains timely. Step S5: Experience screening and transfer. The attention discrimination module is used to calculate the similarity between the new experience sample and the existing experience samples in the short-term memory experience pool. The similarity is compared to see if it exceeds the threshold. The number of times the experience is repeatedly rated as similar is recorded. It is then decided whether to transfer some of the experience in the short-term memory experience pool to the long-term memory experience pool. The long-term memory experience pool is a complete experience pool, and the experiences in it are derived from the screening of the short-term memory experience pool. The attention discrimination module evaluates the long-term preservation value of the experiences in the short-term memory experience pool and decides whether to transfer them. Step S6: Network parameters are updated by retrieving data from the short-term memory experience pool. N 1 One sample was obtained from the long-term memory experience pool. N 2 A set of samples are input into the actor network and the critic network for parameter updates. The target network's parameters are updated at fixed intervals of network parameter updates. N 1 and N 2 It is a positive integer.
2. The brain-like reinforcement learning method based on hierarchical experience playback according to claim 1, characterized in that, In step S1, the data preprocessing includes: denoising the observation data, filtering out outliers and invalid data points, normalizing or standardizing the numerical feature data, and performing one-hot encoding or embedding representation on the discrete action code.
3. The brain-like reinforcement learning method based on hierarchical experience playback according to claim 1, characterized in that, The attention discrimination module uses the cosine similarity method to calculate the similarity between experience samples. By comparing the cosine similarity between the feature vectors of new experience samples and existing samples in the short-term memory experience pool, the degree of similarity between the two is quantified.
4. The brain-like reinforcement learning method based on hierarchical experience playback according to claim 1, characterized in that, In step S3, Gaussian noise is used for exploration noise. Its standard deviation is set to a large value in the early stage of training to promote extensive exploration of the environment by the agent. As training progresses, it is gradually reduced so that the agent can focus on learning the optimal policy.
5. The brain-like reinforcement learning method based on hierarchical experience playback according to claim 1, characterized in that, In step S6, the parameter update of the target network adopts a delayed update strategy. Every fixed number of network parameter updates, the parameters of the actor network and the critic network are copied to the target network to update the parameters of the target network.
6. A brain-inspired reinforcement learning system based on hierarchical experience playback, used to implement the brain-inspired reinforcement learning method based on hierarchical experience playback as described in any one of claims 1-5, characterized in that, include: Initialization module: Used to initialize the actor network, critic network and corresponding target network, as well as the experience buffer pool and parameters; Interaction module: In each training round, initialize exploration noise, obtain the initial state of the environment, control the agent to select actions from the actor network to execute, and store experience samples in the experience buffer pool; Short-term memory experience pool management module: responsible for obtaining new samples from the experience buffer pool to update the short-term memory experience pool, and deleting outdated samples according to the FIFO strategy; Attention filtering module: Calculates the similarity of experience samples and determines whether an experience is transferred from the short-term memory experience pool to the long-term memory experience pool based on the similarity and the number of repetitions; Network update module: Samples are drawn from the short-term memory experience pool and the long-term memory experience pool, input into the actor network and the critic network to update parameters, and the parameters of the actor network and the critic network are copied into the target network at a certain update frequency to achieve delayed update of the target network.
7. A computer device, comprising: Memory and processor; The memory stores a computer program, characterized in that when the processor executes the computer program, it implements the steps of the brain-like reinforcement learning method based on hierarchical experience playback as described in any one of claims 1-5.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When a computer program is executed by a processor, it implements the steps of any one of the brain-like reinforcement learning methods based on hierarchical experience playback as described in any one of claims 1-5.