A method for a communication-based large-scale reinforcement learning distributed training system

By using a communication-based large-scale reinforcement learning distributed training system, which leverages multi-agent interaction with the simulation environment and asynchronous parallel training, the problems of low environment sampling efficiency and algorithm training efficiency in existing technologies are solved, thereby improving resource utilization and lightweighting the training architecture.

CN116402125BActive Publication Date: 2026-06-30SHENZHEN RES INST OF NANJING UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN RES INST OF NANJING UNIV
Filing Date
2023-04-12
Publication Date
2026-06-30

Smart Images

  • Figure CN116402125B_ABST
    Figure CN116402125B_ABST
Patent Text Reader

Abstract

This invention provides a method for a large-scale distributed training system for reinforcement learning based on communication, belonging to the field of distributed training technology. It includes a communication repeater, which receives observations that characterize the current states of multiple parallel environments, processes these observations in batches, and transmits them to a shared experience pool. A learner obtains small batches of data from the shared experience pool for learning interaction. The executor uses a Q-policy network to select actions from a predetermined action set and interacts with the environment through the communication repeater. The Q-policy network is a deep neural network configured to receive observations and actions as input and generate neural network outputs based on the dataset. This invention effectively alleviates the problems of low sampling efficiency and slow training speed in reinforcement learning under single-machine environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of distributed training technology, specifically relating to a method for a large-scale reinforcement learning distributed training system based on communication. Background Technology

[0002] Distributed reinforcement learning is an effective method for rapidly training large and complex tasks using parallelism. Most work in distributed reinforcement learning assumes a main structure consisting of multiple executors interacting with multiple copies of the same environment, and a central system that stores and optimizes common Q-function parameters or policy parameters shared by all participants. Research in distributed reinforcement learning focuses on rapidly optimizing Q-function parameters by generating more samples through multiple executors within the same clock time. To this end, researchers have investigated various techniques in distributed reinforcement learning, such as asynchronous parameter updates, shared experience replay buffers, GPU-based parallel computation, GPU-based simulation, and policy-based V-trace.

[0003] One of the classic early works to extend deep reinforcement learning to large-scale parallel scenarios is Gorila, an asynchronous distributed reinforcement learning training framework proposed by the DeepMind team in 2015. In this framework, each node has an Actor and its own complete copy of the environment, including an experience replay pool. The framework includes a learner to sample and compute gradients, where the gradients need to be synchronized to a centralized parameter server.

[0004] The current intelligent agent training framework has the following problems: low environment sampling efficiency and algorithm training efficiency, low CPU inference efficiency, low resource utilization, high bandwidth requirements, and tight coupling between the intelligent agent simulation environment and the algorithm. Summary of the Invention

[0005] Purpose of the invention: To address the problems of low environmental sampling efficiency, low algorithm training efficiency, and low resource utilization in large-scale reinforcement learning training systems, this invention provides a method and system for large-scale distributed reinforcement learning training based on communication.

[0006] Technical Solution: A method for a communication-based large-scale reinforcement learning distributed training system, comprising the following steps:

[0007] Step 1.1: The large-scale reinforcement learning distributed training system includes multiple agents; the agents interact with the simulation environment by receiving characteristic observations of the current state of the environment and selecting actions to be executed from a predetermined set of actions.

[0008] Step 1.2: The agent uses a Q-neural network to fit the action to be performed; wherein, the Q-neural network is a deep neural network, and the deep neural network is based on a multi-layer neural network model; each layer of the neural network model calculates a new representation (also called a feature) on the input data by learning parameters (usually weights and biases); the representation becomes more abstract with increasing layers (i.e., the representation becomes more abstract with increasing layers, thus allowing the network to learn higher-level features of the input data); the output of the last layer of the neural network model (usually a classification or regression prediction) is generated by combining the outputs of all hidden layers, environmental observations are received as input by the deep neural network, and the action is output from the deep neural network; wherein, training the large-scale reinforcement learning distributed training system includes the values ​​of the parameters of the Q-neural network;

[0009] Step 1.3: The large-scale reinforcement learning distributed training system further includes: multiple learners; each learner executes in a corresponding computing unit, each learner is configured to operate independently of other learners, and the learners centrally complete neural network inference on dedicated hardware (such as GPU or TPU), keeping model parameters and states in local states to accelerate inference speed and avoid data transmission bottlenecks; the learners synthesize distributed inference inputs from hundreds of machines; wherein, each learner maintains a corresponding initial Q-neural network and a target Q-neural network;

[0010] Step 1.4: Multiple learners are further configured to include the following operations: receiving the current values ​​of the parameters of the Q-neural network from the environment parameter server, and using these current values ​​to update the parameters of the learner's Q-neural network copy; the learner selects an experience tuple from a corresponding shared experience pool; using the learner's Q-neural network copy and the target Q-neural network copy maintained by the learner; the learner calculates gradients based on the experience tuples using the learner's Q-neural network copy and the target Q-neural network copy, and provides the calculated gradients to the environment parameter server and the relevant environment.

[0011] In a further embodiment, the following steps are also included:

[0012] Step 2.1 The large-scale reinforcement learning distributed training system further includes: a communication-based repeater EnvProxy; each type of simulation environment and training algorithm provides a repeater, wherein the repeater is a device used to connect the client and the server. The repeater EnvProxy receives the raw observation data in the simulation environment and sends it to the shared experience pool, waiting for the learner to call it after the algorithm is connected.

[0013] Step 2.2: HTTP / 2 further extends the concept of persistent connections by providing a semantic layer on top of the connection: streams;

[0014] Step 2.3: gRPC introduces three new concepts: channels, remote procedure calls, and messages; each channel includes one or more RPCs, and each RPC includes one or more messages.

[0015] Step 2.4: The repeater fits the learner to the action to be performed using a Q-neural network and provides the calculated gradient to the repeater EnvProxy, which then transmits it to the environment parameter server.

[0016] In a further embodiment, the following steps are also included:

[0017] Step 3.1: The large-scale reinforcement learning distributed training system further includes: one or more executors; each executor executes on a corresponding computational unit; each executor is configured to operate independently of each other executor, and each executor interacts with a corresponding copy of the environment; each executor maintains a corresponding copy of the executor Q-neural network, and each executor is further configured to repeatedly execute operations including the following:

[0018] Step 3.2: Receive the current value of the parameters of the Q-neural network from the communication relay EnvProxy from the environment parameter server; use the current value to update the parameter value of the Q-neural network copy maintained by the executor; receive an observation characterized by the current state of the environment copy interacting with the executor; in response to the observation, select an action to be performed using the Q-neural network copy maintained by the executor; receive a reward in response to the performed action and a next observation characterized by the next state of the environment copy interacting with the executor; generate an experience tuple including the current observation, the selected action, the reward, and the next observation; and store the experience tuple in a corresponding shared experience pool.

[0019] In a further embodiment, the following steps are also included:

[0020] The large-scale reinforcement learning distributed training system further includes: an environment parameter server; the environment parameter server is configured to repeatedly perform operations including the following:

[0021] Receive a series of gradients from the plurality of learners; use the gradients to compute an update to the parameter values ​​of the Q-neural network; use the computed update to update the parameter values ​​of the Q-neural network; and provide the updated environment parameter server values ​​to one or more actuators and the plurality of learners.

[0022] In a further embodiment, the following steps are also included:

[0023] The environment parameter server includes multiple environment parameter server shards; wherein each shard is configured to hold the values ​​of the corresponding disjoint partitions of the parameters of the Q neural network; and each shard is configured to operate asynchronously relative to each other shard.

[0024] In a further embodiment, the parameter server is configured to perform operations that further include the following steps:

[0025] Determine whether the criteria for updating the parameters of the target Q-neural network copy maintained by the learner are met; when the criteria are met, provide the learner with data indicating that the updated parameter values ​​will be used to update the parameters of the target Q-neural network copy.

[0026] In a further embodiment, the operation that each learner is configured to perform further includes the following steps:

[0027] The system receives data indicating that the updated parameter values ​​will be used to update the parameters of the target Q-neural network copy maintained by the learner, and uses the updated parameter values ​​to update the parameters of the target Q-neural network copy maintained by the learner.

[0028] In a further embodiment, the following steps are also included:

[0029] Each learner is associated with a corresponding executor and replay memory; each policy experience pool of the learner, executor, and replay memory is implemented on a corresponding computing unit;

[0030] Each policy experience pool is configured to operate independently of other policy experience pools; for each policy experience pool, the corresponding learner selects from the experience tuples generated by the executor in that policy experience pool.

[0031] In a further embodiment, the following steps are also included:

[0032] For each policy experience pool, the current parameter values ​​of the executor Q-neural network copy maintained by the executor in the policy experience pool are synchronized with the current parameter values ​​of the learner Q-neural network copy maintained by the learner in the policy experience pool.

[0033] In a further embodiment, the method further includes the following steps:

[0034] Corresponding to the observation state, the operation of selecting an actuator Q-neural network copy to perform an action includes: determining the following action from a predetermined action set: when provided as input to the actuator Q-neural network copy along with the current observation state, this action will generate the largest actuator Q-neural network copy output.

[0035] In a further embodiment, the following steps are also included:

[0036] In the large-scale reinforcement learning distributed training system, in response to the observation state and using an executor Q-neural network copy maintained by the executor to select an action to be executed, the operation further includes: selecting a random action from a predetermined action set with probability ε; and selecting a determined action with probability 1-ε, which generates the maximum value in the output of the executor Q-neural network copy.

[0037] In a further embodiment, calculating the gradient based on the empirical tuple using a copy of the learner's Q-neural network and a copy of the target Q-neural network maintained by the learner includes the following operations:

[0038] The learner Q-neural network copy is used to process the actions from the experience tuples and the current observation state to determine the output of the learner Q-neural network copy.

[0039] The maximum target Q-neural network copy output is determined using a target Q-neural network copy. This is achieved by processing any action in a predetermined action set and the next observation state from the empirical tuple.

[0040] The gradient is computed using the learner Q-neural network copy output, the maximum target Q-neural network copy output, and a buffer from a shared experience pool of experience tuples.

[0041] The specific details are as follows: The method includes the following steps:

[0042] Step S1: The environment interface in the environment server establishes a bidirectional connection using a dedicated software development kit, a GRPC stream, and a communication repeater EnvProxy. The repeater provides the learner with the action to be executed by fitting it to a Q-neural network and provides the calculated gradient to the repeater EnvProxy, which then transmits it to the learner of the corresponding algorithm in the reinforcement learning algorithm server.

[0043] Step S2: In each environment server, an actor is created on the reinforcement learning algorithm server side. Each actor uses the communication relay EnvProxy to call the final policy after the learner is trained, and transmits the final policy to the environment server using the communication relay EnvProxy.

[0044] Step S3: The repeater is based on the Grpc framework, which is an asynchronous streaming RPC. gRPC is based on HTTP / 2, which aims to solve some of the scalability problems of its predecessor and improves the design of HTTP / 1.1 in many ways, most importantly by providing semantic mapping on connections. Creating an HTTP connection is very expensive; a TCP connection must be established, the connection must be protected with TLS, headers and settings must be exchanged, etc. HTTP / 1.1 simplifies this process by treating connections as long-lived, reusable objects. HTTP / 1.1 connections are kept idle so that new requests can be sent to the same destination through existing idle connections. Although connection reuse alleviates this problem, a connection can only handle one request at a time—they are 1:1 coupled. If a large message is to be sent, new requests must either wait for it to complete (causing queue congestion) or pay the cost of starting another connection more frequently.

[0045] Step S4: After the executor samples the episode data after the environmental interaction, it stores the data into the shared experience pool through the learner's store interface.

[0046] Step S5: The learner obtains the required training data from the shared experience pool. The learner can centrally complete neural network inference on dedicated hardware such as GPU or TPU. By ensuring that the model parameters and states are kept in local state, the inference speed is accelerated and the data transmission bottleneck is avoided. The learner integrates the distributed inference input from the book-mounted machine. Each learner maintains the corresponding initial Q-neural network and target Q-neural network.

[0047] Step S6: When the learner is training the algorithm, it returns a feedback interface. The dedicated software development kit sends the corresponding notification to the executor by calling the learner feedback interface of all the actors.

[0048] In another technical solution, a large-scale reinforcement learning distributed training system is provided, the system including an agent; the agent interacts with the environment by receiving observation states that characterize the current state of the environment and selecting actions to be executed from a predetermined set of actions;

[0049] The agent uses a Q-neural network to select actions to be performed. The Q-neural network is a deep neural network that takes the observed state and actions as input and generates the neural network output based on a set of parameters. The large-scale reinforcement learning distributed training system is trained by adjusting the values ​​of the parameter set of the Q-neural network.

[0050] Beneficial effects:

[0051] (1) The large-scale reinforcement learning distributed training system described in this invention includes a communication relay. The communication relay receives observations that characterize the current state of multiple parallel environments, processes the parallel environment observations in batches, and transmits them to a shared experience pool. The learner obtains small batches of data from the shared experience pool for learning interaction. The executor uses a Q-policy network to select actions to be executed from a predetermined action set and interacts with the environment through the communication relay. The Q-policy network is a deep neural network, which is configured to receive observations and actions as inputs and generate neural network outputs from the inputs based on the dataset. The large-scale reinforcement learning distributed training system, through multi-environment parallel sampling and asynchronous parallel training using a fixed resource budget, is compatible with various reinforcement learning simulation and experimental environments, effectively alleviating the problems of low sampling efficiency and slow training speed in single-machine environments.

[0052] (2) The present invention optimizes the architecture for modern accelerators, improves data efficiency by expanding the model size, and improves data efficiency by using a larger model; it builds an agent-distributed training system based on meta-native, and realizes rapid training of agents under large and complex tasks based on parallel training methods; it designs and implements a scalable multi-agent training architecture, decouples the environment and algorithm, builds a distributed sampling platform, connects the environment and algorithm in the form of agents to schedule the training process, improves the environment sampling efficiency and algorithm training efficiency, and improves resource utilization and system stability, making the training system more lightweight. Attached Figure Description

[0053] Figure 1 This is a diagram of the overall architecture of a distributed reinforcement learning training system that utilizes communication.

[0054] Figure 2 This is a diagram illustrating the interaction between the gRPC server and client.

[0055] Figure 3 This is a diagram of a Q-neural network structure;

[0056] Figure 4 It is a data flow graph of a large-scale reinforcement learning distributed training system. Detailed Implementation

[0057] Example 1

[0058] A method for a communication-based large-scale distributed training system for reinforcement learning includes the following steps:

[0059] Step 1: A large-scale reinforcement learning distributed training system includes at least an agent that interacts with the environment. To interact with the environment, the agent receives an observation state that characterizes the current state of the environment and uses this observation state to select an action to perform. In response to performing the selected action, the agent receives a reward. When interacting with the environment, in response to all actions selected by the agent, the agent attempts to maximize the total reward received by the agent.

[0060] Step 2: In response to a given observation state, the agent selects an action to be performed using the Q-neural network. The Q-neural network is a deep neural network configured to receive the observation state and the action as input and process that input based on the current values ​​of the Q-neural network's parameter set to generate a neural network output. In some implementations, the agent selects the action that, when provided with a given observation state as the output of the Q-neural network, causes the Q-neural network to generate the highest neural network output from any action in a predetermined action set. In other implementations, the agent uses an e-greedy strategy in selecting the action; that is, the agent randomly selects an action from the predetermined action set with probability ε, and selects the action with probability 1-e that causes the Q-neural network to generate the highest neural network output.

[0061] Step 3: During the training of the large-scale reinforcement learning distributed training system, the distributed reinforcement learning training system trains the large-scale reinforcement learning distributed training system to adjust the values ​​of the Q-neural network parameters from initial parameter values. In some implementations, the large-scale reinforcement learning distributed training system is trained offline, and the training system trains the large-scale reinforcement learning distributed training system to determine the training values ​​of the Q-neural network parameters. Then, at runtime, the agent uses the training values ​​in its interactions with the environment. In some other implementations, the large-scale reinforcement learning distributed training system is trained online, and when the agent interacts with the environment at runtime, the training system continuously adjusts the parameter values ​​of the Q-neural network used by the agent.

[0062] Step 4: A large-scale reinforcement learning distributed training system includes one or more actors, one or more learners, and one or more shared experience pools. The system also includes a parameter server and a communication relay.

[0063] Step 5 Figure 1This illustrates an example distributed reinforcement learning training system. A communication-based distributed large-scale reinforcement learning training system is a system example of a computer program on one or more computers in one or more locations to implement the following systems, components, and techniques. The communication-based distributed reinforcement learning training system includes: an environment pool, a communication repeater EnvProxy, an executor pool, a shared experience pool, learners, and a policy experience pool.

[0064] Step 6: Receive the current values ​​of the parameters of the Q-neural network from the communication relay EnvProxy from the environment parameter server; update the parameter values ​​of the executor Q-neural network copy maintained by the executor using the current values; receive observations characterizing the current state of the environment copy interacting with the executor; select an action to be performed using the executor Q-neural network copy maintained by the executor in response to the observations; receive a reward in response to the performed action and a next observation characterizing the next state of the environment copy interacting with the executor; generate an experience tuple including the current observation, the selected action, the reward, and the next observation; and store the experience tuple in a corresponding shared experience pool.

[0065] Step 7 Figure 2 This example illustrates the gRPC communication mechanism. gRPC is a communication framework based on HTTP / 2. In gRPC, client applications can directly invoke methods of a server-side application on a different machine as if they were local objects, making it easier to create distributed applications and services. Similar to many RPC systems, gRPC is based on the following principles: Define a service, specifying its methods (including parameters and return types) that can be invoked remotely. Implement this interface on the server side and run a gRPC server to handle client calls. The client has a stub that can access methods just like the server.

[0066] Step 8 Figure 3 An example Q-neural network structure is shown, in which the agent uses the Q-neural network to fit an action to be performed. The Q-neural network is a deep neural network based on a multi-layer neural network model. Each layer computes a new representation (also called a feature) on the input data by learning parameters (typically weights and biases). This representation becomes increasingly abstract with each layer, allowing the network to learn higher-level features of the input data. The output of the final layer (typically a classification or regression prediction) is generated by combining the outputs of all hidden layers. Environmental observations are received as input to the deep neural network, and the action is output from the neural network. Training the large-scale reinforcement learning distributed training system includes setting the values ​​of the parameters of the Q-neural network.

[0067] Step 9 Figure 4 The example shown is a data flow graph based on a distributed reinforcement learning framework.

[0068] Step 10: The environment interface in the environment server establishes a bidirectional connection using a dedicated software development kit, a GRPC stream, and a communication repeater EnvProxy. The repeater provides the learner with the action to be executed by fitting it to a Q-neural network and provides the calculated gradient to the repeater EnvProxy, which then transmits it to the learner of the corresponding algorithm in the reinforcement learning algorithm server.

[0069] Step 11: In each environment server, an actor is created on the reinforcement learning algorithm server side. Each actor uses the communication relay EnvProxy to call the final policy after the learner is trained, and transmits the final policy to the environment server using the communication relay EnvProxy.

[0070] Step 12: The repeater is based on the Grpc framework, which is an asynchronous streaming RPC. gRPC is based on HTTP / 2. Each simulation environment and training algorithm provides one repeater, which is a device used to connect the client and server. The EnvProxy receives the raw observation data from the simulation environment and sends it to the shared experience pool, waiting for the learner to invoke it after the algorithm is connected. The repeater fits the learner to the action to be executed using a Q-neural network and provides the calculated gradient to the repeater EnvProxy, which then transmits it to the environment parameter server.

[0071] Step 13: After the executor samples the episode data after the environment interaction, it stores the data into the shared experience pool through the learner's store interface.

[0072] Step 14: The learner obtains the necessary training data from the shared experience pool. The learner can centrally perform neural network inference on dedicated hardware such as GPUs or TPUs, accelerating inference speed by ensuring that model parameters and states are kept in local states, avoiding data transmission bottlenecks. The learner synthesizes distributed inference inputs from the book-based machine, wherein each learner maintains a corresponding initial Q-neural network and a target Q-neural network.

[0073] Step 15: When the learner is training the algorithm, it returns a feedback interface. The dedicated software development kit sends the corresponding notification to the executor by calling the learner feedback interface of all the actors.

Claims

1. A method for a large-scale reinforcement learning distributed training system based on communication, characterized in that, The method, using a large-scale reinforcement learning distributed training system, includes the following steps: Step 1.1: The large-scale reinforcement learning distributed training system includes multiple agents; the agents interact with the simulation environment by receiving characteristic observations of the current state of the environment and selecting actions to be executed from a predetermined set of actions. Step 1.2: The agent uses a Q-neural network to fit the action to be executed; wherein, the Q-neural network is a deep neural network, and the deep neural network is based on a multi-layer neural network model; each layer of the neural network model calculates the input data through learning parameters to generate a new representation; the representation becomes more abstract as the number of layers increases; the output of the last layer of the neural network model is generated by combining the outputs of all hidden layers, the environmental observations are received by the deep neural network as input, and the action is output from the deep neural network; wherein, training the large-scale reinforcement learning distributed training system includes the values ​​of the parameters of the Q-neural network; Step 1.3: The large-scale reinforcement learning distributed training system further includes: multiple learners; each learner executes in a corresponding computing unit, and each learner is configured to operate independently of other learners; the learners centrally complete neural network inference on dedicated hardware, maintaining the model parameters and states in local states; the learners synthesize distributed inference inputs from hundreds of machines; wherein, each learner maintains a corresponding initial Q-neural network and a target Q-neural network; Step 1.4: Multiple learners are further configured to include the following operations: receiving the current values ​​of the parameters of the Q-neural network from an environment parameter server; using these current values ​​to update the parameters of the learner's Q-neural network copy; the learner selecting experience tuples from a corresponding shared experience pool; using the learner's Q-neural network copy and the target Q-neural network copy maintained by the learner; the learner calculating gradients based on the experience tuples using the Q-neural network copy and the target Q-neural network copy; and providing the calculated gradients to the environment parameter server and the relevant environment. Step 2.1: The large-scale reinforcement learning distributed training system further includes: a communication-based repeater, EnvProxy; Each type of simulation environment and training algorithm provides a repeater, which is a device used to connect the client and the server. The repeater EnvProxy receives the raw observation data in the simulation environment and sends it to the shared experience pool, waiting for the learner to call it after the algorithm is connected. The communication relay EnvProxy supports bidirectional streaming communication based on the gRPC framework and HTTP / 2 protocol. The relay provides the learner to the action to be executed by fitting it with a Q-neural network and provides the calculated gradient to the relay EnvProxy, which then transmits it to the environment parameter server.

2. The method for a large-scale reinforcement learning distributed training system based on communication as described in claim 1, characterized in that, It also includes the following steps: Step 3.1: The large-scale reinforcement learning distributed training system further includes: one or more executors; each executor executes on a corresponding computational unit; each executor is configured to operate independently of each other executor, and each executor interacts with a corresponding copy of the environment; each executor maintains a corresponding copy of the executor Q-neural network, and each executor is further configured to repeatedly execute operations including the following: Step 3.2: Receive the current value of the parameters of the Q-neural network from the communication relay EnvProxy from the environment parameter server; use the current value to update the parameter value of the Q-neural network copy maintained by the executor; receive an observation characterized by the current state of the environment copy interacting with the executor; in response to the observation, select an action to be performed using the Q-neural network copy maintained by the executor; receive a reward in response to the performed action and a next observation characterized by the next state of the environment copy interacting with the executor; generate an experience tuple including the current observation, the selected action, the reward, and the next observation; and store the experience tuple in a corresponding shared experience pool.

3. The method for a large-scale reinforcement learning distributed training system based on communication as described in claim 2, characterized in that, It also includes the following steps: The large-scale reinforcement learning distributed training system further includes: an environment parameter server; the environment parameter server is configured to repeatedly perform operations including the following: Receive a series of gradients from the plurality of learners; use the gradients to compute an update to the parameter values ​​of the Q-neural network; use the computed update to update the parameter values ​​of the Q-neural network; and provide the updated environment parameter server values ​​to one or more actuators and the plurality of learners.

4. The method for a large-scale reinforcement learning distributed training system based on communication as described in claim 2, characterized in that, It also includes the following steps: The environment parameter server includes multiple environment parameter server shards; wherein each shard is configured to hold the values ​​of the corresponding disjoint partitions of the parameters of the Q neural network; each shard is configured to operate asynchronously relative to each other shard; Determine whether the criteria for updating the parameters of the target Q-neural network copy maintained by the learner are met; when the criteria are met, provide the learner with data indicating that the updated parameter values ​​will be used to update the parameters of the target Q-neural network copy.

5. The method for a large-scale reinforcement learning distributed training system based on communication as described in claim 4, characterized in that, Each of the learners is configured to perform operations that further include the following steps: The system receives data indicating that the updated parameter values ​​will be used to update the parameters of the target Q-neural network copy maintained by the learner, and uses the updated parameter values ​​to update the parameters of the target Q-neural network copy maintained by the learner.

6. A method for a communication-based large-scale reinforcement learning distributed training system as described in any one of claims 2 to 5, characterized in that, It also includes the following steps: Each learner is associated with a corresponding executor and replay memory; each policy experience pool of the learner, executor, and replay memory is implemented on a corresponding computing unit; Each policy experience pool is configured to operate independently of other policy experience pools; for each policy experience pool, the corresponding learner selects from the experience tuples generated by the executor in that policy experience pool.

7. The method for a large-scale reinforcement learning distributed training system based on communication as described in claim 6, characterized in that, It also includes the following steps: For each policy experience pool, the current parameter values ​​of the executor Q-neural network copy maintained by the executor in the policy experience pool are synchronized with the current parameter values ​​of the learner Q-neural network copy maintained by the learner in the policy experience pool. Corresponding to the observation state, the operation of performing an action in response to the selection of an actuator Q-neural network replica includes: determining the following actions from a predetermined action set: when provided as input to the actuator Q-neural network replica along with the current observation state, the action will generate the maximum actuator Q-neural network replica output; selecting a random action from the predetermined action set with probability ε; and selecting a determined action with probability 1-ε that generates the maximum value in the actuator Q-neural network replica output.

8. The method for a large-scale reinforcement learning distributed training system based on communication as described in claim 7, characterized in that, Calculating gradients based on empirical tuples using a copy of the learner's Q-neural network and a copy of the target Q-neural network maintained by the learner involves the following operations: The learner Q-neural network copy is used to process the actions from the experience tuples and the current observation state to determine the output of the learner Q-neural network copy. The maximum target Q-neural network copy output is determined using a target Q-neural network copy. This is achieved by processing any action in a predetermined action set and the next observation state from the empirical tuple. The gradient is computed using the learner Q-neural network copy output, the maximum target Q-neural network copy output, and a buffer from a shared experience pool of experience tuples.