Neural network training method, distributed system, electronic device, and storage medium
By using point-to-point communication to transmit parameter fragments during distributed training of neural networks, the problem of prolonged iteration cycles caused by faulty nodes is solved, achieving more efficient parameter aggregation and training.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- MOORE THREADS TECHNOLOGY (SHANGHAI) CO LTD
- Filing Date
- 2026-05-18
- Publication Date
- 2026-06-19
Smart Images

Figure CN122242655A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a neural network training method, a distributed system, an electronic device, and a computer-readable storage medium. Background Technology
[0002] As the scale of neural networks grows rapidly, the number of parameters also increases. To reduce the memory pressure on computing devices (such as graphics processing units, GPUs) during neural network training, a parallel training method using parameter sharding is typically employed. This parallel training method is also known as distributed training, where each computing device participating in the training (also called a node) updates only one parameter shard from the complete parameters of the neural network.
[0003] In each round of training of a neural network, after each node updates its own parameter slices, in order to obtain the updated complete parameters for the next round of training, it usually performs a cross-node, global parameter aggregation operation (such as performing a full collection operation, i.e., AllGather operation). When performing parameter aggregation, each node broadcasts its latest parameter slices to all other nodes, so that each node eventually aggregates to obtain an updated complete set of parameters.
[0004] To improve training efficiency, fault tolerance mechanisms are typically introduced in distributed training of neural networks. During distributed training, if a node fails, the parameter slices responsible for that node can be temporarily "frozen." That is, the parameter slices of the failed node are not updated in the current iteration, while other healthy nodes continue to update their respective parameter slices. In this case, the failed node cannot provide updated parameter slices, and if the AllGather operation is still used for parameter aggregation, the aggregation will fail.
[0005] In related technologies, the following fault-tolerant processing method is usually adopted: when there is only one faulty node, when the normal node performs the AllGather operation, it skips the faulty node and only aggregates the parameter fragments of the normal node to obtain the aggregation result; then the aggregation result is concatenated into the local cache of complete parameters through two memory copies: first, all parameter fragments of the nodes before the faulty node in the aggregation result are copied to the corresponding positions in the cache; then, the positions corresponding to the faulty node in the cache are skipped; and then all parameter fragments of the nodes after the faulty node in the aggregation result are copied to the corresponding positions in the cache.
[0006] The above processing method involves one AllGather communication and two non-contiguous memory copies. When multiple nodes fail, this method requires even more non-contiguous memory copies. However, for neural networks with hundreds of billions or even trillions of parameters, the memory copying time overhead is significant, resulting in a long single iteration cycle and reducing the overall training efficiency of the neural network. Summary of the Invention
[0007] This disclosure provides a neural network training method, a distributed system, an electronic device, and a computer-readable storage medium.
[0008] Firstly, this disclosure provides a neural network training method applied to a distributed system, the distributed system including a control node and multiple training nodes. The neural network training method includes: during the distributed training of the neural network, in the parameter aggregation phase of the current round, determining, through the control node, whether there are any abnormal nodes among the multiple training nodes; if abnormal nodes exist, determining a set of normal nodes among the multiple training nodes; for any normal node in the node set, aggregating the parameter fragments updated in the current round through point-to-point communication between the normal node and other normal nodes in the node set, obtaining the parameter aggregation result of the normal node in the current round; wherein, the point-to-point communication is based on a point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to store the parameter fragments updated by other normal nodes in the current round; and training the neural network for the next round based on the parameter aggregation result of each normal node in the node set.
[0009] Secondly, this disclosure provides a neural network training method, executed by a first training node in a distributed system, wherein the first training node is any normal node in the distributed system. The neural network training method includes: during the distributed training of the neural network, in the parameter aggregation phase of the current round, obtaining parameter fragments obtained by the first training node in the current round; in response to the presence of abnormal nodes in the distributed system, receiving and storing parameter fragments updated by multiple second training nodes in the current round via point-to-point communication, obtaining the parameter aggregation result of the first training node in the current round; wherein the multiple second training nodes are other normal nodes in the distributed system besides the first training node; the point-to-point communication is based on a point-to-point communication task, which is used to store the parameter fragments updated by the multiple second training nodes in the current round; and based on the parameter aggregation result of the current round, performing training on the neural network in the next round.
[0010] Thirdly, this disclosure provides a distributed system for training a neural network. The distributed system includes a control node and multiple training nodes. The control node is used to: determine whether there are any abnormal nodes among the multiple training nodes during the parameter aggregation phase of the current round of distributed training of the neural network; if abnormal nodes exist, determine a set of normal nodes among the multiple training nodes; any normal node in the node set is used to: aggregate the parameter fragments updated in the current round through point-to-point communication between the normal node and other normal nodes in the node set, to obtain the parameter aggregation result of the normal node in the current round; wherein, the point-to-point communication is based on a point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to store the parameter fragments updated by other normal nodes in the current round; and to train the neural network for the next round based on the parameter aggregation result of the normal node in the current round.
[0011] Fourthly, this disclosure provides an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the neural network training method described above.
[0012] Fifthly, this disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the aforementioned neural network training method.
[0013] According to the neural network training method of this disclosure, during the distributed training process of the neural network, when there are faulty abnormal nodes, the updated parameter fragments are transmitted through point-to-point communication between normal nodes to achieve parameter aggregation. Compared with related technologies that require skipping abnormal nodes and performing multiple non-contiguous memory copies during parameter aggregation, the neural network training method of this disclosure achieves parameter fragment transmission through point-to-point communication without multiple memory copies. This significantly reduces the time overhead caused by memory copies during parameter aggregation, thereby reducing the time spent on parameter aggregation operations, effectively shortening the single iteration cycle, and improving the overall training efficiency of the neural network.
[0014] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0015] The accompanying drawings are provided to further illustrate the present disclosure and form part of the specification. They are used together with the embodiments of the present disclosure to explain the disclosure and do not constitute a limitation thereof. The above and other features and advantages will become more apparent to those skilled in the art from the description of detailed exemplary embodiments with reference to the accompanying drawings, which are described below.
[0016] Figure 1 This is a flowchart of a neural network training method provided in an embodiment of the present disclosure.
[0017] Figure 2 This is a schematic diagram of a neural network training method provided in an embodiment of the present disclosure.
[0018] Figure 3 This is a schematic diagram of a neural network training method provided in an embodiment of the present disclosure.
[0019] Figure 4 This is a flowchart of a neural network training method provided in an embodiment of the present disclosure.
[0020] Figure 5 This is a block diagram of a distributed system provided in an embodiment of the present disclosure.
[0021] Figure 6 This is a block diagram of an electronic device provided in an embodiment of the present disclosure. Detailed Implementation
[0022] To enable those skilled in the art to better understand the technical solutions of this disclosure, exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments of this disclosure to aid understanding. These should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
[0023] Where there is no conflict, the various embodiments of this disclosure and the features thereof in the embodiments may be combined with each other.
[0024] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.
[0025] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. As used herein, the singular forms “a” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, the presence of the stated feature, integral, step, operation, element, and / or component is specified, but the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof is not excluded. Words such as “connected” or “linked” are not limited to physical or mechanical connections but can include electrical connections, whether direct or indirect.
[0026] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this disclosure, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.
[0027] When performing distributed training on a neural network, the optimizer state (e.g., momentum, variance, etc.) can be sharded, or the optimizer state and gradient can be sharded, or the optimizer state, gradient, and neural network parameters can be sharded. This disclosure does not impose any limitations on this. Taking the sharding of the optimizer state as an example, the number of computing devices (i.e., nodes) participating in the training is N (N is a positive integer and N≥2). First, the optimizer state is divided into N parts, and each node stores 1 / N of the optimizer state, the complete gradient, and the complete parameters of the neural network. Then, iterative training is performed.
[0028] For any given training iteration, each node reads its corresponding sample data and performs forward computation based on the complete parameters of the neural network stored locally, obtaining the output of the neural network and calculating the network loss based on this output. Then, each node determines its own gradient based on its network loss and obtains the complete gradient for this iteration through a full reduction operation (AllReduce operation, averaging the gradients of all nodes and synchronizing the average to each node). Each node divides the complete gradient for this iteration into N parts and obtains the 1 / N gradient corresponding to its stored optimizer state from these N parts. For example, if a node stores the second 1 / N optimizer state, it also obtains the second 1 / N gradient from the N parts. Next, each node divides the complete parameters of the neural network into N parts and determines its corresponding 1 / N parameters. Then, each node updates the corresponding 1 / N parameters of the neural network based on its own 1 / N gradient and 1 / N optimizer state, obtaining the parameter slices for this iteration. Finally, an AllGather operation is performed to aggregate the parameters, thus giving each node the updated complete parameters for this iteration. Then, each node, based on the complete parameters updated in this iteration, performs the next round of iterative training on the neural network in the manner described above.
[0029] To improve training efficiency, fault tolerance mechanisms are typically introduced in distributed training of neural networks. Fault tolerance methods in related technologies involve one AllGather communication and two non-contiguous memory copies. When multiple nodes fail, this approach requires even more non-contiguous memory copies. However, for neural networks with hundreds of billions or even trillions of parameters, the memory copying time overhead is significant, resulting in longer single iteration cycles and reducing the overall training efficiency of the neural network.
[0030] To address the aforementioned technical problems, embodiments of this disclosure provide a neural network training method applied to a distributed system. The distributed system includes a control node and multiple training nodes. The method includes: during distributed training of the neural network, in the parameter aggregation phase of the current round, determining, through the control node, whether there are any abnormal nodes among the multiple training nodes; if abnormal nodes exist, determining a set of normal nodes among the multiple training nodes; for any normal node in the node set, aggregating the parameter fragments updated in the current round through point-to-point communication between the normal node and other normal nodes in the node set, obtaining the parameter aggregation result of the normal node in the current round; wherein, the point-to-point communication is based on a point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to store the parameter fragments updated by other normal nodes in the current round; and training the neural network for the next round based on the parameter aggregation result of each normal node in the node set.
[0031] According to the neural network training method of this disclosure, during the distributed training process of the neural network, when there are faulty abnormal nodes, the updated parameter fragments are transmitted through point-to-point communication between normal nodes to achieve parameter aggregation. Compared with related technologies that require skipping abnormal nodes and performing multiple non-contiguous memory copies during parameter aggregation, the neural network training method of this disclosure achieves parameter fragment transmission through point-to-point communication without multiple memory copies. This significantly reduces the time overhead caused by memory copies during parameter aggregation, thereby reducing the time spent on parameter aggregation operations, effectively shortening the single iteration cycle, and improving the overall training efficiency of the neural network.
[0032] In some possible implementations, the neural network training method of this disclosure can be applied to a distributed system. The distributed system can be a distributed system including a control node and multiple training nodes (i.e., a centralized distributed system). The distributed system can also be a distributed system including multiple training nodes (i.e., a decentralized distributed system). The distributed system can also be other distributed systems, such as hierarchical distributed systems, graded distributed systems, etc. This disclosure does not limit the specific type of distributed system.
[0033] In some possible implementations, both the control node and training node in the distributed system can be electronic devices such as terminal devices or servers. Terminal devices can be user equipment (UE), mobile devices, user terminals, terminals, cellular phones, personal digital assistants (PDAs), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. In some possible implementations, this neural network training method can be implemented by the processor calling computer-readable program instructions stored in memory.
[0034] Figure 1 This is a flowchart illustrating a neural network training method provided in an embodiment of the present disclosure. The neural network training method is applied to a distributed system, which includes a control node and multiple training nodes. (Refer to...) Figure 1 The method includes steps S11-S13, which are described in detail below.
[0035] In step S11, during the distributed training of the neural network, in the parameter aggregation stage of the current round, the control node determines whether there are abnormal nodes among the multiple training nodes, and if there are abnormal nodes, determines the set of normal nodes among the multiple training nodes.
[0036] In step S12, for any normal node in the node set, the parameter fragments updated in the current round are aggregated through point-to-point communication between the normal node and other normal nodes in the node set, to obtain the parameter aggregation result of the normal node in the current round. The point-to-point communication is based on a point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to store the parameter fragments updated by other normal nodes in the current round.
[0037] In step S13, the neural network is trained for the next round based on the parameter aggregation results of each normal node in the node set in the current round.
[0038] In some possible implementations, the neural network in this disclosure can be any type of neural network, such as a feedforward neural network (FFNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a graph neural network (GNN), an artificial neural network (ANN), or a spiking neural network (SNN). This disclosure does not limit the specific type of neural network. The neural network in this disclosure can be used to perform at least one of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks. The data processed by the neural network may include at least one of images, speech, text, and video. The neural network in this disclosure can also be used to perform other processing tasks, and this disclosure does not limit this.
[0039] In some possible implementations, the neural network can be trained in a distributed manner before performing the aforementioned tasks. A distributed system can be used for distributed training of the neural network. This distributed system may include a control node and multiple training nodes. Based on the number of training nodes, at least one of the following—optimizer state, gradients, and parameters (including weight parameters)—of the neural network can be divided into multiple partitions, with one partition assigned to each training node. The optimizer state partition is used to update the parameter partition held by the training node according to the corresponding gradient during training. Each training node's storage space (e.g., a cache) can store the complete parameters of the neural network. In each training epoch, the complete parameters of the neural network stored in each training node can be updated through parameter aggregation to prepare for the next training epoch.
[0040] In some possible implementations, in step S11, during the distributed training of the neural network, after each training node updates its own parameter shards in the current training round, it enters the parameter aggregation phase. During the parameter aggregation phase of the current round, the control node of the distributed system can determine whether there are any abnormal nodes among the multiple training nodes based on heartbeat information and timeout mechanisms. An abnormal node refers to a training node that has experienced a failure.
[0041] When there are abnormal nodes among multiple training nodes, the control node of the distributed system can determine the set of normal nodes among the multiple training nodes. The multiple training nodes include both faulty and normally functioning nodes. The control node can identify all abnormal nodes based on heartbeat information and timeout mechanisms, and then identify all nodes in the multiple training nodes that are not abnormal as normal nodes. The set of normal nodes includes all normal nodes in the multiple training nodes. The set of normal nodes can be viewed as the effective communication topology during parameter aggregation in the current round, i.e., which nodes need to be exchanged for parameter shards during parameter aggregation.
[0042] For example, a distributed system includes eight training nodes: training node 0, training node 1, training node 2, training node 3, training node 4, training node 5, training node 6, and training node 7. If training node 2 is an abnormal node, the set of normal nodes can be determined to include the seven nodes: training node 0, training node 1, training node 3, training node 4, training node 5, training node 6, and training node 7. If training nodes 3 and 5 are abnormal nodes, the set of normal nodes can be determined to include the six nodes: training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7.
[0043] In some possible implementations, after determining the set of normal nodes, in step S12, for any normal node in the set, the parameter fragments updated in the current round can be aggregated through point-to-point communication between that normal node and other normal nodes in the set, to obtain the parameter aggregation result for that normal node in the current round. Point-to-point communication includes point-to-point data sending and receiving, i.e., point-to-point data transmission and reception. Point-to-point communication can be asynchronous to reduce the impact on other applications.
[0044] For example, the set of normal nodes includes six nodes: training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7. For training node 0 in this set, the parameter shards updated in the current round can be aggregated through point-to-point data transmission and reception between training node 0 and training nodes 1, 2, 4, 6, and 7, yielding the parameter aggregation result for training node 0 in the current round. Similarly, for training node 1 in this set, the parameter shards updated in the current round can be aggregated through point-to-point data transmission and reception between training node 1 and training nodes 0, 2, 4, 6, and 7, yielding the parameter aggregation result for training node 1 in the current round. The processing method for other normal nodes (i.e., training nodes 2, 4, 6, and 7) is similar to that for training nodes 0 and 1, and will not be elaborated further here.
[0045] Point-to-point communication can be based on point-to-point communication tasks. These tasks are used to send parameter fragments updated by normal nodes in the current round to other normal nodes, and to store parameter fragments updated by other normal nodes in the current round. For example, in the example above, when performing point-to-point communication, a point-to-point communication task can be established between six nodes: training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7. Then, point-to-point communication can be performed based on this task.
[0046] After performing the above operations on each normal node in the node set, the parameter aggregation result for each normal node in the current round is obtained. The parameter aggregation result for any normal node includes the parameter shards updated by each normal node in the current round and the parameter shards of the abnormal node in the previous round. This ensures that, even in the presence of abnormal nodes, the parameters in the parameter shards of normal nodes are the latest values in the complete parameter aggregation, while the parameters in the parameter shards of abnormal nodes retain their original values. In other words, during a node failure, the parameters in its parameter shards remain unchanged. If multiple training rounds pass during a node failure, the parameters in its parameter shards will remain unchanged across all training rounds.
[0047] For example, a distributed system includes 8 training nodes: training node 0, training node 1, training node 2, training node 3, training node 4, training node 5, training node 6, and training node 7. In the current training round, training nodes 3 and 5 fail. Therefore, the faulty nodes among the 8 training nodes are training nodes 3 and 5. Based on this, the parameter aggregation result for training node 1 in the current round includes: the parameter shards updated by training node 0 in the current round, the parameter shards updated by training node 1 in the current round, the parameter shards updated by training node 2 in the current round, the parameter shards from the previous round for training node 3, the parameter shards updated by training node 4 in the current round, the parameter shards from the previous round for training node 5, the parameter shards updated by training node 6 in the current round, and the parameter shards updated by training node 7 in the current round.
[0048] In some possible implementations, after obtaining the parameter aggregation result of each normal node in the current round, in step S13, the neural network can be trained for the next round based on the parameter aggregation result of each normal node in the node set in the current round.
[0049] According to the neural network training method of this disclosure, during the distributed training process of the neural network, when there are faulty abnormal nodes, the updated parameter fragments are transmitted through point-to-point communication between normal nodes to achieve parameter aggregation. Compared with related technologies that require skipping abnormal nodes and performing multiple non-contiguous memory copies during parameter aggregation, the neural network training method of this disclosure achieves parameter fragment transmission through point-to-point communication without multiple memory copies. This significantly reduces the time overhead caused by memory copies during parameter aggregation, thereby reducing the time spent on parameter aggregation operations, effectively shortening the single iteration cycle, and improving the overall training efficiency of the neural network.
[0050] Figure 2 This is a schematic diagram illustrating a neural network training method provided in an embodiment of this disclosure. (Refer to...) Figure 2 This neural network training method is applied to a distributed system, which includes a control node and multiple training nodes.
[0051] During the distributed training of a neural network, in the parameter aggregation phase of the current round, the control node can determine whether there are any abnormal nodes among the multiple training nodes. If abnormal nodes are found, the control node can determine the set of normal nodes among the multiple training nodes and synchronize this set of nodes to each normal node.
[0052] For any normal node in the node set, the parameter slices updated in the current round can be aggregated through point-to-point communication between the normal node and other normal nodes in the node set, resulting in the parameter aggregation result for the normal node in the current round. Then, based on the parameter aggregation result of each normal node in the node set in the current round, the neural network is trained for the next round.
[0053] The neural network training method according to embodiments of this disclosure will now be described in detail.
[0054] In some possible implementations, step S11 may include: during the parameter aggregation phase of the current round, if the control node does not receive heartbeat information from at least one training node within a first preset time period, the control node determines that there is an abnormal node among the multiple training nodes and identifies at least one training node as an abnormal node; the control node determines the node set of normal nodes among the multiple training nodes based on the abnormal node.
[0055] In the case of a distributed system including a control node and multiple training nodes, where the neural network training method of this disclosure is applied, each training node can send heartbeat information to the control node according to a first preset period (e.g., the first preset period is 1 second). The control node can determine whether each training node has failed based on the received heartbeat information. In the parameter aggregation phase of the current round, if the control node does not receive heartbeat information from one or more training nodes within a first preset duration (the first preset duration is longer than the first preset period, e.g., the first preset duration is 3 seconds), the control node can consider that one or more training nodes have failed, and thus determine that there are abnormal nodes among the multiple training nodes participating in the training, and identify one or more training nodes that have not received heartbeat information as abnormal nodes.
[0056] For example, the first distributed system includes a control node and eight training nodes (training node 0, training node 1, training node 2, training node 3, training node 4, training node 5, training node 6, and training node 7). Each training node sends a heartbeat message to the control node at a period of 1 second (a first preset period). During the parameter aggregation phase of the current round, for any training node, if the control node receives the heartbeat message from that training node within 3 seconds (a first preset duration), it can be determined that the training node is operating normally and has not failed; if the control node does not receive the heartbeat message from that training node within 3 seconds, it can be determined that the training node has failed.
[0057] Assuming the control node determines that training nodes 3 and 5 have failed, it can be determined that there are abnormal nodes among the 8 training nodes, and training nodes 3 and 5 are identified as abnormal nodes.
[0058] Once an abnormal node is identified, the control node can add all training nodes (excluding the abnormal node) to the set of normal nodes. The control node can also send node information (such as node identifiers) of all normal nodes in the set to each normal node for use in subsequent point-to-point communication.
[0059] For example, the control node can add the remaining eight training nodes (excluding the two abnormal nodes, training node 3 and training node 5) to a set of normal nodes. This set includes training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7. The control node can then send the node identifiers (0, 1, 2, 4, 6, 7) of all normal nodes in this set to each normal node in the set for use in subsequent point-to-point communication.
[0060] In the embodiments of this disclosure, the neural network training method is applied to a distributed system including a control node and multiple training nodes. During the parameter aggregation phase of the current round, if the control node does not receive heartbeat information from at least one training node within a first preset time period, the control node can determine that there is an abnormal node among the multiple training nodes, and identify the at least one training node that has not received heartbeat information as an abnormal node. Then, based on this abnormal node, the set of normal nodes among the multiple training nodes is determined. This allows the control node in the distributed system to determine whether an abnormal node exists and to determine the set of normal nodes, thus improving processing efficiency.
[0061] In some possible implementations, step S12 may include: establishing a point-to-point communication task for normal nodes; executing the point-to-point communication task to obtain the parameter aggregation result of normal nodes in the current round.
[0062] When performing point-to-point communication for any normal node in the node set, a point-to-point communication task for that normal node can be established first. This task can be used to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to store the parameter fragments updated by other normal nodes in the current round. For example, assuming the node set includes training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7, then the point-to-point communication task for training node 0 includes: sending the parameter fragments updated by training node 0 in the current round to training node 1, training node 2, training node 4, training node 6, and training node 7 respectively, and receiving and storing the parameter fragments updated by training node 1, training node 2, training node 4, training node 6, and training node 7 in the current round.
[0063] It should be noted that when establishing a point-to-point communication task, communication tasks related to abnormal nodes are not established. For example, in the example above, training nodes 3 and 5 are abnormal nodes, and the point-to-point communication task of training node 0 does not include communication tasks related to training nodes 3 and 5.
[0064] After establishing the point-to-point communication task for the normal node, the task can be executed to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to receive and store the parameter fragments updated by other normal nodes in the current round, thereby obtaining the parameter aggregation result of the normal node in the current round.
[0065] In the embodiments of this disclosure, for any normal node in the node set, by establishing and executing a point-to-point communication task for the normal node, the parameter aggregation result of the normal node in the current round is obtained, thereby enabling the parameter aggregation of the normal node in the current round to be executed quickly and accurately, reducing the time overhead of parameter aggregation operation and improving processing efficiency.
[0066] In some possible implementations, the point-to-point communication task includes multiple sending tasks. These sending tasks transmit parameter fragments updated by normal nodes in the current round to a first node, which is one of the other normal nodes. Executing the point-to-point communication task includes: when the normal node is the i-th node among multiple training nodes, for any sending task in the point-to-point communication task, transmitting the parameter fragments updated by the i-th node in the current round to the first node indicated by the sending task. This causes the first node to store the parameter fragments updated by the i-th node in the current round in the storage area corresponding to the i-th node within its own storage space, where i is a positive integer. The maximum value of i is the total number of training nodes in the distributed system.
[0067] For any normal node in the node set, the point-to-point communication task for that normal node may include multiple send tasks. A send task is used to send the parameter fragments updated by that normal node in the current round to a first node, which is one of the other normal nodes in the node set besides the normal node in question. The number of send tasks is the same as the number of first nodes, and both are equal to the total number of normal nodes in the node set minus one.
[0068] For example, assuming the set of normal nodes includes six normal nodes: training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7, then the point-to-point communication task of training node 0 includes five sending tasks. The first sending task sends the parameter fragment updated by training node 0 in the current round to training node 1; the second sending task sends the parameter fragment updated by training node 0 in the current round to training node 2; the third sending task sends the parameter fragment updated by training node 0 in the current round to training node 4; the fourth sending task sends the parameter fragment updated by training node 0 in the current round to training node 6; and the fifth sending task sends the parameter fragment updated by training node 0 in the current round to training node 7.
[0069] When the normal node is the i-th node among multiple training nodes, when executing the point-to-point communication task of this normal node, for any sending task in this point-to-point communication task, the parameter fragment updated by the i-th node in the current round can be sent to the first node indicated by the sending task. After receiving the parameter fragment updated by the i-th node in the current round, the first node can store it in the storage area corresponding to the i-th node in the storage space of the first node.
[0070] For example, in the example above, the point-to-point communication task of training node 0 includes five sending tasks. When executing the first sending task, the parameter fragments updated by training node 0 in the current round can be sent to training node 1 (i.e., the first node indicated by the first sending task). After receiving the parameter fragments updated by training node 0 in the current round, training node 1 will store them in the storage area corresponding to training node 0 in training node 1's storage space.
[0071] In the embodiments of this disclosure, the point-to-point communication task of a normal node includes multiple sending tasks. When the normal node is the i-th node among multiple training nodes, when executing the point-to-point communication task, for any of the sending tasks, the parameter fragment updated by the i-th node in the current round can be sent to the first node indicated by the sending task. The first node can store the parameter fragment updated by the i-th node in the current round in the storage area corresponding to the i-th node in its storage space, thereby enabling the sending task to send the parameter fragment updated by any normal node in the current round to other normal nodes, improving the data sending efficiency during parameter aggregation.
[0072] In some possible implementations, the point-to-point communication task further includes multiple receiving tasks. These receiving tasks receive and store the parameter fragments updated by the second node in the current round, where the second node is one of the other normal nodes. Executing the point-to-point communication task includes: for any receiving task in the point-to-point communication task, if the second node indicated by the receiving task is the j-th node among the multiple training nodes, receiving the parameter fragments updated by the j-th node in the current round and storing them in the storage area corresponding to the j-th node in the storage space of the normal nodes, where j is a positive integer. The maximum value of j is the total number of training nodes in the distributed system.
[0073] For any normal node in the node set, the point-to-point communication task of that normal node may also include multiple receive tasks. The receive task is used to receive and store the parameter fragments updated by the second node in the current round. The second node is one of the other normal nodes in the node set besides the normal node. The number of receive tasks is the same as the number of second nodes, both equal to the total number of normal nodes in the node set minus one.
[0074] For example, assuming the set of normal nodes includes six normal nodes: training node 0, training node 1, training node 2, training node 4, training node 6, and training node 7, then the point-to-point communication task of training node 0 includes five receiving tasks. Specifically, the first receiving task receives and stores the parameter fragments updated by training node 1 in the current round; the second receiving task receives and stores the parameter fragments updated by training node 2 in the current round; the third receiving task receives and stores the parameter fragments updated by training node 4 in the current round; the fourth receiving task receives and stores the parameter fragments updated by training node 6 in the current round; and the fifth receiving task receives and stores the parameter fragments updated by training node 7 in the current round.
[0075] When executing a point-to-point communication task for any normal node, for any receiving task in the point-to-point communication task of the normal node, if the second node indicated by the receiving task is the j-th node among multiple training nodes, the parameter fragment updated by the j-th node in the current round can be received, and the received parameter fragment updated by the j-th node in the current round can be stored in the storage area corresponding to the j-th node in the storage space of the normal node.
[0076] For example, in the example above, the point-to-point communication task of training node 0 includes five receiving tasks. When executing the first receiving task, the parameter shard updated by training node 1 (i.e., the second node indicated by the first receiving task) in the current round can be received, and then the parameter shard updated by training node 1 in the current round can be stored in the storage area corresponding to training node 1 in the storage space of training node 0.
[0077] In the embodiments of this disclosure, the point-to-point communication task of a normal node further includes multiple receiving tasks. When executing the point-to-point communication task, for any of the receiving tasks, if the second node indicated by the receiving task is the j-th node among multiple training nodes, the parameter fragments updated by the j-th node in the current round can be received and stored in the storage area corresponding to the j-th node in the storage space of the normal node. This allows the parameter fragments updated by other normal nodes in the current round to be stored in the corresponding storage area in the storage space of the normal node by executing the receiving task of any normal node, thereby improving the data receiving efficiency during parameter aggregation.
[0078] In some possible implementations, point-to-point communication tasks are executed to obtain the parameter aggregation results of normal nodes in the current round. This includes: executing point-to-point communication tasks in batches by calling a preset batch asynchronous point-to-point communication interface to obtain the parameter aggregation results of normal nodes in the current round.
[0079] The pre-defined batch asynchronous point-to-point communication interface can be used to execute node-to-node data reception or transmission tasks in batches via asynchronous communication. This interface can be a high-efficiency batch interface provided by the underlying communication library. For example, the `torch.distributed.batch_isend_irecv` interface in PyTorch's distributed communication library can be used. Point-to-point communication tasks from normal nodes can be submitted to this interface in batches. This interface is responsible for optimizing and packaging point-to-point communication tasks, initiating them in parallel, and managing their completion status. This process is non-blocking, allowing for a certain degree of overlap between computation and communication.
[0080] After the point-to-point communication task is completed, the parameter aggregation result of the normal node in the current round is obtained. It should be noted that when the point-to-point communication task is executed in batches by calling the preset batch asynchronous point-to-point communication interface, the point-to-point communication tasks of each normal node in the node set can be executed in parallel or concurrently. This disclosure does not impose any restrictions on this.
[0081] In the embodiments of this disclosure, point-to-point communication tasks are executed by calling a batch asynchronous point-to-point communication interface to obtain the parameter aggregation results of normal nodes in the current round, thereby improving the execution efficiency of point-to-point communication tasks.
[0082] Figure 3 This is a schematic diagram of a neural network training method provided in an embodiment of the present disclosure. The neural network training method includes a control node and multiple training nodes. (Refer to...) Figure 3The neural network training method includes steps S301 to S307, wherein steps S301-S306 are executed in the parameter aggregation stage of the current round during the distributed training of the neural network, which will be described in detail below.
[0083] Step S301: Determine whether there are any abnormal nodes among the multiple training nodes through the control node.
[0084] If there are no abnormal nodes among the multiple training nodes, the following steps are executed: Step S302, perform a global aggregation operation (AllGather operation) through multiple training nodes to obtain the parameter aggregation result of each training node in the current round; Step S303, based on the parameter aggregation result of each training node in the current round, train the neural network for the next round through multiple training nodes.
[0085] If there are abnormal nodes among multiple training nodes, the following steps are executed: Step S304, through the control node, determine the node set of normal nodes among multiple training nodes, and synchronize the node set to each normal node; Step S305, through each normal node, establish a point-to-point communication task between itself and other normal nodes, the point-to-point communication task includes multiple sending tasks and multiple receiving tasks; Step S306, by calling the preset batch asynchronous point-to-point communication interface, execute the point-to-point communication task of each normal node in batches, and obtain the parameter aggregation result of each normal node in the current round; Step S307, through each normal node in the node set, based on the parameter aggregation result of each normal node in the current round, perform the next round of training on the neural network.
[0086] Figure 4 This disclosure provides a flowchart of a neural network training method. This neural network training method is executed by a first training node in a distributed system, where the first training node is any normal node in the distributed system. (Refer to...) Figure 4 The neural network training method includes steps S41-S43, which are described in detail below.
[0087] In step S41, during the distributed training of the neural network, in the parameter aggregation stage of the current round, the parameter slices obtained by the first training node in the current round of training are acquired.
[0088] In step S42, in response to the presence of abnormal nodes in the distributed system, parameter shards updated by multiple second training nodes in the current round are received and stored via point-to-point communication to obtain the parameter aggregation result of the first training node in the current round. The multiple second training nodes are other normal nodes in the distributed system besides the first training node. The point-to-point communication is based on a point-to-point communication task, which is used to store the parameter shards updated by the multiple second training nodes in the current round.
[0089] In step S43, the neural network is trained for the next round based on the parameter aggregation results of the current round.
[0090] In the distributed training of a neural network, during the parameter aggregation phase of the current round, any normal node in the distributed system (including multiple training nodes) can be considered as the first training node. Then, for the first training node, in step S41, the parameter fragments obtained by the first training node in the current round are obtained. Then, in step S42, in response to the presence of abnormal nodes in the distributed system, the first training node can receive parameter fragments updated by multiple second training nodes in the current round via point-to-point communication, and store the parameter fragments updated by each second training node in the current round in its corresponding storage area, thereby obtaining the parameter aggregation result of the first training node in the current round. Here, the multiple second training nodes are other normal nodes in the distributed system besides the first training node. Point-to-point communication is based on a point-to-point communication task, which is used to store the parameter fragments updated by multiple second training nodes in the current round. That is, a point-to-point communication task for the first training node can be established and executed to obtain the parameter aggregation result of the first training node in the current round.
[0091] After obtaining the parameter aggregation result of the first training node in the current round in step S42, the first training node can, in step S43, train the neural network for the next round based on its parameter aggregation result in the current round.
[0092] In the embodiments of this disclosure, during the distributed training of a neural network, when a faulty node exists, the updated parameter fragments are transmitted through point-to-point communication between normal nodes to achieve parameter aggregation. Compared with related technologies that require skipping faulty nodes and performing multiple non-contiguous memory copies during parameter aggregation, the neural network training method of this disclosure achieves parameter fragment transmission through point-to-point communication without requiring multiple memory copies. This significantly reduces the time overhead caused by memory copies during parameter aggregation, thereby reducing the time spent on parameter aggregation operations, effectively shortening the single iteration cycle, and ultimately improving the overall training efficiency of the neural network.
[0093] In some possible implementations, the point-to-point communication task further includes multiple receiving tasks. These receiving tasks receive and store the parameter fragments updated by the second node in the current round. The second node is one of a plurality of second training nodes (i.e., one of the other normal nodes besides the first training node). Executing the point-to-point communication task includes: for any receiving task in the point-to-point communication task, if the second node indicated by the receiving task is the j-th node among the plurality of training nodes, receiving the parameter fragments updated by the j-th node in the current round and storing them in the storage area corresponding to the j-th node in the storage space of the first training node, where j is a positive integer. The maximum value of j is the total number of training nodes in the distributed system.
[0094] In some possible implementations, the method further includes: during the parameter aggregation phase of the current round, if no heartbeat information from at least one other training node is received within a second preset time period, determining that there is an abnormal node among the multiple training nodes, and identifying at least one other training node as an abnormal node.
[0095] In the case of a decentralized distributed system, the system includes multiple training nodes but excludes a control node. The first training node in this system can send heartbeat information to other training nodes (nodes other than the first training node among the multiple training nodes participating in training) according to a second preset period (e.g., the second preset period is 1 second). The first training node can also determine whether other training nodes have failed based on the received heartbeat information.
[0096] During the parameter aggregation phase of the current round, if the first training node does not receive heartbeat information from one or more other training nodes within a second preset duration (the second preset duration is longer than the second preset period, for example, the second preset duration is 3 seconds), the first training node can consider that one or more other training nodes have malfunctioned. This allows it to determine that there are abnormal nodes among the multiple training nodes participating in the training, and to identify one or more other training nodes that have not received heartbeat information as abnormal nodes. The first training node can then send the node information of the abnormal nodes to at least one second training node.
[0097] For example, the first distributed system includes eight training nodes (training node 0, training node 1, training node 2, training node 3, training node 4, training node 5, training node 6, and training node 7). Each training node sends heartbeat information to other training nodes at a period of 1 second (a second preset period). That is, training node 0 sends heartbeat information to training node 1, training node 2, training node 3, training node 4, training node 5, training node 6, and training node 7 at a period of 1 second; training node 1 sends heartbeat information to training node 0, training node 2, training node 3, training node 4, training node 5, training node 6, and training node 7 at a period of 1 second; training node 2 sends heartbeat information to training node 0, training node 1, training node 3, training node 4, training node 5, training node 6, and training node 7 at a period of 1 second, and so on.
[0098] During the parameter aggregation phase of the current round, for any normal node, such as training node 0, if training node 0 receives heartbeat information from other training nodes (such as training node 1) within 3 seconds (the second preset duration), it can be determined that the other training node (training node 1) that received the heartbeat information is operating normally and has not experienced a fault; if training node 0 does not receive heartbeat information from other training nodes (such as training node 3 and training node 5) within 3 seconds, it can be determined that the other training nodes (training node 3 and training node 5) that did not receive the heartbeat information have experienced a fault.
[0099] If training node 0 determines that training nodes 3 and 5 have failed, then training node 0 can identify abnormal nodes among the 8 training nodes and identify training nodes 3 and 5 as abnormal nodes.
[0100] In the embodiments of this disclosure, when the distributed system is a decentralized distributed system, abnormal nodes can be identified through the first training node, enabling the neural network training method of this disclosure to be applied to decentralized distributed systems and expanding its application scope. Furthermore, identifying abnormal nodes based on heartbeat information can improve the processing efficiency when identifying abnormal nodes.
[0101] In some possible implementations, the point-to-point communication task is also used to send the parameter fragments updated by the first training node in the current round to multiple second training nodes. The method further includes: sending the parameter fragments updated by the first training node in the current round (i.e., the parameter fragments obtained from training) to multiple second training nodes through point-to-point communication, so that the multiple second training nodes can perform parameter aggregation based on the parameter fragments updated by the first training node in the current round.
[0102] The neural network training method of this disclosure adopts a point-to-point communication method. This method does not require modification or adaptation of the complex internal logic of collective communication primitives to tolerate missing participants. It only needs to simply skip the abnormal nodes that have failed at the task orchestration layer, thus achieving a simpler and more robust approach.
[0103] Furthermore, since no memory copying is required, the neural network training method of this disclosure embodiment can also reduce memory bandwidth pressure and reduce additional memory usage, thereby saving memory resources.
[0104] In some possible implementations, the point-to-point communication task further includes multiple sending tasks. These sending tasks are used to send the parameter fragments updated by the first training node in the current round to the first node, which is one of a plurality of second training nodes (i.e., one of the other normal nodes besides the first training node). Executing the point-to-point communication task includes: if the first training node is the i-th node among the plurality of training nodes, for any sending task in the point-to-point communication task, sending the parameter fragments updated by the i-th node in the current round to the first node indicated by the sending task, so that the first node stores the parameter fragments updated by the i-th node in the current round in the storage area corresponding to the i-th node in the storage space of the first node, where i is a positive integer. The maximum value of i is the total number of training nodes in the distributed system.
[0105] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.
[0106] In addition, this disclosure also provides a distributed system, a neural network training device, an electronic device, and a computer-readable storage medium, all of which can be used to implement any of the neural network training methods provided in this disclosure. The corresponding technical solutions and descriptions are described in the corresponding section of the method and will not be repeated here.
[0107] Figure 5 This is a block diagram of a distributed system provided in an embodiment of the present disclosure.
[0108] Reference Figure 5 This disclosure provides a distributed system 500, which includes a control node 510 and multiple training nodes 520.
[0109] The distributed system 500 is used to train the neural network. The control node 510 is used to: during the distributed training of the neural network, in the parameter aggregation phase of the current round, determine whether there are any abnormal nodes among the multiple training nodes; if abnormal nodes exist, determine the set of normal nodes among the multiple training nodes 520.
[0110] Any normal node in the node set is used to: aggregate the parameter fragments updated in the current round through point-to-point communication between the normal node and other normal nodes in the node set, to obtain the parameter aggregation result of the normal node in the current round; wherein, the point-to-point communication is based on the point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to other normal nodes, and to store the parameter fragments updated by other normal nodes in the current round; based on the parameter aggregation result of the normal node in the current round, the neural network is trained for the next round.
[0111] In some possible implementations, any normal node in the node set is used to: establish a point-to-point communication task for the normal node; execute the point-to-point communication task to obtain the parameter aggregation result of the normal node in the current round.
[0112] In some possible implementations, the point-to-point communication task includes multiple sending tasks. The sending tasks are used to send the parameter fragments updated by normal nodes in the current round to a first node, which is one of the other normal nodes. Any normal node in the node set is used to: when the normal node is the i-th node among multiple training nodes, for any sending task in the point-to-point communication task, send the parameter fragments updated by the i-th node in the current round to the first node indicated by the sending task, so that the first node stores the parameter fragments updated by the i-th node in the current round in the storage area corresponding to the i-th node in the storage space of the first node, where i is a positive integer.
[0113] In some possible implementations, the point-to-point communication task also includes multiple receiving tasks. The receiving tasks are used to receive and store the parameter fragments updated by the second node in the current round. The second node is one of the other normal nodes. Each normal node in the node set is used to: for any receiving task in the point-to-point communication task, if the second node indicated by the receiving task is the j-th node among the multiple training nodes, receive the parameter fragments updated by the j-th node in the current round and store them in the storage area corresponding to the j-th node in the storage space of the normal nodes, where j is a positive integer.
[0114] In some possible implementations, any normal node in the node set is used to: execute point-to-point communication tasks in batches by calling a preset batch asynchronous point-to-point communication interface, and obtain the parameter aggregation results of the normal nodes in the current round.
[0115] In some possible implementations, the control node 510 is used to: during the parameter aggregation phase of the current round, if the control node 510 does not receive heartbeat information from at least one training node within a first preset time period, determine that there is an abnormal node among the multiple training nodes, and identify at least one training node as an abnormal node; and determine the node set of normal nodes among the multiple training nodes based on the abnormal node.
[0116] In some possible implementations, the parameter aggregation result of normal nodes in the current round includes the parameter shards updated by each normal node in the node set in the current round and the parameter shards of abnormal nodes in the previous round.
[0117] In some possible implementations, neural networks are used to perform at least one of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks.
[0118] Figure 6 This is a block diagram of an electronic device provided in an embodiment of the present disclosure.
[0119] Reference Figure 6 This disclosure provides an electronic device, which includes: at least one processor 601; at least one memory 602; and one or more I / O interfaces 603 connected between the processor 601 and the memory 602; wherein the memory 602 stores one or more computer programs that can be executed by the at least one processor 601, and the one or more computer programs are executed by the at least one processor 601 to enable the at least one processor 601 to perform the neural network training method described above.
[0120] This disclosure also provides a computer-readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor, implements the aforementioned neural network training method. The computer-readable storage medium may be volatile or non-volatile.
[0121] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in the processor of an electronic device, the processor in the electronic device executes the above-described neural network training method.
[0122] Those skilled in the art will understand that all or some of the steps, systems, and apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software can be distributed on a computer-readable storage medium, which may include computer storage media (or non-transitory media) and communication media (or transient media).
[0123] As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information, such as computer-readable program instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), static random access memory (SRAM), flash memory or other memory technologies, portable compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, it is known to those skilled in the art that communication media typically contain computer-readable program instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.
[0124] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0125] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.
[0126] The computer program product described herein can be implemented specifically through hardware, software, or a combination thereof. In one alternative embodiment, the computer program product is specifically embodied in a computer storage medium; in another alternative embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.
[0127] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0128] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0129] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0130] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0131] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for illustrative purposes only and should be construed as such, and is not intended to be limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in connection with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in connection with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of this disclosure as set forth by the appended claims.
Claims
1. A neural network training method, characterized in that, Applied to a distributed system, the distributed system including a control node and multiple training nodes, the method includes: During the distributed training of the neural network, in the parameter aggregation stage of the current round, the control node is used to determine whether there are abnormal nodes among the multiple training nodes, and if there are abnormal nodes, the set of normal nodes among the multiple training nodes is determined. For any normal node in the node set, the parameter fragments updated in the current round are aggregated through point-to-point communication between the normal node and other normal nodes in the node set, to obtain the parameter aggregation result of the normal node in the current round; wherein, the point-to-point communication is based on a point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to the other normal nodes, and to store the parameter fragments updated by the other normal nodes in the current round; The neural network is trained for the next round by using each normal node in the node set and the parameter aggregation result of each normal node in the current round.
2. The method according to claim 1, characterized in that, The step of aggregating parameter shards updated in the current round through point-to-point communication between the normal node and other normal nodes in the node set to obtain the parameter aggregation result of the normal node in the current round includes: Establish a point-to-point communication task for the normal nodes; Execute the point-to-point communication task to obtain the parameter aggregation result of the normal node in the current round.
3. The method according to claim 2, characterized in that, The point-to-point communication task includes multiple sending tasks. The sending task is used to send the parameter fragments updated by the normal node in the current round to the first node, where the first node is one of the other normal nodes. The execution of the point-to-point communication task includes: When the normal node is the i-th node among the plurality of training nodes, for any sending task in the point-to-point communication task, the parameter fragment updated by the i-th node in the current round is sent to the first node indicated by the sending task, so that the first node stores the parameter fragment updated by the i-th node in the current round in the storage area corresponding to the i-th node in the storage space of the first node, where i is a positive integer.
4. The method according to claim 3, characterized in that, The point-to-point communication task also includes multiple receiving tasks, which are used to receive and store the parameter fragments updated by the second node in the current round, where the second node is one of the other normal nodes; The execution of the point-to-point communication task includes: For any receiving task in the point-to-point communication task, if the second node indicated by the receiving task is the j-th node among the plurality of training nodes, the parameter fragment updated by the j-th node in the current round is received and stored in the storage area corresponding to the j-th node in the storage space of the normal node, where j is a positive integer.
5. The method according to claim 2, characterized in that, The execution of the point-to-point communication task, obtaining the parameter aggregation result of the normal node in the current round, includes: By calling a preset batch asynchronous point-to-point communication interface, the point-to-point communication tasks are executed in batches to obtain the parameter aggregation results of the normal nodes in the current round.
6. The method according to claim 1, characterized in that, The step of determining whether there are abnormal nodes among multiple training nodes through the control node, and determining the set of normal nodes among the multiple training nodes if abnormal nodes exist, includes: During the parameter aggregation phase of the current round, if the control node does not receive heartbeat information from at least one training node within a first preset time period, the control node determines that there is an abnormal node among the multiple training nodes, and identifies the at least one training node as an abnormal node. The control node determines the set of normal nodes among the multiple training nodes based on the abnormal nodes.
7. The method according to claim 1, characterized in that, The parameter aggregation result of the normal node in the current round includes the parameter shards updated by each normal node in the node set in the current round and the parameter shards of the abnormal node in the previous round.
8. The method according to any one of claims 1-7, characterized in that, The neural network is used to perform at least one of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks.
9. A neural network training method, characterized in that, This is executed by the first training node in the distributed system, where the first training node is any normal node in the distributed system. The method includes: During the distributed training of the neural network, in the parameter aggregation stage of the current round, the parameter slices obtained by the first training node in the current round of training are acquired. In response to the presence of abnormal nodes in the distributed system, the system receives and stores parameter shards updated by multiple second training nodes in the current round via point-to-point communication, thereby obtaining the parameter aggregation result of the first training node in the current round. The multiple second training nodes are other normal nodes in the distributed system besides the first training node. The point-to-point communication is based on a point-to-point communication task, which is used to store the parameter shards updated by the multiple second training nodes in the current round. Based on the parameter aggregation results of the current round, the neural network is trained for the next round.
10. The method according to claim 9, characterized in that, The method further includes: During the parameter aggregation phase of the current round, if no heartbeat information from at least one other training node is received within the second preset time period, it is determined that there is an abnormal node among the multiple training nodes, and the at least one other training node is identified as an abnormal node.
11. A distributed system, characterized in that, The distributed system is used to train a neural network, and the distributed system includes a control node and multiple training nodes; The control node is used to: determine whether there are abnormal nodes among multiple training nodes during the parameter aggregation stage of the current round of distributed training of the neural network. In the presence of the abnormal nodes, determine the set of normal nodes among the plurality of training nodes; Any normal node in the node set is used to: aggregate the parameter fragments updated in the current round through point-to-point communication between the normal node and other normal nodes in the node set, to obtain the parameter aggregation result of the normal node in the current round; wherein, the point-to-point communication is based on a point-to-point communication task, which is used to send the parameter fragments updated by the normal node in the current round to the other normal nodes, and to store the parameter fragments updated by the other normal nodes in the current round; Based on the parameter aggregation results of the normal nodes in the current round, the neural network is trained for the next round.
12. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores one or more computer programs that can be executed by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the neural network training method as described in any one of claims 1-10.
13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the neural network training method as described in any one of claims 1-10.