Distributed task exception handling method and system
A distributed task and exception handling technology, applied in the field of computer data processing, can solve problems such as overflow
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0060] see image 3 and Figure 4 , image 3 Shown is a schematic diagram of the principle of an embodiment of the distributed task processing architecture of the present application, Figure 4 Shown is a schematic flowchart of an embodiment of the distributed task processing method of the present application. Taking the single-tenant mode as an example, when the worker node executes the current round of training tasks to obtain the task parameters (such as the gradient values obtained from the training subset), such as image 3 and Figure 4 As shown, the working node executes step S101.
[0061] In step S101, a first data packet containing task parameters and identification information of a node used to instruct an aggregation operation is sent; wherein the identification information is used to instruct a forwarding node or a parameter node to perform an aggregation operation on the task parameters ; wherein, the task parameters are obtained by executing a distributed...
Embodiment 2
[0093] The above embodiment is the workflow of each node in the single-tenant mode. In the multi-tenant mode (that is, the worker node performs multiple tasks at the same time), the node performing the aggregation operation needs to know which task the currently received data packet corresponds to, so as to prevent the parameters of different tasks from being confused. Therefore, on the basis of any of the foregoing embodiments or any combination thereof, in some embodiments, the first data packet sent by the working node further includes task identification information, so that the node performing the aggregation operation can confirm the The distributed computing task corresponding to the first data packet.
[0094] Here, the working node adds task identification information in the header of the first data packet, and when the forwarding node or the parameter node receives and parses the first data packet, it can learn the current first data according to the task identificat...
Embodiment 3
[0096] The above embodiment describes the flow in a normal situation (here, normal refers to a situation in which the packet loss problem does not occur due to communication link interruption, network delay, etc.). In actual scenarios, there may be packet loss problems caused by communication link interruptions, network delays, and so on.
[0097] Therefore, on the basis of any of the foregoing embodiments or any combination thereof, in some embodiments, the method further includes performing packet loss detection during sending of the first data packet, and resending the first data packet when packet loss is detected data pack.
[0098] Here, as image 3 For example, the working node starts timing when sending the first data packet, and when the timing data reaches the preset threshold and still does not receive the returned data packet, then the working node judges that there is a packet loss situation, then Resend the first packet containing the task parameters. When the...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


