Unlock instant, AI-driven research and patent intelligence for your innovation.

Distributed task exception handling method and system

A distributed task and exception handling technology, applied in the field of computer data processing, can solve problems such as overflow

Pending Publication Date: 2022-05-27
中关村海华信息技术前沿研究院
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When the amount of data required to be calculated by the working nodes increases, the aggregation node sometimes overflows when collecting the data transmitted by each working node and performing calculations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed task exception handling method and system
  • Distributed task exception handling method and system
  • Distributed task exception handling method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0060] see image 3 and Figure 4 , image 3 Shown is a schematic diagram of the principle of an embodiment of the distributed task processing architecture of the present application, Figure 4 Shown is a schematic flowchart of an embodiment of the distributed task processing method of the present application. Taking the single-tenant mode as an example, when the worker node executes the current round of training tasks to obtain the task parameters (such as the gradient values ​​obtained from the training subset), such as image 3 and Figure 4 As shown, the working node executes step S101.

[0061] In step S101, a first data packet containing task parameters and identification information of a node used to instruct an aggregation operation is sent; wherein the identification information is used to instruct a forwarding node or a parameter node to perform an aggregation operation on the task parameters ; wherein, the task parameters are obtained by executing a distributed...

Embodiment 2

[0093] The above embodiment is the workflow of each node in the single-tenant mode. In the multi-tenant mode (that is, the worker node performs multiple tasks at the same time), the node performing the aggregation operation needs to know which task the currently received data packet corresponds to, so as to prevent the parameters of different tasks from being confused. Therefore, on the basis of any of the foregoing embodiments or any combination thereof, in some embodiments, the first data packet sent by the working node further includes task identification information, so that the node performing the aggregation operation can confirm the The distributed computing task corresponding to the first data packet.

[0094] Here, the working node adds task identification information in the header of the first data packet, and when the forwarding node or the parameter node receives and parses the first data packet, it can learn the current first data according to the task identificat...

Embodiment 3

[0096] The above embodiment describes the flow in a normal situation (here, normal refers to a situation in which the packet loss problem does not occur due to communication link interruption, network delay, etc.). In actual scenarios, there may be packet loss problems caused by communication link interruptions, network delays, and so on.

[0097] Therefore, on the basis of any of the foregoing embodiments or any combination thereof, in some embodiments, the method further includes performing packet loss detection during sending of the first data packet, and resending the first data packet when packet loss is detected data pack.

[0098] Here, as image 3 For example, the working node starts timing when sending the first data packet, and when the timing data reaches the preset threshold and still does not receive the returned data packet, then the working node judges that there is a packet loss situation, then Resend the first packet containing the task parameters. When the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed task exception handling method and system, and the method comprises the steps: transmitting a first data packet which comprises a task parameter of a first data format and identification information used for indicating a parameter node to execute an aggregation operation when an exception retransmission instruction is received; wherein the abnormal retransmission instruction is used for indicating a working node to execute data format conversion; the task parameter is obtained by executing a distributed computing task; receiving a second data packet containing an aggregation parameter; wherein the aggregation parameters are used for each working node corresponding to the same distributed computing task to perform data processing; wherein the aggregation parameter is obtained after a parameter node executes aggregation operation according to the first data packet. Under the condition that the parameter node detects the aggregation parameter overflow, the working node retransmits the task parameters in different data formats, so that the problem that the forwarding node cannot process the aggregation parameter overflow is solved.

Description

technical field [0001] The present application relates to the field of computer data processing, and in particular to a method and system for exception handling of distributed tasks. Background technique [0002] Distributed computing systems can be used for sample training and gradient updating of deep neural networks. When the amount of data required to be calculated by the worker nodes increases, the aggregation node sometimes overflows when it aggregates the data transmitted by each worker node and performs computations. SUMMARY OF THE INVENTION [0003] In view of the above-mentioned shortcomings of the related art, the purpose of the present application is to provide an exception handling method and system for distributed tasks, so as to overcome the technical problem of overflow in distributed computing in the above-mentioned related art. [0004] In order to achieve the above purpose and other related purposes, a first aspect disclosed in the present application p...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): H04L67/1008H04L67/565H04L1/16G06N3/08G06F9/50
CPCG06F9/5083G06N3/08
Inventor 吴文斐刘俊林陈奕熹
Owner 中关村海华信息技术前沿研究院