Method for resetting data transmission network in distributed training task training process

A data transmission and distributed technology, applied in the direction of data exchange network, digital transmission system, transmission system, etc., can solve the problems of deployment failure, inability to provide RDMA network for data transmission, and failure of training applications to discover and effectively use RDMA network, etc. , to achieve the effect of efficient data transmission

Active Publication Date: 2021-02-09
CLUSTAR TECH LO LTD
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] To sum up, since the RDMA network information (such as RDMA network IP) cannot be obtained in advance, even if the current container training cluster has an RDMA network, the training application running on each computing node (that is, the container/container group used for training) will not Inability to discover and efficiently use RDMA networks
[0011] In addition, although some methods mentioned above can also implement training tasks d

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for resetting data transmission network in distributed training task training process

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The technical solutions in the embodiments of the present invention are clearly and completely described below in conjunction with the drawings of the embodiments of the present invention. Apparently, the described embodiments are only part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0037] The following are some preferred embodiments of the present invention. Some of these preferred embodiments provide a method for resetting the data transmission network during the training process of the distributed training task. The method includes:

[0038]When distributed training tasks are scheduled to a training cluster with an RDMA network,

[0039] Before the distributed training starts, start the process on each computing node and obtain the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for resetting a data transmission network in a distributed training task training process, which comprises the following steps of: after a distributed training task isscheduled to a training cluster with an RDMA (Remote Direct Memory Access) network, respectively acquiring an RDMA network IP (Internet Protocol) of each sub-task computing node, and determining a master node (the rest is slave nodes) from each sub-task computing node; wherein the collection module is used for collecting training cluster RDMA network information; and after the collection is completed, updating the environment configuration parameters of each subtask according to the cluster RDMA network information so as to realize communication according to the updated environment configuration parameters in the distributed training process and achieve the purpose of resetting the data transmission network to the RDMA network.

Description

technical field [0001] The present invention relates to the field of distributed machine learning and the field of container cloud technology; in particular, it relates to a method for resetting a data transmission network during the training process of a distributed training task. Background technique [0002] Machine learning, especially deep learning, has seen widespread success in AI-driven services. As models become more complex, their training becomes increasingly computationally expensive. To achieve efficient and timely training, it is necessary to explore the advantages of parallel computing in distributed systems. Industry leaders such as Microsoft, Facebook, and Google have begun to try to run distributed machine learning training tasks on production clusters consisting of hundreds or thousands of servers. [0003] However, a physical cluster for distributed training with practical significance, from construction and deployment to operation and maintenance, is e...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L12/24H04L29/08G06F9/455G06N3/063
CPCH04L41/0813H04L67/10G06F9/45558G06F2009/45595G06N3/063
Inventor 张翔宇郭昊张曼妮孙军欢赵来松
Owner CLUSTAR TECH LO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products