Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Distributed machine learning system and method of adaptive RDMA network

A machine learning and distributed technology, applied in machine learning, transmission systems, instruments, etc., can solve problems such as inability to break through communication bottlenecks, inapplicability to large-scale cluster deployment, and inability of training applications to discover and effectively use RDMA networks. Achieve the effect of improving the efficiency of distributed training and overcoming communication bottlenecks

Pending Publication Date: 2021-02-09
CLUSTAR TECH LO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] Therefore, even when distributed training tasks are scheduled to a dual-network container training cluster with an RDMA network, after the distributed training starts, the training applications running on each computing node of the cluster (here refers to the container / container group used for training) The program is also unable to discover and effectively use the RDMA network, and thus still unable to break through the communication bottleneck and achieve efficient training
[0011] In addition, even if distributed training tasks are deployed in a multi-network physical cluster with an RDMA network, special environment configuration parameters (environment configuration parameters with the RDMA network IP as the network connection parameter) need to be generated manually / using scripts; and manual configuration, Inevitably error-free; and not suitable for large-scale cluster deployments
[0012] It should be pointed out that it is precisely because most distributed machine learning frameworks rely too much on the network environment set and provided during deployment scheduling during the training process, and are not aware of the network type of the training cluster to be scheduled, which leads to distributed training. After the task is scheduled to the training cluster, it can only choose to use the default network (normal) or the RDMA network (customized) during the training process, instead of adaptively selecting according to the current network actual situation of the scheduled training cluster The network is used for data transmission of the training application during the training process

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed machine learning system and method of adaptive RDMA network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The technical solutions in the embodiments of the present invention are clearly and completely described below in conjunction with the drawings of the embodiments of the present invention. Apparently, the described embodiments are only part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0040]The following are some preferred embodiments of the present invention. Some of these preferred embodiments provide a distributed machine learning system adaptive to RDMA networks. The system includes: a network environment adaptive unit and a distributed training execution unit; wherein, the network environment adaptive unit is used to detect the network environment of the training cluster and adaptively select the training cluster network for distribu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Disclosed are a distributed machine learning system and method of an adaptive RDMA network. After a distributed training task is scheduled to a training cluster, the training cluster network environment is detected, and the training cluster network is adaptively selected according to the detection to be used for distributed training task communication; the optimal network environment in the training process is selected for the distributed task as much as possible, so that the efficient RDMA network is fully used for communication in the distributed training process, the communication bottleneck problem existing in distributed training task deployment in the prior art is solved, and then distributed training efficiency is improved.

Description

technical field [0001] The present invention relates to the field of distributed machine learning; in particular, it relates to a distributed machine learning system and method for adaptive RDMA network. Background technique [0002] Machine learning, especially deep learning, has seen widespread success in AI-driven services. As models become more complex, their training becomes increasingly computationally expensive. To achieve efficient and timely training, it is necessary to explore the advantages of parallel computing in distributed systems. Industry leaders such as Microsoft, Facebook, and Google have begun to try to run distributed machine learning training tasks on production clusters consisting of hundreds or thousands of servers. [0003] However, a physical cluster for distributed training with practical significance, from construction and deployment to operation and maintenance, is extremely professional, complex and even cumbersome. Applying container cloud t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06N20/00G06F13/28H04L29/08
CPCG06F13/28G06N20/00H04L67/10H04L67/12
Inventor 郭昊张曼妮张翔宇孙军欢赵来松
Owner CLUSTAR TECH LO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products