Distributed machine learning system acceleration method based on network reconfiguration

A network acceleration and machine learning technology, applied in neural learning methods, biological neural network models, instruments, etc., can solve the problem of not fully considering the characteristics of machine learning task load, long tail delay and other problems, to ensure efficient operation, guarantee The effect of fair distribution and avoiding long tail delay

Pending Publication Date: 2020-10-13
HANGZHOU DIANZI UNIV
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Since the schedulers used in the current distributed machine learning system are designed for big data processing tasks, these schedulers do not fu

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed machine learning system acceleration method based on network reconfiguration
  • Distributed machine learning system acceleration method based on network reconfiguration
  • Distributed machine learning system acceleration method based on network reconfiguration

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The present invention will be further described below in conjunction with accompanying drawing, please refer to figure 1 and image 3 ; figure 1 The architecture of the method for improving the distributed training speed of machine learning models based on network reconfiguration proposed by the present invention is given. Among them, 1 is the model database; 2, 3, and 4 are the scheduling policy manager, scheduler, and state memory, which constitute the resource coordinator; 5, 6, and 7 are the wireless router on the top of the rack and the working machine inside the rack. and switches.

[0031] The important components of the system structure of the present invention will be described in detail below.

[0032] (1) Model database

[0033] The model database is used to store the machine learning model to be trained submitted by the user, and to store the relevant parameters of the model to be trained. The resource coordinator will actively pull the model to be train...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a distributed machine learning system acceleration method based on network reconfiguration. The method comprises the following steps that 1, a scheduler obtains a certain number of to-be-trained models from a database; 2, each training task is divided into a plurality of sub-tasks according to the position preference of each training task, and the sub-tasks are deployed toeach server; 3, the scheduling strategy manager continuously allocates GPU resources according to the difficulty degree of reaching the time fairness of each task distance; 4, the size of the TCP buffer area is dynamically adjusted according to the current network condition; step 5, the scheduling strategy manager records a corresponding result into a database according to a load operation condition; and step 6, whether to continue scheduling or not is determined according to the condition of the to-be-trained model in the database. According to the task scheduling strategy of time fairness and position sensitivity, the problems of long-tail delay and the like in the training process of the machine learning model can be avoided, and the service quality of the cluster is improved.

Description

technical field [0001] The invention relates to a method for realizing communication and load scheduling of a machine learning system, especially a method for reducing model training time by rationally optimizing scheduling in a large-scale distributed machine learning system. Background technique [0002] With the continuous development of artificial intelligence technology, massive data training and the emergence of large-scale models make single-machine model training increasingly unable to meet the performance requirements of artificial intelligence applications. Therefore, some scholars have proposed distributed machines such as data parallelism and model parallelism. Learning techniques to improve the training speed of the model. Data parallelism has been extensively researched and optimized for performance. However, as the scale of the model increases, for large models that cannot be accommodated in the memory of a single machine, model parallelism is the only way to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F9/50G06N3/04G06N3/063G06N3/08
CPCG06F9/5072G06F9/5027G06N3/063G06N3/08G06N3/045
Inventor 裘翼滔蒋从锋欧东阳闫龙川殷昱煜张纪林黄震赵子岩李妍
Owner HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products