Distributed machine learning system acceleration method based on network reconfiguration

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A network acceleration and machine learning technology, applied in neural learning methods, biological neural network models, instruments, etc., can solve the problem of not fully considering the characteristics of machine learning task load, long tail delay and other problems, to ensure efficient operation, guarantee The effect of fair distribution and avoiding long tail delay

Pending Publication Date: 2020-10-13

HANGZHOU DIANZI UNIV

View PDF0 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0009] Since the schedulers used in the current distributed machine learning system are designed for big data processing tasks, these schedulers do not fully consider the characteristics of the machine learning task load when scheduling, which often causes problems such as long tail delays in model training.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0030] The present invention will be further described below in conjunction with accompanying drawing, please refer to figure 1 and image 3 ; figure 1 The architecture of the method for improving the distributed training speed of machine learning models based on network reconfiguration proposed by the present invention is given. Among them, 1 is the model database; 2, 3, and 4 are the scheduling policy manager, scheduler, and state memory, which constitute the resource coordinator; 5, 6, and 7 are the wireless router on the top of the rack and the working machine inside the rack. and switches.

[0031] The important components of the system structure of the present invention will be described in detail below.

[0032] (1) Model database

[0033] The model database is used to store the machine learning model to be trained submitted by the user, and to store the relevant parameters of the model to be trained. The resource coordinator will actively pull the model to be train...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a distributed machine learning system acceleration method based on network reconfiguration. The method comprises the following steps that 1, a scheduler obtains a certain number of to-be-trained models from a database; 2, each training task is divided into a plurality of sub-tasks according to the position preference of each training task, and the sub-tasks are deployed toeach server; 3, the scheduling strategy manager continuously allocates GPU resources according to the difficulty degree of reaching the time fairness of each task distance; 4, the size of the TCP buffer area is dynamically adjusted according to the current network condition; step 5, the scheduling strategy manager records a corresponding result into a database according to a load operation condition; and step 6, whether to continue scheduling or not is determined according to the condition of the to-be-trained model in the database. According to the task scheduling strategy of time fairness and position sensitivity, the problems of long-tail delay and the like in the training process of the machine learning model can be avoided, and the service quality of the cluster is improved.

Description

technical field [0001] The invention relates to a method for realizing communication and load scheduling of a machine learning system, especially a method for reducing model training time by rationally optimizing scheduling in a large-scale distributed machine learning system. Background technique [0002] With the continuous development of artificial intelligence technology, massive data training and the emergence of large-scale models make single-machine model training increasingly unable to meet the performance requirements of artificial intelligence applications. Therefore, some scholars have proposed distributed machines such as data parallelism and model parallelism. Learning techniques to improve the training speed of the model. Data parallelism has been extensively researched and optimized for performance. However, as the scale of the model increases, for large models that cannot be accommodated in the memory of a single machine, model parallelism is the only way to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F9/50G06N3/04G06N3/063G06N3/08

CPCG06F9/5072G06F9/5027G06N3/063G06N3/08G06N3/045

Inventor 裘翼滔蒋从锋欧东阳闫龙川殷昱煜张纪林黄震赵子岩李妍

Owner HANGZHOU DIANZI UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Distributed machine learning system acceleration method based on network reconfiguration

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology