Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Distributed job coordination control method, device, computer device and storage medium

A coordinated control and distributed technology, applied in the direction of program control design, calculation, multi-program device, etc., can solve the problems of complex realization logic, low resource utilization rate, and inability to isolate jobs, so as to simplify the realization logic and improve resources. The effect of utilization

Active Publication Date: 2018-12-28
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF7 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, in the former method, the machine learning algorithm and the deep learning framework need to rely on MPI internally, and implement the control logic of MPI within the algorithm or framework, which cannot effectively isolate jobs, resulting in the problem of low resource utilization.
[0006] In the latter method, since the Master needs to be implemented separately, and the communication process between the parameter server process and the training process and the Master needs to be implemented separately, the implementation logic is more complicated

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed job coordination control method, device, computer device and storage medium
  • Distributed job coordination control method, device, computer device and storage medium
  • Distributed job coordination control method, device, computer device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] A distributed machine learning job or a distributed deep learning job usually includes several parameter server processes and several training processes. The training process needs to communicate with all parameter server processes. The training process downloads model parameters from the parameter server process and trains The final model parameters are updated to the parameter server process.

[0035] In the solution of the present invention, the container and Kubernetes can be used to coordinate and control the startup and running status of the job. Since the Pods where all the containers of a job are located have no difference for Kubernetes, they can be scheduled by Kubernetes at the same time or in any order, so a certain mechanism is needed to realize the coordination and control of jobs.

[0036] Kubernetes is an open source large-scale container cluster management system that provides resource debugging, deployment, service discovery, and expansion mechanisms f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed job coordination control method, a device, a computer device and a storage medium, wherein the method may comprise: the first job_coordinator process in the container in which the parameter server process resides starts, spin synchronization is performed until all the parameter server Pods in the job are running state, a unique number is assigned to the parameter server process in the container, the parameter server process starts in the container, and the job is controlled to return to the state; the second job_coordinator process, located in the container in which the training process resides, starts, spin synchronization is performed until all the training process Pods in the job are running, unique numbers are assigned to the training process in the container, the training process in the container is started, links between the training process and all the parameter server processes in the job are established, and the job return status is controlled. The scheme of the invention can improve the resource utilization rate and simplify the realization logic, and the like.

Description

【Technical field】 [0001] The invention relates to computer application technology, in particular to a distributed operation coordination control method, device, computer equipment and storage medium. 【Background technique】 [0002] In distributed jobs such as distributed machine learning jobs and distributed deep learning jobs, the models used are getting larger and larger, and the amount of data is also increasing accordingly. Although this improves the accuracy of learning, it also increases the Training time, the most common method is to use a large-scale machine cluster for parallel training. [0003] A distributed machine learning job or a distributed deep learning job usually includes several parameter server processes and several training processes. The training process needs to communicate with all parameter server processes. The training process downloads model parameters from the parameter server process and trains The final model parameters are updated to the par...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/52
CPCG06F9/526
Inventor 夏燕明
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products