A resource management system and method for in-depth learning

A resource management system and deep learning technology, applied in neural learning methods, resource allocation, electrical digital data processing, etc., can solve the problems of no life cycle management, increase the mental burden of programmers, training interruption, etc., to save training resources and cost, avoid resource contention and waste, and improve operational efficiency

Pending Publication Date: 2019-03-22
咪付(广西)网络技术有限公司
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, in the process of TensorFlow landing, there are also some corresponding problems as follows: (1) Resources cannot be isolated: during training, the various tasks of TensorFlow may affect each other due to the preemption of computing resources. Since the GPU graphics card is composed of GPU computing units and video memory , if multiple tasks share one GPU, if the video memory is not enough, training interruption or other unknown errors will occur; (2) Lack of scheduling capabilities: users need to manually configure and manage computing resources for tasks, which have to be hard-coded in the code Realization; (3) Abnormal training interruption: When the PS or worker abnormality causes the task process to exit, because TensorFlow has no self-healing ability, manual intervention is required to resume training; (4) No life cycle management: Cannot effectively manage the execution of multiple tasks process, and monitoring the status of multiple tasks, etc.; (5) Complex distributed deployment: For AI developers, each time a training task is released, a distributed deployment must be done, which to a certain extent increases the burden on the program In addition to realizing the logic of the training task, they also have to worry about which machine resources are available and how to make the task run
[0004] With the continuous development of AI business, the training time requirements of TensorFlow-based neural network models are getting higher and higher, and it will be difficult to cope with large-scale deep neural network model training in stand-alone mode
Although the distributed TensorFlow cluster training method solves the problem of insufficient single-machine computing power, it does not provide cluster management functions such as task scheduling, monitoring, and restart after failure, which brings a lot to AI developers in large-scale automated model training. Difficulties

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A resource management system and method for in-depth learning
  • A resource management system and method for in-depth learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0034] The purpose of the present invention is to provide a resource management system and method for deep learning to realize the unified scheduling and management of TensorFlow-based deep learning training task resources, monitor the training process, support automatic interruption and restart, and reduce the workload of AI developers. Work load, improve task training efficiency. The principle and implementation of a resource management system and method for deep learning of the present invention will be described in detail below, so that those skilled in the art can understand the technical cont...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a resource management system and a method for deep learning. The system comprises a Kubernetes cluster, a Mysql memory module and a distributed memory. The Kubernetes cluster comprises a training management platform and a TensorFlow project platform, wherein the training management platform comprises a registrar and a controller, and the TensorFlow project platform is composed of a cluster generator; The method comprises the following steps: step S100: creating a Docker image containing a TensorFlow training script and pushing the image to a mirror image warehouse; S200, registering a TensorFlow item and configuring item information; S300: creating a TensorFlow project platform to generate a TensorFlow cluster; S400, starting task training, and saving the training file at regular intervals; Step S500: finishing The task training, and generating a result model. The system and the method of the invention can realize the unified scheduling and management of the deep learning and training task resources based on the TensorFlow, monitor the training process, support automatic interruption and restart, lighten the workload of the AI developer, and improve the tasktraining efficiency. The system and the method of the invention can realize the unified scheduling and the management of the deep learning and training task resources based on the TensorFlow.

Description

technical field [0001] The present invention relates to the technical field of deep learning, in particular to a resource management system and method for deep learning. Background technique [0002] TensorFlow, as the latest and most widely used open source framework for deep learning, has received extensive attention and attention in recent years. It is not only portable, efficient, scalable, flexible, portable, and fast in compilation, but also on different computers. Operation: It can be as small as a smartphone or as large as a computer cluster. TensorFlow has been widely used in different groups from individuals to enterprises, from start-ups to large companies, and has shown great application value in industry, commerce, and scientific research, so it has become the most popular deep learning framework nowadays. [0003] However, in the process of TensorFlow landing, there are also some corresponding problems as follows: (1) Resources cannot be isolated: during train...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F9/50G06N3/08
CPCG06F9/5005G06N3/08
Inventor 代豪蒙孝宗李清
Owner 咪付(广西)网络技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products