GPU cluster service management system and method

A GPU cluster and service management technology, which is applied in the direction of electrical digital data processing, resource allocation, program control design, etc., can solve the problems of reducing the processing efficiency of the GPU cluster and the failure of the GPU cluster to operate normally, so as to achieve efficient utilization, ensure normal operation, The effect of improving processing efficiency

Active Publication Date: 2020-08-18
北京中科云脑智能技术有限公司 +1
View PDF4 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The management system will re-allocate the resources, resulting in repeated allocation of resources, and the GPU cluster cannot operate normally, which greatly reduces the processing efficiency of the GPU cluster.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • GPU cluster service management system and method
  • GPU cluster service management system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0047] The invention provides a GPU cluster service management system, and the management system is based on Kubernetes technology.

[0048] Specifically, Kubernetes is a container orchestration engine open sourced by Google, which supports automated deployment, large-scale scalability, and application container management. W...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the field of computer management, and particularly relates to a GPU cluster service management system and method. The GPU cluster service management system comprises a resource monitoring module used for monitoring GPU cluster resources, generating cluster resource data and sending the cluster resource data, a resource allocation module used for acquiring task informationand the cluster resource data and allocating task resources according to the task information and the cluster resource data, a checking module used for obtaining the cluster resource data sent by theresource monitoring module, checking the GPU cluster resource state according to the cluster resource data, generating a checking result and sending the checking result, and an isolation module used for acquiring the inspection result and isolating abnormal resources according to the inspection result. According to the GPU cluster service management system and method, all resource states in the GPU cluster can be monitored in real time, and it is ensured that resources are efficiently utilized; and abnormal resources can be automatically checked out and isolated, so that normal operation of the GPU cluster is ensured, and the processing efficiency of the GPU cluster is improved.

Description

technical field [0001] The invention belongs to the field of computer management, and in particular relates to a GPU cluster service management system and method. Background technique [0002] A GPU cluster is a computer cluster in which each node is equipped with a graphics processing unit (GPU), and the GPU cluster has a fast calculation speed. GPU clusters can use hardware from two major independent hardware vendors (AMD and NVIDIA). [0003] At present, there is a system for managing GPU clusters, which is used to monitor and allocate cluster resources. However, the existing management system cannot monitor the status of the cluster in real time, and cannot automatically identify and handle GPU cluster failures. For example, when a node in the GPU cluster crashes or freezes, because some resources in the node are idle at this time, the management system will mistakenly believe that the resource has completed its work and is an idle resource. The management system will...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06F9/455G06F11/30
CPCG06F9/5027G06F9/45558G06F11/3006G06F2009/45587Y02D10/00
Inventor 孟家祥常峰查甘望谷家磊刘海峰
Owner 北京中科云脑智能技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products