Method, system and device for operating and maintaining large-scale cluster and storage medium

A large-scale cluster, operation and maintenance technology, applied in transmission systems, character and pattern recognition, instruments, etc., can solve problems such as inability to adapt to large-scale clusters, achieve fast and simple resource positioning, and remove the dependence of deployment nodes.

Pending Publication Date: 2022-04-08
SUZHOU LANGCHAO INTELLIGENT TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The operation and maintenance system used by the deep learning platform in th

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, system and device for operating and maintaining large-scale cluster and storage medium
  • Method, system and device for operating and maintaining large-scale cluster and storage medium
  • Method, system and device for operating and maintaining large-scale cluster and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] In order to make the object, technical solution and advantages of the present invention clearer, the embodiments of the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

[0027] It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are to distinguish two entities with the same name but different parameters or parameters that are not the same, see "first" and "second" It is only for the convenience of expression, and should not be construed as a limitation on the embodiments of the present invention, which will not be described one by one in the subsequent embodiments.

[0028] In the first aspect of the embodiments of the present invention, an embodiment of a method for operating and maintaining a large-scale cluster is proposed. figure 1 What is shown is a schematic diagram of an embodiment of the method for opera...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method, a system and equipment for operating and maintaining a large-scale cluster and a storage medium. The method comprises the following steps: deploying a management module and a client module at a management node, and deploying the client module at a computing node; the management module obtains various monitoring values of all management nodes and computing nodes from the client module, and compares the monitoring values with a currently configured monitoring strategy threshold; in response to meeting the alarm condition, generating alarm data at the management node, and performing task processing strategy matching according to the alarm data; and in response to successful matching, generating a fault processing task at the management node, and transmitting the fault processing task to the corresponding client module based on active pulling of the client module. According to the invention, the operation execution pressure is shared by the management node to each computing node; and the management node is only responsible for fault judgment and task state management and does not need to undertake maintenance work of task distribution execution, so that operation and maintenance work of a super-large-scale cluster can be adapted.

Description

technical field [0001] The present invention relates to the field of large-scale clusters, and more specifically refers to a method, system, device and storage medium for operating and maintaining large-scale clusters. Background technique [0002] At present, artificial intelligence technology has developed rapidly, and various industries are rapidly undergoing intelligent transformation. As the deep learning technology represented by artificial intelligence, various fields have produced a large number of constantly changing and rapidly developing needs for deep learning training. Larger training scale and larger data set size will significantly improve the effect of deep learning training. However, large-scale clusters undoubtedly pose higher challenges to the original operation and maintenance system capabilities. [0003] The systems in the industry can generally realize the monitoring and alarm functions of small-scale clusters, but cannot support automatic fault hand...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L41/0631H04L41/0654H04L67/30G06K9/62
Inventor 荆荣讯
Owner SUZHOU LANGCHAO INTELLIGENT TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products