Task scheduling method in high-performance computing system, and electronic device

By transmitting status and identification information of graphics processing units to the scheduler, the system ensures tasks are assigned to idle units, addressing inefficiencies in high-performance computing systems and enhancing resource utilization.

WO2026121604A1PCT designated stage Publication Date: 2026-06-11CLUNIX

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CLUNIX
Filing Date
2025-11-07
Publication Date
2026-06-11

AI Technical Summary

Technical Problem

Schedulers in high-performance computing systems fail to accurately identify idle graphics processing units, leading to inefficient task assignment and performance degradation when multiple instances of a graphics processing unit are used in parallel.

Method used

A master node in the system collects and transmits identification and status information of graphics processing units to a scheduler, ensuring tasks are assigned to idle units by overriding the scheduler's incorrect assignments.

🎯Benefits of technology

This approach enhances task scheduling efficiency by preventing idle graphics processing units from being overlooked, thereby optimizing resource utilization and improving overall system performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025018237_11062026_PF_FP_ABST
    Figure KR2025018237_11062026_PF_FP_ABST
Patent Text Reader

Abstract

The method of the present disclosure may comprise the steps of: transmitting a task request to a scheduler for scheduling a task such that a plurality of instances are processed in parallel in one graphics processing device; receiving identification information and state information of one or more first graphics processing devices included in each of a plurality of compute nodes from a plurality of compute nodes included in a high-performance computing system; identifying, by the scheduler, a second graphics processing device in an idle state on the basis of the state information, among the one or more first graphics processing devices; and transmitting, to the scheduler, identification information of the second graphics processing device and a command for allocating a task to the second graphics processing device.
Need to check novelty before this filing date? Find Prior Art

Description

Task scheduling method and electronic device in a high-performance computing system

[0001] The technical concept of the present disclosure is to a method and an electronic device for providing a task scheduling method in a high-performance computing system.

[0002] High-Performance Computing (HPC) refers to computer systems and technologies designed to rapidly perform large-scale data processing and complex calculations. HPC processes tasks quickly by using supercomputers to execute numerous processors in parallel. It is used in various fields, including science, engineering, finance, and medicine. HPC can consist of computing nodes containing processors and memory that perform computational tasks, storage devices capable of rapidly storing and reading data, and network interfaces for exchanging data between nodes.

[0003] Job scheduling is a technique for effectively managing multiple computational jobs and coordinating their execution order in high-performance computing environments. It can increase processing efficiency by optimally distributing resources within clusters included in a high-performance computing system and setting job priorities.

[0004] The technical problem that the present invention aims to solve is to prevent a scheduler that schedules tasks for parallel processing of multiple instances in a single graphics processing unit from failing to assign tasks to an idle graphics processing unit.

[0005] The technical problem that the present invention aims to solve is to forcibly send a command to the scheduler to assign a task to an idle graphics processing unit, even when the scheduler does not assign a task to the idle graphics processing unit.

[0006] The technical problem that the present invention aims to solve is to assign a task to the idle instances of a specific graphics processing device when some of the multiple instances of the device are in a task processing state while the others are idle.

[0007] In one embodiment, a job scheduling method in a high-performance computing system, performed by a master node, may include: transmitting a job request to a scheduler that schedules jobs so that a plurality of instances of a single graphics processing unit are processed in parallel; receiving identification information and status information of one or more first graphics processing units included in each of the plurality of computing nodes included in the high-performance computing system from a plurality of computing nodes included in the high-performance computing system; identifying a second graphics processing unit that is idle among the one or more first graphics processing units based on the status information; and transmitting identification information of the second graphics processing unit and a command to the scheduler to assign a job to the second graphics processing unit.

[0008] In one embodiment, the scheduler acquires resource information of a plurality of clusters included in a high-performance computing system and schedules parallel processing tasks based on the resource information, and may include SLURM (Simple Linux Utility for Resource Management).

[0009] In one embodiment, a plurality of instances are executed independently and may be allocated resources of a single graphics processing unit based on commands.

[0010] In one embodiment, the scheduler may assign a task to a graphics processing unit corresponding to the identification information upon receiving a command containing identification information of a graphics processing unit to process the task.

[0011] In one embodiment, the step of receiving identification information and status information may include requesting identification information and status information of one or more first graphics processing devices from a plurality of computing nodes; and receiving identification information and status information from a plurality of computing nodes.

[0012] In one embodiment, the step of receiving identification information and status information may include requesting identification information and status information of one or more first graphics processing devices that are idle from a plurality of computing nodes; and receiving identification information of a second graphics processing device from a plurality of computing nodes.

[0013] In one embodiment, one or more first graphics processing devices in an idle state may include at least one second graphics processing device in an idle state or at least one of a plurality of instances of a third graphics processing device in which at least some instances are in an idle state.

[0014] In one embodiment, the step of transmitting a command to a scheduler may include the step of transmitting a command to the scheduler to assign a task to an instance that is idle among a plurality of instances of the third graphics processing device.

[0015] In one embodiment, the method further comprises the steps of: generating visualization information based on identification information and status information of one or more first graphics processing devices; and transmitting the visualization information to a terminal, wherein the visualization information may include at least one of whether a plurality of computation nodes included in a high-performance computing system are normal, whether they are in the process of processing a task, whether a central processing unit is in use, or whether a graphics processing unit is in use, based on the status information.

[0016] In one embodiment, the scheduler can assign a task to a second graphics processing unit based on instructions and process the task.

[0017] In another embodiment, a computer-readable recording medium storing a program for scheduling tasks in a high-performance computing system may, wherein the program transmits a task request to a scheduler that schedules tasks so that a plurality of instances of a single graphics processing unit are processed in parallel, receives identification information and status information of one or more first graphics processing units included in each of the plurality of computing nodes included in the high-performance computing system, and the scheduler identifies a second graphics processing unit that is idle among the one or more first graphics processing units based on the status information, and transmits identification information of the second graphics processing unit and a command to assign tasks to the second graphics processing unit to the scheduler.

[0018] In another embodiment, the electronic device includes a memory; and one or more processors, and the one or more processors transmit a task request to a scheduler that schedules tasks so that a plurality of instances of a single graphics processing unit are processed in parallel, receive identification information and status information of one or more first graphics processing units included in each of the plurality of computing nodes from a plurality of computing nodes included in a high-performance computing system, and the scheduler may identify a second graphics processing unit that is idle among the one or more first graphics processing units based on the status information, and transmit identification information of the second graphics processing unit and a command to assign tasks to the second graphics processing unit to the scheduler.

[0019] In one embodiment of the present disclosure, a scheduler that schedules tasks to be processed in parallel on a single graphics processing unit can prevent tasks from being assigned to an idle graphics processing unit.

[0020] In one embodiment of the present disclosure, even when the scheduler does not assign a task to an idle graphics processing unit, a command may be forcibly sent to the scheduler to assign a task to the idle graphics processing unit.

[0021] In one embodiment of the present disclosure, when some of a plurality of instances of a specific graphics processing device are in a task processing state while others are idle, a task may also be assigned to the idle instances of the graphics processing device.

[0022] FIG. 1 is a drawing for explaining a task scheduling method according to one embodiment of the present disclosure.

[0023] FIG. 2 is a diagram illustrating a task scheduling method in a cluster including a plurality of computation nodes according to one embodiment of the present disclosure.

[0024] FIG. 3 is a diagram illustrating a method for assigning a task to an idle instance according to one embodiment of the present disclosure.

[0025] FIG. 4 is a drawing illustrating a screen on which status information is displayed according to one embodiment of the present disclosure.

[0026] FIG. 5 is a flowchart illustrating a method for an electronic device to schedule tasks according to one embodiment of the present disclosure.

[0027] FIG. 6 is a drawing illustrating an electronic device according to various embodiments of the present disclosure.

[0028] Preferred embodiments of the present invention will be described below with reference to the accompanying drawings. It should be noted that identical components in the drawings are represented by the same reference numerals and symbols as much as possible, even if they are shown in different drawings. In the following description of the present invention, detailed descriptions of related known functions or configurations will be omitted if it is determined that such detailed descriptions may unnecessarily obscure the essence of the invention.

[0029] Furthermore, when it is stated that a part "includes" a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.

[0030] The terms used herein are for describing the embodiments and are not intended to limit the invention. In this specification, the singular form includes the plural form as appropriate unless specifically stated otherwise in the text. In this specification, terms such as "comprising," "providing," "making arrangements," or "having" do not exclude the presence or addition of one or more other components in addition to the components mentioned.

[0031] In this specification, terms such as "or", "at least one," etc., may indicate one of the words listed together or a combination of two or more. For example, "A or B" or "at least one of A and B" may include only one of A or B, or may include both A and B.

[0032] In this specification, descriptions following "e.g." should not limit the embodiments of the invention according to various embodiments of the invention, such as variations including tolerances, measurement errors, limits of measurement accuracy, and other commonly known factors, as the information presented, such as cited characteristics, variables, or values, may not exactly match.

[0033] In this specification, terms such as 'first,' 'second,' etc., may be used to describe various components, but such components should not be limited by these terms. Furthermore, these terms should not be interpreted as limiting the order of each component, but may be used for the purpose of distinguishing one component from another. For example, 'first component' may be named 'second component,' and similarly, 'second component' may be named 'first component.'

[0034] Each block of the process flow diagrams attached to this specification and combinations of the flow diagrams may be executed by computer program instructions. Since these computer program instructions may be loaded into the processor of a general-purpose computer, a computer for special purposes, or other programmable data processing equipment, the instructions executed through the processor of the computer or other programmable data processing equipment create means for performing the functions described in the flow diagram block(s).

[0035] These computer program instructions may be stored in computer-available or computer-readable memory that can be directed toward a computer or other programmable data processing equipment to implement a function in a specific way, and the instructions stored in said computer-available or computer-readable memory may also produce a manufactured item containing instruction means that performs the function described in the flowchart block(s).

[0036] Since computer program instructions can be loaded onto a computer or other programmable data processing equipment, instructions that perform a series of operation steps on the computer or other programmable data processing equipment to create a process executed by the computer can also provide steps for executing the functions described in the flowchart block(s).

[0037] Additionally, each block may represent a module, segment, or part of code containing one or more executable instructions for executing a specified logical function(s). Furthermore, in some alternative execution examples, the functions mentioned in the blocks may occur out of order. For instance, two blocks described in succession may actually be executed substantially simultaneously, or the blocks may be executed in reverse order according to their corresponding functions.

[0038] The "electronic device" or "terminal" mentioned in this specification may be implemented as a computer or portable terminal capable of connecting to a server or other terminal via a network. Here, the computer includes, for example, a notebook, desktop, or laptop equipped with a web browser, and the portable terminal may include, for example, any type of handheld wireless communication device that ensures portability and mobility, such as a communication-based terminal like IMT (International Mobile Telecommunication), CDMA (Code Division Multiple Access), W-CDMA (W-Code Division Multiple Access), or LTE (Long Term Evolution), as well as a smartphone or tablet PC. Additionally, the "electronic device" or "terminal" mentioned in this specification may also include a processor, memory for storing and executing program data, permanent storage such as a disk drive, a communication port for communicating with an external device, and user interface devices such as a touch panel, a key, or a button.

[0039] In the present disclosure, methods implemented as software modules or algorithms may be stored on a computer-readable recording medium as computer-readable code or program instructions executable on a processor. The computer-readable recording medium may include magnetic storage media (e.g., ROM (read-only memory), RAM (random-access memory), floppy disks, hard disks, etc.) and optical reading media (e.g., CD-ROM, DVD: Digital Versatile Disc). The computer-readable recording medium may be distributed and executed across networked computer systems.

[0040] An artificial intelligence model (or model) can be implemented as a neural network (or artificial neural network) and can operate based on statistical learning algorithms that mimic biological neurons in machine learning and cognitive science. A neural network can refer to a model in which artificial neurons (nodes), which form a network through synaptic connections, change the strength of synaptic connections through learning to possess problem-solving capabilities. A neural network can be composed of multiple neural network layers; for example, a neural network may include an input layer, a hidden layer, and an output layer. Each of the multiple neural network layers may include at least one node and at least one weight, and neural network operations can be performed through operations between the results of operations of the previous (precious) layer and the weights. At least one weight possessed by the multiple neural network layers may be optimized based on the learning results of the artificial intelligence model. For example, at least one weight may be updated so that the loss value or cost value obtained from the artificial intelligence model during the learning process is reduced or minimized. Neural networks can infer a result to be predicted from an arbitrary input.

[0041] The learning methods of artificial intelligence models can be classified according to the learning approach into supervised learning, where input and output data are provided as training data and the correct answer (output data) corresponding to the problem (input data) is predetermined; unsupervised learning, where only input data is provided without output data and the correct answer (output data) corresponding to the problem (input data) is not predetermined; and reinforcement learning, where a reward is granted whenever an action is taken from the current state and learning proceeds in a direction that maximizes this reward. Alternatively, they can be classified according to the architecture, which is the structure of the learning model.

[0042] In the embodiments of the present disclosure, the artificial intelligence model is a Convolutional Neural Network (CNN) such as GoogleNet, AlexNet, VGG Network, Region with Convolutional Neural Network (R-CNN), Region Proposal Network (RPN), Recurrent Neural Network (RNN), Stacking-based Deep Neural Network (S-DNN), State-Space Dynamic Neural Network (S-SDNN), Deconvolution Network, Deep Belief Network (DBN), Restructured Boltzmann Machine (RBM), Fully Convolutional Network, Long Short-Term Memory Network (LSTM), Classification Network, Generative Modeling, eXplainable AI, Continual AI, Representation Learning, AI for Material Design, BERT, SP-BERT, MRC / QA for Natural Language Processing, Text Analysis, Dialog System, GPT-3, GPT-4, Visual Analytics, Visual Understanding, Video Synthesis for Vision Processing, Anomaly Detection, Prediction, Time-Series Forecasting, Optimization, Recommendation for ResNet Data Intelligence, At least one of various artificial intelligence structures and algorithms, such as data creation, may be used. The examples described above are merely examples of artificial intelligence structures and algorithms used according to the embodiments of the present disclosure and do not limit the artificial intelligence structures and algorithms used according to the embodiments of the present disclosure.

[0043] Unless otherwise defined, all terms used in this specification may be used in a meaning commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not to be interpreted ideally or excessively unless explicitly and specifically defined otherwise.

[0044] Hereinafter, various embodiments of the present invention are described with reference to the accompanying drawings.

[0045] FIG. 1 is a drawing for explaining a task scheduling method according to one embodiment of the present disclosure.

[0046] The high-performance computing system of the present disclosure may be configured to form a cluster of computing nodes to enable efficient execution of high-performance computations. The high-performance computing system may include a master node (or management server) and / or computing nodes that perform the role of a master. The master node (110) performs the role of a master server that manages the entire cluster in the high-performance computing system. The master node (110) continuously collects resource information and status information of the computing nodes (120) and manages them in an integrated manner. The master node (110) receives status information collected from each computing node via the UDP protocol, identifies resource information of the entire cluster, and efficiently distributes tasks. A computing node (120) may be a device that receives a task from the master node (110) and processes the task. Status information may refer to information related to the processing of tasks by the cluster or computing nodes. Status information may include tasks being processed by the cluster or computing nodes, the resources required to process the tasks, and the estimated time required. Resource information may include information related to computing resources such as the cluster, computing nodes, and graphics processing units. For example, resource information may include at least one of central processing unit information, accelerator information, memory information, network interface information, power information, temperature information of the computing node, software environment information installed on the computing node, or input / output device information. Central processing unit information is resource information related to the central processing unit and may include the number of cores (Core count), clock speed, cache memory (Cache), architecture (e.g., various CPU designs such as x86-64 (Intel, AMD), ARM, etc.), or SIMD (Single Instruction Multiple Data), a technology that increases parallel processing efficiency.Accelerator information is resource information related to the accelerator and may include information regarding Graphics Processing Units (e.g., NVIDIA A100, H100, AMD Instinct MI250, etc.), Field Programmable Gate Arrays (FPGAs), Tensor Processing Units (TPUs), etc. Memory information may include Random Access Memory (RAM) information, Non-Uniform Memory Access (NUMA) information, High Bandwidth Memory (HBM) information, etc. Network interface information may include information regarding Network Interface Cards (NICs) (e.g., InfiniBand, Ethernet (10GbE, 100GbE), Omni-Path, etc.), network bandwidth, latency, etc. Power information may include information regarding the power consumption of each component, such as the central processing unit, accelerator, and memory. Temperature information may include information regarding heat generation of each component. Software environment information may include information regarding the operating system, parallel processing libraries (e.g., MPI), task schedulers (e.g., SLURM), drivers, accelerator support programs, etc.

[0047] The master node (110) and the computing node (120) can communicate with each other through a network. A network refers to a connection structure that enables information exchange between each node, such as devices, terminals, and servers. Examples of such networks include, but are not limited to, 3GPP (3rd Generation Partnership Project) networks, LTE (Long Term Evolution) networks, 5G networks, WIMAX (World Interoperability for Microwave Access) networks, the Internet, LAN (Local Area Network), Wireless LAN (Wireless Local Area Network), WAN (Wide Area Network), PAN (Personal Area Network), Wi-Fi networks, Bluetooth networks, satellite broadcasting networks, analog broadcasting networks, and DMB (Digital Multimedia Broadcasting) networks.

[0048] In one embodiment, the master node (110) may send a job request to a scheduler (111) that schedules jobs so that multiple instances are processed in parallel on a single graphics processing unit. The operations of the master node (110) described below may be understood as the operations of one or more processors included in the master node (110). The scheduler (111) may acquire resource information of multiple clusters included in a high-performance computing system and schedule parallel processing jobs based on the resource information. The scheduler (111) may include a SLUM. The SLUM is used to manage and execute jobs in a high-performance computing cluster. The SLUM can track resources such as CPUs, memory, and graphics processing units (GPUs) of the cluster and efficiently monitor the usage status of reserved resources. The SLUM can perform job scheduling, such as managing job queues and determining the execution priority of jobs. For example, job scheduling may include FIFO (First In, First Out), which executes jobs in the order they are submitted, priority-based, or Fairshare, which allocates resources based on the user's resource consumption history. Slum can include parallel processing libraries such as MPI and OpenMP. MPI is a standard library for message passing between compute nodes with different architectures.

[0049] In one embodiment, the master node (110) can enable multiple instances to be processed in parallel on a single graphics processing unit. For example, the master node (110) may include a Multi-Instance GPU (MIG) function. MIG is a function that allows a single graphics processing unit manufactured by NVIDIA to be divided and used into up to N independent instances. MIG provides various profiles, each of which may include information for dividing GPU resources at a specific ratio. MIG profiles may be included in state information. For example, "2g.10gb" indicates that a single graphics processing unit has 2 GPU instances and each instance is allocated 10GB of memory, and may be included in the MIG profile.

[0050] However, when features that split a single graphics processing unit into multiple instances (e.g., MIG features) are enabled, difficulties may arise in assigning tasks to specific graphics processing units in high-performance computing systems. For instance, the scheduler SLUM may fail to accurately identify the status of the graphics processing units, preventing the smooth assignment of tasks. This can lead to performance degradation in high-performance computing systems where various tasks require the simultaneous use of multiple graphics processing units.

[0051] Accordingly, the present disclosure can solve the problem in which a scheduler fails to assign a task because it cannot determine the status of an idle graphics processing unit by having the master node transmit status information of the graphics processing unit or a command to assign a task to an idle graphics processing unit to the scheduler. The means for solving the aforementioned problem are described in detail below.

[0052] In one embodiment, the master node (110) may receive identification information and status information of one or more first graphics processing units included in each of the plurality of computing nodes from a plurality of computing nodes included in a high-performance computing system. The identification information may include cluster identification information, identification information of computing nodes included in the cluster, and identification information of graphics processing units included in the computing nodes. Through this, the master node (110) can determine whether the graphics processing unit included in a computing node of a cluster is performing a task or is in an idle state.

[0053] In one embodiment, the master node (110) can identify a second graphics processing unit that is idle among one or more first graphics processing units based on state information. The second graphics processing unit may refer to a graphics processing unit that is idle among the first graphics processing units based on a certain point in time by the computation node (120).

[0054] In one embodiment, the master node (110) may request identification information and status information of one or more first graphics processing units from a plurality of computing nodes. For example, the master node (110) may request status information of GPU A (140) and status information of GPU B (150) from the computing node (120). In another example, upon receiving a job request, the master node (110) may request status information of GPU A (140) and status information of GPU B (150) from the computing node (120). Through this, the master node (110) can identify graphics processing units that are idle at the time the job request is received and efficiently schedule the job.

[0055] In one embodiment, the master node (110) may receive identification information and status information of graphics processing devices from a plurality of computation nodes. For example, the master node (110) may receive identification information and status information of GPU A (140) and identification information and status information of GPU B (150).

[0056] In one embodiment, the GPU management module (112) may receive status information and identification information of a graphics processing unit from the database (113). The GPU management module (112) may be intended to prevent the scheduler from scheduling tasks without assigning tasks to an idle graphics processing unit. The database (113) may contain resource information and / or status information of a compute node included in a high-performance computing system. The database (113) may update the status information through communication with the compute node (120).

[0057] In one embodiment, the master node (110) may transmit identification information of the second graphics processing unit and a command to assign a task to the second graphics processing unit to the scheduler (111). For example, while transmitting the command, the master node (110) may transmit a request to the compute node (120) to ensure that no other task is assigned to the second graphics processing unit in order to process the task corresponding to the task request.

[0058] The GPU management module (112) can transmit received status information and identification information to the master node's scheduler (111). For example, based on status information that GPU B (150) is in an idle state, the GPU management module (112) can transmit identification information and status information of GPU B (150) to the scheduler (111). The GPU management module (112) can transmit a command to the master node's scheduler (111) to assign a task to the idle graphics processing unit. For example, based on status information that GPU B (150) is in an idle state, the GPU management module (112) can transmit a command to assign a task to GPU B (150).

[0059] The scheduler (111) can assign a task to the second graphics processing unit and process the task based on at least one of identification information, status information, or command. For example, the scheduler (111) can schedule the task by assigning it to GPU B (150).

[0060] The scheduler (111) can transmit job scheduling information to the scheduler (121) of the computation node. Based on the job scheduling information, the scheduler (121) of the computation node can transmit a job execution command to one or more graphics processing units. For example, the scheduler (121) of the computation node can transmit a command to execute a job to GPU B (150).

[0061] Through this, the master node (110) can efficiently perform parallel processing for various tasks without missing any idle graphics processing units.

[0062] FIG. 2 is a diagram illustrating a task scheduling method in a cluster including a plurality of computation nodes according to one embodiment of the present disclosure.

[0063] For example, a user can send a deep learning-related task request to the master node (110) through the deep learning task module (250). The task may be related to various tasks, such as deep learning-related tasks, image processing tasks, time series data processing tasks, medical and legal fields, etc., but the present disclosure is not limited thereto.

[0064] The GPU management module (112) can receive status information by receiving a job request. For example, the GPU management module (112) can receive status information from a database (113). As another example, the GPU management module (112) can receive status information by requesting it from a cluster (210) or a compute node (211, 212, 213, 214) included in the cluster (210). The GPU management module (112) can identify an idle graphics processing unit based on the status information. The GPU management module (112) can transmit the identification information of the idle graphics processing unit to the scheduler (111). The scheduler (111) can schedule a job based on the received identification information of the idle graphics processing unit and can generate job scheduling information and transmit it to the cluster (210). The cluster (210) can process the job by receiving the job scheduling information.

[0065] For example, the GPU management module (112) may receive status information that GPU A (310) of compute node 1 (211) is idle and GPU B (320) of compute node 3 (213) is idle. However, the scheduler (111) may not have information that GPU A (310) of compute node 1 (211) and GPU B (320) of compute node 3 (213) are idle. The scheduler (111) may not accurately receive status information of the graphics processing unit (e.g., idle state of an instance) due to the MIG function, which allows multiple instances to be processed in parallel on a single graphics processing unit. The scheduler (111) may schedule tasks without assigning tasks to GPU A (310) of compute node 1 (211) and GPU B (320) of compute node 3 (213). By the GPU management module (112) informing the scheduler (111) that GPU A (310) of compute node 1 (211) and GPU B (320) of compute node 3 (213) are idle, the scheduler (111) can schedule tasks for the cluster (210) so that there are no idle states.

[0066] FIG. 3 is a diagram illustrating a method for assigning a task to an idle instance according to one embodiment of the present disclosure.

[0067] Some of the multiple instances of a single graphics processing unit may be processing tasks while others are idle. If a single graphics processing unit capable of processing multiple instances in parallel does not process tasks for some instances, task processing may be delayed while numerous tasks are waiting in a queue. Therefore, the speed of task processing can be increased by assigning tasks to idle instances for processing.

[0068] One or more first graphics processing devices in an idle state may include at least one second graphics processing device in an idle state or at least one of a plurality of instances of a third graphics processing device in which at least some instances are in an idle state.

[0069] In one embodiment, the master node (110) may send a command to the scheduler (111) to assign a task to an idle instance among a plurality of instances of the third graphics processing unit. In one embodiment, the master node (110) may send identification information of the idle instance to the scheduler (111). The identification information of the instance may include identification information of the cluster containing the instance, identification information of the task node included in the cluster, identification information of the graphics processing unit included in the task node, and / or identification information of the instance.

[0070] For example, GPU C (330) can process multiple instances (331, 332, 333, 334, 335) in parallel. Among the multiple instances (331, 332, 333, 334, 335), instances A (331), B (332), and C (333) may be processing tasks, while instances D (334) and E (335) may be idle. The master node (110) can receive identification information and status information for each of instances D (334) and E (335). For example, the status information of an instance may include profile information of the instance (e.g., MIG profile information). The master node (110) can transmit (360) the identification information and status information of each of instances D (334) and E (335) to the scheduler (111). Alternatively, the master node (110) may send a command to the scheduler (111) to assign tasks to each of instances D (334) and E (335). The scheduler (111) may receive status information, identification information and / or commands and assign tasks to instances D (334) and E (335). This allows instances included in a single graphics processing unit to be operated efficiently without the problem of prolonged idle time.

[0071] In another example, GPU A (310) may be idle, and the master node (110) may transmit identification information of GPU A (310) to the scheduler (111) (350). The scheduler (111) may receive identification information of GPU A (310) and perform scheduling to assign work to GPU A (310) of a specific computing node.

[0072] FIG. 4 is a drawing illustrating a screen on which status information is displayed according to one embodiment of the present disclosure.

[0073] In one embodiment, the master node (110) may generate visualization information based on identification information and status information of one or more first graphics processing units. The visualization information may be used to display the status of a cluster, a compute node, or a graphics processing unit included in a high-performance computing system on a screen. Based on the status information, the visualization information may include at least one of the normal status (410), whether a task is being processed (420), whether a central processing unit is in use (430), or whether a graphics processing unit is in use (440) of a plurality of compute nodes included in the high-performance computing system. The normal status (410) of the plurality of compute nodes may include the number of normal compute nodes and the number of compute nodes that have failed among the compute nodes included in the cluster. The status of whether a task is being processed (420) may include the number of compute nodes currently processing a task and the number of compute nodes in an idle state. The status of central processing unit use (430) may include the number of central processing units currently processing a task and the number of central processing units in an idle state. The use of graphics processing units (440) may include the number of graphics processing units currently processing tasks and the number of graphics processing units in an idle state. Through this, the user can determine in real time how much resource is processing tasks.

[0074] The queue status (450) may include waiting work requests.

[0075] The task ranking (470) may display tasks sorted according to criteria such as whether they are being processed for a long time, whether they consume a lot of resources, execution time, and waiting time among the tasks currently being processed. Through this, the user can easily identify which tasks consume a lot of resources, wait a long time, and take a long time to process.

[0076] FIG. 5 is a flowchart illustrating a method for an electronic device to schedule tasks according to one embodiment of the present disclosure.

[0077] In one embodiment, the electronic device can send a job request to a scheduler that schedules tasks to be processed in parallel in a single graphics processing unit (S510).

[0078] In one embodiment, the electronic device may receive identification information and status information of one or more first graphics processing devices included in each of the plurality of computation nodes from a plurality of computation nodes included in a high-performance computing system (S520).

[0079] In one embodiment, the electronic device can identify a second graphics processing unit that is idle based on state information among one or more first graphics processing units (S530).

[0080] In one embodiment, the electronic device may transmit identification information of the second graphics processing device and a command to assign a task to the second graphics processing device to the scheduler (S540). The operation of transmitting to the scheduler may include an operation to redefine the GPU information assigned by the scheduler.

[0081] FIG. 6 is a drawing illustrating an electronic device according to various embodiments of the present disclosure.

[0082] An electronic device (600) according to one embodiment may be a server or a user terminal (e.g., a mobile device, a desktop, a laptop, a personal computer, etc.). Referring to FIG. 6, an electronic device (600) according to one embodiment may include a user interface (610), a processor (630), a display (650), and a memory (670). The user interface (610), the processor (630), the display (650), and the memory (670) may be connected to each other via a communication bus (605).

[0083] A user interface (610) includes everything that enables interaction between a person and a machine. This can enable a user to manipulate and control a system, software, application, website, etc. For example, a user interface may include a graphical user interface, a text-based interface, a voice user interface, a natural user interface (e.g., gestures, touch, etc.).

[0084] The display (650) can display information generated by the processor (630).

[0085] The memory (670) can store generated information. In addition, the memory (670) can store various information generated during the processing of the processor (630) described above. In addition, the memory (670) can store various data and programs. The memory (670) may include volatile memory or non-volatile memory. The memory (670) may store various data by being equipped with a large-capacity storage medium such as a hard disk.

[0086] Additionally, the processor (630) may perform at least one method or an algorithm corresponding to at least one method described above through FIGS. 1 to 5. The processor (630) may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, the desired operations may include code or instructions included in a program. The processor may be composed of, for example, a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or a NPU (Neural Network Processing Unit). For example, the electronic device implemented in hardware may include a microprocessor, a central processing unit, a processor core, a multi-core processor, a multiprocessor, an ASIC (Application-Specific Integrated Circuit), or a FPGA (Field Programmable Gate Array).

[0087] The processor (630) can execute a program and control an electronic device. The program code executed by the processor (630) can be stored in memory.

[0088] Meanwhile, the embodiments disclosed in this specification may be implemented in the form of a recording medium that stores instructions executable by a computer. The instructions may be stored in the form of program code and, when executed by a processor, may generate a program module to perform the operations of the disclosed embodiments. The recording medium may be implemented as a computer-readable recording medium. A computer-readable recording medium may include all types of recording media that store instructions decipherable by a computer. Examples include ROM, RAM, magnetic tape, magnetic disk, flash memory, optical data storage devices, etc.

[0089] The above descriptions are specific embodiments for carrying out the present disclosure. The present disclosure will include not only the embodiments described above, but also embodiments that can be simply modified or easily modified. Furthermore, the present disclosure will include technologies that can be easily modified and implemented using the embodiments described above. Accordingly, the scope of the present disclosure should not be limited to the embodiments described above, but should be defined by the claims set forth below as well as equivalents to the claims of the present disclosure.

Claims

1. A job scheduling method in a high-performance computing system performed by a master node, A step of sending a job request to a scheduler that schedules jobs so that multiple instances are processed in parallel on a single graphics processing unit; A step of receiving identification information and status information of one or more first graphics processing devices included in each of the plurality of computation nodes from the plurality of computation nodes included in the above-mentioned high-performance computing system; The above scheduler identifies a second graphics processing unit that is in an idle state among the one or more first graphics processing units based on the state information; and A step of transmitting identification information of the second graphics processing device and a command for assigning a task to the second graphics processing device to the scheduler. A method including 2. In Paragraph 1, The above scheduler is, A method comprising obtaining resource information of a plurality of clusters included in the above-described high-performance computing system and scheduling parallel processing tasks based on said resource information, comprising SLURM (Simple Linux Utility for Resource Management).

3. In Paragraph 1, The above plurality of instances are Performed independently, A method for allocating resources of the above-mentioned graphics processing unit based on commands.

4. In Paragraph 1, The above scheduler is, A method for assigning a task to a graphics processing device corresponding to the identification information, upon receiving a command including identification information of a graphics processing device to process the task.

5. In Paragraph 1, The step of receiving the above identification information and the above status information is A step of requesting the identification information and the state information of the one or more first graphics processing devices to the plurality of computation nodes; and A step of receiving the identification information and the state information from the plurality of computation nodes. A method including 6. In Paragraph 1, The step of receiving the above identification information and the above status information is A step of requesting the identification information and the status information of the one or more first graphics processing devices that are idle at the plurality of computation nodes; and A step of receiving the identification information of the second graphics processing device from the plurality of computation nodes. A method including 7. In Paragraph 6, The one or more first graphics processing devices in the above idle state are A method comprising at least one of a second graphics processing unit in an idle state or at least one of a plurality of instances of a third graphics processing unit in an idle state.

8. In Paragraph 7, The step of transmitting the above command to the scheduler is, A step of transmitting to the scheduler a command to assign a task to an idle instance among the plurality of instances of the third graphics processing device. A method including 9. In Paragraph 1, A step of generating visualization information based on the identification information and state information of the one or more first graphic processing devices; and Step of transmitting the above visualization information to a terminal Includes more, The above visualization information is, Based on the above status information, at least one of whether the plurality of computation nodes included in the high-performance computing system are normal, whether they are processing a task, whether a central processing unit is in use, or whether a graphics processing unit is in use, method.

10. In Paragraph 1, The above scheduler is, A method for assigning a task to the second graphics processing unit and processing the task based on the above command.

11. A computer-readable recording medium storing a program for scheduling tasks in a high-performance computing system, The above program is, Sending a job request to a scheduler that schedules tasks so that multiple instances are processed in parallel on a single graphics processing unit, and Receiving identification information and status information of one or more first graphics processing devices included in each of the plurality of computation nodes from a plurality of computation nodes included in a high-performance computing system, and The scheduler identifies a second graphics processing unit that is idle among the one or more first graphics processing units based on the state information, and The identification information of the second graphics processing device and the command to assign a task to the second graphics processing device are transmitted to the scheduler. Computer-readable recording medium.

12. The electronic device Memory; and One or more processors; Includes, The above one or more processors, Sending a job request to a scheduler that schedules tasks so that multiple instances are processed in parallel on a single graphics processing unit, and Receiving identification information and status information of one or more first graphics processing devices included in each of the plurality of computation nodes from a plurality of computation nodes included in a high-performance computing system, and The scheduler identifies a second graphics processing unit that is idle among the one or more first graphics processing units based on the state information, and Transmitting identification information of the second graphics processing device and a command to assign a task to the second graphics processing device to the scheduler Electronic device.