Layered domain-based large model inference task scheduling method and system

By using a hierarchical and domain-based scheduling method, GPU computing unit resources are matched in real time, which solves the problem of limited video memory and computing in GPU resource scheduling, improves the success rate and resource utilization of large model inference tasks, and reduces response latency.

CN122309040APending Publication Date: 2026-06-30CHINA ELECTRONICS CYBERSPACE RESEARCH INSTITUTE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA ELECTRONICS CYBERSPACE RESEARCH INSTITUTE CO LTD
Filing Date
2024-12-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing GPU resource scheduling fails to effectively consider the irregularity of task size in large model inference tasks, resulting in task failure due to memory capacity exceeding available memory, and overload of central nodes causing task scheduling failure, thus failing to meet the service quality requirements of concurrent inference tasks.

Method used

A hierarchical and domain-based large model inference task scheduling method is adopted. The scheduling level nodes receive tasks in real time and obtain the remaining resources of GPU computing units. Based on task requirements and resource matching principles, tasks are allocated to appropriate GPU computing units. Considering the dual constraints of video memory and computing resources, appropriate computing units are selected first to improve resource utilization.

Benefits of technology

It improves the scheduling success rate of large-scale parallel large model inference tasks, reduces response latency, meets the service quality requirements of concurrent tasks, and avoids task failures caused by unreasonable resource allocation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309040A_ABST
    Figure CN122309040A_ABST
Patent Text Reader

Abstract

This invention provides a method and system for scheduling large model inference tasks based on hierarchical and domain-based scheduling. The method includes: each sub-scheduling level node within a scheduling domain receives, in real time, large model inference tasks allocated and forwarded by the main scheduling level node within the same scheduling domain based on a preset scheduling strategy, and obtains the memory and computing resource requirements of the tasks; in real time, it acquires the remaining memory and computing resources of each GPU unit in the corresponding computing subdomain connected to the sub-scheduling level node, and determines all candidate GPU units matching the task resources based on the task's memory and computing resource requirements and the remaining memory and computing resources of each GPU unit in the corresponding computing subdomain; it determines the task type based on the task's memory and computing resource requirements, and determines the target GPU unit corresponding to the task from all candidate GPU units based on the correspondence between the task type and the GPU; and it schedules the task to the corresponding target GPU unit so that the target GPU unit can execute the computation of the corresponding task, thereby achieving efficient parallel scheduling.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of GPU scheduling and computing technology, and in particular to a method and system for scheduling large model inference tasks based on hierarchical and domain-based approaches. Background Technology

[0002] Because GPUs excel at handling parallel computations and matrix operations on large models, the core computations used for large model inference computations primarily rely on GPU Tensor Cores. These are dedicated execution units designed specifically for performing tensor or matrix operations, which are the core computational functions employed in deep learning. Consequently, GPU-based model inference computations have gained widespread support from mainstream large model frameworks and have become the mainstay of large model computing power in mainstream servers.

[0003] In typical business scenarios, online inference services for large models mostly operate in OLTP (Online Transaction Processing) mode, meaning computational tasks need to respond in real time to meet the latency requirements of business applications (such as search and recommendation). In these scenarios, the inference latency requirements for large models are in the seconds or even milliseconds. GPU computation is often hampered by limitations in memory I / O speed or computational capacity, which can affect the smooth execution of GPU tasks. Therefore, these two factors must be fully considered in the scheduling of GPU computing resources to ensure resource utilization and task success rate.

[0004] Existing inference frameworks fail to consider the irregularity of large-scale model inference tasks—that is, different inference tasks have varying and unpredictable requirements for GPU memory and computing power. Directly allocating inference tasks to GPU computing clusters may result in tasks requiring more GPU memory than currently available, leading to task failures. Furthermore, in practical applications, to accommodate scenarios where multiple users submit inference tasks in batches, a system framework capable of handling GPU computing cluster scheduling is urgently needed to meet the quality of service requirements of concurrent inference tasks.

[0005] Currently, GPU resource scheduling is at a relatively low level of automation, with most computing tasks being scheduled through a central node in many existing frameworks. In large-scale batch task scheduling, if all resource retrieval and allocation are handled by the central node, it will undoubtedly face overload, leading to task scheduling failures. Summary of the Invention

[0006] In view of this, embodiments of the present invention provide a method and system for scheduling large model inference tasks based on hierarchical and domain-based approaches, in order to eliminate or improve one or more defects existing in the prior art.

[0007] One aspect of the present invention provides a method for scheduling large-scale model inference tasks based on hierarchical and domain-based approaches, the method comprising the following steps:

[0008] Each sub-scheduling level node within the scheduling domain receives the large model inference task allocated and forwarded by the main scheduling level node within the scheduling domain in real time based on a preset scheduling strategy, and obtains the memory and computing resource requirements of the large model inference task.

[0009] The remaining amount of video memory and computing resources of each GPU computing unit in the computing subdomain connected to the sub-scheduling level node is obtained in real time. Based on the video memory and computing resource requirements of the large model inference task and the remaining amount of video memory and computing resources of each GPU computing unit in the corresponding computing subdomain, all candidate GPU computing units that match the resources of the large model inference task are determined.

[0010] Based on the memory and computing resource requirements of the large model inference task, the task type of the large model inference task is determined to be either a memory-limited task or a computing-limited task. Based on the correspondence between task type and GPU, the target GPU computing unit corresponding to the large model inference task is determined from all candidate GPU computing units.

[0011] The large model inference task is scheduled to the corresponding target GPU computing unit so that the target GPU computing unit can perform the computation of the corresponding large model inference task.

[0012] In some embodiments of the present invention, all candidate GPU computing units matching the resources of the large model inference task are determined based on the memory and computing resource requirements of the large model inference task, and the remaining memory and computing resources of each GPU computing unit in the corresponding computing subdomain, including:

[0013] If the remaining video memory and computing resources of each GPU computing unit in the corresponding computing subdomain, which are obtained in real time, cannot simultaneously meet the video memory and computing resource requirements of the large model inference task, then wait for the GPU computing unit to release resources until a GPU computing unit can simultaneously meet the video memory and computing resource requirements of the large model inference task.

[0014] If the remaining video memory and computing resources of the GPU computing unit in the corresponding computing subdomain, which are obtained in real time, can simultaneously meet the video memory and computing resource requirements of the large model inference task, then the GPU computing unit is determined as a candidate GPU computing unit that matches the resources of the large model inference task, so as to obtain all candidate GPU computing units that match the resources of the large model inference task.

[0015] In some embodiments of the present invention, determining the target GPU computing unit corresponding to the large model inference task from all candidate GPU computing units based on the correspondence between task type and GPU includes:

[0016] From all the candidate GPU computing units, the candidate GPU computing unit that does not perform any large model inference task of any task type is selected as the target GPU computing unit with the first priority.

[0017] Based on the task type, the candidate GPU computing units that perform large model inference tasks of different task types are selected as the target GPU computing units from all candidate GPU computing units with the second priority.

[0018] In some embodiments of the present invention, the preset scheduling strategy includes:

[0019] Based on the real-time received large model inference task's memory and computing resource requirements, and the real-time acquisition of the remaining memory and computing resources of each sub-scheduling level node in the corresponding computing subdomain, it is determined whether the remaining memory and computing resources of each computing subdomain can simultaneously meet the memory and computing resource requirements of the large model inference task.

[0020] All computational subdomains that can simultaneously meet the memory and computational resource requirements of the large model inference task are identified as the candidate computational subdomain set.

[0021] The remaining memory resources and remaining computing resources of all computing subdomains in the candidate computing subdomain set are sorted to obtain the first sorting result and the second sorting result.

[0022] Based on the first and second sorting results, a comprehensive score is obtained for each computational subdomain. The large model inference task received in real time is then assigned and forwarded to the sub-scheduling level node connected to the computational subdomain with the highest comprehensive score.

[0023] In some embodiments of the present invention, each scheduling level node includes a task queue, a video memory manager, a computing unit manager, and a task scheduler.

[0024] In some embodiments of the present invention, after the step of scheduling the large model inference task to the corresponding target GPU computing unit so that the target GPU computing unit performs the computation of the corresponding large model inference task, the method further includes:

[0025] Each sub-scheduling level node in the scheduling domain receives task completion information from the target GPU computing unit in the corresponding computing subdomain, and dequeues the corresponding large model inference task from the corresponding task queue based on the received task completion information.

[0026] The received task completion information is sent to the overall scheduling level node so that the overall scheduling level node dequeues the corresponding large model inference task from the task queue based on the received task completion information.

[0027] In some embodiments of the present invention, the task types of the large model inference task include memory-constrained tasks and computationally-constrained tasks.

[0028] In some embodiments of the present invention, the task type of the large model inference task is determined to be a memory-constrained task or a computationally-constrained task based on the memory and computational resource requirements of the large model inference task, including:

[0029] If the large model inference task requires more video memory resources than computing resources, then the large model inference task is a video memory-constrained task.

[0030] If the computational resource requirements of the large model inference task are higher than the video memory resource requirements, then the large model inference task is a computationally constrained task.

[0031] Another aspect of the present invention provides a large model inference task scheduling system based on hierarchical domain division. The system includes: a computer device, the computer device including a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system implements the steps of the aforementioned method.

[0032] Another aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the aforementioned method.

[0033] Another aspect of the present invention provides a computer program product including computer instructions that, when executed by a processor, implement the steps of the aforementioned method.

[0034] This invention presents a hierarchical, domain-based large model inference task scheduling method and system. Addressing the batch scheduling scenario of large-scale large model inference tasks in GPU computing clusters, it designs a hierarchical, domain-based large model inference task scheduling architecture or framework. Supported by this architecture, through hierarchical, domain-based scheduling at different levels and in different domains, large model inference tasks are allocated and scheduled layer by layer from the scheduling domain to the corresponding GPU computing units in the computing domain for task execution. During this process, both memory I / O speed limitations and computational limitations are considered simultaneously, significantly improving the overall resource utilization of the GPU computing cluster and enhancing the concurrent processing capability of the entire architecture and system for large-scale large model inference tasks. This greatly improves the success rate of large-scale parallel large model inference task scheduling and execution, meeting the service quality requirements of concurrent large model inference tasks while reducing the response latency of large model inference tasks. This effectively avoids the problem of task scheduling failures caused by overload and unreasonable resource allocation.

[0035] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the description, or may be learned by practice of the invention. The objects and other advantages of the invention can be realized and obtained by means of the structures specifically pointed out in the description and drawings.

[0036] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description

[0037] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.

[0038] Figure 1 This is a schematic diagram of a large model inference task scheduling architecture based on hierarchical and domain-based methods in one embodiment of the present invention;

[0039] Figure 2 This is a schematic diagram of the structure of each scheduling level node within the scheduling domain in one embodiment of the present invention;

[0040] Figure 3 This is a flowchart illustrating a large-scale model inference task scheduling method based on hierarchical and domain-based approaches in one embodiment of the present invention. Detailed Implementation

[0041] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.

[0042] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.

[0043] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.

[0044] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.

[0045] In the following description, embodiments of the invention will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.

[0046] To efficiently schedule large-scale model inference tasks to GPU computing clusters in batches and improve the success rate of task scheduling, this invention provides a hierarchical and domain-based method and system for scheduling large-scale model inference tasks. Based on a hierarchical and domain-based large-scale model inference task scheduling architecture, and implemented on an architecture capable of supporting GPU computing cluster scheduling, this method effectively avoids task scheduling failures caused by overload and unreasonable resource allocation. It significantly improves the success rate of scheduling and executing large-scale parallel large-scale model inference tasks and meets the quality of service requirements for concurrent large-scale model inference tasks.

[0047] Figure 1 This is a schematic diagram of a hierarchical and domain-based large model inference task scheduling architecture according to an embodiment of the present invention. Figure 1 As shown, this hierarchical and domain-based large model inference task scheduling architecture supports parallel batch scheduling of large-scale large model inference tasks and efficiently manages the batch execution of these tasks on GPU computing clusters. It mainly consists of two parts: a scheduling domain and a computation domain. The scheduling domain includes a central scheduling hierarchy node and multiple sub-scheduling hierarchy nodes. Each sub-scheduling hierarchy node is a downstream node, the next level down from the central scheduling hierarchy node. These sub-scheduling hierarchy nodes are located in multiple sub-scheduling subdomains (e.g.,...). Figure 1 The sub-scheduling subdomains shown (e.g., 0-3) are located at the top level of the multiple sub-scheduling subdomains. The computation domain includes multiple computation subdomains (e.g., ...). Figure 1 As shown in the diagram, computational subdomains 0-3, etc., are connected one-to-one with multiple computational subdomains and multiple sub-scheduling level nodes. Each computational subdomain deploys multiple GPU computing units of two types (e.g., ...). Figure 1The G0, G1, G3, G4, etc. shown represent two types of GPUs. One type is primarily used for large model inference tasks with limited memory I / O speed, while the other type is primarily used for large model inference tasks with limited computation. The number of GPU computing units deployed within each computing subdomain can be the same or different. The ratio of the two types of GPU computing units deployed and distributed within each computing subdomain can be 1:1 or another ratio close to 1:1. Furthermore, every two GPU computing units within each computing subdomain can communicate with each other to share computational resource information. All GPU computing units within all computing subdomains form a GPU computing cluster within the computing domain.

[0048] The main scheduling level node, as the top layer in the scheduling domain, is responsible for receiving submission requests for multiple large model inference tasks from users in real time, as well as the submitted large model inference tasks. Based on the first scheduling strategy (i.e., the preset scheduling strategy), it schedules and forwards these large model inference tasks to the corresponding sub-scheduling level nodes at the next lower level. Each sub-scheduling level node, as a downstream node of the main scheduling level node, is responsible for receiving the large model inference tasks scheduled and allocated by the main scheduling level node in real time. Based on the second scheduling strategy (i.e., subsequent steps S320-S340), it allocates GPU computing resources in the corresponding computing subdomain connected to the sub-scheduling level node to the received large model inference tasks, and schedules the corresponding GPU computing units in that computing subdomain to execute and compute the large model inference task. By using hierarchical scheduling nodes in the scheduling domain to sequentially allocate large model inference tasks according to the first and second scheduling strategies, until the large model inference tasks are scheduled to the corresponding GPU computing units in the GPU computing cluster within the computing domain for computing and executing the tasks, the resource utilization of the GPU computing cluster can be improved. This also reduces the computational overhead of central scheduling to a certain extent, improves the response speed of each scheduling node and each GPU computing unit within the scheduling domain, thereby improving the scheduling efficiency of the entire architecture and significantly reducing the response latency of large model inference tasks.

[0049] Figure 2 This is a schematic diagram of the structure of each scheduling level node within the scheduling domain in one embodiment of the present invention. For example... Figure 2As shown, each scheduling level node includes a task queue, a memory manager, a compute unit manager, and a task scheduler. The task queue receives large model inference tasks in real time; the memory manager monitors and calculates the remaining memory resources (GPU storage capacity) in real time; the compute unit manager monitors and calculates the remaining compute resources (GPU computing capacity) in real time; and the task scheduler invokes relevant scheduling policies (first or second scheduling policy) and performs real-time scheduling of the received large model inference tasks based on the invoked scheduling policy, the remaining memory resources, and the remaining compute resources.

[0050] Figure 3 This is a flowchart illustrating a large-model inference task scheduling method based on hierarchical and domain-based approaches, according to an embodiment of the present invention. Figure 3 As shown, the execution entity of this method is a sub-scheduling hierarchical node, and includes the following steps:

[0051] In step S310, each sub-scheduling level node in the scheduling domain receives the large model inference task allocated and forwarded by the general scheduling level node in the scheduling domain based on a preset scheduling strategy in real time, and obtains the memory and computing resource requirements of the large model inference task. The sub-scheduling level node is the next level node of the general scheduling level node.

[0052] Specifically, the preset scheduling strategy in step S310 includes the following:

[0053] Based on the real-time received large model inference task's memory and computing resource requirements, and the real-time acquisition of the remaining memory and computing resources of each sub-scheduling level node in the corresponding computing subdomain, it is determined whether the remaining memory and computing resources of each computing subdomain can simultaneously meet the memory and computing resource requirements of the large model inference task.

[0054] All computational subdomains that can simultaneously meet the memory and computational resource requirements of the large model inference task are identified as the candidate computational subdomain set.

[0055] The remaining memory resources and remaining computing resources of all computing subdomains in the candidate computing subdomain set are sorted to obtain the first sorting result and the second sorting result.

[0056] Based on the first and second sorting results, a comprehensive score is obtained for each computational subdomain. The large model inference task received in real time is then assigned and forwarded to the sub-scheduling level node connected to the computational subdomain with the highest comprehensive score.

[0057] The task queue of the central scheduling node receives multiple large model inference tasks submitted by users in real time. For each large model inference task in the task queue, tasks are processed according to queue order (or task priority, which can be determined based on task importance and / or task latency requirements and / or task quality requirements). There can be one or more tasks in the same order (or priority). The task scheduler in the central scheduling node parses the large model inference tasks to obtain their memory and computational resource requirements. The memory manager and compute unit manager in the central scheduling node monitor and statistically analyze the remaining memory and computational resources of each sub-scheduling node in real time for each corresponding computational subdomain. The task scheduler then determines the remaining memory and computational resources based on the memory and computational resource requirements of the large model inference task and the remaining memory resources of each computational subdomain. The remaining memory and computing resources determine all computing subdomains (including one or more) that can simultaneously meet the memory and computing resource requirements of the task. (If no computing subdomain can simultaneously meet the memory and computing resource requirements of the task, resources are released from the computing subdomains until one can meet the requirements.) These all computing subdomains form a candidate computing subdomain set. The remaining memory and computing resources of all computing subdomains in the set are then sorted to obtain a first sorting result and a second sorting result. The average of the first and second sorting results for each computing subdomain is calculated as the comprehensive score for each computing subdomain. One or more large model inference tasks received in real-time from the task queue are then assigned and forwarded to the sub-scheduling level node connected to the one or more computing subdomains with the highest comprehensive score. The remaining memory and computing resources of each computing subdomain are the sum of the remaining memory and computing resources of all GPU computing units within that subdomain, respectively.

[0058] Then, the task queue of the sub-scheduling level node receives large model inference tasks from the central scheduling level node in real time, and the task scheduler of the sub-scheduling level node parses the received large model inference tasks to obtain their memory and computing resource requirements. In other embodiments, the central scheduling level node can also directly forward the parsed memory and computing resource requirements of the large model inference task to the assigned sub-scheduling level node while forwarding the large model inference task, so that the sub-scheduling level node does not need to repeatedly parse the large model inference task to obtain its memory and computing resource requirements.

[0059] Step S320: In real time, obtain the remaining video memory and computing resources of each GPU computing unit in the computing subdomain connected to the sub-scheduling level node. Based on the video memory and computing resource requirements of the large model inference task, and the remaining video memory and computing resources of each GPU computing unit in the corresponding computing subdomain obtained in real time, determine all candidate GPU computing units that match the resources of the large model inference task.

[0060] The memory manager and compute unit manager of the sub-scheduling level node monitor and count the remaining memory and compute resources of all GPU compute units deployed in the corresponding compute subdomain in real time. The task scheduler of the sub-scheduling level node determines all GPU compute units that can simultaneously meet the memory and compute resource requirements of the large model inference task and the remaining memory and compute resources of each GPU compute unit in the corresponding compute subdomain, and selects them as all candidate GPU compute units.

[0061] Specifically, step S320, based on the memory and computing resource requirements of the large model inference task, and the real-time acquired remaining memory and computing resources of each GPU computing unit in the corresponding computing subdomain, determines all candidate GPU computing units that match the resources of the large model inference task, including the following steps:

[0062] If the remaining video memory and computing resources of each GPU computing unit in the corresponding computing subdomain, which are obtained in real time, cannot simultaneously meet the video memory and computing resource requirements of the large model inference task, then wait for the GPU computing unit to release resources until a GPU computing unit can simultaneously meet the video memory and computing resource requirements of the large model inference task.

[0063] If the remaining video memory and computing resources of the GPU computing unit in the corresponding computing subdomain obtained in real time can simultaneously meet the video memory and computing resource requirements of the large model inference task, then the GPU computing unit is determined as a candidate GPU computing unit that matches the resources of the large model inference task, so as to obtain all candidate GPU computing units (including at least one candidate GPU computing unit) that match the resources of the large model inference task.

[0064] Step S330: Based on the memory and computing resource requirements of the large model inference task, determine whether the task type of the large model inference task is a memory-limited task or a computing-limited task. Based on the correspondence between task type and GPU, determine the target GPU computing unit corresponding to the large model inference task from all candidate GPU computing units.

[0065] Specifically, large model inference tasks are categorized into memory-constrained tasks and computation-constrained tasks. These categories are based on whether the task is more susceptible to memory I / O constraints or computational constraints during execution and computation on a GPU computing cluster, thus preventing successful execution. In other words, if a large model inference task has higher memory resource requirements than computational resource requirements (or a larger proportion of memory resource requirements), it can be considered a memory I / O-constrained task; conversely, if a large model inference task has higher computational resource requirements than memory resource requirements (or a larger proportion of computational resource requirements), it can be considered a computationally-constrained task. Meanwhile, because large model inference tasks are easily affected by the limitations of memory I / O speed and computation when executed and computed in GPU computing clusters, they may fail to be executed successfully. Therefore, the second scheduling strategy used by the task scheduler of the sub-scheduling level node needs to consider the task type of large model inference tasks that each GPU computing unit in the corresponding computing domain can undertake and execute. This will allow for better matching of GPU computing units corresponding to the type of large model inference task, thereby improving the utilization of GPU computing resources and the success rate of task execution.

[0066] Specifically, in step S330, the target GPU computing unit corresponding to the large model inference task is determined from all candidate GPU computing units based on the correspondence between task type and GPU, including the following steps:

[0067] From all the candidate GPU computing units, the candidate GPU computing unit that does not perform any large model inference task of any task type is selected as the target GPU computing unit with the first priority.

[0068] Based on the task type, the candidate GPU computing units that perform large model inference tasks of different task types are selected as the target GPU computing unit from all candidate GPU computing units with the second priority. For example, if the task type of the large model inference task to be scheduled and executed is a computationally constrained task, then the candidate GPU computing unit currently executing a memory I / O speed constrained task should be selected as the target GPU computing unit.

[0069] The priority shown in the above steps is used to select the target GPU computing unit that matches the large model inference task.

[0070] Step S340: Schedule the large model inference task to the corresponding target GPU computing unit so that the target GPU computing unit can perform the computation of the corresponding large model inference task.

[0071] In the above scheme, concurrent large model inference tasks are organized into multi-level task queues through hierarchical and domain-based scheduling. Real-time monitoring of computing and storage resources is achieved through scheduling nodes at each level. Large model inference tasks are then allocated and scheduled layer by layer within each scheduling node, and dynamically and in real-time batch-scheduled from the scheduling domain to the corresponding GPU computing subdomains and their respective GPU computing units. This hierarchical and domain-based large model inference task scheduling framework improves the hardware resource utilization of the GPU computing cluster, meets the quality of service requirements for parallel large model inference tasks, and reduces the response latency of large model inference tasks.

[0072] In some embodiments, after scheduling the large model inference task to the corresponding target GPU computing unit so that the target GPU computing unit performs the computation of the corresponding large model inference task, the method further includes the following steps:

[0073] Each sub-scheduling level node in the scheduling domain receives task completion information from the target GPU computing unit in the corresponding computing subdomain, and dequeues the corresponding large model inference task from the corresponding task queue based on the received task completion information.

[0074] The received task completion information is sent to the overall scheduling level node so that the overall scheduling level node dequeues the corresponding large model inference task from the task queue based on the received task completion information.

[0075] Through the above steps, after the scheduling and computation of the large model inference task is completed, the GPU computing resources corresponding to the task are released. The GPU computing units in the computing subdomain, the sub-scheduling level nodes, and the overall scheduling level nodes sequentially feed back the task completion information upwards. Based on the task completion information, the corresponding large model inference task in the task queues of the overall scheduling level node and the corresponding sub-scheduling level node is dequeued to prevent the task from being repeatedly scheduled and executed after successful scheduling and execution, thereby avoiding the waste of computing resources.

[0076] In summary, the hierarchical and domain-based large model inference task scheduling method of this invention addresses the batch scheduling scenario of large-scale large model inference tasks in GPU computing clusters. It designs a hierarchical and domain-based large model inference task scheduling architecture or framework. Supported by this architecture, through hierarchical and domain-based scheduling at different levels and in different domains, large model inference tasks are allocated and scheduled layer by layer from the scheduling domain to the corresponding GPU computing units in the computing domain for task execution. Furthermore, during this process, both memory I / O speed limitations and computational limitations are considered simultaneously, significantly improving the overall resource utilization of the GPU computing cluster and enhancing the concurrent processing capability of the entire architecture and system for large-scale large model inference tasks. This greatly improves the success rate of large-scale parallel large model inference task scheduling and execution, meets the service quality requirements of concurrent large model inference tasks, and reduces the response latency of large model inference tasks.

[0077] Corresponding to the above method, the present invention also provides a large model inference task scheduling system based on hierarchical domain division. The system includes a computer device, which includes a processor and a memory. The memory stores computer instructions, and the processor is used to execute the computer instructions stored in the memory. When the computer instructions are executed by the processor, the system implements the steps of the aforementioned method.

[0078] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned method. The computer-readable storage medium may be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

[0079] This invention also provides a computer program product, including computer instructions that, when executed by a processor, implement the steps of the aforementioned method.

[0080] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.

[0081] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.

[0082] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.

[0083] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for scheduling large-scale model inference tasks based on hierarchical and domain-based approaches, characterized in that, The method includes: Each sub-scheduling level node within the scheduling domain receives the large model inference task allocated and forwarded by the main scheduling level node within the scheduling domain in real time based on a preset scheduling strategy, and obtains the memory and computing resource requirements of the large model inference task. The remaining amount of video memory and computing resources of each GPU computing unit in the computing subdomain connected to the sub-scheduling level node is obtained in real time. Based on the video memory and computing resource requirements of the large model inference task and the remaining amount of video memory and computing resources of each GPU computing unit in the corresponding computing subdomain, all candidate GPU computing units that match the resources of the large model inference task are determined. Based on the memory and computing resource requirements of the large model inference task, the task type of the large model inference task is determined to be either a memory-limited task or a computing-limited task. Based on the correspondence between task type and GPU, the target GPU computing unit corresponding to the large model inference task is determined from all candidate GPU computing units. The large model inference task is scheduled to the corresponding target GPU computing unit so that the target GPU computing unit can perform the computation of the corresponding large model inference task.

2. The method according to claim 1, characterized in that, Based on the memory and computing resource requirements of the large model inference task, and the remaining memory and computing resources of each GPU computing unit in the corresponding computing subdomain, all candidate GPU computing units matching the resources of the large model inference task are determined, including: If the remaining video memory and computing resources of each GPU computing unit in the corresponding computing subdomain, which are obtained in real time, cannot simultaneously meet the video memory and computing resource requirements of the large model inference task, then wait for the GPU computing unit to release resources until a GPU computing unit can simultaneously meet the video memory and computing resource requirements of the large model inference task. If the remaining video memory and computing resources of the GPU computing unit in the corresponding computing subdomain, which are obtained in real time, can simultaneously meet the video memory and computing resource requirements of the large model inference task, then the GPU computing unit is determined as a candidate GPU computing unit that matches the resources of the large model inference task, so as to obtain all candidate GPU computing units that match the resources of the large model inference task.

3. The method according to claim 1, characterized in that, Based on the correspondence between task type and GPU, the target GPU computing unit corresponding to the large model inference task is determined from all candidate GPU computing units, including: From all the candidate GPU computing units, the candidate GPU computing unit that does not perform any large model inference task of any task type is selected as the target GPU computing unit with the first priority. Based on the task type, the candidate GPU computing units that perform large model inference tasks of different task types are selected as the target GPU computing units from all candidate GPU computing units with the second priority.

4. The method according to claim 1, characterized in that, The preset scheduling strategy includes: Based on the real-time received large model inference task's memory and computing resource requirements, and the real-time acquisition of the remaining memory and computing resources of each sub-scheduling level node in the corresponding computing subdomain, it is determined whether the remaining memory and computing resources of each computing subdomain can simultaneously meet the memory and computing resource requirements of the large model inference task. All computational subdomains that can simultaneously meet the memory and computational resource requirements of the large model inference task are identified as the candidate computational subdomain set. The remaining memory resources and remaining computing resources of all computing subdomains in the candidate computing subdomain set are sorted to obtain the first sorting result and the second sorting result. Based on the first and second sorting results, a comprehensive score is obtained for each computational subdomain. The large model inference task received in real time is then assigned and forwarded to the sub-scheduling level node connected to the computational subdomain with the highest comprehensive score.

5. The method according to claim 1, characterized in that, Each scheduling level node includes a task queue, a memory manager, a compute unit manager, and a task scheduler.

6. The method according to claim 5, characterized in that, After scheduling the large model inference task to the corresponding target GPU computing unit so that the target GPU computing unit performs the computation of the corresponding large model inference task, the method further includes: Each sub-scheduling level node in the scheduling domain receives task completion information from the target GPU computing unit in the corresponding computing subdomain, and dequeues the corresponding large model inference task from the corresponding task queue based on the received task completion information. The received task completion information is sent to the overall scheduling level node so that the overall scheduling level node dequeues the corresponding large model inference task from the task queue based on the received task completion information.

7. The method according to any one of claims 1 to 6, characterized in that, Based on the memory and computational resource requirements of the large model inference task, the task type is determined to be either a memory-constrained task or a computationally-constrained task, including: If the large model inference task requires more video memory resources than computing resources, then the large model inference task is a video memory-limited task. If the computational resource requirements of the large model inference task are higher than the video memory resource requirements, then the large model inference task is a computationally constrained task.

8. A hierarchical and domain-based large-model inference task scheduling system, comprising a processor, a memory, and computer instructions stored in the memory, characterized in that, The processor is configured to execute the computer instructions, and when the computer instructions are executed, the system implements the steps of the method as described in any one of claims 1 to 7.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the steps of the method as described in any one of claims 1 to 7.

10. A computer program product comprising computer instructions, characterized in that, When executed by a processor, the computer instructions implement the steps of the method according to any one of claims 1 to 7.