Distributed cluster construction method, distributed inference method and resource scheduler
By encapsulating and adapting hardware accelerators, the problem of difficult resource scheduling of hardware accelerators on container orchestration platforms is solved, enabling stable and efficient deployment of Ray head nodes, supporting log acquisition and fault recovery, optimizing resource utilization and performance of distributed inference, and improving the observability and availability of the system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING KECHENG TECH DEV CO LTD
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-30
AI Technical Summary
When deploying large-scale model inference using distributed clusters on a container orchestration platform, there are problems such as difficulty in scheduling hardware accelerator resources, insufficient deployment of Ray head nodes, difficulty in obtaining inference service logs, incomplete cross-node fault recovery, and inconvenience in obtaining performance metrics.
By encapsulating the hardware accelerator's development toolkit and distributed inference framework, a target inference image adapted to the hardware accelerator is created. The resource scheduler of the container orchestration platform is used to realize the automatic discovery and registration of the hardware accelerator, ensuring that the Ray head node is only used for management and control, supporting convenient log acquisition and fault recovery, and dynamically adjusting resource configuration to optimize performance indicators.
It enables automatic discovery and scheduling of hardware accelerator resources, ensuring the stability and high availability of Ray head nodes, supporting convenient log retrieval and fault recovery, optimizing resource utilization and performance isolation, and improving the efficiency and observability of distributed inference.
Smart Images

Figure CN121996434B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, specifically to a distributed cluster construction method, a distributed inference method, and a resource scheduler. Background Technology
[0002] Container orchestration platforms offer features such as automated deployment and automatic scaling. In large-scale model inference methods, resource orchestration is performed using container orchestration platforms, followed by distributed task scheduling and service-oriented inference using distributed clusters. However, when deploying large-scale model inference using distributed clusters on container orchestration platforms, a technical challenge arises: the distributed cluster struggles to schedule the resources of the hardware accelerators on the container orchestration platform. Summary of the Invention
[0003] In view of the above problems, this application provides a distributed cluster construction method, a distributed inference method, and a resource scheduler.
[0004] According to a first aspect of this application, a method for constructing a distributed cluster is provided, comprising: in response to a commit operation on a system orchestration file, binding a virtual node created based on the system orchestration file to a physical node equipped with a hardware accelerator; invoking multiple physical nodes to pull a target inference image from an image repository to launch container instances of the multiple physical nodes using the target inference image, wherein the target inference image is encapsulated based on a development kit for the hardware accelerator of the physical node and a distributed inference framework; invoking an accelerator plugin of the physical node to allocate the resources of the hardware accelerator on the physical node to the container instances; and constructing a distributed cluster based on the multiple container instances.
[0005] According to the embodiments of this application, by encapsulating the development toolkit and distributed inference framework of the hardware accelerator, a target inference image adapted to the hardware accelerator is obtained. Multiple physical nodes in the container orchestration platform are called to pull the target inference image and start the corresponding container instance. By deploying the accelerator plugin in the resource scheduler of the container orchestration platform, the automatic discovery and registration of the hardware accelerator at the container orchestration platform level is realized. The accelerator plugin is called to allocate the resources of the hardware accelerator on the physical node to the container instance, forming a computing resource pool of the distributed cluster. The container instance serves as the carrier for executing the hardware accelerator resources, realizing the discovery and resource scheduling of the hardware accelerator by the distributed cluster.
[0006] According to an embodiment of this application, in response to a submission operation for a system orchestration file, binding a virtual node created based on the system orchestration file to a physical node equipped with a hardware accelerator includes: parsing the system orchestration file to obtain master node configuration information and worker node configuration information; creating a virtual master node and virtual worker nodes based on the master node configuration information and worker node configuration information; invoking the virtual master node to load multiple model shards of the inference model onto the virtual worker nodes based on the system orchestration file and loading rules, so as to support parallel inference of multiple model shards based on multiple virtual worker nodes in task inference; and binding the virtual master node and multiple virtual target worker nodes to the physical node equipped with a hardware accelerator.
[0007] According to the embodiments of this application, the virtual master node can be scheduled to any idle physical node since it does not occupy accelerator resources, while each virtual worker node is bound to a specific physical node equipped with a hardware accelerator through a device plugin according to the resource requirements, thereby creating a container instance within the virtual worker node, realizing the complete link from logical virtual node to physical hardware resources, and ensuring the computing power supply and performance isolation of distributed inference tasks.
[0008] According to an embodiment of this application, the system orchestration file includes the hardware accelerator resources of the virtual worker node, the number of model shards, and the number of initial model service instances. The number of model shards represents the degree of parallel partitioning within a single initial model service instance. The process of loading multiple model shards onto the virtual worker node based on the system orchestration file and loading rules includes: determining the total resource requirements based on the number of initial model service instances and the number of model shards; determining the target virtual worker node based on the total resource requirements and the hardware accelerator resources of the virtual worker node; and loading the multiple model shards onto the target virtual worker node that satisfies the loading rules. The loading rules include loading model shards of the same initial model service instance onto different hardware accelerators of the same target virtual worker node, where the hardware accelerator resources are greater than the resources required by the model shards.
[0009] According to the embodiments of this application, multiple model shards are loaded into target virtual worker nodes that meet the loading rules, ensuring that each model shard of a single initial model service instance is in the same process space to reduce communication latency and cross-node communication costs. At the same time, physical card isolation is used to avoid resource contention, and the resources of each hardware accelerator are greater than the resources required by the model shards, with a buffer reserved to prevent performance jitter.
[0010] According to embodiments of this application, the system orchestration file further includes a preset range of the number of worker nodes; wherein, determining the target virtual worker node based on the total resource requirements and the resources of the hardware accelerators of the virtual worker nodes includes: determining an initial number of virtual worker nodes based on the total resource requirements and the resources of the hardware accelerators of the virtual worker nodes; if the initial number is within the preset range, determining the target virtual worker node from multiple virtual worker nodes based on the initial number; if the initial number is greater than the upper limit of the preset range, increasing the configuration number of the hardware accelerators of the virtual worker nodes or decreasing the initial number of model service instances, and determining the target virtual worker node from multiple virtual worker nodes based on the initial number; if the initial number is less than the lower limit of the preset range, determining the target virtual worker node from multiple virtual worker nodes based on the lower limit.
[0011] According to the embodiments of this application, the lower limit value set by the preset quantity range ensures high availability and basic concurrency capability of the service and avoids resource fragmentation, while the upper limit value prevents resource exhaustion and cost overrun. This enables the system to automatically adjust the number of initial model service instances within a controllable range based on real-time load, thereby achieving a dynamic balance between resource utilization, service stability and cost-effectiveness.
[0012] According to an embodiment of this application, the distributed cluster construction method further includes: replacing the device identifier in the configuration environment with the hardware accelerator identifier to obtain a replaced configuration environment; and driving the distributed cluster to identify the hardware accelerator based on the replaced configuration environment.
[0013] According to the embodiments of this application, the device identifier in the configuration environment is replaced with the hardware accelerator identifier, and the container image in the resource scheduler is modified to obtain an adjusted image adapted to the hardware accelerator. This enables the components managing the distributed cluster to identify the resources of the hardware accelerator when the distributed cluster is running, that is, to identify that the resources of the hardware accelerator can be used to compute inference tasks, thereby realizing the discovery and scheduling of hardware accelerators by the distributed cluster.
[0014] According to a second aspect of this application, a distributed inference method is provided, comprising: in response to at least one inference request, allocating target tasks indicated by each of the at least one inference request to obtain target model shards for each of the at least one target task; allocating virtual nodes to the target model shards for each of the at least one target task to obtain target virtual nodes for each of the at least one target task; invoking a container instance corresponding to the target virtual node in a distributed cluster, and using the resources of the hardware accelerator of the container instance to perform inference on the target task to obtain an inference result.
[0015] According to the embodiments of this application, a target model shard is determined for the target task to be executed according to the inference request indication. Target virtual nodes are dynamically matched according to the target model shard. Each target task is assigned to the optimal container instance, thereby directly calling the resources of the container instance's hardware accelerator to execute multiple target tasks in parallel inference, maximizing cluster resource utilization while ensuring high-concurrency inference performance.
[0016] According to an embodiment of this application, virtual node allocation is performed on the target model fragments of at least one target task to obtain the target virtual node of at least one target task, including: determining the target virtual node of at least one target model fragment from a preset mapping relationship, wherein the preset mapping relationship represents the loading relationship between the model fragment and the target virtual working node; and obtaining the target virtual node of at least one target task based on the target model fragments of at least one target task and the target virtual node of at least one target model fragment.
[0017] According to the embodiments of this application, when building a distributed cluster, a preset mapping relationship between model shards and target virtual worker nodes is pre-configured. After starting the inference service of the distributed cluster, after allocating a target model shard to the target task of the inference request, the target virtual node allocated to the target task can be directly determined, and the resources of the hardware accelerator can be directly called to infer the target task within the container instance corresponding to the target virtual node, thereby improving inference efficiency.
[0018] According to an embodiment of this application, the distributed inference method further includes: embedding a collection interface based on a transmission protocol during the inference process of the target task; collecting the fluctuation level of inference performance and the number of tasks based on the collection interface, wherein the inference performance includes at least one of the task inference performance and the resource utilization performance of the hardware accelerator; adjusting the number of inference model shards of the target task according to the fluctuation level of the number of tasks and a preset fluctuation threshold; and adjusting the number of target virtual worker nodes according to the inference performance and a preset performance threshold.
[0019] According to an embodiment of this application, a data collection interface is added to the inference service script, allowing the interface to be exposed to users via a communication protocol. This enables the distributed cluster to collect metrics such as the fluctuation level of inference performance and the number of tasks. If the inference service requests fluctuate dynamically or periodically, the initial number of model service instances can be automatically adjusted as needed: when the inference latency exceeds a threshold, the number of virtual worker nodes is automatically increased; when the load decreases, the number is automatically reduced to save resources, thus solving the problem of not being able to obtain performance metrics due to native deployment of the distributed cluster.
[0020] According to an embodiment of this application, the distributed inference method further includes: during the inference process of the target task, detecting the state changes and inference logs of the target virtual node to obtain the detection result; and restarting the inference process of the target task if the detection result is an abnormal result.
[0021] According to the embodiments of this application, a health check strategy is added to the inference service script, which solves the problem of not being able to view inference logs and the state changes of target virtual nodes caused by native deployment of distributed clusters. It continuously monitors the state changes of target virtual nodes and inference logs in the inference service. In the event of a failure, the inference process of the target task is restarted and the failure is resolved in a timely manner, realizing the fault recovery of distributed inference service. When a node is abnormal or a task fails, the service can be automatically restarted or the task can be migrated to ensure the continuous availability of inference service. It reduces manual intervention in service state fluctuation scenarios, reduces operation and maintenance complexity, and enables the system to have self-healing capabilities and observability. The inference service can run stably for a long time and supports high-availability cluster deployment.
[0022] A third aspect of this application provides a resource scheduler, comprising: one or more processors; and a memory for storing one or more computer programs, wherein the one or more processors execute the one or more computer programs to implement the steps of the method described above. Attached Figure Description
[0023] The above-mentioned contents, other objects, features and advantages of this application will become clearer from the following description of embodiments of this application with reference to the accompanying drawings.
[0024] Figure 1 The diagram illustrates an application scenario of the distributed cluster construction method and the distributed inference method according to embodiments of this application.
[0025] Figure 2 A flowchart of a distributed cluster construction method according to an embodiment of this application is shown.
[0026] Figure 3 A schematic diagram of accelerator plug-in scheduling according to an embodiment of this application is shown.
[0027] Figure 4 A schematic diagram illustrating the construction of a distributed cluster according to an embodiment of this application is shown.
[0028] Figure 5 A flowchart of a distributed inference method according to an embodiment of this application is shown.
[0029] Figure 6 A schematic diagram illustrating the processing of inference requests according to an embodiment of this application is shown.
[0030] Figure 7A block diagram is shown that is suitable for implementing a distributed cluster construction method, a distributed inference method, and a resource scheduler according to embodiments of this application. Detailed Implementation
[0031] The embodiments of this application will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of this application. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of this application for ease of explanation. However, it will be apparent that one or more embodiments may be implemented without these specific details. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concepts of this application.
[0032] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The terms “comprising,” “including,” etc., as used herein indicate the presence of features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0033] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.
[0034] When using expressions such as "at least one of A, B and C", they should generally be interpreted in accordance with the meaning that is commonly understood by those skilled in the art (e.g., "a system having at least one of A, B and C" should include, but is not limited to, a system having A alone, a system having B alone, a system having C alone, a system having A and B, a system having A and C, a system having B and C, and / or a system having A, B and C, etc.).
[0035] Container orchestration platforms are used to automate the deployment, scaling, and management of containerized applications.
[0036] A Pod is the smallest container instance to be scheduled in a container orchestration platform. It is the carrier of the target task and contains one or more tightly coupled container instances that share network, storage, and computing resources.
[0037] A physical node is the physical machine or virtual machine that actually runs a Pod in a container orchestration platform. It is equipped with computing resources such as a central processing unit (CPU), memory, storage, and hardware accelerators, and is the actual running carrier of the Pod.
[0038] Hardware accelerators are dedicated computing devices deployed on nodes. For example, a hardware accelerator can be an accelerator device such as a graphics processing unit (GPU) or a deep computing unit (DCU).
[0039] Ray is a high-performance distributed computing framework that allows Ray clusters to be deployed and managed on container orchestration platforms.
[0040] The Virtual Large Language Model (vLLM) inference framework is a fast and easy-to-use large language model inference and deployment service library that optimizes memory usage and improves throughput, thereby accelerating the inference speed of large language models (LLM).
[0041] With the emergence of large language models and multimodal models, the scale of these models has reached tens of billions or even trillions of parameters, bringing enormous challenges to inference computation and resource management. For example, a single model instance often exceeds the memory of a single GPU, computational parallelization is complex, and resource scheduling is difficult.
[0042] Container orchestration platforms offer powerful automated deployment and scaling capabilities; they also possess robust fault tolerance mechanisms, ensuring high availability of applications in the event of failures through multiple replicas, Pod replication, and automatic fault recovery. Furthermore, they can automatically scale applications based on changes in request traffic, ensuring efficient resource utilization. Therefore, to address the aforementioned challenges, container orchestration platforms can be used for resource orchestration, and distributed clusters can be leveraged to achieve distributed task scheduling and service-oriented inference.
[0043] When deploying large-scale model inference using distributed clusters on a container orchestration platform, the following main problems exist:
[0044] First, the distributed cluster cannot recognize DCU devices. The identification of Graphics Processing Unit (GPU) resources in a container orchestration platform relies on the corresponding GPU drivers and device plugins. The container orchestration platform identifies and schedules GPU resources through GPU device plugins, while the DCU identifies and schedules DCU resources through DCU device plugins. However, the resource names registered on the physical nodes are different. At runtime, the distributed cluster needs to obtain DCU resources from the container orchestration platform and schedule tasks to the nodes. Because the resource names registered by the DCU device plugins and the GPU device plugins are different, the GPU device plugins cannot be used to identify and schedule DCU resources, leading to a DCU resource identification failure in the distributed cluster, resulting in nodes being unable to allocate DCU resources.
[0045] Secondly, the Ray head master node in the distributed cluster cannot meet the deployment requirements for management and control only. Generally, GPU resources can be allocated to the head node, allowing it to serve as both a control and compute node. This deployment method is simpler and has higher resource utilization. However, on the other hand, this deployment method may experience increased scheduling latency under high load, and since the head node is scheduled to a GPU node, a failure can cause the entire distributed cluster to shut down. Therefore, this deployment method is recommended for the functional verification phase. In a production environment, the head node should be scheduled on a highly stable physical node, used only for management and control and not for inference. This deployment method offers resource independence, high reliability, and more flexible horizontal scaling of worker nodes. However, because the image used by the DCU contains adapted versions of dependencies, startup errors will occur when no DCU is allocated, which cannot meet the requirement that the Ray head node is used solely for management and control.
[0046] Third, the inference service cannot meet the log acquisition requirements. In the managed inference cluster, vLLM runs as the backend of the service component on worker nodes. The Pod master process launched by the distributed cluster is the cluster startup command, which means that vLLM logs cannot be obtained through the regular kubectl logs command. However, since vLLM logs can be used to determine whether model loading was successful, memory usage, token generation delay, and fault location during startup loading and inference, supporting convenient acquisition of vLLM inference logs in the distributed cluster is an important requirement.
[0047] Fourth, cross-node distributed deployment cannot handle fault recovery. For some models with a large number of parameters, a single GPU cannot complete the task, requiring distributed deployment with multiple GPUs across nodes. If a node in the deployment fails and the inference service fails to restart, inference requests will fail, but GPU resources will not be released. While cross-node vLLM deployment itself supports fault recovery, it cannot be done when starting a cross-node vLLM service through service components in the cluster. This is because the restart or creation of a vLLM service process by a service component is determined by its state, and the implementation of obtaining this state is not perfect.
[0048] Fifth, performance metrics cannot be obtained through the communication protocol interface. vLLM provides a series of metrics for monitoring system health, which are exposed through the / metrics endpoint on the vLLM-compatible interface service. These metrics cover multiple levels, mainly including throughput, latency, resource usage, task execution, etc., helping users understand whether the system is working properly and whether it has met performance expectations, and providing data support for troubleshooting and performance tuning. However, in a distributed cluster, vLLM cannot obtain its metric parameters through / metrics after startup, which also causes inconvenience for subsequent real-time monitoring and alerting through tools.
[0049] To address at least one of the aforementioned problems, embodiments of this application provide a distributed cluster construction method. In response to a submission operation for a system orchestration file, a virtual node created based on the system orchestration file is bound to a physical node equipped with a hardware accelerator. Multiple physical nodes are invoked to pull a target inference image from an image repository to launch container instances on each of the multiple physical nodes using the target inference image. The target inference image is encapsulated based on a development kit for the hardware accelerator of the physical node and a distributed inference framework. The accelerator plugins on the physical nodes are invoked to allocate the resources of the hardware accelerator on the physical nodes to the container instances. A distributed cluster is then constructed based on the multiple container instances.
[0050] In scenarios involving automated decision-making using personal information, the methods, devices, and systems provided in this application all offer users corresponding entry points for choosing to agree to or reject the automated decision-making results. If the user chooses to reject, the process proceeds to the expert decision-making stage. Here, "automated decision-making" refers to the activity of automatically analyzing and evaluating an individual's behavioral habits, interests, or economic, health, and credit status through computer programs, and then making a decision. Here, "expert decision-making" refers to the activity of making decisions by personnel who specialize in a particular field, possess specialized experience, knowledge, and skills, and have reached a certain level of professional expertise.
[0051] Figure 1 The diagram illustrates an application scenario of the distributed cluster construction method and the distributed inference method according to embodiments of this application.
[0052] like Figure 1As shown, the system comprises a hardware layer, a basic software layer, a scheduling and operation layer, an inference service layer, and an operation and maintenance monitoring layer. The hardware layer consists of multiple physical nodes equipped with hardware accelerators; the basic software layer includes a container orchestration platform, DCU drivers, and accelerator plugins; the scheduling and operation layer deploys a resource scheduler to manage the distributed cluster's runtime components RayCluster, service components RayServe, and worker components, and triggers joint scheduling between the distributed cluster and the container orchestration platform to achieve abnormal migration and restart; the inference service layer uses the service component RayServe to run distributed model inference applications, utilizes hardware accelerator resources to execute model computation tasks through virtual worker nodes, and manages the inference service lifecycle through a virtual master node head; the operation and maintenance monitoring layer monitors the resource utilization status of the hardware accelerators and the running status of the inference service through the detection component Prometheus.
[0053] Figure 2 A flowchart of a distributed cluster construction method according to an embodiment of this application is shown.
[0054] like Figure 2 As shown, the distributed cluster construction method of this embodiment includes operations S210 to S240.
[0055] In operation S210, in response to a commit operation for the system orchestration file, the virtual node created based on the system orchestration file is bound to the physical node equipped with a hardware accelerator.
[0056] In operation S220, multiple physical nodes are invoked to pull the target inference image from the image repository in order to launch container instances of each physical node using the target inference image.
[0057] When operating S230, the accelerator plugin of the physical node is invoked to allocate the resources of the hardware accelerator on the physical node to the container instance.
[0058] When operating an S240, a distributed cluster is built based on multiple container instances.
[0059] The system orchestration file is used to define and deploy configuration files for containerized applications on virtual machine nodes in a distributed cluster. It contains all the information required for application deployment, such as images, resource requirements, scheduling policies, and environment variables. The system orchestration file indicates the expected build status for the distributed cluster.
[0060] Virtual nodes are logical node objects abstracted based on virtualization technology in container orchestration platforms. They are not real physical machines, but rather encapsulations and mappings of underlying computing resources, such as physical nodes, cloud servers, and edge devices.
[0061] Physical nodes are actual computing devices in a container orchestration platform, namely physical servers or virtual machine instances, equipped with actual computing resources such as central processing units (CPUs), memory, networks, and hardware accelerators. They are the final running carriers for containers and inference task Pods.
[0062] Hardware accelerators are dedicated computing devices deployed on physical nodes, such as graphics processing units (GPUs) or deep computing units (DCUs), used to accelerate specific workloads such as model training / inference and scientific computing.
[0063] The target inference image is encapsulated based on the development kit of the hardware accelerator of the physical node and the distributed inference framework. For example, the target inference image is built based on the development kit and inference framework (Virtual Large Language Model, vLLM) of DCU, and then pushed to an image repository available in the cluster, where the image repository address is saved.
[0064] Users or systems submit orchestration files to the service components of the container orchestration platform. In response to the submission operation for the system orchestration file, the resource scheduler binds the virtual nodes created based on the system orchestration file to the physical nodes equipped with hardware accelerators.
[0065] Multiple physical nodes are invoked to pull the target inference image from the image repository based on the image repository address, and the container instance of each physical node is started using the target inference image.
[0066] Deploy a DCU accelerator plugin in the container orchestration platform. This plugin interacts with the physical node kernel driver to achieve automatic discovery and registration of DCU devices. The registration information is reported to the service component to generate schedulable device resource types.
[0067] Invoke the accelerator plugins of multiple physical nodes to allocate the hardware accelerator resources on the physical nodes to the container instance.
[0068] Figure 3 A schematic diagram of accelerator plug-in scheduling according to an embodiment of this application is shown.
[0069] like Figure 3As shown, the container orchestration platform has multiple physical nodes, and many of these physical nodes are equipped with available hardware accelerators; the physical nodes have installed the hardware accelerator drivers and runtime environments; the physical nodes have deployed and version-compatible software environments; an accelerator plugin is deployed in the container orchestration platform, which interacts with the physical node kernel driver to achieve automatic discovery and registration of hardware accelerators.
[0070] The accelerator plugin includes a DCU plugin. The DCU plugin registers with the container orchestration platform and defines resource types. Detailed information such as the number and health status of DCUs on physical nodes is obtained from the DCU plugin. The DCU resource information on physical nodes is reported to the service components of the container orchestration platform to update the resource status of the container orchestration platform. When a user submits an inference request for a target task to the service components, the virtual node to which the inference task is scheduled is determined. Based on the DCU resources required by the container instance Pod allocated by the virtual node and the available resources reported by each physical node, specific DCU resources are allocated to the Pod, and device access permissions are configured during container instance runtime.
[0071] By using the target image, multiple physical nodes of the container orchestration platform are created as multiple container instances. A distributed cluster is built within these container instances. Once the virtual node to which the inference task is scheduled is determined, the distributed cluster can directly call the computing resources of the corresponding hardware accelerator based on the container instance corresponding to the virtual node to execute the inference task.
[0072] Figure 4 A schematic diagram illustrating the construction of a distributed cluster according to an embodiment of this application is shown.
[0073] like Figure 4 As shown, during the development phase, developers write system orchestration files and application code on their local development machines and test and debug them in the local Ray runtime. After development is completed, the hardware accelerator's development toolkit and distributed inference framework are packaged to obtain the target inference image, which is then pushed to the image repository for storage and version management. On the production machine, the target inference image is pulled from the image repository through containerized deployment, multiple container instances are started, and a distributed cluster is built within the multiple container instances, thereby enabling the application to run stably in the Ray runtime of the production environment.
[0074] Ray, a distributed cluster, is a general-purpose distributed computing framework for large model inference. It enables dynamic parallel distribution of inference tasks, distributed loading of model shards, task isolation and scheduling for multi-user requests, distributed caching and weight sharing, and other effects.
[0075] The distributed cluster is a distributed computing cluster built on the Ray framework. It distributes multiple inference tasks to multiple virtual nodes in the cluster to call computing resources for parallel execution. It supports elastic scheduling and automatic scaling of heterogeneous resources, and optimizes the load during the inference process of tasks such as large-scale machine learning training, hyperparameter tuning, reinforcement learning and distributed inference.
[0076] According to the embodiments of this application, by encapsulating the development toolkit and distributed inference framework of the hardware accelerator, a target inference image adapted to the hardware accelerator is obtained. Multiple physical nodes in the container orchestration platform are called to pull the target inference image and start the corresponding container instance. By deploying the accelerator plugin in the resource scheduler of the container orchestration platform, the automatic discovery and registration of the hardware accelerator at the container orchestration platform level is realized. The accelerator plugin is called to allocate the resources of the hardware accelerator on the physical node to the container instance, forming a computing resource pool of the distributed cluster. The container instance serves as the carrier for executing the hardware accelerator resources, realizing the discovery and resource scheduling of the hardware accelerator by the distributed cluster.
[0077] According to an embodiment of this application, the distributed cluster construction method further includes: replacing the device identifier in the configuration environment with the hardware accelerator identifier to obtain a replaced configuration environment; and driving the distributed cluster to identify the hardware accelerator based on the replaced configuration environment.
[0078] Deploy custom resources such as the resource scheduler Operator, the service component RayServe for managing distributed clusters, and the node running component RayCluster on the control plane of the container orchestration platform.
[0079] Resource schedulers are used to manage the lifecycle of specific applications, such as the creation, scaling, and upgrading of node running components.
[0080] The node runtime component runs virtual nodes of a distributed cluster as Pods on a container orchestration platform.
[0081] Device identification can consist of device model, drive path, and resource name, etc.
[0082] Hardware accelerator identifiers can consist of hardware accelerator model, driver path, and resource name.
[0083] The container images in the resource scheduler need to be replaced with a recompiled version adapted to the hardware accelerator.
[0084] For example, the device identifier in the configuration environment of the container image is replaced with the hardware accelerator identifier to obtain the replaced configuration environment. The driver is then developed based on the replaced configuration environment to obtain a recompiled version adapted to the hardware accelerator.
[0085] Modify the configuration environment to support the driver interface, library path, and scheduling logic of the hardware accelerator, so that the service components or node running components can recognize the resources of the hardware accelerator when starting the distributed cluster.
[0086] According to the embodiments of this application, the device identifier in the configuration environment is replaced with the hardware accelerator identifier, and the container image in the resource scheduler is modified to obtain an adjusted image adapted to the hardware accelerator. This enables the components managing the distributed cluster to identify the resources of the hardware accelerator when the distributed cluster is running, that is, to identify that the resources of the hardware accelerator can be used to compute inference tasks, thereby realizing the discovery and scheduling of hardware accelerators by the distributed cluster.
[0087] According to an embodiment of this application, in response to a submission operation for a system orchestration file, binding a virtual node created based on the system orchestration file to a physical node equipped with a hardware accelerator includes: parsing the system orchestration file to obtain master node configuration information and worker node configuration information; creating a virtual master node and virtual worker nodes based on the master node configuration information and worker node configuration information; invoking the virtual master node to load multiple model shards of the inference model onto the virtual worker nodes based on the system orchestration file and loading rules, so as to support parallel inference of multiple model shards based on multiple virtual worker nodes in task inference; and binding the virtual master node and multiple virtual worker nodes to the physical node equipped with a hardware accelerator.
[0088] The system orchestration file in the RayCluster node runtime component controls the configuration of the virtual master node (head) and virtual worker nodes (workers) of the distributed cluster. The service component RayServe references the configuration of the distributed cluster and defines the startup and configuration information of inference tasks, providing capabilities such as zero-downtime upgrades and high availability.
[0089] The master node configuration information includes the management attributes of the virtual master node (head). Specifying `headGroupSpec-rayStartParams-num-gpus` as 0 in the `rayClusterConfig` field of the system orchestration file indicates that the head node acts as a pure management node and does not consume hardware accelerator resources.
[0090] Using a custom head image, ensure that the container instance running the head node process is used only to host the global control process of the distributed cluster, and schedule the head Pod to a stable physical node.
[0091] The worker node configuration information includes the number of virtual worker nodes, the range of elastic scaling up and down, and the hardware accelerator resources required for the inference model used for inference tasks within the container instance running the worker node process.
[0092] Execute the master node configuration information and worker node configuration information, and automatically create a virtual master node (head) and a virtual worker node (worker).
[0093] Loading rules can include model sharding strategies, video memory allocation ratios, and other rules.
[0094] The virtual master node is a logical abstraction rather than a physical entity. As a scheduling and control center, the virtual master node first divides the inference model into multiple model fragments according to the system orchestration files and loading rules. Each model fragment contains some network layer parameters. The multiple model fragments are then loaded into multiple virtual worker nodes.
[0095] The node binding mechanism maps the virtual master node and multiple virtual target worker nodes to physical nodes equipped with hardware accelerators.
[0096] Each virtual worker node corresponds to a container instance with independent resource quotas, enabling the distributed deployment of model shards across different container instances. During the task inference phase, multiple model shards perform forward computations in parallel based on multiple virtual worker nodes. Intermediate results are aggregated through the distributed communication mechanism of the distributed cluster, ultimately outputting the inference result.
[0097] By configuring environment variables in the system orchestration file and adjusting them through mounting configurations, relevant service logs can be redirected to files, facilitating fault location.
[0098] According to the embodiments of this application, the head image, resource constraints, and number of nodes are customized in the system orchestration file, realizing one-click startup and automated scheduling of the inference service. This fulfills the deployment requirement that the head node is only used for management in DCU inference, and supports the requirement that container instances of the head node and worker node required in the production environment be separated on physical nodes, thus enhancing stability. In addition, since the virtual master node does not occupy accelerator resources, it can be scheduled to any idle physical node, while each virtual worker node is bound to a specific physical node equipped with a hardware accelerator through a device plugin according to resource requirements, thereby creating container instances within the virtual worker node. This achieves a complete link from logical virtual nodes to physical hardware resources, ensuring the computing power supply and performance isolation of distributed inference tasks.
[0099] According to an embodiment of this application, the system orchestration file includes the hardware accelerator resources of the virtual worker node, the number of model shards, and the number of initial model service instances. The number of model shards represents the degree of parallel partitioning within a single initial model service instance. The process of loading multiple model shards onto the virtual worker node based on the system orchestration file and loading rules includes: determining the total resource requirements based on the number of initial model service instances and the number of model shards; determining the target virtual worker node based on the total resource requirements and the hardware accelerator resources of the virtual worker node; and loading the multiple model shards onto the target virtual worker node that satisfies the loading rules. The loading rules include loading model shards of the same initial model service instance onto different hardware accelerators of the same target virtual worker node, where the hardware accelerator resources are greater than the resources required by the model shards.
[0100] In the system orchestration file, specify the replicas under workerGroupSpecs to configure the number of virtual worker nodes; in the container-resources field of worker Pod, configure the hardware accelerator resources required by the inference model for inference tasks; in the deployments-num_replicas field of serveConfigV2, configure the initial number of model service instances, and pass environment parameters under runtime_env to control the model configuration.
[0101] The initial model service instance is an inference service unit running in the service component deployment. It represents a process that has loaded complete or partial model parameters and exposes an interface (HyperText Transfer Protocol, HTTP) to receive inference requests and return prediction results.
[0102] The network model adopts a layered and decoupled design. Tensor data is passed between layers through standardized interfaces, and the parameter matrix can be losslessly partitioned according to the tensor dimension or the layer dimension. Combined with model parallel technology, a single initial model service instance can be split into multiple sub-instances for distributed model sharding. Each sub-instance only loads a portion of the shard parameters and completes the full forward propagation through cross-node communication. This enables model inference when a single GPU has insufficient memory and improves throughput and resource utilization.
[0103] The total resource requirement is the total amount of hardware accelerator resources needed to run all initial model service instances.
[0104] For example, if the initial number of model service instances is 4 and the number of model shards is 8, the total resource requirement is calculated using the resource requirement model. If each initial model service instance needs to load 2 model shards, the parallelism is 2, and each model shard occupies the resources of 1 hardware accelerator, then the total resource requirement is 4 × 2 = 8 hardware accelerators.
[0105] The resources of a hardware accelerator can include video memory resources and the number of computing units.
[0106] The hardware accelerator resources of the virtual worker node are the resources of the hardware accelerator required by the inference model for inference tasks within the container instance that runs the virtual worker node process and is pre-configured in the system orchestration file.
[0107] The total resource requirements are matched and calculated with the hardware accelerator resources of the virtual worker nodes to select target virtual worker nodes that meet the resource constraints.
[0108] The mapping relationship between the target virtual worker node and the model fragment is determined according to the loading rules, and multiple model fragments are loaded into multiple target virtual worker nodes respectively.
[0109] The loading rules can be for multiple model shards corresponding to the same initial model service instance. For example, for two model shards split by tensor parallel partitioning, the resource scheduler will load the two model shards into a container instance within the same target virtual worker node. The two model shards can be bound to different hardware accelerators corresponding to this container instance.
[0110] Hardware accelerator resources exceed the resources required for model partitioning.
[0111] According to the embodiments of this application, multiple model shards are loaded into target virtual worker nodes that meet the loading rules, ensuring that each model shard of a single initial model service instance is in the same process space to reduce communication latency and cross-node communication costs. At the same time, physical card isolation is used to avoid resource contention, and the resources of each hardware accelerator are greater than the resources required by the model shards, with a buffer reserved to prevent performance jitter.
[0112] According to embodiments of this application, the system orchestration file further includes a preset range of the number of worker nodes; wherein, determining the target virtual worker node based on the total resource requirements and the resources of the hardware accelerators of the virtual worker nodes includes: determining an initial number of virtual worker nodes based on the total resource requirements and the resources of the hardware accelerators of the virtual worker nodes; if the initial number is within the preset range, determining the target virtual worker node from multiple virtual worker nodes based on the initial number; if the initial number is greater than the upper limit of the preset range, increasing the configuration number of the hardware accelerators of the virtual worker nodes or decreasing the initial number of model service instances, and determining the target virtual worker node from multiple virtual worker nodes based on the initial number; if the initial number is less than the lower limit of the preset range, determining the target virtual worker node from multiple virtual worker nodes based on the lower limit.
[0113] The preset quantity range is the range of the number of replicas that can be flexibly expanded or reduced in the control of virtual worker nodes as defined in the system orchestration file.
[0114] In the system orchestration file, specify minReplicas to control the lower limit of the preset number range and maxReplicas to control the upper limit of the preset number range.
[0115] The initial number of virtual worker nodes is determined based on the total resource requirements and the resources of the hardware accelerators for the virtual worker nodes.
[0116] The initial quantity is within the preset quantity range, which means that the number of pre-configured initial model service instances is within the range of virtual worker nodes that can be automatically expanded based on the current physical node resources. Therefore, there is no need to expand or shrink the capacity. The target virtual node can be determined directly from multiple virtual worker nodes based on the initial quantity.
[0117] If the initial number exceeds the upper limit of the preset range, it means that the number of pre-configured initial model service instances exceeds the maximum number of virtual worker nodes that can be automatically expanded based on the current physical node resources. This may trigger scheduling failure or resource alarms. Therefore, it is necessary to increase the configuration number of hardware accelerators corresponding to the container instances in the virtual worker node process or reduce the number of initial model service instances in the system orchestration file.
[0118] If the initial number is less than the lower limit of the preset number range, it means that the number of pre-configured initial model service instances is lower than the minimum number of virtual worker nodes that can be automatically expanded based on the current physical node resources. Therefore, the system will automatically shrink to the lower limit to maintain basic concurrent processing capabilities and avoid excessive resource fragmentation and single-point bottlenecks.
[0119] According to the embodiments of this application, the lower limit value set by the preset quantity range ensures high availability and basic concurrency capability of the service and avoids resource fragmentation, while the upper limit value prevents resource exhaustion and cost overrun. This enables the system to automatically adjust the number of initial model service instances within a controllable range based on real-time load, thereby achieving a dynamic balance between resource utilization, service stability and cost-effectiveness.
[0120] Figure 5 A flowchart of a distributed inference method according to an embodiment of this application is shown.
[0121] like Figure 5 As shown, the distributed inference method of this embodiment includes operations S510 to S530.
[0122] In operation S510, in response to at least one inference request, the target task indicated by each of the at least one inference request is assigned to obtain a target model fragment for each of the at least one target task.
[0123] In operation S520, virtual nodes are allocated to the target model fragments of at least one target task, resulting in target virtual nodes for at least one target task.
[0124] When operating the S530, the container instance corresponding to the target virtual node in the distributed cluster is invoked. The hardware accelerator resources of the container instance are used to perform inference on the target task and obtain the inference result.
[0125] The inference service is exposed to external users via an HTTP interface. Users send inference requests via this HTTP interface. An inference request is a model inference call request sent by the user or an upstream service. For example, a batch image recognition request.
[0126] The target task is the specific inference task parsed from the inference request, which includes execution context information such as model identifier, input tensor, and output format.
[0127] Based on information such as model identifier and model size in the target task, as well as the status of the initial model service instance and model shard in the distributed cluster, the target task is routed to the model shard corresponding to the appropriate model service instance.
[0128] The target model shard is the model shard that the target task needs to access, such as the 3rd-4th layer shard of the Transformer.
[0129] Based on the loading relationship between model fragments and virtual worker nodes, the target model fragments corresponding to the target task are mapped to the target virtual nodes, thereby obtaining the target virtual worker nodes corresponding to the target task.
[0130] For example, the target task of Residual Network (ResNet) recognition is routed to the target virtual node Worker-2 of the model slice loaded with convolution (Conv) parameters, and the target task of Bidirectional Encoder Representations from Transformers (BERT) prediction is routed to the target virtual node Worker-5 of the model slice loaded with attention mechanism (Attention) parameters.
[0131] A container instance is a pod object that runs the target virtual worker node and is the actual inference environment for the target task inference.
[0132] Figure 6 A schematic diagram illustrating the processing of inference requests according to an embodiment of this application is shown.
[0133] like Figure 6As shown, 8 DCUs are deployed on each of the physical nodes Node1, Node2, Node3, and Node4. The 8 DCU resources of physical node Node1 are allocated to container instance 1, the 8 DCU resources of physical node Node2 are allocated to container instance 2, the 8 DCU resources of physical node Node3 are allocated to container instance 3, and the 8 DCU resources of physical node Node4 are allocated to container instance 4. The user sends an inference request to HTTP interface 620 through client 610 to obtain target task 1 and target task 2. Target task 1 is inferred by target model shard 1A and target model shard 2A of the initial model service instance A, and target task 2 is inferred by target model shard 1B and target model shard 2B of the initial model service instance B.
[0134] The target model shard 1A is loaded on the target virtual node 1, and the target model shard 2A is loaded on the target virtual node 2. Therefore, the container instance 1 corresponding to the target virtual node 1 and the container instance 2 corresponding to the target virtual node 2 in the distributed cluster are called to use the resources of the DCU of the container instance 1 and the container instance 2 to perform inference on the target task 1.
[0135] The target model shard 1B is loaded on the target virtual node 3, and the target model shard 2B is loaded on the target virtual node 4. Therefore, the container instance 3 corresponding to the target virtual node 3 and the container instance 4 corresponding to the target virtual node 4 in the distributed cluster are called to use the resources of the DCU of the container instance 3 and the container instance 4 to perform inference on the target task 2.
[0136] The system invokes the container instance corresponding to the target virtual node in the distributed cluster. Inside the container instance, the inference process accesses the resources of the mounted hardware accelerator through the device driver interface, loads the input data of the target task, such as image tensors, into the hardware accelerator's video memory, performs forward computations, such as convolution, attention, and fully connected operations, and finally reads the output tensor from the video memory and serializes it into an inference result to return to the caller, thus completing the end-to-end inference service from request routing and resource scheduling to computation execution.
[0137] The inference result is the predicted value output by the model, such as classification probability, generated text, and bounding box coordinates.
[0138] Ray Serve automatically aggregates and caches intermediate results to obtain the inference results.
[0139] By default, RayServe SVC access type is ClusterIP. When RayService is in the Running state, since ClusterIP is only accessible within the distributed cluster, port 8000 in the container instance needs to be mapped to a port on an external node (such as a local development machine or gateway server) so that external clients can bypass the cluster network isolation and directly call the inference service. In addition, during the system orchestration file deployment phase, the Service configuration can be explicitly declared through the serveService field to specify parameters such as type and port mapping, thereby customizing the service exposure strategy and avoiding manual forwarding afterwards.
[0140] According to the embodiments of this application, a target model shard is determined for the target task to be executed according to the inference request indication. Target virtual nodes are dynamically matched according to the target model shard. Each target task is assigned to the optimal container instance, thereby directly calling the resources of the container instance's hardware accelerator to execute multiple target tasks in parallel inference, maximizing cluster resource utilization while ensuring high-concurrency inference performance.
[0141] According to an embodiment of this application, virtual node allocation is performed on the target model fragments of at least one target task to obtain the target virtual node of at least one target task, including: determining the target virtual node of at least one target model fragment from a preset mapping relationship, wherein the preset mapping relationship represents the loading relationship between the model fragment and the target virtual working node; and obtaining the target virtual node of at least one target task based on the target model fragments of at least one target task and the target virtual node of at least one target model fragment.
[0142] The pre-defined mapping relationship between model fragments and target virtual working nodes is determined based on the system orchestration file.
[0143] For example, the preset mapping relationships include the loading relationship between model fragment A1 and target virtual working node B1, the loading relationship between model fragment A2 and target virtual working node B2, and the loading relationship between model fragment A3 and target virtual working node B3.
[0144] For example, if the target model segment for target task 1 is A1 and the target model segment for target task 2 is A2, then based on the preset mapping relationship, the target virtual node for inference target task 1 is B1 and the target virtual node for inference target task 2 is B2.
[0145] According to the embodiments of this application, when building a distributed cluster, a preset mapping relationship between model shards and target virtual worker nodes is pre-configured. After starting the inference service of the distributed cluster, after allocating a target model shard to the target task of the inference request, the target virtual node allocated to the target task can be directly determined, and the resources of the hardware accelerator can be directly called to infer the target task within the container instance corresponding to the target virtual node, thereby improving inference efficiency.
[0146] According to an embodiment of this application, the distributed inference method further includes: embedding a collection interface based on a transmission protocol during the inference process of the target task; collecting the fluctuation level of inference performance and the number of tasks based on the collection interface, wherein the inference performance includes at least one of the task inference performance and the resource utilization performance of the hardware accelerator; adjusting the number of inference model shards of the target task according to the fluctuation level of the number of tasks and a preset fluctuation threshold; and adjusting the number of target virtual worker nodes according to the inference performance and a preset performance threshold.
[0147] The transport protocol can be Hypertext Transfer Protocol (HTTP).
[0148] Add a metrics method to the inference service script. The metrics method enables the data collection interface to obtain information such as the fluctuation of inference performance and the number of tasks based on the HTTP server with the transport protocol.
[0149] Inference performance includes task inference performance and hardware accelerator resource utilization performance. Task inference performance includes information such as inference latency and communication bandwidth; hardware accelerator resource utilization performance can include hardware accelerator utilization rate, etc.
[0150] If the fluctuation in the number of tasks exceeds a preset fluctuation threshold, the number of inference model shards for the target task is increased to distribute the load of the distributed cluster.
[0151] For example, if the utilization rate of the hardware accelerator remains too low, the number of nodes may be reduced to save costs; if the inference latency exceeds a preset performance threshold, the number of target virtual worker nodes may be increased.
[0152] According to an embodiment of this application, a data collection interface is added to the inference service script, allowing the interface to be exposed to users via HTTP. This enables the distributed cluster to collect metrics such as the fluctuation level of inference performance and the number of tasks. If the inference service requests fluctuate dynamically or periodically, the initial number of model service instances can be automatically adjusted as needed: when the inference latency exceeds a threshold, the number of virtual worker nodes is automatically increased; when the load decreases, the number is automatically reduced to save resources, thus solving the problem of not being able to obtain performance metrics due to native deployment of the distributed cluster.
[0153] According to an embodiment of this application, the distributed inference method further includes: during the inference process of the target task, detecting the state changes and inference logs of the target virtual node to obtain the detection result; and restarting the inference process of the target task if the detection result is an abnormal result.
[0154] In cross-node deployment mode, multiple container instances correspond to one inference framework service engine. If the container instance where the service engine's service process resides fails, RayServe will automatically restart the inference service process. In other failure scenarios, RayServe will not automatically restart the inference service process, thus causing the service failure to persist.
[0155] Add a health check policy to the inference service script to detect state changes of the target virtual node and inference logs in the inference service, and obtain the detection results.
[0156] The test results include normal results and abnormal results.
[0157] If the detection result is abnormal, the vllm.engine component can be used to check for faults and restart the inference process of the target task.
[0158] According to the embodiments of this application, a health check strategy is added to the inference service script, which solves the problem of not being able to view inference logs and the state changes of target virtual nodes caused by native deployment of distributed clusters. It continuously monitors the state changes of target virtual nodes and inference logs in the inference service. In the event of a failure, the inference process of the target task is restarted and the failure is resolved in a timely manner, realizing the fault recovery of distributed inference service. When a node is abnormal or a task fails, the service can be automatically restarted or the task can be migrated to ensure the continuous availability of inference service. It reduces manual intervention in service state fluctuation scenarios, reduces operation and maintenance complexity, and enables the system to have self-healing capabilities and observability. The inference service can run stably for a long time and supports high-availability cluster deployment.
[0159] Figure 7 A block diagram is shown that is suitable for implementing a distributed cluster construction method, a distributed inference method, and a resource scheduler according to embodiments of this application.
[0160] like Figure 7As shown, the resource scheduler 700 according to an embodiment of this application includes a processor 701, which can perform various appropriate actions and processes based on a program stored in read-only memory (ROM) 702 or a program loaded from storage portion 708 into random access memory (RAM) 703. The processor 701 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 701 may also include onboard memory for caching purposes. The processor 701 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of this application.
[0161] RAM 703 stores various programs and data required for the operation of resource scheduler 700. Processor 701, ROM 702, and RAM 703 are interconnected via bus 704. Processor 701 executes various operations of the method flow according to embodiments of this application by executing programs in ROM 702 and / or RAM 703. It should be noted that programs may also be stored in one or more memories other than ROM 702 and RAM 703. Processor 701 may also execute various operations of the method flow according to embodiments of this application by executing programs stored in one or more memories.
[0162] According to embodiments of this application, the resource scheduler 700 may further include an input / output (I / O) interface 705, which is also connected to a bus 704. The resource scheduler 700 may also include one or more of the following components connected to the input / output (I / O) interface 705: an input section 706 including a keyboard, mouse, etc.; an output section 707 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 708 including a hard disk, etc.; and a communication section 709 including a network interface card such as a LAN card, modem, etc. The communication section 709 performs communication processing via a network such as the Internet. A drive 710 is also connected to the input / output (I / O) interface 705 as needed. A removable medium 711, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 710 as needed so that computer programs read from it can be installed into the storage section 708 as needed.
[0163] This application also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The aforementioned computer-readable storage medium carries one or more programs, which, when executed, implement the distributed cluster construction method according to the embodiments of this application.
[0164] According to embodiments of this application, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this application, the computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this application, the computer-readable storage medium may include ROM 702 and / or RAM 703 and / or one or more memories other than ROM 702 and RAM 703 described above.
[0165] Embodiments of this application also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code enables the computer system to implement the distributed cluster construction method provided in the embodiments of this application.
[0166] When the computer program is executed by the processor 701, it performs the functions defined in the system / apparatus of this application embodiment. According to the embodiments of this application, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0167] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and may be downloaded and installed via the communication section 709, and / or installed from a removable medium 711. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.
[0168] In such an embodiment, the computer program can be downloaded and installed from a network via the communication section 709, and / or installed from the removable medium 711. When the computer program is executed by the processor 701, it performs the functions defined in the system of this application embodiment. According to the embodiments of this application, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0169] According to embodiments of this application, program code for executing the computer programs provided in the embodiments of this application can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages include, but are not limited to, languages such as Java, C++, Python, "C", or similar programming languages. The program code can be executed entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0170] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0171] Those skilled in the art will understand that the features described in the various embodiments of this application can be combined and / or combined in various ways, even if such combinations or combinations are not explicitly described in this application. In particular, the features described in the various embodiments of this application can be combined and / or combined in various ways without departing from the spirit and teachings of this application. All such combinations and / or combinations fall within the scope of this application.
Claims
1. A method for constructing a distributed cluster, characterized in that, Applied to resource schedulers, including: In response to a commit operation for a system orchestration file, a virtual node created based on the system orchestration file is bound to a physical node equipped with a hardware accelerator, the system orchestration file indicating the desired build state for the distributed cluster; Multiple physical nodes are invoked to pull the target inference image from the image repository, and the container instances of the multiple physical nodes are launched using the target inference image. The target inference image is obtained by encapsulating the development kit of the hardware accelerator of the physical node and the distributed inference framework. Invoke the accelerator plugin of the physical node to allocate the resources of the hardware accelerator on the physical node to the container instance; The distributed cluster is constructed from multiple container instances.
2. The method according to claim 1, characterized in that, The step of binding virtual nodes created based on system orchestration files to physical nodes equipped with hardware accelerators in response to a commit operation for a system orchestration file includes: The system orchestration file is parsed to obtain master node configuration information and worker node configuration information; Based on the master node configuration information and the worker node configuration information, create a virtual master node and a virtual worker node; The virtual master node is invoked, and multiple model shards of the inference model are loaded onto the virtual worker node based on the system orchestration file and loading rules, so as to support parallel inference of multiple model shards based on multiple virtual worker nodes in task inference. The virtual master node and the plurality of virtual worker nodes are each bound to the physical node equipped with a hardware accelerator.
3. The method according to claim 2, characterized in that, The system orchestration file includes the hardware accelerator resources of the virtual worker node, the number of model shards, and the number of initial model service instances. The number of model shards represents the degree of parallel partitioning within a single initial model service instance. The step of loading multiple model shards onto virtual worker nodes based on the system orchestration file and loading rules includes: The total resource requirements are determined based on the initial number of model service instances and the number of model shards. Based on the total resource requirements and the hardware accelerator resources of the virtual worker node, determine the target virtual worker node; Multiple model shards are each loaded to a target virtual worker node that meets the loading rules, which include loading model shards of the same initial model service instance to different hardware accelerators of the same target virtual worker node, and the hardware accelerator resources are greater than the resources required by the model shard.
4. The method according to claim 3, characterized in that, The system orchestration file also includes a preset range of the number of working nodes; The step of determining the target virtual worker node based on the total resource requirements and the resources of the hardware accelerator of the virtual worker node includes: The initial number of virtual worker nodes is determined based on the total resource requirements and the resources of the hardware accelerators of the virtual worker nodes. If the initial quantity is within the preset quantity range, a target virtual node is determined from the plurality of virtual working nodes based on the initial quantity; If the initial number is greater than the upper limit of the preset number range, increase the number of hardware accelerators configured for the virtual worker node or decrease the initial number of model service instances, and determine the target virtual worker node from the multiple virtual worker nodes based on the initial number. If the initial number is less than the lower limit of the preset number range, the target virtual working node is determined from the plurality of virtual working nodes based on the lower limit.
5. The method according to claim 1, characterized in that, The method further includes: Replace the device identifier in the configuration environment with the hardware accelerator identifier to obtain the replaced configuration environment; The distributed cluster is driven by the replaced configuration environment to identify the hardware accelerator.
6. A distributed reasoning method, characterized in that, Applied to distributed clusters, the method includes: In response to at least one inference request, the target task indicated by each of the at least one inference request is assigned to obtain a target model fragment for each of the at least one target task; Virtual nodes are allocated to the target model fragments of at least one of the target tasks to obtain target virtual nodes for at least one of the target tasks. The container instance corresponding to the target virtual node in the distributed cluster is invoked, and the hardware accelerator resources of the container instance are used to perform inference on the target task to obtain the inference result. The distributed cluster is constructed using the method described in any one of claims 1 to 5.
7. The method according to claim 6, characterized in that, The step of allocating virtual nodes to the target model fragments of at least one of the target tasks to obtain target virtual nodes for each of the target tasks includes: At least one target virtual node for each of the target model fragments is determined from a preset mapping relationship, wherein the preset mapping relationship represents the loading relationship between the model fragments and the target virtual working nodes; At least one target virtual node for each of the target tasks is obtained based on at least one target model fragment for each of the target tasks and at least one target virtual node for each of the target model fragments.
8. The method according to claim 6, characterized in that, The method further includes: During the reasoning process of the target task, an acquisition interface is embedded based on the transmission protocol; The fluctuation of inference performance and the number of tasks are collected based on the acquisition interface, wherein the inference performance includes at least one of task inference performance and hardware accelerator resource utilization performance. The number of inference model fragments for the target task is adjusted based on the degree of fluctuation in the number of tasks and a preset fluctuation threshold. The number of target virtual worker nodes is adjusted based on the inference performance and the preset performance threshold.
9. The method according to claim 6, characterized in that, The method further includes: During the reasoning process of the target task, the state changes and reasoning logs of the target virtual node are detected to obtain the detection results; If the detection result is abnormal, the inference process of the target task is restarted.
10. A resource scheduler, comprising: One or more processors; Memory, used to store one or more computer programs. The characteristic feature is that the one or more processors execute the one or more computer programs to implement the steps of the method according to any one of claims 1 to 9.