Inference service dynamic scheduling method and device based on multi-source heterogeneous computing power resources
By accessing multi-source heterogeneous computing power resource nodes, performing performance stress testing and dynamic scheduling, the problems of high cost and resource waste in existing inference services are solved, and efficient and flexible inference service resource management and performance optimization are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING TAOKEN TECHNOLOGY CO LTD
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-23
Smart Images

Figure CN122268950A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a dynamic scheduling method and device for inference services based on multi-source heterogeneous computing resources. Background Technology
[0002] With the widespread application of Large Language Models (LLM) and multimodal inference models, the computational demands of inference services are growing exponentially. Existing inference services typically rely on GPU clusters from fixed cloud vendors or self-built computing centers, which are costly, inflexible, and difficult to efficiently schedule based on different regions, hardware performance, and memory capacity.
[0003] On the other hand, there are a large number of idle computing resources distributed throughout the country (such as small and medium-sized data centers, edge nodes, and personal servers). However, due to the significant differences in processor architecture (GPU / NPU / ARM / x86), memory size, and network bandwidth among these computing resources, traditional inference service platforms struggle to fully utilize them, resulting in a waste of computing resources.
[0004] Therefore, a dynamic scheduling method for inference services based on multi-source heterogeneous computing resources is needed to solve the problems existing in the above technical solutions. Summary of the Invention
[0005] To address this, the present invention provides a dynamic scheduling method and apparatus for inference services based on multi-source heterogeneous computing power resources, in order to solve or at least alleviate the problems mentioned above.
[0006] According to one aspect of the present invention, a dynamic scheduling method for inference services based on multi-source heterogeneous computing resources is provided, executed in a computing device, comprising: accessing multiple computing resource nodes distributed across multiple regions and with multiple architectures, each computing resource node being adapted to deploy a corresponding model to provide inference services; performing performance stress tests on each computing resource node, and determining the weight of each computing resource node based on the performance stress test results, the performance stress test results including a maximum TPM value or a maximum RPM value; receiving an inference request from a client, determining a target computing resource node based on the specified model information in the inference request and the weight and real-time load of each computing resource node, and distributing the inference request to the target computing resource node so as to provide inference services to the client based on the model deployed on the target computing resource node.
[0007] Optionally, the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention further includes: collecting real-time inference performance index data of each computing power resource node in real time, and dynamically adjusting the weight of each computing power resource node according to the real-time inference performance index data of each computing power resource node.
[0008] Optionally, the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention further includes: for each computing power resource node, deploying a model of a corresponding scale and adopting a corresponding parallel strategy on the computing power resource node according to the memory capacity, architecture type and architecture characteristics of the computing power resource node.
[0009] Optionally, in the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention, determining the target computing power resource node according to the specified model information in the inference request and the weight and real-time load of each computing power resource node includes: selecting one or more candidate computing power resource nodes that match the specified model information from the plurality of computing power resource nodes; determining the idle score of each candidate computing power resource node according to the weight and real-time load of each candidate computing power resource node, and determining the candidate computing power resource node with the highest idle score as the target computing power resource node.
[0010] Optionally, in the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention, determining the idle score of each candidate computing power resource node according to the weight and real-time load of each candidate computing power resource node includes: for each candidate computing power resource node, determining the ratio of the real-time TPM value of the candidate computing power resource node to the maximum TPM value to obtain the real-time load coefficient of the candidate computing power resource node; and determining the idle score of the candidate computing power resource node according to the weight and real-time load coefficient of the candidate computing power resource node.
[0011] Optionally, in the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention, the inference request further includes the client's geographical location; determining the idle score of each candidate computing power resource node according to the weight and real-time load of each candidate computing power resource node, and determining the candidate computing power resource node with the highest idle score as the target computing power resource node, further includes: preferentially selecting the candidate computing power resource node closest to the client's geographical location from one or more candidate computing power resource nodes, and distributing the inference request to the nearest candidate computing power resource node; determining the idle score of each candidate computing power resource node according to the weight and real-time load of each candidate computing power resource node, and determining the candidate computing power resource node with the highest idle score as the target computing power resource node; switching the inference request from the nearest candidate computing power resource node to the target computing power resource node.
[0012] Optionally, in the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention, determining the target computing power resource node according to the specified model information in the inference request and the weight and real-time load of each computing power resource node includes: determining the target computing power resource node using a weighted random or weighted round-robin algorithm according to the specified model information in the inference request and the weight and real-time load of each computing power resource node.
[0013] Optionally, the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention further includes: in response to the request to reclaim the target computing power resource node, switching the inference request to a standby computing power resource node.
[0014] Optionally, in the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention, accessing multiple computing power resource nodes distributed across multiple regions and multiple architectures includes: for each computing power resource node, obtaining and storing the node attribute information registered by the computing power resource node, wherein the node attribute information includes the architecture type, architecture characteristics, graphics card model, video memory capacity, network bandwidth, and geographical location of the computing power resource node; and generating a resource identifier corresponding to the computing power resource node based on the node attribute information and including it in the scheduling pool.
[0015] Optionally, in the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention, the real-time inference performance index data includes one or more of real-time TPM value, latency time, and error rate; dynamically adjusting the weight of each computing power resource node according to the real-time inference performance index data of each computing power resource node includes: determining the health of each computing power resource node according to the real-time inference performance index data of each computing power resource node; dynamically adjusting the weight of each computing power resource node according to the health of each computing power resource node includes: if the health of the computing power resource node is less than the health threshold, then reducing the weight of the computing power resource node.
[0016] Optionally, the dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources according to the present invention further includes: recording the real-time inference performance index data of each computing power resource node in a time series to generate a performance log; generating a Power performance graph based on the performance log and displaying it.
[0017] According to one aspect of the present invention, a dynamic scheduling apparatus for inference services is provided, deployed in a computing device, suitable for performing the method described above, the apparatus comprising: The resource access module is suitable for accessing multiple computing resource nodes distributed across multiple regions and with multiple architectures, and each of the computing resource nodes is suitable for deploying a corresponding model to provide inference services; The weight configuration module is suitable for performing performance stress tests on each computing resource node and determining the weight of each computing resource node based on the performance stress test results. The performance stress test results include the maximum TPM value or the maximum RPM value. The scheduling gateway module is adapted to receive inference requests from clients, determine target computing resource nodes based on the specified model information in the inference request and the weight and real-time load of each computing resource node, and distribute the inference request to the target computing resource node so as to provide inference services to the client based on the model deployed on the target computing resource node.
[0018] Optionally, the inference service dynamic scheduling device according to the present invention further includes: a performance monitoring module, adapted to collect real-time inference performance index data of each computing power resource node in real time and send it to the scheduling gateway module, so that the scheduling gateway module can dynamically adjust the weight of each computing power resource node according to the real-time inference performance index data of each computing power resource node.
[0019] According to one aspect of the present invention, a computing device is provided, comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions including instructions for executing the dynamic scheduling method for inference services based on multi-source heterogeneous computing resources as described above.
[0020] According to one aspect of the present invention, a computer program product is provided, comprising computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method as described above.
[0021] According to one aspect of the present invention, a readable storage medium storing program instructions is provided, which, when read and executed by a computing device, causes the computing device to perform the dynamic scheduling method for inference services based on multi-source heterogeneous computing resources as described above.
[0022] According to the technical solution of this invention, a dynamic scheduling method for inference services based on multi-source heterogeneous computing power resources is provided. This method deploys models by accessing multiple computing power resource nodes distributed across multiple regions and with various architectures. The weights of each computing power resource node are determined based on performance stress test results. When an inference request is received from a client, a target computing power resource node is determined based on the specified model information in the request, the weights of each computing power resource node, and its real-time load. The inference request is then distributed to the target computing power resource node, thereby providing inference services to the client based on the model deployed on the target computing power resource node. Based on this, the present invention can integrate multi-source heterogeneous computing power resources and dynamically allocate inference requests according to the weights and real-time loads of each computing power resource node, thus enabling full and efficient utilization of computing power resources while reducing computing costs.
[0023] Furthermore, this invention dynamically adjusts the weight of each computing resource node based on the real-time inference performance index data of each computing resource node, thereby enabling continuous optimization of the performance of each computing resource node and achieving a closed loop of performance optimization.
[0024] Furthermore, this invention supports a location-based proximity-first strategy, which can reduce network latency and improve inference response speed by prioritizing the distribution of inference requests to candidate computing resource nodes that are geographically closest to the client.
[0025] The above description is merely an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention and to implement it in accordance with the contents of the specification, and in order to make the above and other objects, features and advantages of the present invention more apparent and understandable, specific embodiments of the present invention are described below. Attached Figure Description
[0026] To achieve the foregoing and related objectives, certain illustrative aspects are described herein in conjunction with the following description and accompanying drawings. These aspects indicate various ways in which the principles disclosed herein may be practiced, and all aspects and their equivalents are intended to fall within the scope of the claimed subject matter. The foregoing and other objectives, features, and advantages of the invention will become more apparent from the following detailed description, taken in conjunction with the accompanying drawings. Throughout the invention, the same reference numerals generally refer to the same parts or elements.
[0027] Figure 1 A schematic diagram of a computing device 100 provided according to an embodiment of the present invention is shown; Figure 2 A flowchart illustrating a dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources according to an embodiment of the present invention is shown. Figure 3 A schematic diagram of a dynamic scheduling device 300 for inference services provided according to an embodiment of the present invention is shown. Detailed Implementation
[0028] Exemplary embodiments of the invention will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the invention and to fully convey the scope of the invention to those skilled in the art.
[0029] To address the issues of high computing power costs and difficulty in fully utilizing computing resources in existing inference services, this invention proposes a dynamic scheduling method for inference services based on multi-source heterogeneous computing resources. This method can integrate multi-source heterogeneous computing resources and dynamically allocate inference requests based on the weight and real-time load of each computing resource node, thereby enabling full and efficient utilization of computing resources while reducing computing costs.
[0030] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0031] Figure 1 A schematic diagram of a computing device 100 according to an embodiment of the present invention is shown. Figure 1 As shown, in a basic configuration, computing device 100 includes at least one processing unit 102 and system memory 104. According to one aspect, depending on the configuration and type of the computing device, the processing unit 102 may be implemented as a processor. System memory 104 includes, but is not limited to, volatile memory (e.g., random access memory), non-volatile memory (e.g., read-only memory), flash memory, or any combination of such memories. According to one aspect, system memory 104 includes an operating system 105.
[0032] According to one aspect, operating system 105 is, for example, suitable for controlling the operation of computing device 100. Furthermore, examples are practiced in conjunction with graphics libraries, other operating systems, or any other applications, and are not limited to any particular application or system. Figure 1 The basic configuration is illustrated by the components within the dashed lines. According to one aspect, the computing device 100 has additional features or functions. For example, according to one aspect, the computing device 100 includes additional data storage devices (removable and / or non-removable), such as disks, optical discs, or magnetic tapes. This additional storage... Figure 1 The image is shown by removable storage device 109 and non-removable storage device 110.
[0033] As stated above, according to one aspect, program module 103 is stored in system memory 104. According to one aspect, program module 103 may include one or more applications. The present invention does not limit the type of application; for example, applications may include: email and contact applications, word processing applications, spreadsheet applications, database applications, slideshow applications, drawing or computer-aided applications, web browser applications, etc.
[0034] According to one aspect, program module 103 may include a plurality of program instructions adapted to execute the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources of the present invention, such that computing device 100 is configured to execute the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources of the present invention.
[0035] According to one aspect, program module 103 may include inference service dynamic scheduling device 300, which may be configured to execute the inference service dynamic scheduling method 200 based on multi-source heterogeneous computing resources of the present invention.
[0036] According to one aspect, examples can be practiced on circuits including discrete electronic components, packaged or integrated electronic chips containing logic gates, circuits utilizing microprocessors, or on a single chip containing electronic components or a microprocessor. For example, it can be practiced via wherein... Figure 1Each or many of the components shown can be implemented as an example by integrating a System-on-a-Chip (SOC) on a single integrated circuit. According to one aspect, such an SOC device may include one or more processing units, graphics units, communication units, system virtualization units, and various application functions, all integrated (or “burned in”) as a single integrated circuit onto a chip substrate. When operating via the SOC, the functions described herein can be operated via dedicated logic integrated on a single integrated circuit (chip) with other components of the computing device 100. Embodiments of the invention can also be implemented using other techniques capable of performing logical operations (e.g., AND, OR, and NOT), including but not limited to mechanical, optical, fluid, and quantum technologies. Additionally, embodiments of the invention can be implemented within a general-purpose computer or in any other circuit or system.
[0037] According to one aspect, computing device 100 may also have one or more input devices 112, such as a keyboard, mouse, pen, voice input device, touch input device, etc. It may also include output devices 114, such as a display, speaker, printer, etc. The foregoing devices are examples and other devices may also be used. Computing device 100 may include one or more communication connections 116 that allow communication with other computing devices 118. Examples of suitable communication connections 116 include, but are not limited to: RF transmitter, receiver and / or transceiver circuitry; Universal Serial Bus (USB), parallel and / or serial ports.
[0038] As used herein, the term computer-readable medium includes computer storage medium. Computer storage medium can include volatile and non-volatile, removable and non-removable media implemented using any method or technology for storing information (e.g., computer-readable instructions, data structures, or program module 103). System memory 104, removable storage device 109, and non-removable storage device 110 are examples of computer storage media (i.e., memory storage). Computer storage media can include random access memory (RAM), read-only memory (ROM), electrically erasable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic tape, magnetic tape, disk storage or other magnetic storage devices, or any other article of manufacture that can be used to store information and is accessible by computing device 100. According to one aspect, any such computer storage medium can be part of computing device 100. Computer storage media does not include carrier waves or other transmitted data signals.
[0039] According to one aspect, the communication medium is implemented by computer-readable instructions, data structures, program modules 103, or other data in a modulated data signal (e.g., a carrier wave or other transmission mechanism), and includes any information transmission medium. According to one aspect, the term "modulated data signal" describes a signal having one or more sets of characteristics or altered in a manner that encodes information in the signal. By way of example and not limitation, the communication medium includes wired media such as wired networks or direct wired connections, and wireless media such as acoustic, radio frequency (RF), infrared, and other wireless media.
[0040] In an embodiment of the present invention, a computing device 100 is configured to execute the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources of the present invention. The computing device 100 includes one or more processors and one or more readable storage media storing program instructions, which, when configured to be executed by one or more processors, cause the computing device to execute the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources of the present invention.
[0041] Figure 2 A flowchart illustrating a dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources according to an embodiment of the present invention is shown. The dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources can be executed in a computing device (e.g., the aforementioned computing device 100).
[0042] In embodiments of the present invention, the computing device 100 for executing the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources of the present invention can be a service platform. Specifically, the computing device 100 can be implemented as a model-as-a-service platform or an inference service platform. The computing device 100 can communicate with one or more clients.
[0043] like Figure 2 As shown, the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing power resources includes at least the following steps 210-230.
[0044] Step 210: The computing device 100 can connect to multiple computing resource nodes distributed across multiple regions and with various architectures. Each computing resource node can deploy a corresponding model to provide inference services.
[0045] It should be noted that the computing resource nodes connected to the computing device 100 can be provided by one or more upstream suppliers. This invention supports unified access to multiple computing resource nodes with different architectures, and also supports access to idle computing resource nodes from different regions (such as small and medium-sized data centers, edge nodes, personal servers, etc.), thereby integrating multi-source heterogeneous computing resources to make full use of computing resources.
[0046] In some embodiments, the model deployed on each computing resource node can be a large language model (LLM) or a multimodal reasoning model, so as to provide reasoning services based on the model.
[0047] In some embodiments, in step 210, for each computing resource node to be accessed, the computing device 100 can obtain the node attribute information registered by the computing resource node and store this information in a configuration file. The node attribute information of the computing resource node may include its architecture type (GPU / NPU), architecture characteristics (x86 / ARM), graphics card model, video memory capacity, network bandwidth, geographical location, and other information. Furthermore, the computing device 100 can generate a resource identifier (unique resource identifier) corresponding to the computing resource node based on its node attribute information and include this identifier in a scheduling pool so that inference service scheduling can be performed based on the computing resource nodes corresponding to each resource identifier in the scheduling pool. In this embodiment, the node attribute information of the computing resource node and its corresponding resource identifier can be associated and stored in a configuration file.
[0048] In some embodiments, in step 210, for each access computing resource node, the computing device 100 can deploy a model of appropriate scale and adopt a corresponding parallel strategy (Tensor parallelism) on the computing resource node according to its memory capacity, architecture type, and architecture characteristics, thereby achieving heterogeneous model deployment. This allows the same model to be deployed on computing resource nodes with different architectures (GPU / NPU), enabling heterogeneous distribution among computing resource nodes with different architectures. For example, in some embodiments, for computing resource nodes with large memory capacity, large models such as DeepSeek and Llama70B can be deployed, and a lower Tensor parallelism (e.g., tp=4) can be used for large models; for computing resource nodes with small memory capacity, small models such as Baichuan and Llama7B can be deployed, and a higher Tensor parallelism (e.g., tp=9) can be used for small models.
[0049] Furthermore, the computing device 100 can deploy models and manage models (including loading models to provide inference services) on the computing resource nodes according to the architecture type of the computing resource nodes, using the corresponding inference engine. For example, CUDA or TensorRT inference engines can be used for GPU computing resource nodes, and Ascend or other adaptive inference engines can be used for NPU computing resource nodes.
[0050] In some embodiments, the computing device 100 may store the model information (including the model name) of the model deployed on each computing resource node (associated with the resource identifier corresponding to the computing resource node) in a configuration file.
[0051] Step 220: The computing device 100 can perform performance stress tests on each connected computing resource node, obtain the performance stress test results of each computing resource node, and determine the weight of each computing resource node based on the performance stress test results of each computing resource node.
[0052] In this embodiment of the invention, the performance test results include the maximum TPM (Tokens Per Minute) value or the maximum RPM (Requests Per Minute) value. It should be noted that the maximum TPM value represents the maximum number of tokens the model can process per minute. The maximum RPM value represents the maximum number of requests the model can respond to per minute.
[0053] In this embodiment of the invention, the computing device 100 can determine the weight configuration ratio of each computing resource node based on the performance stress test results (maximum TPM value or maximum RPM value) of each computing resource node, and then determine the weight of each computing resource node based on the weight configuration ratio of each computing resource node.
[0054] For example, in some embodiments, performance test results include a maximum TPM value. The maximum TPM value of a computing resource node represents the maximum number of tokens that the model of that computing resource node can process per minute. The computing device 100 can determine the weight configuration ratio of each computing resource node based on its maximum TPM value. For example, if the maximum TPM value of computing resource node A is 100 and the maximum TPM value of computing resource node B is 200, then the weight configuration ratio of computing resource node A to computing resource node B is determined to be 1:2. Furthermore, the weight of each computing resource node can be determined based on its weight configuration ratio. In addition, the weight of each computing resource node can be written into a configuration file and stored.
[0055] Step 230: The computing device 100 can receive inference requests from clients. Subsequently, it can extract specified model information from the inference request. Based on the specified model information in the inference request (such as the specified model name and specified token requirement information) and the weight and real-time load of each computing power resource node, it determines the target computing power resource node. Then, it distributes the inference request to the target computing power resource node so as to provide inference services (execute inference tasks) to the client based on the model deployed on the target computing power resource node.
[0056] Based on this, the present invention can integrate multi-source heterogeneous computing power resources and dynamically allocate inference tasks according to the weight and real-time load of each computing power resource node, thereby making full use of computing power resources.
[0057] In some embodiments, the computing device 100 may determine the target computing resource node by using a weighted random or weighted round-robin algorithm based on the specified model information in the inference request and the weight and real-time load of each computing resource node.
[0058] In some embodiments, the target computing resource node can load the model deployed on the target computing resource node through the corresponding inference engine and provide inference services to the client.
[0059] In some embodiments, in step 230, after receiving an inference request from a client, the computing device 100 can first select one or more candidate computing resource nodes that match the specified model information from the multiple accessed computing resource nodes (each computing resource node corresponding to each resource identifier in the resource pool). Specifically, the model information corresponding to each computing resource node can be matched with the specified model information to determine one or more candidate computing resource nodes that match the specified model information. A candidate node list can be formed based on one or more candidate computing resource nodes.
[0060] Furthermore, the computing device 100 can determine the idle score of each candidate computing resource node based on the weight and real-time load of each candidate computing resource node, and determine the candidate computing resource node with the highest idle score as the target computing resource node.
[0061] In some embodiments, the computing device 100 may determine the idle score of each candidate computing resource node in the following manner: for each candidate computing resource node, the ratio of the real-time TPM value of the candidate computing resource node to the maximum TPM value (i.e., the maximum TPM value obtained in the performance stress test in step 220) may be calculated, and the ratio of the real-time TPM value of the candidate computing resource node to the maximum TPM value may be used as the real-time load coefficient of the candidate computing resource node.
[0062] Furthermore, the idle score of a candidate computing resource node can be determined based on its weight and real-time load ratio. The specific calculation method for the idle score is as follows: available_score = weight * (1 - load_ratio), where available_score represents the idle score, weight represents the weight, and load_ratio represents the real-time load ratio.
[0063] In some embodiments, the inference request also includes the client's geographical location. That is, after receiving an inference request from the client, the computing device 100 can also obtain the client's geographical location from the inference request. In step 230, after selecting one or more candidate computing resource nodes that match the specified model information from multiple computing resource nodes, the computing device 100 can, based on the node attribute information (geographical location) of each candidate computing resource node, preferentially select the candidate computing resource node that is closest to the client's geographical location from the one or more candidate computing resource nodes, and first distribute the inference request to the nearest candidate computing resource node.
[0064] After distributing inference requests to the nearest candidate computing power resource nodes, the idle score of each candidate computing power resource node can be determined based on its weight and real-time load. The candidate computing power resource node with the highest idle score is then identified as the target computing power resource node. The inference request is then switched from the nearest candidate computing power resource node to the target computing power resource node.
[0065] It should be understood that by prioritizing the distribution of inference requests to candidate computing resource nodes that are geographically closest to the client, network latency can be reduced and inference response speed can be improved.
[0066] In some embodiments, after distributing inference requests to target computing resource nodes, if an upstream supplier requests the reclamation of the target computing resource node, the computing device 100 can also respond to the reclamation request from the upstream supplier by switching the inference requests to standby computing resource nodes to provide inference services to clients instead of the target computing resource node. Standby computing resource nodes include, for example, Volcano Cloud and Alibaba Cloud. Based on the dynamic replacement mechanism of standby computing resource nodes, this invention supports the reclamation of computing resource nodes and ensures the continuity of inference services.
[0067] In some embodiments, the computing device 100 can also collect real-time inference performance index data of each computing resource node in real time, and dynamically adjust the weight of each computing resource node based on the real-time inference performance index data. It should be understood that by dynamically adjusting the weight of each computing resource node, the target computing resource node can be determined and inference requests distributed based on the adjusted weight of each computing resource node and its real-time load (real-time load coefficient). Based on this, continuous optimization of the performance of each computing resource node can be achieved, realizing a closed loop of performance optimization.
[0068] Specifically, the computing device 100 can determine whether the performance of each computing resource node is abnormal based on the real-time inference performance index data of each computing resource node. If the performance of a computing resource node is abnormal, the weight of that computing resource node can be reduced, and an alarm can be triggered.
[0069] In some embodiments, real-time inference performance metrics may include one or more of real-time TPM values, latency, and error rate. Latency may be determined based on the TTFT (Time To First Token) metric and / or TPOT (Time Per Output Token) metric. The computing device 100 can determine the health of each computing resource node based on its real-time inference performance metrics and dynamically adjust the weight of each computing resource node based on its health. Specifically, if the health of a computing resource node is less than a health threshold (indicating abnormal performance), its weight can be reduced, or the resource identifier corresponding to that node can be removed from the scheduling pool (stopping the allocation of inference requests to that computing resource node).
[0070] In some embodiments, real-time inference performance metrics include real-time TPM value, latency, and error rate. The computing device 100 can calculate the health score of the computing resource node using the following formula:
[0071] In the formula, This represents the real-time TPM value of a computing resource node. This represents the maximum TPM value of the computing resource node; Indicates the average delay time. Indicates the delay time threshold; Indicates the error rate.
[0072] In some embodiments, the computing device 100 collects real-time inference performance index data of each computing resource node in real time, records the real-time inference performance index data of each computing resource node in a time series to generate a performance log, and then generates and displays a Power performance graph based on the performance log. The Power performance graph can visualize the performance trend and load status of the computing resource nodes.
[0073] Figure 3A schematic diagram of a dynamic scheduling device 300 for inference services according to an embodiment of the present invention is shown. The dynamic scheduling device 300 for inference services can be deployed in a computing device 100, and the dynamic scheduling device 300 for inference services is configured to execute the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing resources of the present invention.
[0074] like Figure 3 As shown, in an embodiment of the present invention, the inference service dynamic scheduling device 300 includes a resource access module 310, a weight configuration module 320, and a scheduling gateway module 330 that are connected in sequence.
[0075] Among them, the resource access module 310 can access multiple computing resource nodes distributed in multiple regions and with multiple architectures. Each computing resource node can deploy a corresponding model to provide inference services.
[0076] The weight configuration module 320 can perform performance stress tests on each computing resource node and determine the weight of each computing resource node based on the performance stress test results. The performance stress test results include the maximum TPM value or the maximum RPM value.
[0077] The scheduling gateway module 330 can receive inference requests from clients, determine target computing resource nodes based on the specified model information in the inference request and the weight and real-time load of each computing resource node, and distribute the inference request to the target computing resource node so as to provide inference services to clients based on the model deployed on the target computing resource node.
[0078] In some embodiments, the inference service dynamic scheduling device 300 further includes a performance monitoring module (not shown in the figure), which can be communicatively connected to the scheduling gateway module 330. The performance monitoring module can collect real-time inference performance index data of each computing resource node in real time, and can send the collected real-time inference performance index data of each computing resource node to the scheduling gateway module 330. Then, the scheduling gateway module 330 can dynamically adjust the weight of each computing resource node based on the real-time inference performance index data. Afterwards, it can determine the target computing resource node and distribute inference requests based on the adjusted weights of each computing resource node and its real-time load (real-time load coefficient). Based on this, continuous optimization of the performance of each computing resource node can be achieved, realizing a closed loop of performance optimization.
[0079] It should be noted that the resource access module 310, the weight configuration module 320, and the scheduling gateway module 330 are respectively used to execute the aforementioned steps 210 to 230. Here, the specific execution logic of each unit can be found in the description of steps 210 to 230 in method 200 above, and will not be repeated here.
[0080] According to the embodiment of the present invention, the dynamic scheduling method 200 for inference services based on multi-source heterogeneous computing power resources deploys models by accessing multiple computing power resource nodes distributed across multiple regions and with various architectures. The weight of each computing power resource node is determined based on the performance stress test results of each node. When an inference request is received from a client, the target computing power resource node is determined based on the specified model information in the request, the weight of each node, and the real-time load. The inference request is then distributed to the target node, thereby providing inference services to the client based on the model deployed on the target node. Based on this, the present invention can integrate multi-source heterogeneous computing power resources and dynamically allocate inference requests according to the weight and real-time load of each node, thus fully and efficiently utilizing computing power resources while reducing computing costs.
[0081] Furthermore, this invention dynamically adjusts the weight of each computing resource node based on the real-time inference performance index data of each computing resource node, thereby enabling continuous optimization of the performance of each computing resource node and achieving a closed loop of performance optimization.
[0082] Furthermore, this invention supports a location-based proximity-first strategy, which can reduce network latency and improve inference response speed by prioritizing the distribution of inference requests to candidate computing resource nodes that are geographically closest to the client.
[0083] The various techniques described herein can be implemented in combination with hardware or software, or a combination thereof. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, can take the form of program code (i.e., instructions) embedded in a tangible medium, such as a removable hard disk, USB flash drive, floppy disk, CD-ROM, or any other machine-readable storage medium, wherein when the program is loaded into and executed by a machine such as a computer, the machine becomes an apparatus for practicing the present invention.
[0084] When the program code is executed on a programmable computer, the mobile terminal generally includes a processor, a processor-readable storage medium (including volatile and non-volatile memory and / or storage elements), at least one input device, and at least one output device. The memory is configured to store program code; the processor is configured to execute the dynamic scheduling method for inference services based on multi-source heterogeneous computing resources of the present invention according to instructions in the program code stored in the memory.
[0085] By way of example, and not limitation, readable media include readable storage media and communication media. Readable storage media stores information such as computer-readable instructions, data structures, program modules, or other data. Communication media generally embodies computer-readable instructions, data structures, program modules, or other data in the form of modulated data signals such as carrier waves or other transmission mechanisms, and includes any information delivery medium. Any combination of the above is also included within the scope of readable media.
[0086] In the specification provided herein, the algorithms and displays are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with the examples of this invention. The required structure for constructing such systems is apparent from the above description. Furthermore, this invention is not directed to any particular programming language. It should be understood that the contents of the invention described herein can be implemented using various programming languages, and the above description of specific languages is for the purpose of disclosing the best mode of implementation of the invention.
[0087] Numerous specific details are set forth in the specification provided herein. However, it will be understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this specification.
[0088] Similarly, it should be understood that, in order to streamline this disclosure and aid in understanding one or more of the various aspects of the invention, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof.
[0089] Those skilled in the art will understand that modules, units, or components of the devices disclosed in the examples herein can be arranged in the devices described in this embodiment, or alternatively, can be located in one or more devices different from the devices in this example. The modules in the foregoing examples can be combined into a single module or, in addition, can be divided into multiple sub-modules.
[0090] Unless otherwise specified, the use of ordinal numbers such as “first,” “second,” “third,” etc., to describe ordinary objects merely indicates different instances of similar objects and is not intended to imply that the objects being described must have a given order in time, space, ordering, or any other manner.
Claims
1. A dynamic scheduling method for inference services based on multi-source heterogeneous computing resources, executed in a computing device, comprising: Access to multiple computing resource nodes distributed across multiple regions and with various architectures, each of which is suitable for deploying a corresponding model to provide inference services; Performance stress tests are performed on each computing resource node, and the weight of each computing resource node is determined based on the performance stress test results. The performance stress test results include the maximum TPM value or the maximum RPM value. The system receives inference requests from clients, determines target computing resource nodes based on the specified model information in the inference request and the weight and real-time load of each computing resource node, and distributes the inference request to the target computing resource nodes so as to provide inference services to the clients based on the models deployed on the target computing resource nodes.
2. The method as described in claim 1, wherein, Also includes: Real-time inference performance metrics data of each computing resource node are collected, and the weights of each computing resource node are dynamically adjusted based on the real-time inference performance metrics data of each computing resource node.
3. The method as described in claim 1 or 2, wherein, Also includes: For each computing resource node, a model of a corresponding scale is deployed on the computing resource node and a corresponding parallel strategy is adopted according to the memory capacity, architecture type and architecture characteristics of the computing resource node.
4. The method according to any one of claims 1-3, wherein, The target computing resource node is determined based on the specified model information in the inference request and the weight and real-time load of each computing resource node, including: From the plurality of computing power resource nodes, select one or more candidate computing power resource nodes that match the specified model information; Based on the weight and real-time load of each candidate computing power resource node, the idle score of each candidate computing power resource node is determined, and the candidate computing power resource node with the highest idle score is determined as the target computing power resource node.
5. The method of claim 4, wherein, Based on the weight and real-time load of each candidate computing resource node, an idle score is determined for each candidate computing resource node, including: For each candidate computing power resource node, the ratio of the real-time TPM value of the candidate computing power resource node to the maximum TPM value is determined to obtain the real-time load coefficient of the candidate computing power resource node. The idle score of the candidate computing resource node is determined based on the weight and real-time load coefficient of the candidate computing resource node.
6. The method of claim 4, wherein, The inference request also includes the client's geographical location; based on the weight and real-time load of each candidate computing power resource node, the idle score of each candidate computing power resource node is determined, and the candidate computing power resource node with the highest idle score is determined as the target computing power resource node, further including: From one or more candidate computing power resource nodes, the candidate computing power resource node that is geographically closest to the client is selected first, and the inference request is distributed to the nearest candidate computing power resource node; Based on the weight and real-time load of each candidate computing power resource node, the idle score of each candidate computing power resource node is determined, and the candidate computing power resource node with the highest idle score is determined as the target computing power resource node. The inference request is switched from the nearest candidate computing resource node to the target computing resource node.
7. The method according to any one of claims 1-3, wherein, The target computing resource node is determined based on the specified model information in the inference request and the weight and real-time load of each computing resource node, including: Based on the specified model information in the inference request and the weight and real-time load of each computing resource node, the target computing resource node is determined using a weighted random or weighted round-robin algorithm.
8. The method according to any one of claims 1-7, wherein, Also includes: In response to the request to reclaim the target computing resource node, the inference request is switched to a standby computing resource node.
9. The method according to any one of claims 1-8, wherein, Access to multiple computing resource nodes distributed across multiple regions and with various architectures, including: For each computing resource node, obtain and store the node attribute information registered by the computing resource node. The node attribute information includes the architecture type, architecture characteristics, graphics card model, video memory capacity, network bandwidth, and geographical location of the computing resource node. Based on the node attribute information, a resource identifier corresponding to the computing power resource node is generated and included in the scheduling pool.
10. The method of claim 2, wherein, The real-time inference performance metrics include one or more of the following: real-time TPM value, latency, and error rate; based on the real-time inference performance metrics of each computing resource node, the weights of each computing resource node are dynamically adjusted, including: The health status of each computing resource node is determined based on the real-time inference performance metrics data of each computing resource node. The weight of each computing resource node is dynamically adjusted based on its health status. This includes reducing the weight of a computing resource node if its health status is less than a health threshold.
11. The method of claim 2, wherein, Also includes: Real-time inference performance metrics data of each computing resource node are recorded in time series to generate performance logs; A Power performance graph is generated based on the performance logs and then displayed.
12. A dynamic scheduling apparatus for inference services, deployed in a computing device, adapted to perform the method as described in any one of claims 1-11, the apparatus comprising: The resource access module is suitable for accessing multiple computing resource nodes distributed across multiple regions and with multiple architectures, and each of the computing resource nodes is suitable for deploying a corresponding model to provide inference services; The weight configuration module is suitable for performing performance stress tests on each computing resource node and determining the weight of each computing resource node based on the performance stress test results. The performance stress test results include the maximum TPM value or the maximum RPM value. The scheduling gateway module is adapted to receive inference requests from clients, determine target computing resource nodes based on the specified model information in the inference request and the weight and real-time load of each computing resource node, and distribute the inference request to the target computing resource node so as to provide inference services to the client based on the model deployed on the target computing resource node.
13. The apparatus of claim 12, wherein, Also includes: The performance monitoring module is suitable for collecting real-time inference performance index data of each computing resource node and sending it to the scheduling gateway module, so that the scheduling gateway module can dynamically adjust the weight of each computing resource node based on the real-time inference performance index data of each computing resource node.
14. A computing device, comprising: At least one processor; and A memory storing program instructions, wherein the program instructions are configured to be processed by the at least one processor, the program instructions including instructions for processing the method as claimed in any one of claims 1-11.
15. A computer program product comprising computer program instructions, wherein, When the computer program instructions are executed by the processor, they implement the method as described in any one of claims 1-11.
16. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the method as described in any one of claims 1-11.