System, modules, and method for selecting device for running machine learning workloads

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The system addresses inefficient accelerator selection by predicting workload performance on heterogeneous devices, optimizing resource allocation and reducing costs and energy consumption in datacenters through a profiler, estimator, and scheduler module.

WO2026123203A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2024-12-10
Publication Date: 2026-06-18

Application Information

Patent Timeline

10 Dec 2024

Application

18 Jun 2026

Publication

WO2026123203A1

IPC: G06F9/50

AI Tagging

Application Domain

Resource allocation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A scheduling method and device for encrypted card virtual resources, equipment and medium
CN122195659AImplement dynamic schedulingAchieve on-demand allocationResource allocation Biological models
Resource management method and apparatus, communication device, and storage medium
CN122228483AResource allocation
A multi-level cache method and system based on structure-enhanced prediction and reinforcement learning
CN122220262AProgram initiation/switching Digital data information retrieval
Model inference method and device, computer device, computer readable storage medium, and computer program product
CN121960793BResource allocation Inference methods
Helium speech processing method across deep acoustic feature fusion and dynamic phoneme error correction
CN122224202AResource allocation Speech analysis

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing methods for selecting accelerators to run machine learning workloads are wasteful and costly due to extensive benchmarking requirements and lack of knowledge about device performance, leading to inefficient resource allocation and increased energy consumption.

⚗Method used

A system and method for predicting the performance of machine learning workloads on heterogeneous devices using a profiler module to extract model structure, an estimator module to predict performance, and a scheduler module to select the optimal device based on Service Level Objectives, reducing power and energy requirements by deploying workloads on the most suitable accelerators.

🎯Benefits of technology

This approach improves datacenter efficiency by reducing power and energy consumption, increasing utilization, and lowering costs by suggesting cheaper alternatives that meet performance requirements, while minimizing Service-Level-Objective violations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2024138140_18062026_PF_FP_ABST

Patent Text Reader

Abstract

A method of selecting a device from a set of heterogeneous devices for running workloads given a Service Level Objective, SLO, and a workload. The method comprises: receiving a workload at a profiler module; extracting the model structure of the workload; passing the model structure to a performance estimator module; generating a list of predicted performance of the workload on each of the devices based on the model structure; feeding the list to a scheduler module alongside the SLO; and selecting, based on the predicted performance and the SLO for the workload, on which device to run the workload. There is also provided a method for training the profiler module, the estimator module, and a workload builder module. There is also provided a system comprising a profiler module and an estimator module.

Need to check novelty before this filing date? Find Prior Art

Description

SYSTEM, MODULES, AND METHOD FOR SELECTING DEVICE FOR RUNNING MACHINE LEARNING WORKLOADSFIELD OF THE INVENTION

[0001] This invention relates tomethods of scheduling workloads on accelerators for training and inference of ML and AI models. In particular accurately predicting the behaviour of the models on various types of hardware to determine the best location for running the model for training and / or inference.BACKGROUND

[0002] There has recently been an explosion in the number of Machine Learning (ML) models developed. For example, Hugging Face (RTM) has been adding on average more than 40000 new models per month in 2023, with more than 1 Million models in total hosted on their platform. In parallel, hardware companies have developed and released many accelerators that focus on ML model training and inference optimization. To better serve the demands of the growing ML model space, cloud providers have started adopting more accelerator families in their datacentres to better serve the demands of their customers. For example, today most cloud providers offer at least 10 different accelerator types through their cloud services.

[0003] Given the large diversity in accelerators offered by cloud providers, when deploying workloads, customers need not only choose which provider to run their workload, but also which accelerator family to deploy the workload on. The choice of the accelerator can significantly affect the costs of deployments, the energy consumption, and the performance of the models in terms of Service-Level Objectives (SLOs) such as throughput and latency requirements. With the diversity of accelerators, cloud users are left with one of two choices; either to extensively benchmark the performance of the model they plan to run on all available hardware to see which hardware will provide the required SLOs; or to choose an accelerator based on, e.g., their available budget to run this workload.

[0004] The above two choices are wasteful and expensive for cloud clients. Benchmarking workloads across devices can be expensive, and non-trivial. Choosing based on budget means that the client can be missing out on potentially large cost savings by running on smaller instances. In addition, with the proliferation of ML models in many domains, many cloud clients do not have the necessary knowledge on how to choose the best accelerator for their applications.

[0005] From a cloud provider's perspective, optimizing and lowering the running costs, increasing the overall utilization of the (expensive) accelerators, and reducing the overall energy and carbon footprints of their datacentres are all priorities. However, these goals cannot be fully met for most GPU clusters since users typically over-provision their resources significantly. In addition, mismatches between the actual resource requirements for an ML model and the requested resources is another major source of resource wastage.

[0006] However, the existing techniques require knowledge of the performance characteristics of a neural network on the different devices, which in turn typically requires offline profiling for each workload on each device. This is an expensive and costly process and is unlikely to work for most cloud applications.

[0007] Further, the existing techniques do not allow for abstracting the resources meaningfully enough for schedulers to be able to estimate how running a computation on different accelerators affects the performance.

[0008] The above-described problem is strongly related to three research and innovation directions from academia and industry comprising Heterogeneous Resource Allocation, Resource abstractions, and ML optimization.

[0009] In an existing approach, there is presented a system and method for computation graph mapping in heterogeneous computer systems where the system takes a computational workload of NN models, generates a graph of the computations with nodes representing the model. The system then augments the graph. The graph adapter then can be used to augment the graph with, e.g., performance values to be able to find an optimal path using the path finder. Then the tasks are allocated. However, the approach above requires full knowledge of the performance of each node in the graph on each device in the system.

[0010] In another existing approach, there is presented an automated hardware resource optimization system. However, the system cannot schedule multiple workloads.

[0011] In another existing approach, there is presented an automated hardware resource optimization system which includes a computing platform having a hardware processor and a system memory storing a software code. The software code optimizes, based on the determined batch size and the tuned second performance parameter, a process flow for performing the data processing, and generates a configuration file identifying the computing hardware, the neural network based application, the determined batch size, the tuned second performance parameter, and the optimized process flow.

[0012] However, this approach focuses on automatic updates to a configuration file identifying the computing hardware and the batch size based on two performance parameters. Examples include for performance parameter 1 the memory load size, and for the second parameter the memory load rate. This does not include any performance such as response time, or XPU utilization.

[0013] In another existing approach, there is presented a system for workload selection and placement in heterogeneous computing systems that support virtual GPUs. They train a number of vGPU placement neural networks are trained to maximize a composite efficiency metric based on workload data and GPU data for the plurality of vGPU placement models, and use a combined NN network selector to assign workloads to VGPUs. However, this system does not model heterogeneous devices and cannot predict the expected performance of a ML algorithm on a device.

[0014] In another existing approach, there is presented Google Compute Units (GCUs) but does not discuss the design in detail. GCUs are a virtual unit for CPU performance, which abstract the heterogeneous hardware deployed in Google’s data centres. 1 GCU represents the amount of compute resources needed to generate the same amount of compute performance for a hypothetical “average” Google application. Different machines in the fleet will be running different processor types and models and therefore 1 GCU should provide about the same amount of computational across any machine in the fleet. Google’s cluster manager, Borg then performs a mapping of GCUs onto the appropriate number of physical CPU cores. However, GCUs only abstract CPU platforms and the output is a fixed ratio of one platform’s relative performance against another.SUMMARY OF THE INVENTION

[0015] According to one aspect there is provided amethod of selecting a device from a set of heterogeneous devices for running workloads given a Service Level Objective, SLO, and a workload, the method comprises: receiving a workload at a profiler module; extracting by the profiler module the model structure of the workload; passing the model structure to a performance estimator module; generating, by the performance estimator module, a list of predicted performance of the workload on each of the devices of the set of devices based on the model structure; feeding the list to a scheduler module alongside the SLO; andselecting by the scheduler, based on the predicted performance and the SLO for the workload, on which device of the set of devices to run the workload. This reduces the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles.

[0016] In an embodiment, the method may comprisereceiving a plurality of different workloads and selecting by the scheduler, based on a plurality of predicted performances and SLOs, on which of the set of devices to run the plurality of workloads. This improves the power and energy efficiency of the datacentre.

[0017] In an embodiment, the estimator module may output a list of the predicted performance of the workload on each device with a confidence value of the prediction. This reduces the ML users’ costs by suggesting cheaper alternatives that are capable of meeting the required ML performance but with a lower cost.

[0018] In an embodiment, the list of predicted performance may be provided according to the format W1: {D1: (E (Perf) : C) , …Dn: (E (Perf) : C) } , where W is the workload, D1-n is the device, E (Perf) is the expected performance, and C is the confidence. This improves the power and energy efficiency of the datacentre.

[0019] In an embodiment, the predicted performance may comprise an estimation of any combination of the operator level, the model level, or the layer level performance expected on each of the devices. This reduces the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles.

[0020] In an embodiment, the method may comprise monitoring the actual performance of the workload on the selected device and updating the estimator module based on the actual performance. This improves the power and energy efficiency of the datacentre.

[0021] According to another aspect there is provided a method of training a performance estimator module for estimating the runtime of a workload, the method comprises: receiving, from a profiler module, workload runtime metrics; extracting static features from the workload without running the workload; andestimating runtime information of a workload based on the received workload runtime metrics and the static features. This improves the power and energy efficiency of the datacentre.

[0022] In an embodiment, the estimated runtime information may be passed back to a workload builder module for estimating coverage of the workload configuration. In an embodiment, the estimator may be passed back to the workload builder module for estimating coverage of the workload configuration after each iteration of training. In an embodiment, the workload builder module may provide one or more new workload configurations to the profiler module based on the estimated coverage of a previous workload configuration. This reduces the power and energy requirements of the datacentre.

[0023] In an embodiment, for each workload configuration the profiler module may receive a set of data pairs from a workload builder module, each data pair may comprise a workload and a device on which to run the workload. In an embodiment, the workload may comprise a machine learning model, model parameters, and an example model input. This reduces the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles.

[0024] According to another aspect the is provided a method of training a profiler module for determining the optimal accelerator for running a workload, the method comprises: selecting a set of predefined machine learning models with representative workloads;

[0025] for each model of the set: decomposing the model into its constituent components at a specified layer; running the model on each of a set of devices; andobtaining runtime data by monitoring the performance of each component of the model on each device of the set of devices; passing the runtime data for each model to one of a plurality of machine learning models representing a device of the set of devices; passing corresponding model structure data to said one of the plurality of machine learning models representing said device of the set of devices; andtraining each machine learning model to estimate the performance of a different device of the plurality of devices of the set by associating the respective model runtime data with the corresponding model structure data. This reduces the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles.

[0026] In an embodiment, each device may representa different accelerator for running the workload of the model. In an embodiment, the specified layer may be any one of the machine learning layer, the compiler layer, or the hardware layer. In an embodiment, the constituent components of the compiler layer may comprise the computational kernels of the workload. This improves the performance of ML workloads and reduces any Service-Level-Objective violations by suggesting to an uninformed user a more powerful accelerator that is capable of servicing the ML workload according to the user’s requirements.

[0027] In an embodiment, the model structure data may comprise any combination of number of kernels, type of kernels, a directed acyclic graph, DAG, of kernels, and one or more other hyper-parameters of the model. In an embodiment, the DAGs may be provided to the device performance estimator model as dictionaries with keys equal to each model and values comprising a list of model layers or kernels. This increases the datacentre utilization by increasing the number of ML workloads that can be packed on the same set of hardware since the most powerful hardware is not allocated to ML workloads with low accelerator requirements.

[0028] In an embodiment, the monitoring may comprise deploying an exporter agent on each computing node representing a device of the set of devices within the system. Thisincreases the datacentre utilization.

[0029] In an embodiment, the runtime data may comprise any combination of overall device utilization rate, specific usage of device memory, response time, throughput, model accuracy, and one or more user-defined QoS metrics. This reduces the ML users’ costs by suggesting cheaper alternatives that are capable of meeting the required ML performance but with a lower cost.

[0030] According to another aspect there is provided a method for generating a set of workloads for training a profiler module or estimator module, the method comprising: receiving a plurality of machine learning models, a plurality of device types, and a plurality of neural network layers; andgenerating a set of data pairs, each data pair comprising a workload and a device on which to run the workload.

[0031] In an embodiment, each workload may comprise a machine learning model, a set of parameters of the model, and an example input of the model. In an embodiment, the method may comprise receiving a list of multiple workloads and their associated runtime metrics when run on a particular device. In an embodiment, the plurality of device types may comprise model numbers and associated specs of each device which may be used to run the model. In an embodiment, the plurality of machine learning models may comprise a list of neural network layers of the model. In an embodiment, the plurality of neural network layers may comprise a list of types of layers used in the machine learning models. This reduces the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles

[0032] In an embodiment, the method may comprise estimating coverage of previously sampled workload configurations using variational autoencoders or Gaussian processes and determining new workload configurations based on the estimated coverage. This increases the datacentre utilization by increasing the number of ML workloads that can be packed on the same set of hardware since the most powerful hardware is not allocated to ML workloads with low accelerator requirements.

[0033] According to another aspect there is provided a system for selectinga device from a set of heterogeneous devices for running workloads given a Service Level Objective, SLO, and a workload, the system comprises: a profiler module configured to extract the model structure of the workload; an estimator module configured to estimate the predicted performance of the workload on each device based on the model structure and one or more pre-trained device models; and a scheduler node configured to select a device from the set of devices for running the workload based on the predicted performance for each device and the SLO of the workload. This reduces the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles. This also improves the performance of ML workloads and reduces any Service-Level-Objective violations by suggesting to an uninformed user a more powerful accelerator that is capable of servicing the ML workload according to the user’s requirements.

[0034] According to another aspect there is provided a profiler module configured to determine the optimal accelerator for running a workload and trained according to the method of claim 13.

[0035] According to another aspect there is provided an estimator module configured to estimate the runtime of a workload and trained according to the method of claim 7.

[0036] According to another aspect there is provided a computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method of claim 1 or the method of claim 7 or the method of claim 13 or the method of claim 21 alone or in combination when the program code is executed by the computer or the processor.

[0037] The proposed system, modules, and methods reduce the power and energy requirements by the datacentre as the workloads are deployed on the most suitable accelerator for their required performance with the lowest power or energy profiles.

[0038] The proposed system, modules, and methods increase the datacentre utilization by increasing the number of ML workloads that can be packed on the same set of hardware since the most powerful hardware is not allocated to ML workloads with low accelerator requirements.

[0039] The proposed system, modules, and methods reduce the ML users costs by suggesting cheaper alternatives that are capable of meeting the required ML performance but with a lower cost.

[0040] The proposed system, modules, and methods improve the performance of ML workloads and reduces any Service-Level-Objective violations by suggesting to an uninformed user a more powerful accelerator that is capable of servicing the ML workload according to the user’s requirements.

[0041] BRIEF DESCRIPTION OF THE FIGURES

[0042] The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

[0043] Figure 1 showsa schematic overview of the offline training architecture for the HCU generator. Figure 2 showsa schematic diagram of the workload builder module.

[0044] Figure 3 showsa schematic diagram of the profiler module.

[0045] Figure 4 showsa schematic diagram of the estimator training operation.

[0046] Figure 5 showsa schematic diagram of the modules once trained and their inference is part of an online system.

[0047] Figure 6 showsa schematic diagram providing an overview of the online system components similar to that shown in figure 5.DETAILED DESCRIPTION OF THE INVENTION

[0048] To overcome these challenges there is introduced a system, methods, and modules, for automatic selection of a computational resource from a set of heterogeneous devices for running workloads given an SLO and a workload.

[0049] There is provided a Profiler module and a Scheduler module for workloads that need to run on accelerators. The system takes as input a workload, e.g., an ML model, with a desiredService-Level Objective (SLO) , and may output the predicted performance, costs, and a confidence interval of these predicted metrics for running the workload on each accelerator. The prediction is then used to find a schedule for placing the workload on different devices.

[0050] As mentioned above, the proposed system and method herein aims to tackle the following issues with existing approaches. The requirement for extensive benchmarking for each workload on all available heterogeneous accelerators by a user to learn the best accelerator for a given workload. Performance prediction for a workload on heterogeneous accelerators without running the workload with a confidence in the predictions. The Scheduling of diverse, never before seen, workloads on a set of heterogeneous accelerators while maintain a certain optimal criterion, e.g., at least a desired SLO.

[0051] The landscape of machine learning (ML) and artificial intelligence (AI) workloads is notably diverse, spanning Large Language Models (LLM) to deep reinforcement learning and simpler inference tasks. For example, operations may encompass a wide array of workloads, ranging from advanced Pangu models (as seen in "PanGu-Sigma: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. " by Ren, Xiaozhe, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li et al. arXiv preprint arXiv: 2303.10845 (2023) ) to MILP solvers (as seen in"A Deep Instance Generative Framework for MILP Solvers Under Limited Data Availability. " by Geng, Zijie, Xijun Li, Jie Wang, Xiao Li, Yongdong Zhang, and Feng Wu; Thirty-seventh Conference on Neural Information Processing Systems. 2023. ) and Deep Reinforcement Learning (RL) applications (as seen in"Optimizing communication in deep reinforcement learning with XingTian. " by Pan, Lichen, Jun Qian, Wei Xia, Hangyu Mao, Jun Yao, Pengze Li, and Zhen Xiao; in Proceedings of the 23rd ACM / IFIP International Middleware Conference, pp. 255-268.2022. ) . Each of these models undergoes extensive training on vast datasets before being deployed for inference, with some involving a tuning phase aimed at real-time model updates.

[0052] Furthermore, many applications necessitate a sequential flow of models, akin to microservice-based or workflow applications. This diverse range of ML / AI workloads poses significant challenges in optimizing their deployments solely at the application-framework level.

[0053] Some datacentres, which run these ML and AI workloads, are heterogeneous and contain different types of hardware including CPUs and accelerators such as GPUs, NPUs, FPGAs, etc. Each of these accelerator classes has multiple different models with different computing capabilities. This variation means that these different types of hardware exhibit varying levels of performance in relation to a given workload.

[0054] It has been realised that when a cluster scheduler is assigning jobs to machines it needs to take into account the hardware heterogeneity across the data centre and abstract away the physical hardware characteristics and differences to more uniform metrics. A job's resource assignments can then be decoupled from the actual hardware and allow the scheduler to fulfil each request with equivalent heterogeneous resources. In order to achieve this objective, the heterogenous device types need to be abstracted away from the point of view of the scheduler, presenting instead the relative performance of a workload

[0055] The proposed Heterogeneous Computing System (HCS) is a system to generate performance abstractions for different device types. The abstraction will enable vis-à-vis performance comparison between different device types, e.g., different versions of Ascend NPUs.

[0056] Figure 1 shows a schematic overview of the offline training architecture for the HCU generator. The proposed system 100 comprises a profiler module 102, a plurality of device models 104, a monitoring module 106, and an estimator training module 108 which undergoes training in this first mode.

[0057] The system has two modes, an offline mode for training performance prediction models for different devices; and an online mode that is used during production. The present description focuses on ML workloads by way of example. However, the sameconcepts and provisions can be used for other workloads that can run on an accelerator.

[0058] Tw indicates one or more training workloads. Each training workload Tw is passed as a set to each device model D1-n which are trained on these workloads to determine the corresponding performance of the device. The monitoring module 106 may be used to monitor the training of the device models and pass performance data on to the estimator training module 108. Information regarding the model structures may be passed forward to the estimator module 108 to add context to the determined performance data and develop device performance estimation models.

[0059] Thus, there is provided a system for selectinga device from a set of heterogeneous devices for running workloads given a Service Level Objective, SLO, and a workload. The system comprises a profiler module configured to extract the model structure of the workload. There is also an estimator module configured to estimate the predicted performance of the workload on each device based on the model structure and one or more pre-trained device models. There is also provided a scheduler node configured to select a device from the set of devices for running the workload based on the predicted performance for each device and the SLO of the workload.

[0060] In order to provide the training workloads a workload builder module is used.

[0061] Figure 2 shows a schematic diagram of the workload builder module 200. Aplurality of device types D1-Dp may be provided as one input. A plurality of machine learning models M1-Mq may be provided as another input. A plurality of neural network layers L1-Ls may be provided as another input. A list of multiple sets comprising a workload W1-Wr, a device D1-Dr, and a runtime metric S1-Sr (i.e. the runtime of that workload on that device, etc. ) may also be provided as an input.

[0062] The workload builder 200 creates a set of training workloads to be executed on a specific device. The workloads comprise the machine learning model M, a set of parameters P and input I, and are associated with a device type D. The training workload and device sets are generated from a list of devices, for example, A100, T4, A900, etc.; a list of neural network models, for example, ResNet, YOLO, Llama, etc.; and a list of all possible pytorch layers, for example, fully connected, convolution, batch norm, etc. In some cases, a list of multiple [workload; device run on; runtime metric] sets may be present.

[0063] The job of the workload builder is to sample a diverse set of workloads, while also accounting for the most popular workloads on the data centre and previously sampled workloads, which is subsequently used by the profiler and estimator model. The workload builder may use, for example, variational autoencoders, or Gaussian processes, to estimate coverage of previously sampled workload configurations and suggest new ones.

[0064] There is therefore provided a method for generating a set of workloads for training a profiler module or estimator module. The method comprises receiving a plurality of machine learning models, a plurality of device types, and a plurality of neural network layers; and generating a set of data pairs, each data pair comprising a workload and a device on which to run the workload.

[0065] Each workload comprises a machine learning model, a set of parameters of the model, and an example input of the model. The method may comprise receiving a list of multiple workloads and their associated runtime metrics when run on a particular device.

[0066] The plurality of device types may comprise model numbers and associated specs of each device which may be used to run the model. For example, hardware metrics such as the memory size or the number of computing cores in the device. The plurality of machine learning models may comprise a list of neural network layers of the model. The plurality of neural network layers may comprise a list of types of layers used in the machine learning models. That is, a model M may be split into layers. For a model that has never been seen before, it can also be split it to L layers. This will be a different order / configuration of layers than everything previously seen, but the layers themselves comprise the same or similar basic building blocks. It can be seen based on the training how this model should perform on each hardware based on the previous experience with how these blocks have performed in different previous configurations.

[0067] The method may comprise estimating coverage of previously sampled workload configurations using variational autoencoders or Gaussian processes and determining new workload configurations based on the estimated coverage.

[0068] Offline Profiler

[0069] An offlineprofiler module 102 is at the heart of the implementation. To build initial models for the generation of abstractions for AI workloads, it is necessary to profile across various NPU and GPU device and there respective device generations using different frameworks. The offline profiler 102 may extract ML workload information such as the kernels the model has, the directed acyclic graph (DAG) of computations, and any other relative hyper-parameters. It also manages the launching of workloads on all the different devices which it is desired to model / abstract, acting as a workload and experiments driver.

[0070] The profiler 102 takes as input a set of predefined ML workloads W that are representative of the workloads running in the datacentre and associated device type (e.g. the input to the profiler module is the workload builder output) . The profiler 102may output the Model DAGs to theestimator training 108 as dictionaries with keys equal to each model and values as a list of all model layers / kernels. In addition, the profiler 102 launches the experiments on the devices.

[0071] The offline model profiling operation is executed as follows. The goal of the offline profiling stage is to gather information to help optimise or improve the performance of estimator models. Models exhibit different behaviours on different devices. Therefore, a profiling stage on each device is important to gain a better understanding of the performance. The profiler 102 emits operator level statistics when given a model and expected inputs. The outputs of the profiling stage may then be transformed and fed to the estimator training 108 for model generation. Profiling enables the comparing of the performance impact of different model versions, hardware configurations, or optimization techniques.

[0072] Thus, there is provided a method of training a profiler module for determining the optimal accelerator for running a workload. The method comprises selecting a set of predefined machine learning models with representative workloads. For each model of the set the method comprises decomposing the model into its constituent components at a specified layer, running the model on each of a set of devices, and then obtaining runtime data by monitoring the performance of each component of the model on each device of the set of devices. The method continues by passing the runtime data for each model to one of a plurality of machine learning models representing a device of the set of devices. Then passing corresponding model structure data to said one of the plurality of machine learning models representing said device of the set of devices, Finally, the method comprises training each machine learning model to estimate the performance of a different device of the plurality of devices of the set by associating the respective model runtime data with the corresponding model structure data.

[0073] Each device may represent a different accelerator for running the workload of the model. The specified layer may be any one of the machine learning layer, the compiler layer, or the hardware layer. The constituent components of the compiler layer may comprise the computational kernels of the workload.

[0074] The model structure data may comprise any combination of number of kernels, type of kernels, a directed acyclic graph, DAG, of kernels, and one or more other hyper-parameters of the model. The DAGs may be provided to the device performance estimator model as dictionaries with keys equal to each model and values comprising a list of model layers or kernels.

[0075] The monitoring may comprise deploying an exporter agent on each computing node representing a device of the set of devices within the system. The runtime data may comprise any combination of overall device utilization rate, specific usage of device memory, response time, throughput, model accuracy, and one or more user-defined QoS metrics.

[0076] Figure 3 shows a schematic diagram of the profiler module 102. The profiler 102 can run in static or dynamic mode. In static mode, the profiler extracts information without running the model. In dynamic mode, the profiler runs the model and collects detailed runtime metrics.

[0077] The profiler 102 has three main components. The first is a model structure analyser302 which finds the different Workload layers in a neural network, the connectivity between these layers, the communication patterns, input parameters, and the parallelism used in the model.

[0078] The second is an operator analyser304 which splits the different layers into their main high-level operators, e.g., Aten operators, or ONNX operators.

[0079] The third is a hardware operator analyser 306 which analyses how the high-level models are translatedon the accelerators, e.g., into CUDA operators or Ascend operators; different properties of the device, e.g. size of HBM / cache, number of AI-cores, etc. This enables the system to predict performance more accurately for a given hardware.

[0080] In a different task configuration, for example preparing training data or running in deployment, the profiler 102 can contain any combination of the analysers 302-306, with different settings for each of the static mode and dynamic mode.

[0081] Offline Monitoring

[0082] In the model building phase, the inclusion of a monitoring component 106 is used for tracking the performance and efficiency of the computational resources involved. This monitoring component 106 operates by deploying an exporter agent on each computing module within the system. These agents are responsible for the continuous collection of detailed metrics related to the utilization of the devices. Key metrics may include overall device utilization rates and the specific usage of device memory. This data is important for understanding how different workloads impact the performance of the hardware, providing a foundation for optimising resource allocation and improving model accuracy. In addition, the monitoring system 106 collects workload level metrics such as response times, throughput, model accuracy, and any other user-defined quality of service (QoS) metricsneeded for comparison of the different devices. Once collected, the metrics may be passed on to inform the training process of the estimator models 108. This step ensures that the models are grounded in real-world performance data, enhancing their predictive accuracy.

[0083] The input to the monitoring module may be user defined performance metrics for comparing the models on the devices. The output of the monitoring module 106 may be measured values for performance and device level metrics.

[0084] Estimator Training

[0085] The primary objective of the training process for the estimator is to develop and refine models capable of accurately predicting the performance of various workloads when executed on different accelerator technologies, such as GPU and NPUs. These predictive models are crucial for understanding how different types of workloads will perform across a diverse range of computing resources. By leveraging these models, it becomes possible to make informed decisions about which accelerator is best suited for a particular job or workload, taking into consideration the specific requirements for Service Level Objectives (SLOs) or Quality of Service (QoS) that need to be met.

[0086] The input to the estimator training module is the data collected by the offline monitoring framework and the model information from the profiler 102.

[0087] The output from the estimator training module 108are device abstraction models. Each device can have an ensemble of models that correspond to the expected device performance for a different model family, e.g., LLMs vs ResNets.

[0088] The process of model development in the estimator training operation involves the following steps. First, performance data from different accelerators is collected and processed using a dynamic workload system that can be sampled from using a small set of parameters (e.g., batch size, model width, model depth, etc. ) . Then, a regressive prediction model is trained and implemented to approximate the performance of a given accelerator under a specific workload. Following this, iterative development of said model is performed using various known techniques, a well-defined data split (train, validation, and test) for hyper-parameter tuning, and additional data collection. Lastly, the model is tested on held-out extrapolation data (i.e., workload settings that are outside of the training / validation scope) .

[0089] Once developed, these models act as a foundational step in a larger workflow management system, enabling automated and intelligent mapping of jobs and workloads to the most appropriate accelerators. This not only helps in optimising the utilisation of computing resources but also ensures that applications meet their performance targets and service quality requirements. In real-world scenarios, this capability can significantly enhance the efficiency and responsiveness of computing environments, particularly in cloud computing and datacentres where a wide variety of workloads are processed, and resource allocation decisions can have a substantial impact on operational costs and user satisfaction.

[0090] Moreover, the deployment of these models in a production environment involves integrating them with the scheduler and resource management systems. This integration allows for real-time decision-making and dynamic allocation of resources based on the current workload demands and available infrastructure, further enhancing the flexibility and efficiency of the system.

[0091] Figure 4 shows a schematic diagram of the estimator training operation. The idea is to build workload and estimator models based on data that are used to predict performance on new workloads on devices deployed in the datacentre. First, there is the workload building phase that is done offline. This is based on a set of target devices (D1, D2.. ) , machine learning models (M1, M2, .. ) and neural network layers (L1, L2, .. ) . The workload builder 200 generates a pair of workloads and target device to run them on. Each workload contains the machine learning model M, parameters for the model P and example input I.

[0092] The workloads W are then run and profiled, with detailed metrics collected. Based on the information from the profiler 102, an estimator is trained 108 to predict runtime information of a workload based on static features that can be extracted without running the workload.

[0093] The trained estimator 402, together with the runtime estimation Sare passed back to the workload builder 200 which can generate more training data based on the performance.

[0094] Once the models are trained, they may be deployed and integrated in a cluster management infrastructure such as kubernetes. Integration of the trained modules enables a step towards achieving an efficient and dynamic resource allocation mechanism. It allows the models to directly influence the scheduling decisions made within the cluster, ensuring that workloads are assigned to the most appropriate computing resources. This setup ensures that the predictive capabilities of the models are effectively utilised to optimise resource allocation, based on the current state and requirements of the cluster.

[0095] Accordingly, there is provided a method of training a performance estimator module for estimating the runtime of a workload. The method comprises receiving, from a profiler module, workload runtime metrics. Then extracting static features from the workload without running the workload. Lastly, the method comprises estimating runtime information of a workload based on the received workload runtime metrics and the static features.

[0096] The estimated runtime information may be passed back to a workload builder module for estimating coverage of the workload configuration. The estimator may be passed back to the workload builder module for estimating coverage of the workload configuration after each iteration of training.

[0097] The workload builder module may provide one or more new workload configurations to the profiler module based on the estimated coverage of a previous workload configuration. For each workload configuration, the profiler module may receive a set of data pairs from a workload builder module, each data pair comprising a workload and a device on which to run the workload. The workload may comprise a machine learning model, model parameters, and an example model input.

[0098] Figure 5 shows a schematic diagram of the modules once trained where their inference is part of an online system. The performance estimator 502 is trained offline and then tuned using continuous online monitoring 504.

[0099] Online Profiler and Classifier

[0100] The profiler 506 leverages information available on never seen before models such as the model DAG, the model operators, and the model layers to identify the submitted model. This component has two stages:

[0101] Online Profiling: Online profiling makes use of the fact that during runtime, a workload W needs to be loaded from the model file to the host memory. Focus herein is on examples for ML workloads. When a workload gets deployed in host memory, the model is typically deployed from a model file. Generally, during the process of loading the model, and before the model is sent to the accelerator, the model has to be loaded in the host memory within the serving framework. The model is then stored in a special structure, state_dict. The state_dict contains all the model data, and when printed for a simple network gives an output like: conv1. weight torch. Size ( [6, 3, 5, 5]) , conv1. bias torch. Size ( [6] ) , conv2. weight torch. Size ( [16, 6, 5, 5] ) , conv2. bias torch. Size (

[0016] ) , fc1. weight torch. Size ( [120, 400] ) , fc1. bias torch. Size (

[0120] ) , fc2. weight torch. Size ( [84, 120] ) , fc2. bias torch. Size (

[0084] ) , fc3. weight torch. Size ( [10, 84] ) , fc3. bias torch. Size (

[0010] ) .

[0102] The profiler 506 reads the state_dictstructure for all the deployed models using a simple plugin that can be loaded at runtime.

[0103] Accordingly, the inputs for the online profiler 506 may be model files, a process ID to watch, or a container to watch. The output may be a model structure as a list including the size of each layer and its weight.

[0104] For online inference there is also provided a classifier module. In figure 5 the classifier and profiler are shown as the same module 506. However, their respective processes may be schematically represented as separate modules with no change to their functions or interaction.

[0105] The classifier 506 takes as input the output from the profiler 506. The classifier finds the closest existing model from theoffline profiling stage to the newly generated model. The decision-making process for assigning workloads within this integrated system works differently based on the workload.

[0106] Workloadsthat the system has previously encountered leverage historical performance data and model predictions to quickly identify the best-fitting accelerator for the job. Since machine-learning operator / layer types are limited, the training of the models should be able to capture most, if not all, of the commonly used operator / layer types. The profiler 506 then extracts the most important workload characteristics and uses the classifier to choose a family of workloads that this new model belongs to. This is then used to classify the model based on the previously learnt classes. The system then relies on the online tuning module 508 to tune the model using simple transfer learning methods.

[0107] Therefore, the input to the classifier is the online profiler output. The out of the classifier is the most suitable prediction model for each device for the given workload.

[0108] Performance Estimator inference

[0109] The performance estimator subcomponent 502 utilises the models produced from the offline profiling stage to predict the workload performance on each device. For each device, the estimators take as input the workload parameters, e.g., the DAG, the batch size, etc., and outputs the predicted performance with respect to the response time and throughput for each device. This enables the administrator to evaluate the speed-ups or the slow-downs for a given workload running on different devices, enabling decisions such as migrating a workload from a given cluster with certain XPU type, to another cluster with a different XPU type. Here XPU stands for Accelerator Processing Unit and may include NPUs and / or GPUs, and / or any other accelerator types.

[0110] In addition, when coupled with a scheduler 510, it enables a datacentre or a cluster level scheduler to make placement and scheduling decisions across multiple heterogeneous resources. Further, the output from this module, if exposed to cloud users, can serve as a tool that enables cloud users to decide on which resource types they should provision for their workloads.

[0111] The input for the performance estimator module 502 may be the workload DAG structure along with other hyperparameters. The output from the performance estimator module 502 is a dictionary with the predicted performance of the workload on each device with a confidence in the prediction. For example, the format of the output may resemble: {Device1: (Performance1, confidence1) , Device2: (Performance2, confidence2) ... } .

[0112] The Scheduler

[0113] The scheduler module 510 is designed to navigate the complexities of device heterogeneity, leveraging predictions from the performance estimator 502 to guide its decisions. Through standard scheduling stages, it filters and assigns workloads to the most suitable device, ensuring an efficient mapping between tasks and computing resources. This process includes evaluating potential matches for each workload, with the scheduler 510 selecting a device D from the pool of suitable candidates, sometimes choosing randomly among equally fitting options to optimize resource distribution.

[0114] In scenarios where the scheduler 510 faces challenges in finding a match, either due to resource constraints or specific requirements of the workload, it activates a fallback mechanism, so long as this mechanism is not going to violate the user provided workload SLOs 512. This might involve assigning the workload to the next most compatible device or re-queuing the job for future scheduling attempts. Such flexibility ensures that the system can adapt to varying conditions and maintain operational efficiency, even when ideal matches are not immediately available.

[0115] Thus, there is provided a method of selecting a device from a set of heterogeneous devices for running workloads given a Service Level Objective, SLO, and a workload. The method comprises receiving a workload at a profiler module, extracting by the profiler module the model structure of the workload, and passing the model structure to a performance estimator module. The method further comprises generating, by the performance estimator module, a list of predicted performance of the workload on each of the devices of the set of devices based on the model structure and feeding the list to a scheduler module alongside the SLO. Next, the method comprises selecting by the scheduler, based on the predicted performance and the SLO for the workload, on which device of the set of devices to run the workload.

[0116] The method may comprise receiving a plurality of different workloads and selecting by the scheduler, based on a plurality of predicted performances and SLOs, on which of the set of devices to run the plurality of workloads. The estimator module may output a list of the predicted performance of the workload on each device with a confidence value of the prediction.

[0117] The list of predicted performance may be provided according to the format:

[0118] W1: {D1: (E (Perf) : C) , …Dn: (E (Perf) : C) } ,

[0119] where W is the workload, D1-n is the device, E (Perf) is the expected performance, and C is the confidence.

[0120] The predicted performance may comprise an estimation of any combination of the operator level, the model level, or the layer level performance expected on each of the devices. The method may comprise monitoring the actual performance of the workload on the selected device and updating the estimator module based on the actual performance.

[0121] Figure 6 shows a schematic diagram providing an overview of the online system components600 similar to that shown in figure 5.

[0122] The generated performance models from the offline training are used in the live deployment in the estimator inference component 602. When a new deep neural network (DNN) workload arrives (W1…Wn) , a Profiler 604 is run in static mode and used to extract the model structure. The model structure is then fed to the estimator inference component 602 which outputs a list of the predicted performance of the workload on each device with a confidence value in the prediction: W1: {D1: (E (Perf) : C) , …Dn: (E (Perf) : C) } ; where E (Perf) is the expected performance, and C is the confidence.

[0123] The output of the estimator inference component 602is then fed to a scheduler 606. The scheduler 606 gets the user-provided SLOs 608 for the each of the workloads. The scheduler 606 then decides on which device 610 to run each workload W.

[0124] The performance is the monitored 612 and is fed to an online model tuning component 614 to improve the models.

[0125] According to the above description there is provided a computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method of selecting a device from a set of heterogeneous devices for running workloads given a Service Level Objective, SLO, and a workload; or the method of training a performance estimator module; or the method of training a profiler module; or the method for generating a set of workloads for training, alone or in combination when the program code is executed by the computer or the processor.

[0126] Online Monitoring &Tuning

[0127] After deploying a workload to a device, its performance may be continuously monitored to detect any signs of performance degradation. This monitoring is an important component of ensuring optimal performance and is the first step in a performance tuning process designed to address any model inaccuracies. If the system performance is different from the predicted model performance, a model online tuning process is triggered that leverages the newly collected data to retune the prediction models for the workload. If the problem is that the performance of the workload is violating the SLOs 608, the scheduler 606 can respond by either migrating the workload to a more powerful device 610 or scheduling additional instances of the workload on alternative devices.

[0128] There are two cases for this process. First, for inference jobs, which often require real-time or near-real-time processing, employing a robust mechanism to manage performance without causing downtime is essential. Strategies might include using load balancing to distribute inference requests across multiple devices 610 or dynamically scaling resources to meet demand without interrupting service. Such mechanisms ensure that inference jobs continue to operate smoothly, even as the system adjusts to optimize performance.

[0129] Second, training jobs, on the other hand, offer more flexibility in managing performance issues. If a device 610 no longer meets the requirements of a training job due to performance degradation, the job can be checkpointed saving its current state and then moved to a device better suited to its demands. This process minimizes the impact on the training progress while ensuring that the job benefits from the most appropriate computational resources available, ultimately leading to more efficient model development.

[0130] An example of a specific embodiment without cross device learning will now be described.

[0131] The process starts with building a model for a new device. In order to model a new device X, first a set of predefined workloads are run on X, one by one, using a ML framework that has the profiling plugin. The profiling plugin, which includes both the profiler and monitoring components, collects per operator statistics and per layer statistics, including but not limited to, X’s resource utilization while each operator is running, energy consumption for each operator, latency for each operator, and any other hardware metrics. Using the collected data, multi-variate prediction models are built, which are capable of taking as input the operator, layer, model, and / or parameters, and outputting a performance / utilization prediction. Said models can use DNN, linear models, or any other prediction methods.

[0132] The process then continues to Model Deployment. The performance estimator and the online profiler are deployed as part of either the ML serving / training frameworks, e.g., as part of PyTorch Train / Serve, Ray Serve / Train, and Mindspore Serve / Train, or as independent components.

[0133] When a new model for device X is ready, it is automatically added to the Performance Estimator inference component.

[0134] A new workload that requires scheduling is first profiled using the online profiler to extract high-level model information such as the type of operators used, the workload graph, or the model layers. This is an online light-weight profiling stage that makes use of data already available at the framework level. The profiled data for each of these workloads is forwarded to the Performance estimator inference component.

[0135] The performance estimator outputs a list of the predicted performance of the workload on each device with a confidence value in the prediction: W1: {D1: (E (Perf) : C) , …Dn: (E (Perf) : C) } where E(Perf) is the estimated performance and the C is the confidence value. This estimation can be an estimation of the operator level, the model level, or the layer level performance expected on each of these devices for the different devices.

[0136] The list of the predicted performance of the workload on each device with a confidence value in the prediction is passed to the scheduler. The scheduler takes as input the above list along with SLO and QoS requirements for the given workload / operator. The scheduler can also take as input a prediction of the expected workload increase or decrease, in terms of, e.g., number of inference requests, amount of data to ingest, or some other workload level metrics. The scheduler uses the above estimations of performance on each device, the confidence, the workload level predictions, and the required QoS to decide on a placement of the workload / operators.

[0137] Once the workload / operator is deployed on a device, the performance is monitored. If there are large deviations between the expected performance on the device and the actual measured performance, the monitoring data is passed to a model tuning component that updates the device model based on newly collected data.

[0138] An example of a specific embodiment comprising cross device learning will now be described. It is assumed that a new hardware XPUUnknown is added to the hardware clusters that cannot be profiled offline for model building. The device behaviour needs to be learnt in production. So, the device is added with no model building stage.

[0139] To model the device during runtime, the initial model in the model performance estimator is set to be similar to the average performance of all the device models. This enables the average error to be reduced when any workload is scheduled.

[0140] As workloads arrive, the scheduler chooses the most suitable device from the set of other known devices, XPUknown, similar to the above described example. The workload is then scheduled on both XPUUnknownandXPUknown.

[0141] There can be one of two cases for the next step. The first device to finish may beXPUknown. In this case, the workload is left to run on the XPUunknowndevice until it is done. The system will rely on the tuning component to tune the model from the average model the system started with. The results from the XPUknownmay be reported back as soon as they are ready to the end-user. Alternatively, the first device to finish is XPUunknown. In this case, the workload is stopped on the XPUknowndevice. The system still relies on the tuning component to tune the model from the average model the system started with. The results from the XPUunknownmay be reported back as soon as they are ready to the end-user.

[0142] In an embodiment, the system can rely on deep reinforcement learning to tune the models. In another embodiment, the system can sample the workloads forwarded to XPUUnknownbased on a selection method that can use random sampling or optimization methods such as submodular optimization, or any other suitable known methods.

[0143] According to the proposed system, method, and modules described above there is provided Heterogeneous accelerator managementwith reduced costs to cloud providers and internal cloud operation. There is also provided performance predictions for running a workload on an accelerator which reduces costs for pre-testing deployments and enables better scheduling.

[0144] There is provided a system and method for training a DNN for automatic generation of Hardware performance estimation device models for generating performance device models.

[0145] There is provided a system and method for using said device models to profile and schedule AI workloads on heterogenous devices based on the predicted performance of the workload on the devices to meet pre-specified user-defined Service-Level-Objectives SLOs.

[0146] There is provided a system and method to auto-tune the DNN used to generate the device models based on measured real performance.

[0147] There are provided simpler scheduling interfaces that enable a user to only specify their target QoS without having to specify how many XPUs they needs, or of what type.

[0148] Accordingly, accurate performance models can be provided that enable better scheduling on XPU resources such that the utilization is increased with higher and stronger QoS guarantees.

[0149] Similarly, a schedulers are providedwhich are capable of increasing both spatial and temporal utilization of resources, reducing the overall cost of operation.

[0150] Online auto-tuning of performance models is enabled that reduces the need for costly and slow offline profiling. Seamless interfaces to the users reduce the burden on the programmers and on the datacentre operators. There is provided a methodological way to compare different accelerators without running workloads on them.

[0151] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1.A method of selecting a device (D1) from a set of heterogeneous devices (104) for running workloads given a Service Level Objective, SLO, (512) and a workload (W1) , the method comprises:receiving a workload (W1) at a profiler module (102) ;extracting by the profiler module the model structure of the workload;passing the model structure to a performance estimator module (108) ;generating, by the performance estimator module, a list of predicted performance of the workload on each of the devices of the set of devices based on the model structure;feeding the list to a scheduler module (510) alongside the SLO; andselecting by the scheduler, based on the predicted performance and the SLO for the workload, on which device of the set of devices to run the workload.2.The method according to claim 1, wherein the method comprises receiving a plurality of different workloads and selecting by the scheduler, based on a plurality of predicted performances and SLOs, on which of the set of devices to run the plurality of workloads.3.The method according to claim 1 or 2, wherein the estimator module outputs a list of the predicted performance of the workload on each device with a confidence value of the prediction.4.The method according to claim 3, wherein the list of predicted performance is provided according to the format W1: {D1: (E (Perf) : C) , …Dn: (E (Perf) : C) } ,where W is the workload, D1-n is the device, E (Perf) is the expected performance, and C is the confidence.5.The method according to any preceding claim, wherein the predicted performance comprises an estimation of any combination of the operator level, the model level, or the layer level performance expected on each of the devices.6.The method according to any preceding claim, wherein the method comprises monitoring the actual performance of the workload on the selected device and updating the estimator module based on the actual performance.7.A method of training a performance estimator module (108) for estimating the runtime of a workload, the method comprises:receiving, from a profiler module (102) , workload runtime metrics;extracting static features from the workload without running the workload; andestimating runtime information of a workload based on the received workload runtime metrics and the static features.8.The method according to claim 7, wherein the estimated runtime information is passed back to a workload builder module (200) for estimating coverage of the workload configuration.9.The method according to claim 7 or 8, wherein the estimator is passed back to the workload builder module for estimating coverage of the workload configuration after each iteration of training.10.The method according to claim 8 or 9, wherein the workload builder module provides one or more new workload configurations to the profiler module based on the estimated coverage of a previous workload configuration.11.The method according to any of claims 7 to 10, wherein for each workload configuration the profiler module receives a set of data pairs from a workload builder module, each data pair comprises a workload and a device on which to run the workload.12.The method according to any of claims 7 to 11, wherein the workload comprises a machine learning model, model parameters, and an example model input.13.A method of training a profiler module (102) for determining the optimal accelerator for running a workload (W1) , the method comprises:selecting a set of predefined machine learning models with representative workloads;for each model of the set:decomposing the model into its constituent components at a specified layer;running the model on each of a set of devices (104) ; andobtaining runtime data by monitoring the performance of each component of the model on each device of the set of devices;passing the runtime data for each model to one of a plurality of machine learning models representing a device of the set of devices;passing corresponding model structure data to said one of the plurality of machine learning models representing said device of the set of devices; andtraining each machine learning model to estimate the performance of a different device of the plurality of devices of the set by associating the respective model runtime data with the corresponding model structure data.14.The method according to claim 13, wherein each device represents a different accelerator for running the workload of the model.15.The method according to claim 13 or 14, wherein the specified layer is any one of the machine learning layer, the compiler layer, or the hardware layer.16.The method according to any of claims 13 to 15, wherein the constituent components of the compiler layer comprise the computational kernels of the workload.17.The method according to any of claims 13 to 16, wherein the model structure data comprises any combination of number of kernels, type of kernels, a directed acyclic graph, DAG, of kernels, and one or more other hyper-parameters of the model.18.The method according to claim 17, wherein the DAGs are provided to the device performance estimator model as dictionaries with keys equal to each model and values comprising a list of model layers or kernels.19.The method according to any of claims 13 to 18, wherein the monitoring (612) comprises deploying an exporter agent on each computing node representing a device of the set of devices within the system.20.The method according to any of claims 13 to 19, wherein the runtime data comprises any combination of overall device utilization rate, specific usage of device memory, response time, throughput, model accuracy, and one or more user-defined QoS metrics.21.A method for generating a set of workloads for training a profiler module (102) or estimator module (108) , the method comprising:receiving a plurality of machine learning models, a plurality of device types, and a plurality of neural network layers; andgenerating a set of data pairs, each data pair comprising a workload and a device on which to run the workload.22.The method according to claim 21, wherein each workload comprises a machine learning model, a set of parameters of the model, and an example input of the model.23.The method according to claim 21 or 22, wherein the method comprises receiving a list of multiple workloads and their associated runtime metrics when run on a particular device.24.The method according to any of claims 21 to 23, wherein the plurality of device types comprises model numbers and associated specs of each device which may be used to run the model.25.The method according to any of claims 21 to 24, wherein the plurality of machine learning models comprises a list of neural network layers of the model.26.The method according to any of claims 21 to 25, wherein the plurality of neural network layers comprises a list of types of layers used in the machine learning models.27.The method according to any of claims 21 to 26, wherein the method comprises estimating coverage of previously sampled workload configurations using variational autoencoders or Gaussian processes and determining new workload configurations based on the estimated coverage.28.A system (600) for selectinga device from a set of heterogeneous devices (104) for running workloads (W1) given a Service Level Objective, SLO, (512) and a workload, the system comprises:a profiler module (604) configured to extract the model structure of the workload;an estimator module (602) configured to estimate the predicted performance of the workload on each device based on the model structure and one or more pre-trained device models; anda scheduler module (606) configured to select a device from the set of devices (610) for running the workload (W1) based on the predicted performance for each device and the SLO (608) of the workload.29.A workload building module (200) for generating a set of workloads for training a profiler module or estimator module, the module configured to performing the operations of:receiving a plurality of machine learning models, a plurality of device types, and a plurality of neural network layers; andgenerating a set of data pairs, each data pair comprising a workload and a device on which to run the workload.30.A profiler module (102) configured todetermine the optimal accelerator for running a workload and trained according to the method of claim 13.31.An estimator module (108) configured to estimate the runtime of a workload and trained according to the method of claim 7.32.A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method of claim 1 or the method of claim 7 or the method of claim 13 or the method of claim 21 alone or in combination when the program code is executed by the computer or the processor.