A heterogeneous inference scheduling method and electronic device for sparse activation models

By acquiring the running information of the sparse activation model within a preset period, determining the activation distribution characteristics, and dynamically selecting heterogeneous devices, the problem of resource mismatch in heterogeneous inference of the sparse activation model is solved, thereby improving inference efficiency and stability.

CN122309162APending Publication Date: 2026-06-30ANT BLOCKCHAIN TECHNOLOGY (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANT BLOCKCHAIN TECHNOLOGY (SHANGHAI) CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In heterogeneous inference scenarios, existing scheduling methods for sparse activation models struggle to reflect dynamic changes in the model's activation state in a timely manner, leading to a mismatch in resource requirements and impacting inference efficiency and stability.

Method used

By acquiring the running information of the target task within a preset period, the activation distribution characteristics are determined, and suitable heterogeneous candidate devices are dynamically selected for scheduling based on computing, communication, and memory requirements.

Benefits of technology

It improves the adaptability of heterogeneous inference scheduling, taking into account inference efficiency, execution stability and resource utilization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309162A_ABST
    Figure CN122309162A_ABST
Patent Text Reader

Abstract

This invention relates to the field of deep learning model inference optimization technology, and discloses a heterogeneous inference scheduling method and electronic device for sparse activation models. The method includes: acquiring task data for a target task and determining a target model for executing the target task; acquiring runtime information within a preset period corresponding to the target task; determining the activation distribution characteristics of the target task in the target model based on activation-related information; determining the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device, combined with resource-related information; determining the target device for executing the target task based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device; and calling the target device to process the task data to obtain the task processing result. This method can improve the inference scheduling adaptability of sparse activation models in heterogeneous environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of deep learning model inference optimization technology, specifically to a heterogeneous inference scheduling method and electronic device for sparse activation models. Background Technology

[0002] Sparse activation models are a class of neural network models that, in a single inference iteration, do not invoke all model parameters or all computational units. Instead, they select a subset of computational units from multiple candidate units based on the characteristics of the current input. Sparse activation models can include Mixture of Experts (MoE) models, gated sparse networks, Conditional Computation networks, dynamic routing networks, and modular neural networks that select specific sub-networks for execution based on the input. These models share the characteristic that not all computational units participate in computation during a single inference iteration; instead, only certain layers, branches, experts, or sub-modules are activated based on the input characteristics. Therefore, the actual execution path of the task typically changes with the input, further leading to variations in computational requirements, data exchange requirements, and memory requirements.

[0003] These models typically employ gating, routing, conditional computation, or path selection mechanisms to choose the actual execution path corresponding to the current input. Compared to dense models where all parameters participate in the computation simultaneously, sparse activation models can maintain a large model capacity and strong expressive power while activating only a subset of parameters or sub-modules in a single inference, thus achieving a balance between model capacity and computational cost per inference. Based on this characteristic, sparse activation models are increasingly being applied to scenarios with high requirements for model size and inference efficiency, such as natural language processing, multimodal understanding and generation, and recommendation decision-making.

[0004] Taking hybrid expert models as an example, as a typical form of sparse activation models, these models typically include multiple expert sub-modules and a gating network for selecting among them. The gating network can select some expert sub-modules to participate in the inference computation based on the characteristics of the current input data, while the remaining expert sub-modules typically do not participate in the computation. In other words, the number of expert sub-modules actually participating in the computation in a single inference is usually less than the total number of selectable expert sub-modules in the model. Therefore, hybrid expert models typically exhibit significant sparse activation characteristics during the inference phase. Since hybrid expert models are highly representative of sparse activation models, the following explanation uses hybrid expert models as an example to illustrate the technical challenges faced by sparse activation models in heterogeneous inference scenarios. However, those skilled in the art will understand that the relevant analysis is also applicable to other sparse activation models that dynamically select some computational units to perform inference based on the input.

[0005] In the inference process of sparse activation models, different input data typically trigger different activation modes. Taking a hybrid expert model as an example, the set of expert submodules actually participating in the computation during a single forward pass may differ for different inference requests, and the activation intensity of each activated expert submodule may also differ. For some model structures, this difference is further reflected in the changes in activation distribution between different model layers. Therefore, sparse activation models typically do not run along a fixed computational path during the inference phase, but rather exhibit dynamic execution path characteristics that vary with the input. As the actual execution path changes, the computational load, data exchange scale, and device-side storage usage of the task often change within different time slices.

[0006] This dynamic execution path characteristic further impacts the utilization of underlying hardware resources. For a specific inference task, a larger number of activated computational units, higher activation intensity, or larger parameter scales for activated computational units typically mean a higher actual computational load, more data transmission or exchange, and greater storage resource consumption. Taking a hybrid expert model as an example, when the number of activated expert sub-modules increases within a certain period, or when frequently activated expert sub-modules correspond to larger weight scales, the task's inference time, data communication overhead, and memory consumption during execution will usually change accordingly. Therefore, the resource requirements of sparse activation models during the inference phase are not only determined by the model's static structure but are also closely related to the activation distribution caused by the current input.

[0007] To support online inference for such models, existing systems typically employ heterogeneous computing environments comprised of a core processor (CPU), a graphics processing unit (GPU), and a neural processing unit (NPU). The hardware architecture and resource characteristics of different types of candidate devices vary significantly. For example, CPUs are generally better suited for general control logic, serial scheduling, and some computationally relatively low-parallel tasks; their execution typically relies on accessing relevant data in the system's runtime memory. GPUs are generally better suited for large-scale parallel tensor computations; their efficient execution usually depends on model weights, intermediate activations, and cached data residing pre-reside in local video memory. NPUs are typically optimized for specific operators in neural network inference; their execution also usually relies on local on-chip storage resources or their associated storage resources. Therefore, different candidate devices differ not only in computational power but also in the location of the data they rely on and the storage resources they can directly access.

[0008] In common heterogeneous inference environments, when a task is executed on a target device, if the model weights, input data, intermediate activation results, or cached data required by the task are not yet located in a storage area that the target device can directly and efficiently access, data transfer is usually required. For example, when data in the CPU's system memory needs to be provided to the GPU for execution, the relevant data usually needs to be transferred to the GPU's video memory; when the GPU's execution results need to be handed over to the CPU or other devices for further processing, the corresponding intermediate results also usually need to be transferred from the video memory back to a storage area accessible to other devices. For heterogeneous inference processes involving the NPU, there are also often situations where input data, model parameters, or intermediate results are transferred between different storage areas. Therefore, data access in heterogeneous inference processes involves not only the computation itself but also often the data exchange between devices.

[0009] One direct impact of this is data communication overhead. This overhead depends not only on the amount of data to be transmitted, but also on the connection method between devices, available bandwidth, and current transmission load. In some scenarios, even if a candidate device has high theoretical computing power, if the task needs to transfer relevant parameters before execution, or if intermediate activation results need to be frequently exchanged during execution, data transmission waiting can still significantly increase the overall task inference time. In other words, the theoretical computing power of a candidate device cannot solely determine the actual inference efficiency. Whether the task-related data already resides in the accessible storage area of ​​the target device, and the communication cost of data transmission across devices, also significantly affect heterogeneous inference performance.

[0010] Besides communication factors, memory resources are also crucial in heterogeneous inference using sparse activation models. Since sparse activation models typically contain many candidate computational units, the overall model parameter scale is large. Even if only a portion of these computational units are activated in a single inference iteration, the relevant weights, intermediate activation values, and runtime cache still require device-side memory or GPU memory resources. Taking hybrid expert models as an example, even if only some expert sub-modules are activated in a single inference iteration, the relevant expert parameters, gating results, intermediate activation values, and cached data still require system runtime memory, GPU memory, or other device-side storage resources. Different candidate devices typically differ in system runtime memory capacity, GPU memory capacity, and currently available memory space. When a candidate device's local available storage resources are insufficient, even with high computing power, its execution may be limited due to the inability to accommodate the model data and runtime data required for the current task. Furthermore, with a large number of concurrent tasks or rapidly changing activation paths, the system runtime memory and GPU memory usage may fluctuate within a short period, affecting the candidate device's ability to handle subsequent tasks.

[0011] Therefore, in heterogeneous inference scenarios using sparse activation models, the actual execution performance of the same inference task on different candidate devices is typically influenced by a combination of factors, including computational resource usage, data communication overhead, and memory consumption. If a task has a high computational load, it is more concerned with the computational capabilities of the candidate devices; if a task involves significant data transfer or intermediate result exchange before and after execution, it is more sensitive to communication links and bandwidth conditions; if a task activates many computational units or involves large parameters and caches, it is more dependent on system memory, video memory, or other device-side storage resources. Thus, the compatibility between tasks and devices is usually multi-dimensional, rather than determined by a single hardware metric.

[0012] In existing technologies, a common approach is to pre-establish fixed mappings between model tasks and devices during the deployment phase. For example, certain model layers, task types, or computational units are fixedly assigned to specific GPUs, CPUs, or other devices, and the device allocation is not adjusted based on task state changes during runtime. This approach is relatively straightforward and has low runtime scheduling overhead, making it suitable for scenarios with minimal load variations. However, for sparse activation models, the actual computational units involved in a single inference are not fixed, and the computational resource usage, data communication overhead, and memory usage of a task often fluctuate across different time slices. Fixed mappings are typically established based on static configurations, focusing more on the model structure itself than the actual activation state of the current task, and failing to reflect the changing relationship between the current location of task-related data and the target device's access path. Therefore, when input distribution or activation patterns change significantly, fixed mappings often fail to reflect the real-time adaptation of different candidate devices, easily leading to situations where some devices are overloaded while others are underutilized.

[0013] Another approach employs a heuristic scheduling method based on empirical rules. For example, candidate devices are screened or ranked based on whether the task is computationally intensive, data-intensive, or whether the remaining resources of the current device meet preset conditions. Compared to a completely static mapping method, this approach considers the matching relationship between tasks and devices, thus offering some flexibility. However, empirical rules are typically based on prior classifications of task types and hardware characteristics, and their rule boundaries are often quite coarse. When the task activation state continuously changes during execution, relying solely on preset rules makes it difficult to fully characterize the task's comprehensive resource requirements in terms of computation, communication, and memory. Especially in sparse activation models, even different requests belonging to the same task category may exhibit different computational resource usage, data communication overhead, and memory usage due to different sets of actual activation units. In this case, if devices are still allocated to tasks according to fixed rules, the scheduling results often fail to maintain a high degree of fit.

[0014] Another approach uses a single performance objective as the basis for device selection, such as prioritizing candidate devices with higher theoretical computing power or those with lower current utilization. This approach is easy to implement and facilitates rapid scheduling decisions. However, the inference performance of sparse activation models in heterogeneous environments is usually not determined by a single factor. If the target device is selected solely based on the theoretical computing power of the candidate devices, data communication overhead can become a new limiting factor when there is significant parameter transfer, intermediate activation transmission, or data exchange between devices during task execution. If the target device is selected solely based on the current device utilization, insufficient remaining memory to accommodate the currently activated computing units and their runtime data can also lead to decreased task execution stability. Therefore, while a single metric can reflect the device status in a certain resource dimension, it is often difficult to comprehensively characterize the overall suitability of the candidate device for the current inference task.

[0015] Furthermore, inference scheduling in sparse activation models not only faces the challenges of heterogeneous device architecture but also the problem of state changes over time. During continuous operation, the request arrival rate, input length distribution, path selection results, and real-time device load can all change rapidly. A candidate device might have high available computing power and low memory pressure in one time slice, but become busy in another due to increased concurrent tasks, data transfer, or cache usage. Correspondingly, some computational units might be frequently activated for a period, while being less frequently invoked in another. For example, in a hybrid expert model, some expert submodules might become hot experts for a period, while being less frequently selected by the gating network in another. This indicates that device adaptation relationships are not static and stable over the long term, but rather dynamic relationships that are jointly related to recent task states and device states.

[0016] However, from the perspective of existing technology, many scheduling methods still do not fully utilize recent operational status when selecting equipment. Some solutions tend to make scheduling decisions based on long-term statistical results or instantaneous single-point states, while paying less attention to continuously collecting task operation indicators within a preset period and using these indicators to continuously characterize the computing resource consumption, data communication overhead, and memory usage of candidate devices when executing the current task. When there is a lack of continuous recording of recent status, the scheduling system's judgment of the actual resource requirements of the task is prone to lag; and when there is a lack of joint analysis of recent multi-dimensional resource indicators, the scheduling system has difficulty distinguishing whether the current limiting factors mainly come from the computing side, the communication side, or the memory side. Therefore, even if the equipment selection process can be executed dynamically in form, its scheduling basis may still be relatively crude.

[0017] This problem is often more pronounced in sparse activation models. The dynamic activation characteristics of these models mean that the activation intensity of different computational units can vary across different time slices, impacting inference time, data communication overhead, and memory usage. For example, in hybrid expert models, changes in the activation sets and intensity of expert submodules further affect the number of experts actually participating in the computation, the scale of inter-layer data exchange, and runtime cache usage. If the scheduling system cannot combine the activation data collected within a preset period with task performance metrics to uniformly evaluate the execution suitability of each candidate device, it becomes difficult to accurately determine which type of device is more suitable for the current task. Especially when some tasks have high computational pressure, some have high communication pressure, and some have high memory pressure, and these pressures may change simultaneously, relying solely on static mapping, empirical rules, or a single metric often fails to balance inference efficiency, execution stability, and the effective utilization of heterogeneous resources.

[0018] In summary, existing heterogeneous inference solutions for sparse activation models often fail to continuously represent the current task state by incorporating recent runtime information within a preset period, and rarely conduct unified adaptation analysis of candidate devices based on the computational, data exchange, and memory requirements corresponding to the activation distribution. Due to the lack of a device adaptation mechanism that can dynamically update with changes in model activation state, existing solutions typically struggle to reflect changes in resource requirements caused by dynamic execution paths, making it difficult to balance inference efficiency, execution stability, and resource utilization in heterogeneous computing environments. Summary of the Invention

[0019] This invention provides a heterogeneous inference scheduling method and electronic device for sparse activation models, which alleviates the problem in the prior art where heterogeneous inference scheduling is unable to reflect the recent activation state changes of sparse activation models and the state changes of candidate devices in terms of computing resource occupation, data communication overhead and memory occupation in a timely manner, thus making it difficult to accurately determine the execution device that matches the activation distribution characteristics and resource requirements of the current task.

[0020] To achieve the above objectives, embodiments of the present invention provide a heterogeneous inference scheduling method for sparse activation models, comprising: acquiring task data of a target task; determining a target model for executing the target task, wherein the target model is a sparse activation model; acquiring runtime information within a preset period corresponding to the target task, wherein the runtime information within the preset period includes at least activation-related information characterizing the activation state of the target model, and resource-related information characterizing the difference in execution cost of the target task on heterogeneous candidate devices; determining the activation distribution characteristics of the target task in the target model based on the activation-related information; and determining the execution cost of the target task based on the computational requirements characterized by the activation distribution characteristics and information related to task inference time and device computing power in the resource-related information. Determine the computational adaptation results of each heterogeneous candidate device for the target task; determine the communication adaptation results of each heterogeneous candidate device for the target task based on the data exchange requirements represented by the activation distribution characteristics and the information related to task communication overhead and device communication capabilities in resource-related information; determine the memory adaptation results of each heterogeneous candidate device for the target task based on the memory carrying capacity requirements represented by the activation distribution characteristics and the information related to task memory usage and device storage resources in resource-related information; determine the target device for executing the target task based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device; invoke the target device, process the task data based on the target model, and obtain the task processing results.

[0021] In one alternative implementation, activation-related information includes at least one of the activation count and activation weight of the activated computing units in the model, and resource-related information includes at least one of the task inference time, task communication metrics, and task memory metrics.

[0022] In one optional implementation, determining the activation distribution characteristics of the target task in the target model based on activation-related information includes: determining at least one of the following as activation distribution characteristics: the set of activated computational units of the target task in at least one model layer, the activation intensity distribution of each activated computational unit, and the inter-layer distribution of each activated computational unit in multiple model layers.

[0023] In one optional implementation, the computational adaptation results of each heterogeneous candidate device for the target task are determined based on the computational requirements represented by the activation distribution characteristics and information related to task inference time and device computing capabilities in resource-related information. This includes: determining the computational adaptation results of each heterogeneous candidate device for the target task based on the number of activated computing unit sets, the activation intensity distribution of each activated computing unit, and the corresponding task inference time, combined with the peak performance indicators of each heterogeneous candidate device; and determining the communication adaptation results of each heterogeneous candidate device for the target task based on the data exchange requirements represented by the activation distribution characteristics and information related to task communication overhead and device communication capabilities in resource-related information. This includes: determining the communication adaptation results of each heterogeneous candidate device for the target task based on the inter-layer distribution of the activated computing unit set in multiple model layers and the corresponding task communication indicators, combined with the device communication indicators of each heterogeneous candidate device; and determining the memory adaptation results of each heterogeneous candidate device for the target task based on the memory carrying requirements represented by the activation distribution characteristics and the information related to task memory occupation and device storage resources in resource-related information, including: determining the memory adaptation results of each heterogeneous candidate device for the target task based on the number of activated computing unit sets, the activation intensity distribution of each activated computing unit and the corresponding task memory indicators, combined with the device storage resources of each heterogeneous candidate device.

[0024] In one alternative implementation, the heterogeneous candidate device includes at least two of a core processor (CPU), a graphics processing unit (GPU), and a neural network processor (NPU). The device's computing power includes the floating-point operation capability of the heterogeneous candidate device; the device's communication capability includes the data transmission bandwidth capability of the heterogeneous candidate device; and the device's storage resources include at least one of the following: system running memory, video memory, on-chip storage resources, or storage resources associated with the heterogeneous candidate device.

[0025] In one optional implementation, obtaining the running information within a preset period corresponding to the target task includes: collecting the running indicators of the target model on heterogeneous candidate devices; storing the running indicators in an indicator storage unit; and reading the running information within a preset period corresponding to the target task from the indicator storage unit.

[0026] In one optional implementation, the operational metrics include historical operational metrics and currently collected operational metrics written into the metric storage unit according to the collection time. The operational information corresponding to the target task within a preset period is read from the metric storage unit, including: extracting activation-related information and resource-related information corresponding to the target task from the historical operational metrics and currently collected operational metrics, as operational information within the preset period.

[0027] In one optional implementation, the target device for performing the target task is determined based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device. This includes: fusing the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device to obtain a comprehensive adaptation result of each heterogeneous candidate device for the target task; and determining the target device based on the comprehensive adaptation result of each heterogeneous candidate device.

[0028] In one optional implementation, the target model includes multiple model layers, and each model layer includes multiple sub-models. For each sub-model, the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device for the corresponding sub-model are determined. Based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device for each sub-model, corresponding sub-devices are determined for each sub-model. Subsequently, each sub-device is invoked to execute the corresponding sub-model, and the task processing result is generated based on the processing results of multiple sub-models.

[0029] This invention also provides an electronic device, including a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, it implements the above-described heterogeneous inference scheduling method for sparse activation models.

[0030] The heterogeneous inference scheduling method and apparatus for sparse activation models provided in this invention acquires runtime information within a preset period corresponding to the target task, determines the activation distribution characteristics of the target task in the target model based on activation-related information, and then characterizes the computational requirements, data exchange requirements, and memory carrying requirements based on the activation distribution characteristics. Combining the resource-related information of candidate devices in the computation, communication, and storage dimensions, the computational adaptation results, communication adaptation results, and memory adaptation results are determined, thereby identifying the target device. This enables heterogeneous inference scheduling to more closely resemble the actual dynamic execution path of the sparse activation model, better reflects the real requirements of the current task in different resource dimensions, improves the adaptability of target device selection, and helps to balance inference efficiency, execution stability, and heterogeneous resource utilization. Attached Figure Description

[0031] To more clearly illustrate the technical solutions of the embodiments in this specification, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a schematic diagram of the overall process of a heterogeneous inference scheduling method for sparse activation models, provided as an embodiment of the present invention.

[0033] Figure 2 This is a schematic diagram of a process for acquiring operational information within a preset period, as provided in an embodiment of the present invention.

[0034] Figure 3 This is a schematic diagram of an organization structure for operational information within a preset period, provided as an embodiment of the present invention.

[0035] Figure 4 This is a schematic diagram illustrating an activation distribution feature determination process according to an embodiment of the present invention.

[0036] Figure 5 This is a schematic diagram illustrating the principle of generating task requirement features from activation distribution features, as provided in one embodiment of the present invention.

[0037] Figure 6 This is a schematic diagram of a multidimensional adaptation analysis process provided in one embodiment of the present invention.

[0038] Figure 7 This is a schematic diagram of a target device determination process provided in one embodiment of the present invention.

[0039] Figure 8 This is a schematic diagram of the structure of an electronic device provided in one embodiment of the present invention.

[0040] Figure 9 This is a schematic diagram of a task processing device provided in one embodiment of the present invention.

[0041] Figure 10 This is a schematic diagram illustrating an application scenario of a heterogeneous inference scheduling system for sparse activation models, as provided in one embodiment of the present invention. Detailed Implementation

[0042] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.

[0043] The technical solutions of one or more embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the embodiments described below are used to illustrate the technical concept of the present invention, and those skilled in the art can make various equivalent substitutions or modifications without departing from the technical concept of the present invention.

[0044] In one embodiment of the present invention, Figure 1 This paper presents a schematic diagram of the overall process of a heterogeneous inference scheduling method for sparse activation models. Figure 1 The process can be handled by Figure 8 The electronic device 800 shown can perform this function, or it can be performed by... Figure 9 The task processing device 900 shown executes the task. Figure 10 This diagram illustrates an application scenario for a heterogeneous inference scheduling system for sparse activation models. Client 1001 can send a target task to server 1002. Server 1002 can invoke candidate devices from a heterogeneous device resource pool 1003 to perform inference processing and return the task processing result to client 1001. The heterogeneous device resource pool 1003 may include a core processor CPU 10031, a graphics processing unit GPU 10032, a neural network processor NPU 10033, and other heterogeneous computing devices that can be used for neural network inference.

[0045] In one embodiment of the present invention, a sparse activation model is a type of neural network model that, in a single inference process, does not invoke all model parameters or all computational units, but instead selects a subset of computational units from multiple candidate computational units to participate in the computation based on the characteristics of the current input. Sparse activation models can include Mixture of Experts (MoE) models, gated sparse networks, Conditional Computation networks, dynamic routing networks, and modular neural networks that select specific sub-networks for execution based on the input. The common feature of these models is that, in a single inference process, not all computational units participate in the computation; instead, only some layers, branches, experts, or sub-modules are activated based on the input characteristics. Therefore, the actual execution path of the target task in the model usually changes with the input, further leading to variations in computational requirements, data exchange requirements, and memory requirements.

[0046] For ease of explanation, in one embodiment of the present invention, a hybrid expert model is used as a typical example of a sparse activation model for detailed description. It should be understood that the following descriptions of expert submodules, gating networks, and expert activation distribution characteristics can also correspond to activated computational units, path selection units, or modular subnetworks in other sparse activation models. That is, when the target model is not a hybrid expert model, but rather another sparse activation model that selects computational units based on input, the expert submodules mentioned below can be understood as the selected computational branches, conditional computational units, or subnetwork modules, and the corresponding expert activation distribution characteristics can be understood as activation distribution characteristics.

[0047] In one embodiment of the present invention, Figure 1 Step S110 may include: acquiring task data for the target task and determining the target model, wherein the target model is used to execute the target task. The target task may be a currently pending inference request, and the task data may include text, images, audio, video, or multimodal combined data. The target model may be a sparse activation model that matches the target task. If the target task itself carries a model identifier, the target model can be directly determined based on the model identifier; if the target task does not explicitly carry a model identifier, the target model matching the target task can be selected from multiple candidate models based on task type, input modality, item label, or preset routing rules.

[0048] In a specific example, if the target task is a text understanding task, the task data may include the text sequence to be analyzed, context fragments, or cue word sequences; if the target task is a multimodal understanding task, the task data may also include image features, video frame features, or audio features. Since different input data typically differ in semantic content, input length, statistical distribution, and modal composition, the same sparse activation model will usually trigger different activation modes when processing different target tasks. This difference will not only be reflected in the model output but will also further affect the computational load, data exchange volume, and device-side storage usage involved in subsequent inference processes. Therefore, in one embodiment of the present invention, step S110 is not only used to determine the input object and target model but also provides a processing entry point for subsequently extracting runtime information within a preset period around the target task, determining activation distribution characteristics, and analyzing heterogeneous device adaptation relationships.

[0049] In one embodiment of the present invention, Figure 1 Step S120 may include: obtaining running information within a preset period corresponding to the target task. The running information within the preset period includes at least activation-related information representing the activation state of the target model, and resource-related information representing the difference in execution cost of the target task on heterogeneous candidate devices. Figure 2This diagram illustrates a process for acquiring operational information within a preset period. Figure 2 The runtime information acquisition unit 201 can collect activation-related information from the model execution process, the resource status acquisition unit 202 can collect information related to task inference time, communication overhead and memory usage, the runtime information organization unit 203 can organize information from different sources according to the target task, target model, candidate device and acquisition time, and the runtime information output unit 204 can output runtime information within a preset period corresponding to the target task.

[0050] The operational information within the preset period here is not limited to instantaneous sampling results at a certain moment, but rather is a set of information reflecting the activation change trend and resource demand change trend of the target task within a recent time range. For sparse activation models, the activation state corresponding to a single inference is often affected by input features and changes. If device selection is based solely on a single point state at a certain moment, the device selection result is easily affected by instantaneous fluctuations; if device selection is based solely on the average statistical value over an excessively long period, it is easy to mask activation changes that occur in a short period. Therefore, in one embodiment of the present invention, by constructing operational information within a preset period corresponding to the target task, a balance can be achieved between state timeliness and statistical stability, so that the subsequent device adaptation judgment can reflect recent changes while avoiding oversensitivity to individual instantaneous fluctuations.

[0051] In one embodiment of the present invention, activation-related information may include at least one of the activation count and activation weight of the activated computational unit in the model. Taking a hybrid expert model as an example, the activated computational unit may correspond to an expert submodule. The activation count may represent the number of times an expert submodule is selected by the gating network to participate in inference computation within a preset period. The activation weight may represent the routing strength, gating output value, or contribution weight of the expert submodule when it is activated. The activation count is more suitable for characterizing the call frequency of a computational unit within a preset period, while the activation weight is more suitable for characterizing the degree of participation of the computational unit in the current target task execution path. If an expert submodule is frequently activated in multiple consecutive time slices, the expert submodule usually corresponds to a high call popularity; if an expert submodule is activated only a few times, but the activation weight is high each time, the expert submodule may still have a high influence on the inference path of the current task.

[0052] In one embodiment of the present invention, resource-related information may include at least one of task inference time, task communication metrics, and task memory metrics. Task inference time may represent the execution time of the target model or a portion of its model layers or sub-modules on the corresponding candidate device. Task communication metrics may represent the communication overhead associated with parameter transport, intermediate activation exchange, result transmission, or data backhaul during the execution of the target task. Task memory metrics may represent the degree of occupancy of system memory, video memory, on-chip storage resources, or device-side storage resources by the target task during execution. The reason why these resource-related information can characterize the difference in execution cost of the target task on heterogeneous candidate devices is that although the execution path of the same target task on different candidate devices such as CPU, GPU, and NPU may be the same in model logic, the corresponding inference time, data transport scale, and storage occupancy during actual execution are usually different.

[0053] In one embodiment, task inference time can be obtained by recording the start and end times of the task on the candidate device, or by recording the start and end times of a model layer, a computing unit, or a subtask on the candidate device. Task communication metrics can be obtained by recording the amount of model weights loaded before and after task execution, the amount of intermediate results transmitted, the size of the returned results, the transmission duration, or the link occupancy level. Task memory metrics can be obtained by recording the peak memory usage, average memory usage, GPU memory usage, or cache usage during task execution. For different types of sparse activation models, although the internal activation objects are different, as long as they can reflect the differences in resource costs when the target task is executed on different candidate devices, the aforementioned information can all be used as resource-related information in one embodiment of the present invention.

[0054] In one embodiment of the present invention, Figure 3 This diagram illustrates an organizational structure for operational information within a preset period. During the recording of operational information, Figure 3 The task identifier 301 can be used to distinguish different target tasks, the model identifier 302 can be used to distinguish different sparse activation models, the candidate device identifier 303 can be used to distinguish different types of devices such as CPU, GPU, and NPU, the time stamp 304 can be used to distinguish different sampling times within a preset period, the activation-related field 305 can be used to record information such as activation count and activation weight, and the resource-related field 306 can be used to record information such as task inference time, task communication indicators, and task memory indicators. Through this organization, the system can extract activation-related and resource-related information corresponding to the target task from the historical running information accumulated within the preset period and the currently collected running information, and use this information as the input basis for the subsequent heterogeneous inference scheduling process.

[0055] In one embodiment of the present invention, the runtime information within a preset period can be organized at the overall model level, or at the model layer level, sub-model level, or computation unit level. Using overall model-level information is more suitable for overall task-level device selection; using model layer-level or computation unit-level information is more suitable for fine-grained device adaptation analysis. In a different embodiment, the system can simultaneously maintain overall model-level runtime information and sub-module-level runtime information, thereby switching between coarse-grained and fine-grained scheduling. For example, when the target task is small or has high requirements for response latency, overall model-level runtime information can be used to quickly determine the target device; when the target task is large or the activation differences between different layers or computation units are significant, sub-module-level runtime information can be further combined to perform finer-grained adaptation analysis. In this way, a balance can be achieved between scheduling overhead and scheduling accuracy.

[0056] In one embodiment of the present invention, the processing procedure consisting of steps S110 and S120 enables the system to first determine the sparse activation model corresponding to the current task to be processed, and then extract the running information within a preset period around the target task. Since the extracted running information already includes the activation state information and resource cost information of the target model on heterogeneous candidate devices, it is no longer necessary to rely solely on static model configuration, fixed device rules, or instantaneous single-point states to infer device adaptation relationships. Instead, the activation distribution characteristics of the target task in the target model can be further determined based on the actual running information. This makes the heterogeneous inference scheduling process closer to the actual execution state of the sparse activation model and lays the foundation for subsequent analysis of the adaptation degree of different candidate devices to the current target task from the three dimensions of computing, communication, and memory.

[0057] In one embodiment of the present invention, Figure 1 Step S130 may include: determining the activation distribution characteristics of the target task in the target model based on activation-related information; and forming computational requirements, data exchange requirements, and memory carrying requirements based on the activation distribution characteristics. Figure 4 A schematic diagram of the process for determining activation distribution characteristics is shown. Figure 4 The activation information parsing unit 401 can receive information from... Figure 2 The activation-related information of the running information output unit 204 shown can be merged by the hierarchical merging unit 402 according to the model layer or execution unit level. The distribution construction unit 403 can form the activation distribution features corresponding to the target task. The distribution output unit 404 can output the distribution results for subsequent adaptation analysis.

[0058] The activation distribution features mentioned here can be understood as a structured representation of the actual activation state of the target task in a sparse activation model. This structured representation is not limited to a single numerical value but can include multiple interrelated distribution information. In one embodiment of the present invention, the activation distribution features may include at least one of the following: a set of activated computational units in at least one model layer, the activation intensity distribution of each activated computational unit, and the inter-layer distribution of each activated computational unit across multiple model layers. The set of activated computational units can characterize the range of units actually participating in computation in one or more model layers for the current target task; the activation intensity distribution can characterize the differences in the degree of participation of different activated computational units in the current target task; and the inter-layer distribution can characterize the distribution pattern and changing relationships of activated computational units across multiple model layers.

[0059] For hybrid expert models, activated computational units can correspond to expert submodules; for gated sparse networks, activated computational units can correspond to gated functional branches; for conditional computation networks or dynamic routing networks, activated computational units can correspond to path nodes, routing branches, or subnetwork modules triggered by the current input. Therefore, in one embodiment of this invention, although expert submodules in MoE can still be used as the subject of description, the activation distribution features themselves are not limited to expert selection scenarios, but can cover sparse activation models that dynamically select some computational units to execute based on input. Thus, the computational requirements, data exchange requirements, and memory carrying requirements subsequently mapped from the activation distribution features can also correspond to a more general sparse activation inference process.

[0060] In a specific example, if the target model is a hybrid expert model comprising multiple model layers, each containing several expert sub-modules, then within a preset period, the system can first group activation-related information according to model layers, and then identify expert sub-modules within each model layer whose activation frequency exceeds a preset frequency, thereby forming the set of activated computational units in that model layer. Further, the system can form the activation intensity distribution within that model layer based on the cumulative or average activation weight of the expert sub-modules within the preset period. If further description of the activation relationships between multiple model layers is needed, the overlap of activated computational unit sets in different model layers, the frequency of inter-layer switching, or the concentration trend of cross-layer activation can be compared to form inter-layer distribution information. Through the above processing, the system no longer simply retains discrete activation counts and activation weights, but obtains activation distribution characteristics that more closely correspond to the actual execution state of the target task.

[0061] In one embodiment of the present invention, the activation intensity distribution in a certain model layer can be formed in the following manner. Assume the target task affects the model layer within a preset period. The computing unit in The generated cumulative activation weight is The sum of the cumulative activation weights of all activated computational units in this model layer is ,in, Representation Model Layer The set of activated computing units in the set, then computing units In the model layer The normalized activation intensity in can be expressed as:

[0062] in, Represents the computing unit In the model layer The activation intensity distribution value in Represents the computing unit The cumulative activation weight within a preset period, Representation Model Layer The set of activated computational units in the model. Using formula (1), the activation weights of different computational units can be mapped to the relative intensity distribution within the same model layer. This relative intensity distribution can reflect which computational units are more likely to be in a hotspot activation state in the current target task, and can also reflect whether the activation intensity is dispersed among multiple computational units.

[0063] In one embodiment, if the system needs to consider both activation frequency and activation intensity simultaneously, it can also be based on the number of activations. and normalized activation intensity Construct a comprehensive activation level. Define the model layer. The maximum number of activations for each computing unit is Then the computing unit In the model layer Overall activation level It can be represented as:

[0064] in, Represents the computing unit In the model layer The overall activation level in Represents the computing unit The number of activations within a preset period. The frequency term weight can range from 0 to 1. Formula (2) unifies activation frequency and activation intensity into a single metric, resulting in a more stable activation distribution characteristic. In certain target tasks, if a computational unit is frequently activated but has a low single-time weight, the frequency term reflects its long-term activity; if a computational unit is activated infrequently but has a high single-time activation weight, the intensity term reflects its importance to the current task execution path. Combining both can reduce the bias caused by relying solely on a single activation frequency or a single activation weight.

[0065] In one embodiment of the present invention, the reason why the activation distribution feature is a key input in the subsequent device scheduling process is that the distribution feature can be further mapped to the target task’s computing needs, data exchange needs, and memory carrying needs under the current preset period. Figure 5 This diagram illustrates the principle of generating task requirement features from activation distribution features. Figure 5 The computational requirement generation unit 501 can generate computational requirements based on the set of activated computational units and the distribution of activation intensity. The data exchange requirement generation unit 502 can generate data exchange requirements based on inter-layer distribution and cross-device data transmission information. The memory carrying requirement generation unit 503 can generate memory carrying requirements based on the set of activated computational units and the distribution of activation intensity. The requirement aggregation unit 504 can output the above requirements to the subsequent adaptation analysis process.

[0066] In one embodiment of the present invention, the computational requirement mainly corresponds to the actual computational scale that the target task needs to perform in the current activation state. Since sparse activation models typically involve only a subset of computational units in a single inference iteration, the computational requirement of the current target task is not simply equivalent to the total parameter size of the target model, but rather closer to the effective computational scale jointly determined by the set of activated computational units and their activation intensity. The larger the number of activated computational units, or the more concentrated the high-intensity units are in computationally expensive units, the higher the matrix operation requirement of the current target task. Conversely, if the set of activated computational units is small, and the activation intensity is concentrated in computationally inexpensive units, the computational requirement of the current target task can be relatively low.

[0067] In one embodiment, the computational requirements of the target task can be estimated based on the overall activation degree and static computational scale of the computational units in each model layer. The computational units are then designed. In the model layer The static computation scale in is The computational requirements of the target task within the preset period are then determined. It can be represented as:

[0068] in, This indicates the computational requirements of the target task. Represents the computing unit In the model layer The overall activation level in Represents the computing unit The static computational scale. Through formula (3), the activation distribution characteristics can be converted into the effective computational demand corresponding to the target task. This computational demand is not a static accumulation of all computational units, but a weighted aggregation of the currently activated parts, thus better reflecting the real computational pressure of the target task in the current time period.

[0069] In one embodiment of the present invention, the data exchange requirement mainly corresponds to the scale and frequency of data transmission required by the target task during execution. This requirement is typically related to the following factors: first, whether the currently activated set of computing units is distributed across multiple model layers or multiple device-related execution units; second, whether intermediate activation results need to be transferred across layers or devices; and third, whether parameter, cache, and result data already reside in an efficiently accessible storage area on the target device side. If the activation distribution exhibits strong inter-layer dispersion, or if the same task involves numerous cross-layer result interactions during execution, the data exchange requirement usually increases. If the relevant model parameters, intermediate results, or cached data do not reside in a directly accessible storage area of ​​the candidate device, data transfer requirements may also increase before and after execution.

[0070] In one embodiment, the data exchange requirements for the target task can be constructed based on inter-layer distribution and historical communication metrics. Let the model layer... With model layer The amount of interlayer exchange caused by the switching of activation units is The data exchange requirements of the target task within the preset period. It can be represented as:

[0071] in, This indicates the data exchange requirements of the target task. This represents the inter-layer exchange volume between adjacent model layers corresponding to the current activation distribution. The inter-layer exchange volume can be determined based on the degree of change in the set of activated computing units, the scale of intermediate activation results, or historical communication indicators. Through formula (4), the system can aggregate the originally discrete inter-layer interaction behaviors into a communication requirement for device adaptation analysis.

[0072] In one embodiment of the present invention, the memory carrying capacity requirement mainly corresponds to the scale of device-side storage resources required by the target task during execution, including but not limited to system running memory, video memory, on-chip cache, and device-side supporting storage resources. The larger the number of activated computing units, the more parameters, intermediate activation values, and cached data typically need to reside simultaneously; the more concentrated the activation intensity distribution is in computing units with larger parameter scales, the higher the demand for device-side storage resources is usually. Therefore, the memory carrying capacity requirement can also be derived from the activation distribution characteristics, rather than being directly determined by the total scale of the static model.

[0073] In one embodiment, the design calculation unit In the model layer The static storage requirement is The memory requirements of the target task within the preset period are then determined. It can be represented as:

[0074] in, This indicates the memory requirements of the target task. Represents the computing unit In the model layer The static storage requirements are determined by formula (5). Based on the currently activated set of computing units and their overall activation degree, the system can obtain the effective storage requirements corresponding to the target task. This storage requirement can more accurately reflect the current task's usage trend of system memory, video memory, or other device-side storage resources, thus providing a basis for subsequent memory adaptation analysis.

[0075] In one embodiment of the present invention, the activation distribution features formed in step S130 can elevate the original activation counts and activation weights into a more interpretable intermediate representation of the actual execution state of the target task. Furthermore, the computational requirements, data exchange requirements, and memory carrying requirements mapped from the activation distribution features enable subsequent adaptation analysis to no longer directly deal with discrete and fragmented activation information, but rather to determine the capacity of different heterogeneous candidate devices to handle the current target task from the perspective of task requirements. In this way, the dynamic execution path in the sparse activation model can be naturally propagated to the heterogeneous inference scheduling stage, thereby avoiding the bias caused by selecting devices solely based on static specifications or instantaneous utilization.

[0076] In one embodiment of the present invention, Figure 1Step S140 may include: determining the computation adaptation result, communication adaptation result, and memory adaptation result respectively by combining resource-related information. Determining the computation adaptation result, communication adaptation result, and memory adaptation result respectively by combining resource-related information may include: determining the computation adaptation result of each heterogeneous candidate device for the target task based on the computational requirements represented by the activation distribution characteristics and information related to task inference time and device computing capabilities in the resource-related information; determining the communication adaptation result of each heterogeneous candidate device for the target task based on the data exchange requirements represented by the activation distribution characteristics and information related to task communication overhead and device communication capabilities in the resource-related information; and determining the memory adaptation result of each heterogeneous candidate device for the target task based on the memory carrying capacity requirements represented by the activation distribution characteristics and information related to task memory usage and device storage resources in the resource-related information. Figure 6 A schematic diagram of a multidimensional adaptation analysis process is shown. Figure 6 The computational adaptation analysis unit 601 in the middle can receive Figure 5 The communication adaptation analysis unit 602 can receive data exchange requirements and information related to the communication capabilities of candidate devices, the memory adaptation analysis unit 603 can receive memory carrying requirements and information related to the storage resources of candidate devices, and the adaptation result aggregation unit 604 can output multi-dimensional adaptation results corresponding to different candidate devices. The multi-dimensional adaptation results include computation adaptation results, communication adaptation results, and memory adaptation results.

[0077] In one embodiment of the present invention, the computational adaptation result is used to characterize the suitability of a candidate device for undertaking the computational requirements of the current target task. For the same target task, even if its computational requirements have been mapped by the activation distribution characteristics, different candidate devices may exhibit different computational capabilities due to differences in floating-point operation capabilities, parallel execution capabilities, operator optimization capabilities, or historical task inference time. Taking CPUs, GPUs, and NPUs as examples, CPUs are generally suitable for undertaking general control logic and some low-parallel computation tasks, GPUs are generally more suitable for undertaking large-scale tensor parallel operations, and NPUs are generally more suitable for undertaking specially optimized neural network operator inference tasks. Therefore, in one embodiment of the present invention, the computational adaptation result is not simply equivalent to the theoretical peak computing power of the device, but is determined jointly by the current computational requirements of the target task and the actual computational capabilities of the candidate device.

[0078] In one embodiment, computational adaptation results can be constructed based on the computational requirements of the target task, the task inference time, and the peak performance metrics of candidate devices. Let the computational requirements corresponding to the target task be... Candidate devices The average inference time for similar tasks within a preset period is Candidate devices The peak performance index is Then candidate device Calculation of fit values ​​for the target task It can be represented as:

[0079] in, Indicates candidate device The calculated fit value, This indicates the computational requirements of the target task. Indicates candidate device The average inference time for the corresponding task within a preset period. Indicates candidate device Peak performance metrics and Indicates the weighting coefficient. This indicates a smoothing term to avoid a denominator of zero. Formula (6) allows us to utilize... and The relationship between these parameters characterizes the theoretical capacity of candidate devices to meet the computational requirements of the target task. On the other hand, historical average inference time can be used to characterize the actual performance of candidate devices under recent operating conditions. The resulting computational fit values ​​reflect both device specifications and actual operating conditions.

[0080] In one embodiment of the present invention, a larger computational adaptation value for a candidate device indicates that it is more suitable for undertaking the computational requirements of the current target task; conversely, a smaller computational adaptation value indicates that the candidate device is relatively weak in terms of computational capacity to undertake the current target task. In some embodiments, the computational adaptation value obtained by formula (6) can be further normalized to ensure that the computational adaptation results between different candidate devices are on a uniform scale. The normalized computational adaptation results can be more easily fused with subsequent communication adaptation results and memory adaptation results. The normalization method here is not limited to a single method; as long as the relative size relationship between different candidate devices can be maintained, it can be used for subsequent adaptation result fusion.

[0081] In one embodiment of the present invention, the communication adaptation result is used to characterize the suitability of a candidate device for undertaking the data exchange requirements of the current target task. Since the actual execution path of the sparse activation model affects the exchange of intermediate activation results, parameter transport, and the scale of result transmission, the communication adaptation result is not solely determined by the device bandwidth index, but is also related to the data exchange requirements of the current target task and recent communication overhead. If the target task currently corresponds to a large data exchange requirement, it is more necessary to select a candidate device with higher data transmission bandwidth, lower communication latency, or lower cross-device transmission cost.

[0082] In one embodiment, a communication adaptation result can be constructed based on the data exchange requirements of the target task, task communication metrics, and device communication metrics of candidate devices. Let the data exchange requirements corresponding to the target task be... Candidate devices The average task communication overhead within the preset period is Candidate devices The device communication metrics are Then candidate device Communication adaptation value for the target task It can be represented as:

[0083] in, Indicates candidate device Communication adaptation value, This indicates the data exchange requirements of the target task. Indicates candidate device The average communication cost of the corresponding task within a preset period. Indicates candidate device The device communication metrics, and The weighting coefficient is represented here. The device communication indicators here may include bandwidth capacity, link transmission capacity, or device-side data exchange efficiency. The average communication overhead may include historical data transfer time, cross-device transmission time, or intermediate result exchange time. Through formula (7), a quantitative relationship can be established between the target task data exchange requirements and device communication capabilities, and the recent communication execution performance can be introduced at the same time, so that the obtained communication adaptation results are closer to the communication capacity of the candidate device in the actual inference scenario.

[0084] In one embodiment of the present invention, the memory adaptation result is used to characterize the suitability of a candidate device for handling the memory requirements of the current target task. Since different candidate devices differ in system runtime memory, video memory, on-chip storage resources, and currently available storage space, the memory requirements corresponding to the same target task may exhibit completely different execution feasibility and efficiency on different candidate devices. If the device's storage resources are insufficient to support the parameters, intermediate activation values, and cache residency corresponding to the current target task, even if the device has high computing power, execution may be limited or data swapping may occur frequently due to storage resource constraints, thereby increasing the overall inference time.

[0085] In one embodiment, memory adaptation results can be constructed based on the memory requirements of the target task, task memory metrics, and the device storage resources of candidate devices. Let the memory requirements corresponding to the target task be... Candidate devices The average memory usage of the corresponding task within the preset period is Candidate devices Currently available storage resources are Then candidate device Memory fit value for the target task It can be represented as:

[0086] in, Indicates candidate device Memory adaptation value, This indicates the memory requirements of the target task. Indicates candidate device The average memory usage of the corresponding task within a preset period. Indicates candidate device Currently available storage resources and The weighting coefficient is represented here. The device storage resources here may include the system memory corresponding to the CPU, the video memory corresponding to the GPU, the on-chip storage resources corresponding to the NPU, or the storage resources associated with the NPU. Through formula (8), the memory carrying requirements of the target task can be correlated with the current available storage resources and historical memory execution performance of the candidate device, thereby obtaining the suitability of the candidate device for undertaking the current target task in terms of memory.

[0087] In one embodiment of the present invention, the computational adaptation results, communication adaptation results, and memory adaptation results can constitute a three-dimensional adaptation description of the candidate device for the target task. This three-dimensional adaptation description differs significantly from traditional device selection methods that rely solely on a single computing power metric, a single utilization rate metric, or static rules. Traditional methods focus more on whether the device itself is idle, while the multi-dimensional adaptation results in one embodiment of the present invention focus more on whether the current target task is suitable for execution by the candidate device in the current active state. In other words, one embodiment of the present invention does not evaluate devices in isolation, but rather establishes an adaptation relationship between the target task and the candidate device that is coupled with the activation distribution characteristics.

[0088] In one embodiment of the present invention, Figure 1 Step S150 may include: determining the target device based on the multidimensional adaptation results, that is, determining the target device for performing the target task based on the computation adaptation results, communication adaptation results and memory adaptation results of each heterogeneous candidate device. Figure 7 A schematic diagram of a target device determination process is shown. Figure 7 The adaptation and fusion unit 701 in the middle can receive Figure 6The output calculation adaptation results, communication adaptation results, and memory adaptation results are used by the comprehensive adaptation generation unit 702 to generate comprehensive adaptation results for each candidate device to the target task. The device selection unit 703 can determine the target device based on the comprehensive adaptation results, and the task distribution unit 704 can distribute the target task to the determined target device for execution.

[0089] In one embodiment, candidate devices can be used. The three types of adaptation results are fused together to generate a comprehensive adaptation result. For example, the overall adaptation results It can be represented as:

[0090] in, Indicates candidate device The overall adaptation results to the target task. This indicates the result of the adaptation calculation. Indicates the communication adaptation result. This indicates the memory adaptation result. , and This represents the fusion weight coefficient. The fusion weight coefficient can be a preset value or configured according to project requirements. For example, in scenarios emphasizing low-latency inference, the weights of computational adaptation results and communication adaptation results can be appropriately increased; in scenarios emphasizing the stable operation of large models, the weights of memory adaptation results can be appropriately increased. Through formula (9), adaptation results from different dimensions can be unified into the same evaluation space, thereby forming a more complete device adaptation conclusion.

[0091] In one embodiment of the present invention, a larger overall adaptation result indicates that the candidate device is more suitable for performing the current target task; a smaller overall adaptation result indicates that the candidate device has a lower overall adaptation degree to the current target task. The device selection unit 703 can select the candidate device with the largest overall adaptation result from multiple candidate devices as the target device. In a different embodiment, pre-screening can also be performed based on the adaptation result of a certain dimension, and then the overall adaptation result can be compared among the pre-screened candidate devices, thereby avoiding a situation where a candidate device has a high overall adaptation result but clearly does not meet the acceptance conditions in a certain key resource dimension. For example, when the memory adaptation result of a candidate device is lower than a preset threshold, it can be excluded first, and then the device with the highest overall adaptation result can be selected from the remaining candidate devices.

[0092] In one embodiment of the present invention, Figure 1Step S150 may further include: invoking the target device to perform inference processing and obtain task processing results. That is, once the target device is determined, the task issuing unit 704 can invoke the target device to process the task data of the target task based on the target model and obtain task processing results. The invocation process here may include loading or confirming the required model parameters, input data, and running configuration to the target device, or it may include placing the task to be executed into the execution queue corresponding to the target device. After the target device completes inference, it can output prediction results, generation results, classification results, retrieval results, or other task processing results. If the target device still involves data interaction with other devices during execution, this interaction process can also continue to be collected and written into the running information within a preset period for reference in subsequent task scheduling.

[0093] In one embodiment of the present invention, the multidimensional adaptation analysis process formed by step S140, and the process formed by... Figure 7 The implemented target device determination process transforms heterogeneous inference scheduling from a single-resource-dimensional judgment to a multi-dimensional adaptation judgment coupled with activation distribution characteristics. Since computational adaptation results, communication adaptation results, and memory adaptation results all originate from activation-related and resource-related information of the current target task within a preset period, the device selection result is no longer merely a reflection of static device specifications, but a comprehensive response to the current task execution path, current resource requirements, and current device availability. This allows for more timely reflection of resource demand changes caused by the dynamic execution path of the sparse activation model, thus better balancing inference efficiency, execution stability, and resource utilization in a heterogeneous computing environment.

[0094] In one embodiment of the present invention, the determination of the aforementioned target device can be performed at the overall task level, or further refined to the model layer level, sub-model level, or computation unit level. For some sparse activation models, although the entire target task can be executed on a single candidate device, different model layers or different sub-models may have significant differences in computational characteristics, data exchange characteristics, and memory usage characteristics. If a uniform device allocation is always performed at the overall task level, the capability differences of some local sub-tasks between heterogeneous devices may be difficult to fully utilize. Therefore, in one embodiment of the present invention, Figure 1 Steps S140 and S150 can be expanded at a finer granularity to determine the corresponding sub-devices for each of the multiple sub-models in the target model.

[0095] Taking a hybrid expert model as an example, the target model can include multiple model layers, and each model layer can include multiple expert sub-modules, shared modules, gated modules, or other computational substructures. For gated sparse networks, conditional computation networks, and dynamic routing networks, different branches, different path segments, or different functional blocks can also be regarded as different sub-models. In this case, Figure 3 The organization of operational information can be further subdivided according to sub-model identifiers. Figure 4 The activation distribution features in the model can be further refined into sub-model level activation distribution features. Figure 5 The computational requirements, data exchange requirements, and memory capacity requirements can be further refined into multiple sub-requirements, thereby supporting the analysis of device adaptation relationships at the sub-model level.

[0096] In one embodiment of the present invention, if the target model includes multiple sub-models, the corresponding activation-related information and resource-related information can be extracted for each sub-model, and the computational adaptation result, communication adaptation result, and memory adaptation result of each heterogeneous candidate device for that sub-model can be determined respectively. Let the target model include... The sub-model, the first Sub-models in candidate devices The computational adaptation results, communication adaptation results, and memory adaptation results are as follows: , and Then candidate device For the Sub-synthesis fitting results of individual sub-models It can be represented as:

[0097] in, Indicates candidate device For the Sub-synthesis and adaptation results of each sub-model , and Indicates the first The fusion weights of each sub-model across the dimensions of computation, communication, and memory. The fusion weights of different sub-models can be the same or different. When a sub-model is more inclined towards large-scale parallel computing, its computation dimension weight can be appropriately increased; when a sub-model relies more on cross-layer data exchange or cross-device intermediate result transmission, its communication dimension weight can be appropriately increased; when a sub-model involves large parameter residency or cache usage, its memory dimension weight can be appropriately increased.

[0098] In one embodiment of the present invention, the system can select the candidate device with the highest sub-synthesis adaptation result for each sub-model as the sub-device. If the first... The excellent candidate devices corresponding to each sub-model are Then it can be expressed as:

[0099] in, Indicates the first The target sub-device index corresponding to each sub-model. Indicates candidate device For the The sub-synthesis and adaptation results of each sub-model. Through formula (11), different sub-devices can be selected for each of the multiple sub-models in the target model. In this way, when one sub-model is more suitable for execution on the GPU, while another sub-model is more suitable for execution on the CPU or NPU, the system can make full use of the capability differences between different heterogeneous devices.

[0100] In one embodiment, the execution order of multiple sub-models can follow the original topological order of the target model, or it can form an execution graph based on dependencies, with the scheduler scheduling different sub-models in parallel while satisfying the dependencies. If there is no direct data dependency between two sub-models, they can be sent to different sub-devices for parallel execution; if a later sub-model depends on the output of a previous sub-model, its output can be transmitted to the corresponding sub-device of the later sub-model after the previous sub-model has finished executing. This result transmission process can also be included in the aforementioned task communication metrics to reflect the data exchange overhead between devices under fine-grained scheduling.

[0101] In one embodiment of the present invention, to avoid additional communication overhead caused by excessively dispersed sub-model allocation under fine-grained scheduling, the system can also add constraints when performing sub-device selection. For example, it can be stipulated that adjacent sub-models are preferentially allocated to the same candidate device if the data exchange volume exceeds a preset threshold; or, when a sub-model has a high-frequency data sharing relationship with multiple subsequent sub-models, these sub-models can be preferentially combined and allocated to the same candidate device. In this way, the differences in capabilities of heterogeneous devices can be utilized while avoiding additional communication waiting caused by excessive splitting.

[0102] In one embodiment of the present invention, Figure 7The device selection unit 703 can further include an overall device selection subunit and a sub-model device selection subunit. The overall device selection subunit can determine the target device at the overall task level, while the sub-model device selection subunit can determine sub-devices for different sub-models at a fine-grained level. The scheduling system can switch between overall task-level device selection and sub-model-level device selection based on the target task size, model complexity, preset latency requirements, or current system load. Thus, when the scheduling system prioritizes rapid response, it can adopt the overall device selection method; when the scheduling system prioritizes fully utilizing the differences in heterogeneous resources, it can adopt the sub-model-level device selection method.

[0103] In one embodiment of the present invention, Figure 9 A schematic diagram of a task processing device is shown. The task processing device 900 can be deployed as a software functional module on a server, or as a hardware-software hybrid device deployed in a heterogeneous inference system. The task processing device 900 may include a task acquisition module 901, a runtime information acquisition module 902, an activation distribution analysis module 903, an adaptation analysis module 904, and a device scheduling module 905. The task acquisition module 901 is used to acquire task data for the target task and determine the target model; the runtime information acquisition module 902 is used to acquire runtime information within a preset period corresponding to the target task; the activation distribution analysis module 903 is used to determine the activation distribution characteristics of the target task in the target model based on activation-related information, and to form computational requirements, data exchange requirements, and memory carrying requirements based on the activation distribution characteristics; the adaptation analysis module 904 is used to determine the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device for the target task; the device scheduling module 905 is used to determine the target device based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device, and to call the target device to execute the target task.

[0104] In one embodiment of the present invention, Figure 8 A schematic diagram of an electronic device 800 is shown. The electronic device 800 may include a processor 801, a memory 802, a communication interface 803, and a bus 804. The processor 801 may include one or more core processing units, and may further include a graphics acceleration unit, a dedicated inference acceleration unit, or other programmable logic units. The memory 802 may include volatile and non-volatile memory for storing program instructions and runtime data. The communication interface 803 can be used for data interaction with clients, external servers, heterogeneous device resource pools, or other system components. The bus 804 can be used to connect the processor 801, the memory 802, and the communication interface 803. When the program instructions stored in the memory 802 are executed by the processor 801, the processor 801 can implement the steps in the aforementioned heterogeneous inference scheduling method for sparse activation models.

[0105] In one embodiment of the present invention, Figure 10 The server 1002 shown may internally include a task receiving component, a runtime information management component, an adaptation analysis component, and a task execution control component. The task receiving component can receive target tasks sent by the client 1001; the runtime information management component can manage runtime information within a preset period for the heterogeneous device resource pool 1003; the adaptation analysis component can generate multi-dimensional adaptation results based on activation distribution characteristics and resource-related information; the task execution control component can schedule tasks to at least one device among CPU 10031, GPU 10032, and NPU 10033 for execution based on the determined target device or target sub-device. In some embodiments, the server 1002 may also include a model management component for maintaining model versions, parameter locations, runtime caches, and sub-model partitioning information for different sparse activation models.

[0106] In one embodiment of the present invention, the foregoing method and device embodiments jointly illustrate the following technical approach: instead of directly selecting heterogeneous devices based on static device specifications or a single utilization rate indicator, activation-related information is first extracted based on the operating information within a preset period corresponding to the target task. Then, activation-related information forms activation distribution characteristics, which are further mapped to computational requirements, data exchange requirements, and memory carrying requirements. Finally, a multi-dimensional adaptation result is formed by combining the actual carrying capacity of candidate devices in terms of computing power, communication capabilities, and storage resources. Through this process, the relationship between the target task and heterogeneous candidate devices is no longer a coarse-grained fixed mapping relationship, but rather an adaptation relationship corresponding to the dynamic execution path triggered by the current input. Therefore, when the actual execution path of the target task changes, the device scheduling result can also be adjusted accordingly, thereby better adapting to the dynamic resource requirements of the sparse activation model in heterogeneous inference scenarios.

[0107] In one embodiment of the present invention, the above processing procedure can bring about the following technical effects. Since the determination of the target device is based on the computational requirements, data exchange requirements, and memory carrying requirements represented by the activation distribution characteristics, it can more timely reflect the actual resource requirements of the target task in the current time period. Furthermore, since the computational adaptation results, communication adaptation results, and memory adaptation results all consider both the target task requirements and the capacity of candidate devices, the device selection results are no longer limited to a single-dimensional optimal solution, but are more conducive to achieving a balance between inference time consumption, execution stability, and heterogeneous resource utilization. Further, through adaptation analysis and device allocation at the sub-model level, the heterogeneous capability differences between CPUs, GPUs, and NPUs can be more fully utilized, enabling subtasks with different load characteristics to be allocated to more suitable execution devices.

[0108] In one different embodiment, the division of the aforementioned functional modules or components is not uniquely limited. Some modules can be further split, and some modules can be combined. For example, the operation information acquisition module 902 and the activation distribution analysis module 903 can be implemented by the same processing unit, and the adaptation analysis module 904 and the device scheduling module 905 can also be implemented by the same scheduling engine. For the electronic device 800, the program instructions executed by the processor 801 can be fully deployed in a standalone environment or partially deployed in a cloud-edge collaborative environment. As long as the operation information acquisition, activation distribution feature formation, multi-dimensional adaptation analysis, and device selection of the target task can be completed, the technical solution of one embodiment of the present invention can be realized.

[0109] In one embodiment of the present invention, the aforementioned operating information within the preset period can be organized and maintained in a variety of ways. Figure 2 The operational information organization unit 203 shown is not limited to a single data structure, but can adopt different information organization methods based on task scale, concurrency level, model complexity, and number of devices. For scenarios with frequent inference requests and rapid changes in device status, the operational information organization unit 203 can focus more on the timeliness of recent operational information; for scenarios with relatively stable task types and relatively small fluctuations in project load, the operational information organization unit 203 can also focus more on the stability of statistical results. Therefore, the aforementioned preset period is not limited to a single length, but can be dynamically set according to system configuration. A shorter preset period is more conducive to reflecting recent status changes; a longer preset period is more conducive to smoothing out occasional fluctuations. The specific period length to be used can be determined comprehensively based on the project's requirements for scheduling sensitivity and stability.

[0110] In one embodiment of the present invention, after acquiring historical and currently collected operational metrics, the operational information organization unit 203 can first group the data according to model identifier, task category, input feature label, or sub-model identifier, and then organize activation-related information and resource-related information within each group according to candidate device identifiers. In this way, when it is necessary to read operational information within a preset period corresponding to the target task, the system can not only extract recent operational data from the time dimension, but also filter historical samples that are closer to the current target task from dimensions such as model type, task category, and device type. This multi-dimensional organization method is highly valuable for sparse activation models because even within the same broad task category, different input data may trigger different activation paths; by introducing additional indexes such as model, task, and device, the extracted operational information within the preset period can be closer to the actual execution mode of the current target task.

[0111] In one embodiment, if the system maintains both overall model-level and sub-model-level operational information, it can first filter historical operational records similar to the target task from the overall model-level operational information, and then extract more granular activation-related and resource-related information from the corresponding sub-model-level operational information. The advantage of this approach is that the overall model-level operational information provides faster coarse-grained localization capabilities, while the sub-model-level operational information provides a more refined description of activation differences. The former helps the system quickly narrow down the candidate sample range, while the latter helps the system further improve the accuracy of subsequent adaptation analysis.

[0112] In one embodiment of the present invention, the aforementioned computational adaptation results, communication adaptation results, and memory adaptation results can be uniformly scaled before fusion. This is because the numerical ranges of adaptation results across different dimensions are typically not the same. For example, computational adaptation results may be more influenced by the peak performance of the device, communication adaptation results by bandwidth and transmission overhead, and memory adaptation results by the capacity of video memory or system RAM. If the original adaptation values ​​of the three dimensions are directly fused, one dimension may become overly dominant in the final comprehensive adaptation result due to its larger numerical scale. Therefore, in one embodiment of the present invention, a uniform scaling process can be performed. Figure 7 The adaptation fusion unit 701 shown first normalizes the three types of adaptation results, and then performs the fusion operation.

[0113] In one embodiment, if a minimum and maximum value are used to linearly normalize the adaptation results of a certain type in the candidate device set, then the candidate devices... Normalized value on this type of adaptation result It can be represented as:

[0114] in, It can represent computational dimension, communication dimension, or memory dimension. Indicates candidate device In dimensions The original adaptation value on, Indicates all candidate devices in dimension Minimum fit value, Indicates all candidate devices in dimension The maximum fit value on, This represents the normalized fit value. The term represents the smoothing term. Through formula (12), the adaptation results of different dimensions can be compressed to a similar scale range, which facilitates subsequent unified fusion.

[0115] In one different embodiment, standardization based on mean and standard deviation, quantile normalization based on ranking position, or other transformation methods that can maintain the relative fit relationship of candidate devices can also be used to unify the scale of the fit results in different dimensions. For different project systems, as long as the processing method can reduce the unexpected impact of numerical scale differences on the comprehensive fit results, it can be used in the fit fusion process of one embodiment of the present invention.

[0116] In one embodiment of the present invention, Figure 7 After obtaining the normalized multi-dimensional adaptation results, the comprehensive adaptation generation unit 702 can use either a weighted summation method to form the comprehensive adaptation result, or a hierarchical screening combined with weighted fusion method. Hierarchical screening combined with weighted fusion refers to first performing preliminary screening of candidate devices based on a key dimension, and then performing multi-dimensional adaptation fusion on the selected candidate devices. For example, in some large model inference scenarios, if the available GPU memory of a candidate device is significantly lower than the memory requirements of the target task, it may not be suitable as a target device even if its computational adaptation result is high. In this case, screening can be performed first based on the memory dimension, and then the adaptation results and communication adaptation results of the remaining candidate devices can be comprehensively calculated. This method helps reduce the number of candidate devices that clearly do not meet the key resource constraints from entering the final comparison stage.

[0117] In one embodiment, if a fusion method based on key constraint screening is used, then the candidate devices... Whether to enter the final fusion set can be indicated by a function. express:

[0118] in, Indicates candidate device Does it meet the preset memory adaptation threshold? Indicates candidate device The memory adaptation results Indicates the memory adaptation threshold. When When, it indicates the candidate device Entering the subsequent comprehensive adaptation and integration process; when When, it indicates the candidate device These devices will not be included in the subsequent comprehensive comparison. This approach helps avoid misselecting candidate devices that are clearly deficient in key resource dimensions simply because they score high in other dimensions.

[0119] In one embodiment of the present invention, Figure 6 and Figure 7The adaptation analysis and equipment selection process reflected can also be appropriately adjusted according to the structural characteristics of different sparse activation models. For hybrid expert models, the activation distribution characteristics are usually more directly represented by the set of activated expert sub-modules and their activation intensity distribution. Therefore, the aforementioned formulas (1) to (5) can be applied more directly. For gated sparse networks, the activation distribution characteristics can be more represented by the opening state and opening weight of different functional branches; for conditional computation networks, the activation distribution characteristics can be more represented by the triggering frequency and triggering intensity of different conditional computation units; for dynamic routing networks, the activation distribution characteristics can be more represented by the path distribution and path switching relationship between different routing nodes. Although the internal activation objects are different, they still have the same point: at least one of the set of activated computation units, activation intensity distribution and inter-layer distribution can be extracted from the activation-related information, and then further mapped to obtain the computation requirements, data exchange requirements and memory carrying requirements.

[0120] In one embodiment, if the target model is a gated sparse network, the opening probability or number of times a certain gated branch is opened can replace the expert activation weights and activation counts in the aforementioned MoE; if the target model is a dynamic routing network, the forwarding ratio or path selection frequency between routing nodes can replace the aforementioned inter-layer distribution indicators; if the target model is a modular neural network that selects some sub-networks to execute based on the input, the set of sub-networks selected for execution can replace the aforementioned set of activated computational units. Therefore, the aforementioned adaptation analysis process can adapt not only to typical hybrid expert models but also to more general sparse activation inference scenarios.

[0121] In one embodiment of the present invention, the aforementioned processing does not require all runtime information to originate from the same physical node. For distributed inference systems, different nodes can collect task inference time, communication metrics, and memory metrics from their local candidate devices, and then aggregate this information into a unified runtime information management component to form runtime information within a preset period. Thus, when the target task needs to be scheduled across a set of heterogeneous devices across nodes, the system can still utilize a unified approach to form activation distribution characteristics and multi-dimensional adaptation results. In other words, the aforementioned method can be used in both single-machine, multi-device heterogeneous inference systems and multi-machine, multi-device collaborative heterogeneous inference systems.

[0122] In one embodiment of the present invention, from the perspective of the execution chain, the entire heterogeneous inference scheduling process can be summarized as the following continuous derivation: The system first identifies the target model around the target task and extracts activation-related information and resource-related information corresponding to the current task from the running information within a preset period; then, activation-related information forms activation distribution features, and these activation distribution features further generate computing requirements, data exchange requirements, and memory carrying requirements; subsequently, combining the carrying capacity of different candidate devices in terms of computing power, communication capabilities, and storage resources, computing adaptation results, communication adaptation results, and memory adaptation results are formed respectively; finally, the target device or target sub-device is determined by the multi-dimensional adaptation results. Through this continuous derivation process, the dynamic execution path changes of the target task can be passed to the device scheduling decision layer by layer, rather than being simplified to static model labels or single hardware indicators during the scheduling process.

[0123] In one embodiment of the present invention, the above technical solution achieves the following effects in actual deployment. Because the scheduling criteria explicitly incorporate running information within a preset period corresponding to the target task, the system can more promptly perceive recent changes in activation status and device status. Because the scheduling process explicitly constructs activation distribution characteristics, and derives computational requirements, data exchange requirements, and memory carrying requirements from these characteristics, the system can more accurately distinguish the main pressure sources of the current task in different resource dimensions. Because the final device selection simultaneously considers computational adaptation results, communication adaptation results, and memory adaptation results, the system can avoid the one-sidedness caused by scheduling based solely on a single computing power indicator, a single utilization rate indicator, or static empirical rules. Therefore, this technical solution is more conducive to improving the device adaptation accuracy during the sparse activation model inference process in heterogeneous environments, while also considering inference efficiency, execution stability, and resource utilization.

[0124] In one embodiment of the present invention, combined with Figures 1 to 10 The illustrated process, structure, and application scenarios allow the aforementioned technical solution to be further understood as a heterogeneous reasoning and scheduling mechanism centered around the dynamic execution path of the target task. This mechanism does not directly select equipment based on the static specifications of candidate devices after the target task arrives; instead, it first involves the task acquisition module 901 or... Figure 10 The task receiving component acquires the target task, and then the operation information acquisition module 902 or the operation information management component extracts the operation information within a preset period corresponding to the target task. After obtaining the operation information within the preset period, the distribution analysis module 903 is activated or... Figure 4 The distribution building unit 403 can determine the activation distribution characteristics of the target task in the target model based on activation-related information, and then... Figure 5The computational requirement generation unit 501, data exchange requirement generation unit 502, and memory capacity requirement generation unit 503 generate computational requirements, data exchange requirements, and memory capacity requirements, respectively. Subsequently, the adaptation analysis module 904... Figure 6 The computational adaptation analysis unit 601, communication adaptation analysis unit 602, and memory adaptation analysis unit 603 in the module can generate multi-dimensional adaptation results for different heterogeneous candidate devices. The device scheduling module 905... Figure 7 The integrated adaptation generation unit 702 and device selection unit 703 can determine the target device or target sub-device based on this, and the task issuing unit 704 or Figure 10 The task execution control component in the system completes task scheduling and execution. Therefore, Figure 1 The main process shown Figures 2 to 7 The intermediate processing structure shown Figures 8 to 10 The devices and system architectures shown can form a consistent data flow and control flow relationship.

[0125] In one embodiment of the present invention, Figure 1 The overall processing procedure consisting of steps S110 to S150 can be viewed as a continuous process of gradually deriving the device selection result from the task input. Step S110 is used to establish the correspondence between the target task and the target model, thereby determining the sparse activation objects that need to be analyzed subsequently; Step S120 is used to extract the activation-related information and resource-related information most relevant to the current task from the running information within a preset period, thereby providing basic data for subsequent dynamic analysis; Step S130 is used to form activation distribution characteristics based on activation-related information, and further derive computing requirements, data exchange requirements, and memory carrying requirements, thereby transforming the dynamic execution path changes of the target task into a requirement expression that can be used for scheduling analysis; Step S140 is used to combine the computing power, communication capabilities, and storage resources of candidate devices to form computing adaptation results, communication adaptation results, and memory adaptation results, respectively, thereby establishing a multi-dimensional adaptation relationship between the target task and heterogeneous candidate devices; Step S150 is used to determine the target device based on the multi-dimensional adaptation results and call the determined target device to complete the inference processing. Through this continuous process, changes in the target task's input, model activation, resource requirements, and device adaptation can be connected to the same inference scheduling link.

[0126] In one embodiment of the present invention, the aforementioned formulas are mainly used to illustrate how intermediate and final results are derived step by step from known quantities. Formulas (1) and (2) are used to obtain the activation intensity distribution and comprehensive activation degree from known quantities such as activation count and activation weight; Formulas (3) to (5) are used to obtain the computing requirements, data exchange requirements, and memory carrying requirements from known quantities such as comprehensive activation degree, static computing scale of computing unit, and static storage requirements; Formulas (6) to (8) are used to obtain the computing adaptation results, communication adaptation results, and memory adaptation results from information related to task requirements and device capabilities; Formula (9) is used to integrate the three types of adaptation results into a comprehensive adaptation result; Formulas (10) and (11) are used to select the corresponding sub-device at the sub-model level; Formulas (12) and (13) are used to normalize or filter key constraints on the multi-dimensional adaptation results. It can be seen that the entire scheduling process does not directly give the device conclusion by a single rule, but transforms the dynamic execution characteristics of the task into heterogeneous device selection results through a series of intermediate variables that can be interpreted layer by layer. This approach not only ensures a clear implementation path for the technical solution but also allows for the replacement of some specific calculation methods in different application scenarios while maintaining the overall technical approach.

[0127] In one embodiment of the present invention, the specific expression of the above formula does not constitute a unique limitation. As long as the following technical logic can be implemented, the technical solution of one embodiment of the present invention can be used: first, an activation distribution feature is formed based on activation-related information; then, computational requirements, data exchange requirements, and memory carrying requirements are formed based on the activation distribution feature; then, multi-dimensional adaptation results are formed by combining the capability information of candidate devices in the three dimensions of computation, communication, and storage; finally, the target device is determined based on the multi-dimensional adaptation results. That is, in some embodiments, the weight coefficients, normalization methods, smoothing terms, or adaptation function forms in the formula can be adjusted; in some embodiments, equivalent ratio forms, difference forms, ranking forms, or scoring forms can also be used to achieve the same analysis process. For those skilled in the art, as long as the activation distribution feature can ultimately be incorporated into the device selection process, and the device selection result is simultaneously affected by computational requirements, data exchange requirements, and memory carrying requirements, it still falls within the technical concept disclosed in one embodiment of the present invention.

[0128] In one embodiment of the present invention, the aforementioned operational information within a preset period can be continuously maintained before the target task arrives, or it can be supplemented in real time after the target task arrives, in conjunction with the current collection results. For some low-latency scenarios, the operational information of each sparse activation model and each candidate device can be pre-maintained before the target task arrives. When the target task arrives, the operational information within the preset period corresponding to it can be directly read and subsequent analysis can be performed. For some scenarios with extremely high real-time requirements, after reading historical operational information, newly collected activation-related information and resource-related information at the current moment can be supplemented to make the final activation distribution characteristics and multi-dimensional adaptation results closer to the current state. Thus, the operational information within the preset period can be represented as pure historical statistics or as a set of information combining historical statistics and current sampling.

[0129] In one embodiment of the present invention, the scope of candidate devices is not uniquely limited. Although the preceding description mainly uses CPU10031, GPU10032, and NPU10033 as examples, in other embodiments, candidate devices may also include digital signal processors, field-programmable gate array acceleration devices, dedicated artificial intelligence inference cards, or other heterogeneous processing resources capable of performing neural network inference computations. For different types of candidate devices, as long as their computing power information related to task inference time, device communication information related to task communication overhead, and device storage resource information related to task memory usage can be obtained, they can be included in the aforementioned heterogeneous candidate device set for unified adaptation analysis.

[0130] In one embodiment of the present invention, the data type, project type, and model type of the target task are not uniquely limited. Although examples of text understanding and multimodal understanding have been given above, one embodiment of the present invention can also be applied to text generation, image classification, visual detection, speech recognition, recommendation ranking, search retrieval, content understanding, or other project scenarios that use sparse activation models for inference processing. For different project scenarios, the target task may differ in the representation of activation-related information and resource-related information, but as long as the basic characteristic of dynamically activating some computing units based on input to perform inference is still met, the aforementioned heterogeneous inference scheduling method can be used.

[0131] In one embodiment of the present invention, from the perspective of technical effect, by Figure 1The embodiment consisting of steps S110 to S150 can achieve the following results: Since the method acquires the running information within a preset period corresponding to the target task before device scheduling, it can better characterize recent activation state changes and recent device state changes; since the method further determines the activation distribution characteristics of the target task in the target model based on activation-related information, and uses these activation distribution characteristics to characterize computational requirements, data exchange requirements, and memory carrying requirements, the dynamic execution path changes of the sparse activation model can be explicitly incorporated into the scheduling analysis; since the method separately forms the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device for the target task, and further determines the target device based on the multi-dimensional adaptation results, it can avoid the problem of insufficient characterization of actual task resource requirements by a single device indicator or static mapping method. Therefore, in a heterogeneous computing environment, the selection result of the target device can more closely reflect the actual execution characteristics of the current target task, thus being more conducive to improving inference efficiency, execution stability, and the utilization effect of heterogeneous resources.

[0132] In one embodiment of the present invention, Figure 8 The electronic device 800 shown Figure 9 The task processing device 900 shown and Figure 10 The heterogeneous inference scheduling system shown can be deployed in three ways: single-machine, device module, and system collaborative deployment. In the single-machine deployment, processor 801 can directly call local or locally accessible candidate device resources to perform inference. In the device module deployment, task acquisition module 901, runtime information acquisition module 902, activation distribution analysis module 903, adaptation analysis module 904, and device scheduling module 905 can be deployed as program modules within server processes, edge containers, or the inference framework. In the system collaborative deployment, client 1001 and server 1002 can interact via network, and server 1002 then collaborates with the heterogeneous device resource pool 1003 to complete task scheduling and execution. These different deployment models mainly differ in their implementation vehicles, but their technical logic still revolves around the same activation distribution analysis and multi-dimensional adaptation scheduling process.

[0133] In one embodiment of the present invention, the foregoing implementation methods can also be combined with each other. For example, a system can adopt a target device selection method at the overall task level, or it can adopt a sub-model level device selection method on some key sub-models; it can either unify and normalize the multi-dimensional adaptation results before fusion, or it can first perform key resource threshold screening before fusion; it can be deployed in a single-machine multi-device system, or it can be deployed in a multi-machine multi-device system. For those skilled in the art, without changing the core idea of ​​multi-dimensional adaptation scheduling driven by activation distribution characteristics, the adjustments, combinations, and substitutions made to the various local implementation methods can all be considered as optional implementation methods of an embodiment of the present invention.

[0134] It should be understood that the method steps in the foregoing embodiments can be implemented by program instructions controlling related hardware, or by dedicated circuits, programmable logic devices, or a combination thereof. Correspondingly, the systems, devices, modules, units, or components in the foregoing embodiments can be implemented in software, hardware, or a combination of both. The division of modules, units, or components is merely a logical division for the purpose of illustrating the technical solution; in actual implementation, they can be combined, split, or integrated as needed.

[0135] In one embodiment, the electronic device may include a processor, a memory, and a communication interface, wherein the memory is used to store program instructions, and the processor is used to call and execute the program instructions to implement all or part of the steps in the foregoing method embodiments. The electronic device may be a server, a terminal device, an edge computing node, a cloud computing device, or other device with data processing capabilities.

[0136] In one embodiment, this application may also be implemented in the form of a computer-readable storage medium. The computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to implement all or part of the steps in the foregoing method embodiments. The computer-readable storage medium may be a read-only memory, random access memory, flash memory, hard disk, solid-state drive, optical disk, or other non-transitory storage medium.

[0137] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, system embodiments are basically similar to method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments. In the description of this specification, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this specification. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described can be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification and the features of different embodiments or examples.

[0138] Furthermore, the terms "including," "comprising," and "having" used in the specification are all non-exclusive inclusions; the terms "first," "second," etc., are only used to distinguish technical features and do not indicate limitations on order, quantity, or importance. The execution order of each step in the method embodiments is not strictly limited. Without departing from the technical concept of this application, the steps can be adjusted in order, executed in parallel, combined, or split.

Claims

1. A heterogeneous inference scheduling method for sparse activation models, comprising: Obtain task data for the target task and determine the target model for executing the target task, wherein the target model is a sparse activation model; Obtain running information within a preset period corresponding to the target task. The running information within the preset period includes at least activation-related information representing the activation state of the target model, and resource-related information representing the difference in execution cost of the target task on heterogeneous candidate devices. Based on the activation-related information, determine the activation distribution characteristics of the target task in the target model; Based on the computational requirements represented by the activation distribution features and the information related to task inference time and device computing power in the resource-related information, the computational adaptation results of each heterogeneous candidate device for the target task are determined. Based on the data exchange requirements represented by the activation distribution characteristics and the information related to task communication overhead and device communication capabilities in the resource-related information, the communication adaptation results of each heterogeneous candidate device for the target task are determined. Based on the memory carrying requirements represented by the activation distribution characteristics and the information related to task memory usage and device storage resources in the resource-related information, the memory adaptation result of each heterogeneous candidate device for the target task is determined. Based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device, a target device for performing the target task is determined; The target device is invoked to process the task data based on the target model, thereby obtaining the task processing result.

2. The method according to claim 1, wherein, The activation-related information includes at least one of the activation count and activation weight of the activated computing unit in the model, and the resource-related information includes at least one of the task inference time, task communication metrics, and task memory metrics.

3. The method according to claim 1 or 2, wherein, Determining the activation distribution characteristics of the target task in the target model based on the activation-related information includes: Based on the activation-related information, at least one of the following is determined as the activation distribution feature: the set of activated computing units in at least one model layer of the target task, the activation intensity distribution of each activated computing unit, and the inter-layer distribution of each activated computing unit in multiple model layers.

4. The method according to claim 3, wherein, The step of determining the computational adaptation result of each heterogeneous candidate device for the target task based on the computational requirements represented by the activation distribution characteristics and the information related to task inference time and device computing power in the resource-related information includes: determining the computational adaptation result of each heterogeneous candidate device for the target task based on the number of activated computing units, the activation intensity distribution of each activated computing unit and the corresponding task inference time, combined with the peak performance index of each heterogeneous candidate device; The step of determining the communication adaptation result of each heterogeneous candidate device for the target task based on the data exchange requirements represented by the activation distribution characteristics and the information related to task communication overhead and device communication capabilities in the resource-related information includes: determining the communication adaptation result of each heterogeneous candidate device for the target task based on the inter-layer distribution of the activated computing unit set in multiple model layers and the corresponding task communication indicators, combined with the device communication indicators of each heterogeneous candidate device; The step of determining the memory adaptation result of each heterogeneous candidate device for the target task based on the memory carrying requirements represented by the activation distribution characteristics and the information related to task memory occupation and device storage resources in the resource-related information includes: determining the memory adaptation result of each heterogeneous candidate device for the target task based on the number of activated computing units, the activation intensity distribution of each activated computing unit and the corresponding task memory index, combined with the device storage resources of each heterogeneous candidate device.

5. The method according to claim 1, wherein, The heterogeneous candidate devices include at least two of the following: a core processor (CPU), a graphics processing unit (GPU), and a neural network processing unit (NPU). The computing power of the device includes the floating-point operation capability of the heterogeneous candidate devices; The device communication capability includes the data transmission bandwidth capability of the heterogeneous candidate devices; The device storage resources include at least one of the following: system running memory, video memory, on-chip storage resources, or storage resources associated with the heterogeneous candidate device.

6. The method according to claim 1, wherein, The step of acquiring the running information within a preset period corresponding to the target task includes: Collect the operating metrics of the target model on the heterogeneous candidate devices; The operational metrics are stored in the metric storage unit; Read the running information within a preset period corresponding to the target task from the indicator storage unit.

7. The method according to claim 6, wherein, The operational metrics include historical operational metrics written to the metric storage unit according to the collection time and currently collected operational metrics. The step of reading operational information within a preset period corresponding to the target task from the metric storage unit includes: The activation-related information and resource-related information corresponding to the target task are extracted from the historical operation indicators and the currently collected operation indicators, and used as the operation information within the preset period.

8. The method according to claim 1, wherein, The step of determining the target device for performing the target task based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device includes: The computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device are fused to obtain the comprehensive adaptation result of each heterogeneous candidate device for the target task. The target device is determined based on the comprehensive adaptation results of each heterogeneous candidate device.

9. The method according to claim 1, wherein, The target model includes multiple model layers, and each model layer includes multiple sub-models. The step of determining the target device for performing the target task based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device includes: For each of the multiple sub-models, the computational adaptation result, communication adaptation result, and memory adaptation result of each heterogeneous candidate device for the corresponding sub-model are determined. Based on the computational adaptation results, communication adaptation results, and memory adaptation results of each heterogeneous candidate device for each sub-model, corresponding sub-devices are determined for different sub-models respectively; The step of invoking the target device, processing the task data based on the target model, and obtaining the task processing result includes: Each of the sub-devices is invoked to execute its corresponding sub-model, and the processing results of multiple sub-models are obtained. The task processing result is generated based on multiple processing results.

10. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor, when executing the computer program, implements the method of any one of claims 1 to 9.