Systems, methods and nodes in distributed inference networks

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By prefetching data to on-chip caches during collective communication operations, the system addresses memory-bandwidth bottlenecks and inter-device communication overheads, improving performance and scalability in LLM inference systems.

WO2026123241A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2024-12-11
Publication Date: 2026-06-18

AI Technical Summary

⚠Technical Problem

Existing systems face memory-bandwidth bottlenecks and inter-device communication overheads in large language models (LLMs), limiting performance and scalability.

⚗Method used

Implementing a distributed system with interconnected nodes that prefetch data to on-chip caches during collective communication operations, overlapping memory reads with communication tasks to enhance resource utilization and reduce latency.

🎯Benefits of technology

This approach reduces latency, increases resource utilization, and enhances scalability by effectively addressing memory bandwidth bottlenecks and inter-device communication overheads in LLM inference systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2024138448_18062026_PF_FP_ABST

Patent Text Reader

Abstract

In some examples, a node in a distributed system can comprise multiple nodes configured for neural network inference, the node comprising at least one processor, and at least one memory including computer program code, wherein the multiple nodes are communicatively interconnected with one another for inter-node collective communication operations, the at least one memory and computer program code being configured to, with the at least one processor, cause the node to prefetch data to a second storage of the node during a collective communication operation of the multiple nodes.

Need to check novelty before this filing date? Find Prior Art

Description

SYSTEMS, METHODS AND NODES IN DISTRIBUTED INFERENCE NETWORKSTECHNICAL FIELD

[0001] The present disclosure relates, in general, to artificial intelligence inference systems.BACKGROUND

[0002] Large language models (LLMs) consist of multiple layers (e.g., Linear, Self-attention, Activation) that require pretrained weights and contextual data such as key-value caches (KV-cache) during inference. These weights and KV-cache data are typically stored in off-chip memory and accessed by artificial intelligence (AI) accelerators, which can comprise compute cores and on-chip memory (e.g., level 2 / L2 cache) .

[0003] A key challenge is the memory-bandwidth bottleneck, where reading data from off-chip memory takes longer than processing it, even with high-bandwidth memory technologies like HBM (high-bandwidth memory) .

[0004] For large LLMs, data is often distributed across multiple accelerators in a cluster using parallelization techniques (e.g., tensor parallelism) . This requires inter-device communication, which can further limit scalability and performance due to latency.SUMMARY

[0005] An objective of the present disclosure is to provide system, methods and apparatus for improving the performance of AI inference systems, particularly for large language models (LLMs) , by addressing memory bandwidth bottlenecks and inter-device communication overheads.

[0006] The foregoing and other objectives are achieved by the features of the independent claims.

[0007] Further implementation forms are apparent from the dependent claims, the description and the Figures.

[0008] A first aspect of the present disclosure provides a node in a distributed system comprising multiple nodes configured for neural network inference, the node comprising at least one processor, and at least one memory including computer program code , wherein the multiple nodes are communicatively interconnected with one another for inter-node collective communication operations, the at least one memory and computer program code being configured to, with the at least one processor, cause the node to prefetch data to a second storage of the node during a collective communication operation of the multiple nodes.

[0009] Prefetching during a collective or common communication operation of nodes in a cluster of multiple nodes reduces latency since memory reads are overlapped with communication operations. It also increases resource utilization by minimizing idle compute time and enhances scalability for distributed systems with interconnected accelerators / nodes.

[0010] In an implementation of the first aspect, the node can comprise a neural processing unit, NPU, wherein the second storage comprises an on-chip cache of the neural processing unit.

[0011] As paradigm of LLM inference systems is shifting from a few powerful AI accelerators to smaller but more cost-effective chiplet-based accelerators in a cluster, the present disclosure provides a cost-effective inference system that can comprise a large number of interconnected chiplet-based accelerators.

[0012] In an example, the node can prefetch the data from an off-chip memory for the NPU, wherein the off-chip memory is communicatively coupled to the second storage using a memory bus. The at least one memory and computer program code can be configured to, with the at least one processor, further cause the node to perform the collective communication operation as part of a first execution stream for the node, and prefetch the data as part of a second execution stream for the node, wherein the first execution stream and the second execution stream are executed in parallel by the node. The at least one memory and computer program code can be configured to, with the at least one processor, further cause the node to calculate an estimated size for respective ones of multiple data items for prefetching by the node, and prefetch selected ones of the multiple data items whose aggregate size is less than a preselected threshold value. For example, a preselected threshold value can be related to the size of an on-chip memory. For example, a preselected threshold value can be a maximum size of an on-chip memory.

[0013] In an example, the node can receive a prefetch trigger instruction from a host node of the distributed system, wherein the prefetch trigger instruction is configured to trigger the prefetch of the data by the node. The prefetch trigger can comprise a message for one or more nodes (e.g., a broadcast, unicast, multicast, anycast etc. ) configured to cause a node or nodes (e.g., all or selected nodes) to perform a prefetch of data during a collective communication operation.

[0014] A second aspect of the present disclosure provides a distributed system comprising multiple nodes configured for neural network inference, wherein at least one of the nodes of the multiple nodes comprises a node as provided according to the first aspect.

[0015] A third aspect of the present disclosure provides a method for a node in a distributed system configured for neural network inference, wherein the distributed system comprises multiple nodes communicatively interconnected with one another for inter-node collective communication operations, the method comprising prefetching data to a second storage of the node during a collective communication operation of the multiple nodes.

[0016] In an implementation of the third aspect, the node can comprise a neural processing unit, NPU, wherein the second storage comprises an on-chip cache of the neural processing unit. Prefetching the data can comprise prefetching the data from an off-chip memory for the NPU, wherein the off-chip memory is communicatively coupled to the second storage using a memory bus. The method can further comprise performing the collective communication operation as part of a first execution stream for the node, and prefetching the data as part of a second execution stream for the node, wherein the first execution stream and the second execution stream are executed in parallel by the node. The method can further comprise calculating an estimated size for respective ones of multiple data items for prefetching by the node, and prefetching selected ones of the multiple data items whose aggregate size is less than a preselected threshold value. The method can further comprise receiving a prefetch trigger instruction from a host node of the distributed system, wherein the prefetch trigger instruction is configured to trigger the prefetch of the data by the node.

[0017] A fourth aspect of the present disclosure provides a non-transitory computer readable medium comprising program instructions for causing a node to perform a method in a distributed system configured for neural network inference, wherein the distributed system comprises multiple nodes communicatively interconnected with one another for inter-node collective communication operations, the method comprising prefetching data to a second storage of the node during a collective communication operation of the multiple nodes.

[0018] In an implementation of the fourth aspect, the node can comprise a neural processing unit, NPU, wherein the second storage comprises an on-chip cache of the neural processing unit. The non-transitory computer readable medium can comprise program instructions for causing the node to perform prefetching the data from an off-chip memory for the NPU, wherein the off-chip memory is communicatively coupled to the second storage using a memory bus. The non-transitory computer readable medium can comprise program instructions for causing the node to perform performing the collective communication operation as part of a first execution stream for the node, and prefetching the data as part of a second execution stream for the node, wherein the first execution stream and the second execution stream are executed in parallel by the node. The non-transitory computer readable medium can comprise program instructions for causing the node to perform calculating an estimated size for respective ones of multiple data items for prefetching by the node, and prefetching selected ones of the multiple data items whose aggregate size is less than a preselected threshold value.

[0019] These and other aspects of the invention will be apparent from the embodiment (s) described below.BRIEF DESCRIPTION OF THE DRAWINGS

[0020] In order that the present disclosure may be more readily understood, embodiments will now be described, by way of example, with reference to the accompanying drawings, in which:

[0021] FIG. 1 is a schematic representation of part of an LLM architecture, according to an example;

[0022] FIG. 2 is a schematic representation of an AI accelerator, according to an example;

[0023] FIG. 3 is a schematic representation of a system, according to an example;

[0024] FIG. 4 is a schematic representation of a method, according to an example;

[0025] FIG. 5 is a schematic representation of a method, according to an example;

[0026] FIG. 6 is a schematic representation of a machine according to an example; and

[0027] FIG. 7 is as schematic representation of a method for a node in a distributed system configured for neural network inference.DETAILED DESCRIPTION

[0028] Example embodiments are described below in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein.

[0029] Accordingly, while embodiments can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.

[0030] The terminology used herein to describe embodiments is not intended to limit the scope. The articles “a, ” “an, ” and “the” are singular in that they have a single referent, however the use of the singular form in the present document should not preclude the presence of more than one referent. In other words, elements referred to in the singular can number one or more, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises, ” “comprising, ” “includes, ” and / or “including, ” when used herein, specify the presence of stated features, items, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, items, steps, operations, elements, components, and / or groups thereof. The term “and / or” is only an association relationship for describing associated objects and represents that three relationships may exist such that A and / or B may indicate that A exists alone, A and B exist at the same time, or B exists alone. The character “ / ” generally represents that the associated objects are in an “or” relationship.

[0031] Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.

[0032] The following contains specific information related to implementations of the present disclosure. The drawings and their accompanying detailed disclosure are merely directed to implementations. However, the present disclosure is not limited to these implementations. Other variations and implementations of the present disclosure will be obvious to those skilled in the art.

[0033] The phrases “in one implementation, ” or “in some implementations, ” may each refer to one or more of the same or different implementations. The term “coupled” is defined as connected whether directly or indirectly through intervening components and is not necessarily limited to physical connections. The expression “at least one of A, B and C” or “at least one of the following: A, B and C” means “only A, or only B, or only C, or any combination of A, B and C. ” .

[0034] The terms “system” and “network” may be used interchangeably.

[0035] For the purposes of explanation and non-limitation, specific details such as functional entities, techniques, protocols, and standards are set forth for providing an understanding of the present disclosure. In other examples, detailed disclosure of well-known methods, technologies, systems, and architectures are omitted so as not to obscure the present disclosure with unnecessary details.

[0036] Persons skilled in the art will immediately recognize that any network function (s) or algorithm (s) disclosed may be implemented by hardware, software or a combination of software and hardware. Disclosed functions may correspond to modules which may be software, hardware, firmware, or any combination thereof.

[0037] A software implementation may include machine-and / or computer-readable and / or executable instructions stored on a machine-and / or computer-readable medium such as memory or other types of storage devices. One or more microprocessors or general-purpose computers with communication processing capability may be programmed with corresponding executable instructions and perform the disclosed network function (s) or algorithm (s) .

[0038] The microprocessors or general-purpose computers may include Applications Specific Integrated Circuitry (ASIC) , programmable logic arrays, and / or using one or more Digital Signal Processor (DSPs) . Although some of the disclosed implementations are oriented to software installed and executing on computer hardware, alternative implementations implemented as firmware or as hardware or as a combination of hardware and software are well within the scope of the present disclosure. The computer readable medium includes but is not limited to Random Access Memory (RAM) , Read Only Memory (ROM) , Erasable Programmable Read-Only Memory (EPROM) , Electrically Erasable Programmable Read-Only Memory (EEPROM) , flash memory, Compact Disc Read-Only Memory (CD-ROM) , magnetic cassettes, magnetic tape, magnetic disk storage, or any other equivalent medium capable of storing computer-readable instructions.

[0039] Large language models (LLMs) are a type of machine learning method. FIG. 1 is a schematic representation of part of an LLM architecture, according to an example. LLMs consist of a set of layers, such as Linear, Self-attention, Activation etc. Some of these layers (e.g., Linear) have a set of pretrained weights, which are trained using large amounts of data in a “training” process prior to “inference” (inference is also known as model serving or deployment) . Some of these layers (e.g., Self-attention) have contextual data (also known as a key-value, KV, -cache) , which is produced during the previous steps of the inference and stored temporarily until the completion of the inference process. The weights and KV-cache are typically stored in the off-chip memory of an AI accelerator, which is a type of processor specialized at processing AI workloads.

[0040] FIG. 2 is a schematic representation of an AI accelerator, according to an example. The accelerator 200 of FIG. 2 comprises an “off-chip memory” 201 and a “compute die” 203, which are connected to each other through a memory bus 205. The compute die 203 consists of a number of “cores” 207, each of which can execute arithmetic operations to process the layers of LLMs, such as that depicted in FIG. 1 for example. The compute die 203 also comprises an “on-chip memory” 209 (often referred to as L2 cache or buffer) to temporarily store data in-between memory operations to reduce the number of off-chip accesses and to reduce the latency of the memory operations.

[0041] During inference, an AI accelerator 200 sequentially executes the layers of an LLM. While executing a layer, the accelerator reads the required pretrained weight or KV-cache from the off-chip memory 201 to on-chip cache 209. The pretrained weights and KV-cache are typically so large that reading them from the off-chip memory 201 usually takes longer than processing the layers of the LLM, even with high-end memory technologies (e.g., HBM) that provides the highest bandwidth. This phenomenon is called “memory-bandwidth bottleneck” and it is often the limiting factor in the performance of AI accelerators for LLM inference.

[0042] For many large LLMs, the capacity of the off-chip memory 201 is not sufficient to store all the required weights and KV-cache. In these cases, LLMs are partitioned using various parallelization techniques (e.g., tensor parallelism, pipeline parallelism, sequence parallelism etc. ) and distributed across multiple accelerators (i.e., a “cluster” ) . For example, a cluster can comprise multiple accelerators 200.

[0043] During inference, each accelerator processes a partition of the computation graph with the partitioned weights and / or KV-cache. After executing certain layers, the accelerators need to share their intermediate data with each other. The intermediate data transfer often occurs in the form of collective communications such as Allreduce 101 for tensor parallel (see FIG. 1) or Allgather for sequence parallel etc. Depending on the size of the cluster, the amount of transferred data, and the specifications of the network between the accelerators, the collective operations may take a considerable amount of time, which limits the scalability of the system and decreases the overall performance of the LLM inference.

[0044] According to an example, there is provided a method for overlapping prefetching with communication operations in order to address off-chip memory bandwidth bottleneck and inter-device communication overheads issues effectively. For example, in the context of distributed LLM inference, model weights and KV-cache data can be prefetched from an off-chip memory to on-chip L2 cache of an accelerator at a time to overlap with inter-device communication operations.

[0045] According to an example, a host CPU and multiple AI accelerators are interconnected through a network. Each accelerator has off-chip memory (e.g., HBM) and on-chip cache (e.g., L2) . During collective communication operations (e.g., Allreduce, Allgather) , the system prefetches data required for subsequent tasks into an on-chip cache, ensuring efficient utilization of resources. In an example, prefetching can be performed only if the total size of prefetched data does not exceed an on-chip cache capacity or some other predefined threshold value.

[0046] FIG. 3 is a schematic representation of a system, according to an example. Specifically, in the example of FIG. 3, a system architecture of a distributed inference platform is shown, such as a distributed neural network inference platform, or distributed LLM inference platform. The system consists of a host 301 (denoted as CPU) and a number of nodes 303 comprising AI accelerators (denoted as NPU, enumerated as device 0-3) . The host 301 is equipped with an off-chip memory 305 (e.g., DDR) and a storage unit 307 (e.g., disk) , which stores, e.g., pretrained model weights and KV-cache of an LLM. Each node 303 has an off-chip memory 309 (e.g., HBM) and an on-chip memory (e.g., L2 cache) . The nodes 303 are connected to each other through an interconnection network 311, thereby forming a cluster.

[0047] Accordingly, each node 303 is part of a distributed system 300 comprising multiple nodes configured for neural network inference. The multiple nodes 303 are communicatively interconnected with one another (311) for inter-node collective communication operations.

[0048] In an example, host 301 can execute 313 a program that enables queries received from an external network 315 to be served. Host 301 loads pretrained model weights (for, e.g., an LLM) from storage 307 to its memory 305, copies partitions of the weights to the off-chip memories 309 of the nodes 303, dispatches parts of the computation graph to the nodes 303, and orchestrates communication and synchronization between the nodes 303. Each node 303 can therefore execute a part of the computation graph and produce intermediate results. These intermediate results are shared with other nodes 303 through the interconnect network 311. When the nodes 303 complete their part of the execution, the outputs are sent back to the host 301, which performs any post-processing steps and completes the query. Accordingly, in the context of an LLM for example, host 301 can receive a query from network 315 for inference using an LLM across the distributed nodes 303, with each node 303 executing a given part of a computation graph for the LLM that has been provided by the host 301 in order to produce an intermediate result.

[0049] According to an example, in order to reduce node latency, increase resource utilization, and enhances scalability, each node can be configured to perform memory reads for data prefetching that overlap with communication operations. That is, a node 303 can prefetch data during a collective communication operation of the multiple nodes 303 of the system 300. The data can be prefetched to a second storage of a node 303, such as an on-chip memory, from an off-chip memory 309. Each node 303 can be configured to prefetch data to a second storage of the node 303 during a collective communication operation of the multiple nodes, or each node can be configured to prefetch data to a second storage of the node 303 during a collective communication operation of the multiple nodes by the host 301, or prefetching during a collective communication operation of the multiple nodes can comprise a combination of node and host orchestrated prefetching.

[0050] FIG. 4 is a schematic representation of a method, according to an example. With reference to FIG. 3, each node 303 can execute a sequence of tasks (e.g., collective communication operation (CC0) → Task1, 401 → Task2, 403 → Task3, 405) , as shown in FIG. 4 (only three tasks are shown but it can be any number of tasks) .

[0051] CC0 407 is a collective communication operation (e.g., allreduce, allgather etc. ) during which a number of nodes 303 share data with each other. Task1, Task2, Task3 can be any operation, including but not limited to matrix multiplication, self-attention, activation, element-wise arithmetic, and so on. Task1, Task2, and Task3 may or may not have an input operand that is stored in off-chip memory 309 (denoted as Data 1 and Data 3 in FIG. 4) . A node may execute multiple streams 409, 411 of tasks in parallel (denoted as Stream 1 and Stream 2 in FIG. 4) .

[0052] According to an example, a node 303 starts executing CC0 407 in Stream 1 409. The node 303 can prefetch 412 Data 1 413 and prefetch 414 Data 3 415 from the off-chip memory 309 of the node to on-chip memory, such as a cache or buffer of the node, in Stream 2 411 in parallel to CC0 407. That is, during a CCO 407 of a cluster of nodes, a node can prefetch data to an on-chip memory of the node (from an off-chip memory) for a task or tasks. The prefetch of data is performed in parallel to the execution of the CCO 407.

[0053] In an example, when the execution of CC0 407 is completed, node 303 executes Task1 401, Task2 403, Task3 405 in Stream 1 409 using the data (Data1 413, Data3 415) prefetched in stream 2 411 during the CCO 407.

[0054] A node can repeat this process for every CCO in a computation graph.

[0055] FIG. 5 is a schematic representation of a method, according to an example. With reference to FIG. 3, each node 303 can execute a sequence of tasks (e.g., collective communication operation (CC0) → Task1, 501 → Task2, 503 →Task3, 505 → Task4, 507) , as shown in FIG. 5 (only four tasks are shown but it can be any number of tasks) .

[0056] In the example of FIG. 5, a total data size that is prefetched can be calculated such that prefetching is only preformed if the total prefetched data size is below a threshold (e.g., an on-chip cache capacity of a node) .

[0057] Accordingly, with reference to FIG. 5, a node 303 can estimate the memory size of Data 1 513, Data 3 515, and Data 4 517 and calculate that the total size of, e.g., Data 1 513, Data 3 515, and Data 4 517 exceeds a predefined threshold, whereas the total size of Data 1 513 and Data 3 515 is below the predefined threshold. Therefore, the node can perform prefetching only for Data 1 513 and Data 3 515 and skips prefetching of Data 4 517.

[0058] A node 303 starts executing CC0 507 in Stream 1 409. The node 303 can prefetch 512 Data 1 513 and prefetch 514 Data 3 515 from the off-chip memory 309 of the node to on-chip memory, such as a cache or buffer of the node, in Stream 2 511 in parallel to CC0 507. That is, during a CCO 507 of a cluster of nodes, a node can prefetch data to an on-chip memory of the node (from an off-chip memory) for a task or tasks. The prefetch of data is performed in parallel to the execution of the CCO 507. The prefetch of data is estimated dynamically to ensure prefetching does not exceed, e.g., cache (on-chip memory) capacity.

[0059] As with the example of FIG. 4, when the execution of CC0 507 is completed, node 303 executes Task1 501, Task2 503, Task3 505 and Task4 507 in Stream 1 509 using the data (Data1 513, Data3 515) prefetched in stream 2 511 during the CCO 507 and the Data4 517 that is obtained just before or as Task 4 507 is executed (i.e., Data 4 517 is not prefetched) .

[0060] A node can repeat this process for every CCO in a computation graph.

[0061] A CCO, such as that performed in the examples of FIG. 4 and FIG. 5, can comprise, e.g., a Broadcast, All-to-all, All-to-one, Reduce, All-reduce, Scan, Gather, All-Gather, Scatter as well as other communication operations such as Send and Receive.

[0062] In the examples of FIG. 4 and FIG. 5, the tasks can comprise linear, attention, activation, normalization, embedding tasks, positional encoding layers as well as other algebraic operations such as matrix-matrix multiplication, matrix-vector multiplication, vector-vector multiplication, grouped matrix multiplication, matrix transpose as well as elementwise operations such as addition, subtraction, division, multiplication etc.

[0063] Prefetched data as described herein, such as with reference to the examples of FIG. 4 and FIG. 5 can comprise weights and parameters of an AI model or data generated during an inference process (e.g., KV-cache) .

[0064] In an example, a node 303 can comprise an AI accelerator comprising a processing device in the form of a central processing unit (CPU) , graphics processing unit (GPU) , data processing unit (DPU) , neural processing unit / AI accelerator (NPU) , Application Specific Integrated Circuit (ASIC) or field programmable gate array (FPGA) .

[0065] Data can be prefetched not only to an on-chip memory of a node, such as a L2 cache memory, but also other types of memory at various levels of a memory hierarchy of a node including but not limited to register files, scratchpad memory, on-chip buffers, L0, L1, L3 cache. Some of these memories might be on the compute die of a node, or may be connected to the compute die through various packaging technologies such as 2.5D, 3D, Wire Bonding, Hybrid Bonding etc.

[0066] According to an example, there is therefore provided a node, a method and a system for overlapping prefetching of, e.g., model weights and KV-cache with collective communication operations in a cluster of nodes forming a distributed neural network inference system. As such, latency of collective communications is hidden by the memory reads, resource utilization of the nodes (e.g., AI accelerators in a distributed LLM inference system) is increased, and end-to-end token generation latency is decreased, throughput is increased, the cost per token is decreased.

[0067] Examples in the present disclosure can be provided as methods, systems or machine-readable instructions, such as any combination of software, hardware, firmware or the like. Such machine-readable instructions may be included on a computer readable storage medium (including but not limited to disc storage, CD-ROM, optical storage, etc. ) having computer readable program codes therein or thereon.

[0068] The present disclosure is described with reference to flow charts and / or block diagrams of the method, devices and systems according to examples of the present disclosure. Although the flow diagrams described above show a specific order of execution, the order of execution may differ from that which is depicted. Blocks described in relation to one flow chart may be combined with those of another flow chart. In some examples, some blocks of the flow diagrams may not be necessary and / or additional blocks may be added. It shall be understood that each flow and / or block in the flow charts and / or block diagrams, as well as combinations of the flows and / or diagrams in the flow charts and / or block diagrams can be realized by machine readable instructions.

[0069] The machine-readable instructions may, for example, be executed by a machine such as a general-purpose computer, a platform comprising user equipment such as a smart device, e.g., a smart phone, a special purpose computer, an embedded processor or processors of other programmable data processing devices to realize the functions described in the description and diagrams. In particular, a processor or processing apparatus may execute the machine-readable instructions. Thus, modules of apparatus may be implemented by a processor executing machine readable instructions stored in a memory, or a processor operating in accordance with instructions embedded in logic circuitry. The term 'processor' is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate set etc. The methods and modules may all be performed by a single processor or divided amongst several processors.

[0070] Such machine-readable instructions may also be stored in a computer readable storage that can guide the computer or other programmable data processing devices to operate in a specific mode. For example, the instructions may be provided on a non-transitory computer readable storage medium encoded with instructions, executable by a processor.

[0071] FIG. 6 is a schematic representation of a machine according to an example. The machine 600 can be, e.g., a system or apparatus, user equipment, or part thereof, a node in a cluster of nodes, where each node can comprise, e.g., an AI accelerator or processor. The machine 600 comprises a processor 603, and a memory 605 to store instructions 602, executable by the processor 603. The machine comprises a storage 609, such as an on-chip memory comprising, e.g., a L2 cache, that can be used to store data 611 representing, e.g., prefetched data and so on as described above for example.

[0072] The instructions 607, executable by the processor 603, can cause the machine 600 to prefetch data during a collective communication operation of multiple nodes, wherein the machine 600 can comprise a node in a distributed system configured for neural network inference, wherein the distributed system comprises multiple nodes communicatively interconnected with one another for inter-node collective communication operations.

[0073] Accordingly, the machine 600 can implement a method for prefetching data during a collective communication operation of multiple nodes in a cluster.

[0074] Such machine-readable instructions may also be loaded onto a computer or other programmable data processing devices, so that the computer or other programmable data processing devices perform a series of operations to produce computer-implemented processing, thus the instructions executed on the computer or other programmable devices provide an operation for realizing functions specified by flow (s) in the flow charts and / or block (s) in the block diagrams.

[0075] Further, the teachings herein may be implemented in the form of a computer or software product, such as a non-transitory machine-readable storage medium, the computer software or product being stored in a storage medium and comprising a plurality of instructions, e.g., machine readable instructions, for making a computer device implement the methods recited in the examples of the present disclosure.

[0076] In some examples, some methods can be performed in a cloud-computing or network-based environment. Cloud-computing environments may provide various services and applications via the Internet. These cloud-based services (e.g., software as a service, platform as a service, infrastructure as a service, etc. ) may be accessible through a web browser or other remote interface of the user equipment for example. Various functions described herein may be provided through a remote desktop environment or any other cloud-based computing environment.

[0077] FIG. 7 is as schematic representation of a method for a node in a distributed system configured for neural network inference, wherein the distributed system comprises multiple nodes communicatively interconnected with one another for inter-node collective communication operations. In block 701, the method comprises prefetching data to a second storage of the node during a collective communication operation of the multiple nodes.

[0078] While various embodiments have been described and / or illustrated herein in the context of fully functional computing systems, one or more of these exemplary embodiments may be distributed as a program product in a variety of forms, regardless of the particular type of computer-readable-storage media used to actually carry out the distribution. The embodiments disclosed herein may also be implemented using software modules that perform certain tasks. These software modules may include script, batch, or other executable files that may be stored on a computer-readable storage medium or in a computing system. In some embodiments, these software modules may configure a computing system to perform one or more of the exemplary embodiments disclosed herein. In addition, one or more of the modules described herein may transform data, physical devices, and / or representations of physical devices from one form to another.

[0079] The preceding description has been provided to enable others skilled in the art to best utilize various aspects of the exemplary embodiments disclosed herein. This exemplary description is not intended to be exhaustive or to be limited to any precise form disclosed. Many modifications and variations are possible without departing from the spirit and scope of the instant disclosure. The embodiments disclosed herein should be considered in all respects illustrative and not restrictive. Reference should be made to the appended claims and their equivalents in determining the scope of the instant disclosure.

Claims

1.A node in a distributed system comprising multiple nodes configured for neural network inference, the node comprising:at least one processor; andat least one memory including computer program code , wherein the multiple nodes are communicatively interconnected with one another for inter-node collective communication operations, the at least one memory and computer program code being configured to, with the at least one processor, cause the node to:prefetch data to a second storage of the node during a collective communication operation of the multiple nodes.2.The node as claimed in claim 1, wherein the node comprises a neural processing unit, NPU, wherein the second storage comprises an on-chip cache of the neural processing unit.3.The node as claimed in claim 2, wherein the node is configured to prefetch the data from an off-chip memory for the NPU, wherein the off-chip memory is communicatively coupled to the second storage using a memory bus.4.The node as claimed in any preceding claim, wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the node to:perform the collective communication operation as part of a first execution stream for the node; andprefetch the data as part of a second execution stream for the node, wherein the first execution stream and the second execution stream are executed in parallel by the node.5.The node as claimed in any preceding claim, wherein the at least one memory and computer program code are configured to, with the at least one processor, further cause the node to:calculate an estimated size for respective ones of multiple data items for prefetching by the node; andprefetch selected ones of the multiple data items whose aggregate size is less than a preselected threshold value.6.The node as claimed in any preceding claim, wherein the node is configured to:receive a prefetch trigger instruction from a host node of the distributed system, wherein the prefetch trigger instruction is configured to trigger the prefetch of the data by the node.7.A distributed system comprising multiple nodes configured for neural network inference, wherein at least one of the nodes of the multiple nodes comprises a node as claimed in any preceding claim.8.A method for a node in a distributed system configured for neural network inference, wherein the distributed system comprises multiple nodes communicatively interconnected with one another for inter-node collective communication operations, the method comprising:prefetching data to a second storage of the node during a collective communication operation of the multiple nodes.9.The method as claimed in claim 8, wherein the node comprises a neural processing unit, NPU, wherein the second storage comprises an on-chip cache of the neural processing unit.10.The method as claimed in claim 9, wherein prefetching the data comprises prefetching the data from an off-chip memory for the NPU, wherein the off-chip memory is communicatively coupled to the second storage using a memory bus.11.The method as claimed in any of claims 8 to 10, further comprising:performing the collective communication operation as part of a first execution stream for the node; andprefetching the data as part of a second execution stream for the node, wherein the first execution stream and the second execution stream are executed in parallel by the node.12.The method as claimed in any of claims 8 to 11, further comprising:calculating an estimated size for respective ones of multiple data items for prefetching by the node; andprefetching selected ones of the multiple data items whose aggregate size is less than a preselected threshold value.13.The method as claimed in any of claims 8 to 12, further comprising:receiving a prefetch trigger instruction from a host node of the distributed system, wherein the prefetch trigger instruction is configured to trigger the prefetch of the data by the node.14.A non-transitory computer readable medium comprising program instructions for causing a node to perform a method in a distributed system configured for neural network inference, wherein the distributed system comprises multiple nodes communicatively interconnected with one another for inter-node collective communication operations, the method comprising:prefetching data to a second storage of the node during a collective communication operation of the multiple nodes.15.The non-transitory computer readable medium as claimed in claim 14, wherein the node comprises a neural processing unit, NPU, wherein the second storage comprises an on-chip cache of the neural processing unit.16.The non-transitory computer readable medium as claimed in claim 15, further comprising program instructions for causing the node to perform:prefetching the data from an off-chip memory for the NPU, wherein the off-chip memory is communicatively coupled to the second storage using a memory bus.17.The non-transitory computer readable medium as claimed in any of claims 14 to 16, further comprising program instructions for causing the node to perform:performing the collective communication operation as part of a first execution stream for the node; andprefetching the data as part of a second execution stream for the node, wherein the first execution stream and the second execution stream are executed in parallel by the node.18.The non-transitory computer readable medium as claimed in any of claims 14 to 17, further comprising program instructions for causing the node to perform:calculating an estimated size for respective ones of multiple data items for prefetching by the node; andprefetching selected ones of the multiple data items whose aggregate size is less than a preselected threshold value.