Reasoning method, computing cluster and computing device
By building a cache consistency protocol and shared memory devices in the computing cluster, local cache reuse and dynamic control management are achieved, solving the problem of historical key-value cache migration delay in distributed large language model inference tasks, and improving overall performance and system stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XFUSION DIGITAL TECH CO LTD
- Filing Date
- 2026-02-14
- Publication Date
- 2026-06-23
AI Technical Summary
In large language model inference tasks deployed in a distributed manner, the migration of historical key-value caches across nodes leads to high latency and affects overall performance.
By constructing a computing cluster framework, utilizing cache consistency protocols and shared memory devices, local cache reuse and on-demand retrieval of historical key-value caches are achieved, control nodes are dynamically managed, and cache access paths are optimized.
It significantly reduced cache access latency, improved overall inference performance, avoided interruptions caused by single points of failure, and ensured the efficient operation of the system.
Smart Images

Figure CN122264136A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of large language model technology, and in particular to a reasoning method, computing cluster, and computing device. Background Technology
[0002] In interactive applications such as intelligent question answering, code generation, and dialogue systems, large language model (LLM) inference needs to balance high throughput and low latency. Especially when handling long contexts or multi-turn dialogues, recalculating the key and value vectors of all historical tokens every time a new token is generated would result in a large amount of redundant computation. To address this, the industry commonly employs a key-value cache (KV Cache) mechanism—caching the key / value vectors of processed tokens during inference, allowing them to be directly reused in subsequent token generation, thereby significantly improving inference efficiency.
[0003] However, in actual distributed deployments, the KV Cache is usually stored on the compute-local node that first processes the inference request. When subsequent related inference requests are scheduled to other target nodes, since the target nodes have neither local copies nor direct access to the original KV Cache, the system is often forced to perform a complete KV Cache network migration. This transmission introduces a delay of hundreds of milliseconds or more, which seriously affects the overall performance of the inference task. Summary of the Invention
[0004] This application provides an inference method, a computing cluster, and a computing device. By constructing a computing cluster framework, and by reusing local cache and retrieving missing historical key-value cache on demand based on the computing node currently in control, efficient reuse of historical key-value cache can be achieved, thereby significantly reducing cache access latency and improving overall inference performance.
[0005] To achieve the above objectives, the embodiments of this application adopt the following technical solutions: In a first aspect, embodiments of this application provide an inference method applied to a first computing node among multiple computing nodes. The multiple computing nodes are connected to a shared memory device based on a cache consistency protocol. The method includes: in response to an inference request, obtaining a first historical key-value cache locally cached by the first computing node for processing the inference request; the first historical key-value cache being a historical key-value cache required to execute the inference request; and, if a second historical key-value cache needs to be obtained, sending a first read request to a controlling node; wherein the controlling node is a computing node that manages access rights to the complete historical key-value cache corresponding to the inference request; the second historical key-value cache being a historical key-value cache required for processing the inference request and not present in the first computing node; receiving the second historical key-value cache read and returned by the second computing node from the shared memory device in response to the first read request; and performing an inference task based on the first and second historical key-value caches.
[0006] Based on this scheme, after receiving an inference request, the first compute node obtains the historical key-value cache information corresponding to the inference request from its local storage. If a portion (the first historical key-value cache) is stored, it interacts with the compute node managing the historical key-value cache corresponding to the inference request to obtain the second historical key-value cache. Then, based on these two portions of historical key-value cache, the inference task is executed. In this way, by reusing the local cache and retrieving missing historical key-value cache as needed based on the compute node currently in control, efficient reuse of the historical key-value cache can be achieved, significantly reducing cache access latency and improving overall inference performance. Simultaneously, since the control of the historical key-value cache is not fixed, a single point of failure will not interrupt the entire inference process, ensuring overall performance.
[0007] In one possible implementation, multiple computing nodes establish a communication connection with a management node; a first read request is sent to a control node, including: sending an addressing request to the management node; the addressing request is used to obtain the control node of the second historical key-value cache corresponding to the inference request; receiving an addressing response from the management node corresponding to the addressing request; the addressing response includes the control node of the second historical key-value cache; and sending a first read request to the control node based on the address of the control node, so that the control node obtains the first key-value cache based on the first read request.
[0008] Based on this scheme, when the first computing node processes an inference request, it sends an addressing request to the management node. This allows the management node to accurately map the historical key-value cache of the inference request to the control node. By interacting with the control node, the management node obtains the corresponding historical key-value cache. In this way, by obtaining the address of the control node first and then interacting with it, blind probing is avoided, providing a reliable path for efficiently obtaining the historical key-value cache. This effectively reduces cache access latency and improves overall inference performance.
[0009] One possible approach also includes: generating a new key-value cache during the inference task based on the first historical key-value cache and the second historical key-value cache; writing the new key-value cache to a shared memory device in the form of incremental storage, so that the control node can associate the new key-value cache with the managed second historical key-value cache; and updating the access counts of the second historical key-value cache and the new key-value cache after the inference task corresponding to the inference request is completed.
[0010] Based on this solution, by maintaining a historical key-value cache corresponding to inference requests in a shared memory device, and asynchronously writing newly generated key-value caches to this storage area in an incremental manner during the inference process, the repeated writing of the complete cache in each inference step is effectively avoided, thereby effectively improving the overall inference performance.
[0011] Secondly, embodiments of this application provide another inference method applied to a management node, which establishes communication connections with multiple computing nodes. The method includes: in response to an inference request, obtaining metadata corresponding to the inference request; the metadata is used to characterize the context information of the inference request and the address information of storing the corresponding historical key-value cache; based on the metadata, determining a first computing node from among the multiple computing nodes to execute the inference request; if the first computing node is different from the control node, obtaining the weight parameters corresponding to each computing node when processing the inference request; the weight parameters are used to characterize the overall scheduling performance when processing the inference request; wherein, the control node is the computing node where the current control of the second historical key-value cache is located; if the comparison result between the weight parameters and a preset threshold satisfies the access right change, determining a new control node, and enabling the new control node to interact with the computing node where the current access right is located through a cache consistency protocol; wherein, the access right is the management authority of the control node over the second historical key-value cache.
[0012] Based on this scheme, the management node dynamically evaluates the data using weight parameters and proactively triggers access right migration when necessary, thereby reducing cross-node cache access overhead. At the same time, access right migration is automatically completed based on the cache consistency protocol, ensuring data consistency and avoiding redundant copying.
[0013] One possible implementation also includes updating the address of the computing node where the control right corresponding to the historical key-value cache resides, so that the first computing node can obtain the new control node by interacting with the management node.
[0014] Based on this solution, by dynamically updating the control node address in the metadata, the management can always accurately locate the current access holder of the historical key-value cache, providing the correct addressing basis for subsequent related requests, avoiding access failures or redundant migrations caused by control drift, thereby improving the overall consistency of the system.
[0015] In another possible implementation, in response to an inference request, the metadata corresponding to the inference request is obtained, including: performing word frequency encoding on the text in the inference request to generate a word frequency vector; performing semantic encoding on the word frequency vector to generate semantic features corresponding to the inference request; and obtaining the metadata corresponding to the inference request based on the semantic features.
[0016] Based on this scheme, the original text is mapped into structured semantic features by combining word frequency and semantic encoding, thereby efficiently associating historical reasoning context. This enables the caching and reuse of similar semantic requests, significantly improves the generalization ability and hit rate of metadata retrieval, and effectively enhances the overall reasoning performance.
[0017] In another possible implementation, a first computing node for executing the inference request is determined from multiple computing nodes based on metadata, including: determining the matching degree between semantic features and the context information of each historical inference request in the metadata; determining the affinity score of each computing node associated with the inference request based on each matching degree and a preset weight; wherein the affinity score is used to quantify the degree of adaptation between the computing node and the inference request; and determining the first computing node for executing the inference request from each computing node associated with the metadata based on each affinity score.
[0018] Based on this scheme, by using affinity scheduling driven by semantic similarity, inference requests are preferentially allocated to computing nodes that cache semantically related historical contexts, which significantly improves the hit rate of historical key-value cache and avoids redundant calculations. In this way, while ensuring the efficient reuse of historical key-value cache, the overall load balancing, response latency and resource utilization of the system are also taken into account.
[0019] In another possible implementation, based on metadata, a first computing node for executing the inference request is determined from multiple computing nodes, including: determining the matching degree between semantic features and the context information of each historical inference request in the metadata; determining the affinity score of the computing node associated with the inference request based on the matching degree and a preset weight; wherein the affinity score is used to quantify the degree of fit between the computing node associated with the historical inference request and the inference request; setting the affinity score to a default value for computing nodes not associated with inference requests among the multiple computing nodes; obtaining the load score and access latency score of each computing node in the multiple computing nodes when processing the inference request; determining multiple candidate computing nodes from the multiple computing nodes based on the filtering conditions preset by the service level agreement of the inference request; determining the comprehensive score of the candidate computing nodes based on the affinity score, the load score, and the access latency score of the candidate computing nodes; and determining the first computing node for executing the inference request based on the multiple comprehensive scores.
[0020] Based on this scheme, a unified comprehensive scoring mechanism is established by integrating semantic affinity, real-time load, access latency scores, and pre-defined filtering conditions (SLAs) of the Service Level Agreement (SLAs). This comprehensive score is used to select the node with the best overall performance from all computing nodes to execute the inference task. Simultaneously, nodes without historical records are assigned a default affinity score, ensuring that the scheduling scope covers all available resources and preventing high-quality nodes from being excluded due to a lack of historical cache. In this way, efficient reuse of historical key-value caches is guaranteed while simultaneously considering overall system load balancing, response latency, and resource utilization.
[0021] In another possible implementation, the load score of each computing node in the multiple computing nodes when processing inference requests is obtained, including: obtaining the sequence length corresponding to the inference request; determining the prediction computation amount for processing the inference request based on the sequence length, the model parameters of the target inference model, and the accuracy of the target inference model; obtaining the real-time load parameters of each computing node in the multiple computing nodes; and obtaining the load score of each computing node in the multiple computing nodes based on the prediction computation amount and the real-time load parameters of each computing node.
[0022] Based on this scheme, by combining the computational requirements of inference requests (such as sequence length, model size, and accuracy) with the real-time load status of nodes, a request-aware dynamic load score is generated. This enables scheduling decisions to accurately predict the actual pressure on each node to execute the current inference request. In this way, overload or resource idleness caused by relying solely on static or historical loads is avoided, and a foundation is provided for subsequent support of the selection of the globally optimal node.
[0023] In another possible implementation, multiple computing nodes are divided into a pre-filled resource pool and a decoding resource pool. The computing nodes in the pre-filled resource pool are used to perform pre-filled computations for inference requests and generate corresponding historical key-value caches. The computing nodes in the decoding resource pool are used to perform autoregressive decoding based on the historical key-value caches to generate new lexical units and corresponding new key-value caches. The method further includes: generating an adaptation score for each computing node based on its running status information when processing inference requests; determining the resource type corresponding to each computing node based on the adaptation score and a preset resource pool allocation threshold; and dynamically migrating each computing node based on its corresponding resource type to adjust the number of computing nodes in the pre-filled and decoding resource pools.
[0024] Based on this scheme, computing nodes are assigned roles in real time through adaptation scores driven by multi-dimensional operational states. This allows for dynamic adjustment of the resource pool size. For example, in long-context scenarios, the pre-filled resource pool is automatically expanded to accelerate context encoding, while in high-throughput dialogue scenarios, the decoding resource pool is enhanced to improve token generation efficiency. This effectively matches the characteristics of computing resource supply and task demand, avoiding resource mismatch caused by static partitioning and significantly improving hardware utilization.
[0025] Thirdly, embodiments of this application provide a computing cluster, comprising: multiple computing nodes, which are divided into a pre-filled resource pool and a decoding resource pool; a management node, configured to determine a first computing node from the multiple computing nodes to execute an inference request, and to trigger a migration of access rights for the historical key-value cache if the first computing node is different from the computing node that holds the access rights to the historical key-value cache corresponding to the inference request; a shared memory device, to which the multiple computing nodes are connected based on a cache consistency protocol; and a system for storing the historical key-value cache corresponding to the inference request.
[0026] Fourthly, embodiments of this application also provide a computing device, including: a processor and a memory; the processor and the memory are coupled; the memory is used to store program instructions; the processor is used to execute the program instructions to perform the method as described in any one of the first and second aspects above.
[0027] Fifthly, embodiments of this application provide a chip for performing the methods described in any one of the first and second aspects above.
[0028] Sixthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a computer, implement the method as described in either the first or second aspect.
[0029] In a seventh aspect, embodiments of this application provide a program product including a computer program that, when executed by a processor, implements the method as described in either the first or second aspect. Attached Figure Description
[0030] Figure 1 This is a schematic diagram of a distributed reasoning architecture for a reasoning method provided in an embodiment of this application; Figure 2 This is a schematic diagram of a computing cluster framework provided in an embodiment of this application; Figure 3 This is a schematic diagram of another computing cluster framework provided in an embodiment of this application; Figure 4 This is a schematic diagram of another computing cluster framework provided in an embodiment of this application; Figure 5 This is a first flowchart illustrating a reasoning method provided in an embodiment of this application; Figure 6 This is a schematic diagram of an interface for obtaining a reasoning request provided in an embodiment of this application; Figure 7 This is an interactive schematic diagram of a reasoning method provided in an embodiment of this application; Figure 8 This is a schematic diagram of a process for obtaining metadata provided in an embodiment of this application; Figure 9A This is a schematic flowchart of a method for determining a first computing node provided in an embodiment of this application; Figure 9B This is a schematic flowchart of another method for determining the first computing node provided in an embodiment of this application; Figure 10 This is a flowchart illustrating a method for determining a control node, as provided in an embodiment of this application. Figure 11 This is a second flowchart illustrating a reasoning method provided in an embodiment of this application; Figure 12 This is an interactive schematic diagram of another reasoning method provided in an embodiment of this application; Figure 13 This is a third flowchart illustrating a reasoning method provided in an embodiment of this application; Figure 14 This is an interactive schematic diagram of another reasoning method provided in the embodiments of this application; Figure 15 This is a schematic diagram of the complete process of a reasoning method provided in an embodiment of this application; Figure 16 This is a fourth flowchart illustrating a reasoning method provided in an embodiment of this application; Figure 17 This is the fifth flowchart of a reasoning method provided in an embodiment of this application; Figure 18 This is a schematic diagram of a computing device provided in an embodiment of this application. Detailed Implementation
[0031] The technical solutions of the embodiments of this application will now be described with reference to the accompanying drawings. To facilitate a clear description of the technical solutions of the embodiments of this application, the use of terms such as "first," "second," etc., in the embodiments of this application is for illustrative purposes and to distinguish the objects being described. There is no particular order between them, nor does it indicate a specific limitation on the number of devices in the embodiments of this application, and they do not constitute any limitation on the embodiments of this application.
[0032] To enable those skilled in the art to better understand the technical solutions in this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of this application.
[0033] It should be noted that many specific details are set forth in the following description in order to provide a full understanding of this application. However, this application may also be implemented in other ways different from those described herein. Therefore, the scope of protection of this application is not limited to the specific embodiments disclosed below.
[0034] The following explanations of the technical terms mentioned in the embodiments of this application are provided to facilitate understanding by those skilled in the art.
[0035] Large language models (LLMs) are neural network models based on deep learning techniques with a large number of parameters (usually hundreds of millions to hundreds of billions or even more). By pre-training on massive amounts of text data, they have the ability to understand and generate natural language and can be applied to various natural language processing and target reasoning tasks such as text generation, question answering, translation, and summarization.
[0036] Key-value cache (KV cache) refers to a mechanism within a large language model where each input token generates a corresponding key and value vector during the attention process. To avoid repeatedly calculating the attention results of historical tokens when generating each new token, the large language model caches the keys and values of already processed tokens, forming a key-value cache.
[0037] Compute Express Link (CXL) is a high-performance, low-latency open interconnect technology primarily used to improve communication efficiency between processors and components such as accelerators and memory expansion devices.
[0038] Shared memory devices, also known as CXL shared memory devices, are memory sharing mechanisms based on the compute fast interconnect protocol that support efficient, low-latency, and cache-consistent memory sharing across nodes.
[0039] The embodiments of this application will now be described with reference to the accompanying drawings.
[0040] In large-scale model inference scenarios, due to the sheer size of the model, a single computing node often struggles to handle its computational and storage demands. Therefore, a distributed inference architecture is commonly employed, where the large language model is broken down and deployed across a computing cluster composed of multiple computing devices. Through task decomposition and parallel processing, the overall throughput and response efficiency of the system are improved.
[0041] The following is based on Figure 1 We will use an example to illustrate the distributed inference structure.
[0042] Figure 1This is a schematic diagram of a distributed reasoning architecture for a reasoning method provided in an embodiment of this application.
[0043] like Figure 1 As shown, a general distributed inference architecture is implemented by deploying a large language model on three computing devices (Node 1, Node 2, and Node 3). This is achieved by dividing the large language model into three main logical parts and deploying them on different computing nodes. The computing devices exchange necessary data through efficient network communication protocols to ensure the smooth operation of the entire inference process.
[0044] Based on this, the embodiments of this application further propose a computing cluster, which is described below in conjunction with... Figure 2 An example is provided.
[0045] Figure 2 This is a schematic diagram of a computing cluster framework provided in an embodiment of this application.
[0046] like Figure 2 As shown, the computing cluster includes multiple computing nodes 21, a management node 22, and a shared memory device 23. The multiple computing nodes 21 are connected to the shared memory device 23 via a high-speed interconnect (e.g., the CXL protocol) and achieve consistent access to shared data based on a cache consistency protocol. The management node 22 establishes communication connections with each computing node 21 to coordinate resource allocation and scheduling decisions. In this embodiment, the multiple computing nodes 21 are divided into two functional resource pools based on the phased characteristics of the Large Language Model (LLM) inference task: a pre-filling resource pool M1 and a decoding resource pool M2. The computing nodes in the pre-filling resource pool M1 are used to perform pre-filling stage computations for inference requests, i.e., performing a forward propagation on the prompts in the inference request to generate complete attention key-value pairs, i.e., key-value caches, and writing the results as historical key-value caches (KV Caches) to the shared memory device 23. The computing nodes in the decoding resource pool M2 are used to perform the autoregressive decoding stage, which generates new tokens one by one based on the historical key-value cache stored in the shared memory device 23, and appends the newly added key-value cache to the shared memory device 23 at each step.
[0047] Furthermore, to adapt to dynamically changing workloads, the resource pool can be adjusted in real time, that is, the number of computing nodes in the resource pool can be dynamically adjusted based on the running status information.
[0048] Specifically, firstly, the management node 22 continuously monitors the operational status information of each compute node 21 during the processing of inference requests. This operational status information includes compute load, cache hit rate, network latency, service level agreement (SLA) compliance, energy efficiency, etc., which are used to characterize the multi-dimensional service capabilities of each compute node under the current workload.
[0049] The computational load refers to the current resource usage intensity of a compute node. Resources include computational resources, such as central processing unit (CPU) utilization, graphics processing unit (GPU) utilization, and compute unit occupancy; storage resources, such as video memory or system memory bandwidth usage and cache utilization; and communication resources, such as data transfer bandwidth with shared memory devices. This metric is used to assess the remaining capacity of a compute node to handle new inference requests. Cache hit rate refers to the proportion of historical key-value cache (KV Cache) that can be directly retrieved from local or shared memory devices when executing an inference request. A higher hit rate indicates less cached data needs to be recalculated or remotely fetched, resulting in higher inference efficiency. Network latency refers to the round-trip latency between the compute node and shared memory devices or other nodes. Service level agreement (SLA) policy indicates whether the compute node currently meets its promised response time, throughput, or priority requirements; energy efficiency performance refers to the power consumption or energy efficiency ratio per unit of computational task. These metrics collectively characterize the multi-dimensional service capabilities of each compute node under its current workload.
[0050] Secondly, the management node 22 generates the adaptation score for each computing node based on the running status information of each computing node when processing inference requests.
[0051] Among them, the adaptation score is used to quantify the overall matching degree of each computing node in performing inference tasks under the current workload, and serves as the basis for resource pool partitioning or scheduling decisions.
[0052] For example, when executing an inference request, the management node calculates the score for each matching compute node j.
[0053] Alternatively, the fit score can be determined using formula (1): ;Formula (1) in, This is used to indicate that compute node j is executing inference request i. The adaptation score calculated for computing node i during the execution of inference request (j); The normalized value of computing resources of compute node i when executing inference request (j), including core resources such as CPU utilization, GPU utilization, and compute unit utilization. To calculate the prediction hit rate of the required historical KV cache when node j executes inference request r; This is the normalized value of the network latency overhead of compute node i interacting with the shared memory device while executing inference request (j); The business priority weight is preset for computing node i when executing inference request (j), such as whether it is an interactive dialogue or whether it is batch processing; It is the energy efficiency cost coefficient of computing node j, which reflects its power consumption per unit computing task. The network topology score between compute node j and the shared memory device can be determined based on the physical connection topology (such as CXL link level, NUMA distance). - It is a dynamic weight, determined by the management node based on system configuration.
[0054] Next, based on the adaptation score and the preset resource pool allocation threshold, the resource type corresponding to each computing node is determined.
[0055] The preset resource pool allocation threshold characterizes the decision boundary for classifying compute nodes into pre-filled or decoding types. Its value is based on pre-configured or dynamically adjusted system settings for the resource requirements of pre-filled and decoding tasks. The resource type indicates whether the compute node should belong to the pre-filled or decoding resource pool to match its capabilities under the current workload. If a compute node's fit score is higher than the threshold, it is determined to be more suitable for executing computationally intensive pre-filled tasks, and its resource type is assigned as "pre-filled." If it is lower than or equal to the threshold, it is determined to be more suitable for executing high-concurrency, low-computational-intensity decoding tasks, and its resource type is assigned as "decoding."
[0056] For example, suppose that in the initial system configuration, the pre-filled resource pool M1 contains 5 compute nodes and the decoding resource pool M2 contains 5 compute nodes. If compute node j is currently located in the pre-filled resource pool M1, but its adaptation score is lower than the preset pre-filled threshold after evaluation when processing a new round of inference requests (e.g., due to excessive GPU load or decreased cache affinity), and is more in line with the characteristics of the decoding task, then the management node updates its resource type to "decoding".
[0057] Finally, based on the resource type corresponding to each computing node, each computing node is dynamically migrated to adjust the number of computing nodes in the pre-filled resource pool and the decoding resource pool.
[0058] Continuing with the example above, the management node removes compute node j from the pre-populated resource pool M1 and adds it to the decoding resource pool M2, thus completing its role switch. Accordingly, the number of nodes in the pre-populated resource pool M1 decreases to 4, and the number of nodes in the decoding resource pool M2 increases to 6.
[0059] In this way, through such dynamic adjustments, computing resources can be allocated elastically according to real-time load characteristics, ensuring that the processing capacity of the pre-filling and decoding stages always matches the needs of the current inference request.
[0060] In the above system architecture based on dynamic partitioning of resource pools (such as...) Figure 2 Based on the above (as shown), this application further provides a distributed reasoning execution mechanism, which is described below in conjunction with... Figure 3 An example is provided.
[0061] Figure 3 This is a schematic diagram of another computing cluster framework provided in an embodiment of this application.
[0062] like Figure 3 As shown, the computing cluster includes multiple computing nodes 21, a management node 22, and a shared memory device 23. For ease of explanation, the diagram uses three computing nodes as an example, such as computing node 21a, computing node 21b, and computing node 21c. These computing nodes are all connected to the management node 22 via communication interfaces and can interconnect with each other through a shared switching device 24 to support efficient data interaction. In this distributed inference framework, the management node 22 is used to select the optimal first computing node to execute the inference request. Specifically, the management node 22 first receives the inference request and extracts the corresponding metadata. Then, based on this metadata, it selects one of the multiple computing nodes as the first computing node, i.e., the computing node used to execute the inference request. For example, computing node 21a is selected as the first computing node. The first computing node retrieves the first historical key-value cache corresponding to the inference request from its local cache. Then, the first computing node (i.e., computing node 21a) can identify the control node based on the interaction between the first computing node (i.e., computing node 21a) and the management node 22. The control node is a compute node that manages access rights to the second historical key-value cache. This control node is dynamically adjusted, with the management node deciding whether to migrate it from its current compute node to another based on the system scheduling policy. This enables precise location and efficient reuse of context states, thereby avoiding redundant computations and reducing latency.
[0063] For example, assume the controlling node is compute node 21b. Further, the first compute node (compute node 21a) sends a first read request to the controlling node (compute node 21b) to obtain the second historical key-value cache. The controlling node responds to this first read request, reads the required historical key-value cache from the shared memory device 23, and returns it to the first compute node. Finally, the first compute node uses the obtained first and second historical key-value caches to perform specific inference tasks, thereby achieving a context-aware, highly efficient, and reusable distributed inference process.
[0064] In summary, by constructing a computing cluster framework, and within this framework, by reusing local cache and retrieving missing historical key-value cache on demand based on currently accessible computing nodes, efficient reuse of historical key-value cache can be achieved, thereby significantly reducing cache access latency and improving overall inference performance.
[0065] Furthermore, based on the above cluster architecture, in order to more clearly understand the functional implementation of each computing node, its internal module structure can be explained in detail.
[0066] Figure 4 This is a schematic diagram of another computing cluster framework provided in the embodiments of this application.
[0067] like Figure 4 As shown, each compute node 21 not only serves as the execution unit for inference tasks but also integrates multiple collaborative functional modules. Specifically, the compute node 21 may include: a central processing unit (CPU) and / or a graphics processing unit (GPU) for performing model inference computations; a key-value cache manager responsible for generating, maintaining, and retrieving historical key-value caches, supporting key-value cache reuse and efficient autoregressive decoding; and a hot-warm data module, serving as a local storage unit for hierarchically caching frequently accessed "hot" data (such as recently active KV caches) and infrequently accessed "warm" data, thereby achieving a balance between memory efficiency and access performance.
[0068] Management node 22, serving as the scheduling and coordination hub of the computing cluster, comprises two core components: a cache-aware scheduler and a metadata cluster. The cache-aware scheduler is further subdivided into several functional units, primarily a request analyzer and a routing decision engine. The request analyzer is responsible for deep parsing of input inference requests, performing text feature extraction (such as term frequency encoding and semantic vector generation) and sequence feature prediction (such as sequence length and computational cost estimation), thereby constructing a semantic and load context for scheduling decisions. The routing decision engine, based on these features and combined with multi-dimensional metrics such as real-time load, access latency, affinity, and cache status of each computing node, calculates a comprehensive score for each node and selects the optimal first computing node accordingly.
[0069] Meanwhile, the metadata cluster is responsible for recording and managing metadata such as the location, status, and access statistics of all KV caches in the entire cluster. In other words, metadata is used to represent the context information of inference requests and store the address information of the corresponding historical key-value caches. For example, the metadata cluster adopts a distributed architecture, containing multiple metadata managers. Each metadata manager is logically bound to a compute node and is responsible for managing the metadata related to that node. Each metadata manager consists of three core modules: a metadata storage engine based on efficient index structures such as B+ trees, supporting fast key-value and range queries; and a high-availability storage mechanism that redundantly backs up metadata across multiple nodes, ensuring that a single point of failure does not affect service continuity. The consistency synchronization module coordinates metadata update operations across nodes, ensuring data consistency under concurrent writes through distributed consistency protocols (such as Raft or Paxos), effectively resolving conflicts that may arise from multiple nodes simultaneously modifying the same metadata item. The fault recovery service is used to monitor the health status of each metadata manager and underlying computing node in real time. When the control node fails, an election mechanism is automatically triggered to elect a new control node to exchange access rights to the historical key-value cache. After the faulty node recovers, it can also quickly synchronize the missing metadata to achieve state reconstruction and system self-healing.
[0070] The shared memory device is implemented using a pooled architecture, consisting of multiple CXL memory modules to form a shared memory pool with a total capacity of up to terabytes, specifically for storing globally accessible key-value cache (KV Cache) data. This memory pool employs a multi-controller architecture, supporting concurrent access from multiple nodes and load balancing, truly achieving hardware-level memory resource pooling—meaning memory is no longer bound to a single compute node, but rather treated as a cluster-level resource, dynamically allocated according to the actual needs of each node, thereby significantly improving overall memory utilization and system elasticity. The CXL memory pool is divided into three tiers, managed according to data access frequency: Hot Layer: Employs low-latency CXL devices, dedicated to storing frequently accessed "hot" data (such as the KV cache of the current session), with access latency below 100 nanoseconds, ensuring extreme responsiveness for critical inference paths; Warm Layer: Based on standard-performance CXL devices, used to store data with moderate access frequency, with access latency between 100 and 500 nanoseconds, striking a balance between performance and capacity; Cold Layer: Uses high-density, low-cost CXL devices for archiving infrequently accessed "cold" data, with access latency ranging from 500 nanoseconds to 1 microsecond.
[0071] The following section, with reference to the accompanying diagram, details the specific implementation process of the task computing cluster's inference method.
[0072] Figure 5 This is a first flowchart illustrating a reasoning method provided in an embodiment of this application.
[0073] like Figure 5 As shown, the reasoning method includes the following steps: S1: The terminal device responds to the user's input operation and obtains the inference request.
[0074] Figure 6 This is a schematic diagram of an interface for obtaining a reasoning request provided in an embodiment of this application.
[0075] like Figure 6 As shown in (a), the terminal device responds to the user's input operation (such as entering "locallhost / ........html") when opening the inference interface in browser page 1, and displays as shown in (a). Figure 6 The first interface 6 is shown in (b) of the diagram. The first interface 6 includes an input box 61 for the user to input a reasoning request, i.e., to submit a target reasoning task.
[0076] For example, the terminal device responds to the reasoning request A, which is entered by the user in the input box 61 of the first interface 6, asking "What is the reason why Zhang San won the Nobel Prize?"
[0077] S2: The terminal device sends an inference request to the management node through the communication interface.
[0078] Continuing with the example above, the terminal device sends inference request A to the management node through the communication interface (API server).
[0079] S3: The management node receives the inference request and, based on the inference request, obtains the corresponding metadata.
[0080] Figure 7 This is an interactive schematic diagram of a reasoning method provided in an embodiment of this application.
[0081] like Figure 7 As shown, the management node includes a cache-aware scheduler, which receives inference requests, extracts semantic features, evaluates the status of computing nodes, and performs intelligent scheduling based on multi-dimensional information such as semantic matching, load pressure, and access latency to achieve efficient reuse of historical key-value caches and low-latency inference execution. The inference method includes the following steps: S1001: The cache-aware scheduler of the management node receives inference requests.
[0082] Optionally, the cache-aware scheduler includes a request analyzer and a routing decision engine. In step S1001, the request analyzer receives inference request A.
[0083] Upon receiving an inference request, the management node retrieves the corresponding metadata, which is then analyzed below. Figure 8 An example is provided.
[0084] Figure 8 This is a schematic diagram of a process for obtaining metadata provided in an embodiment of this application.
[0085] like Figure 8 As shown, step S3 includes steps S31-S33.
[0086] S31: The management node performs word frequency encoding on the text in the inference request to generate a word frequency vector.
[0087] Continue to combine Figure 8 As shown, step S31 includes steps S311-S312.
[0088] S311: The management node performs word segmentation on the text in the inference request to obtain a word sequence.
[0089] Continue to combine Figure 7 As shown, step S311 includes step S1002.
[0090] S1002: The cache-aware scheduler of the management node performs word segmentation on the text in the inference request to obtain a word sequence.
[0091] A lexical unit, as the basic semantic unit for language models to process text, can be a complete word, a fragment of a word, or a specific symbol. Optionally, the management node can segment the text in the inference request using a sub-word algorithm (such as Byte-Pair Encoding (BPE) or SentencePiece) to obtain multiple lexical units. This can effectively handle out-of-vocabulary words, spelling variations, and multilingual mixed inputs within a limited vocabulary size, ensuring semantic integrity and computational efficiency.
[0092] Continuing with the example above, the management node segments the text of inference request A using a sub-word algorithm, resulting in a word sequence of “[“Einstein”, “obtained”, “Nobel Prize”, “of”, “reason”, “is”, “what”]”.
[0093] S312: The management node performs frequency statistics on the word sequence and generates word frequency vectors.
[0094] Continue to combine Figure 7 As shown, step S312 includes step S1003.
[0095] S1003: The cache-aware scheduler of the management node performs frequency statistics on the word sequence and generates word frequency vectors.
[0096] Frequency statistics refer to counting the number of times each unique word appears in the word list obtained after word segmentation, thereby constructing a sparse or dense vector representation with word as the dimension and frequency as the value.
[0097] Continuing with the example above, the word sequence is ["Einstein", "obtained", "Nobel Prize", "of", "reason", "is", "what"]. Among them, the content words (such as "Einstein", "Nobel Prize", "reason") usually have higher semantic weights, while the function words (such as "of", "is"), although having a frequency of 1, may be eliminated or downweighted by the stop word filtering mechanism in subsequent processing. Based on this, a specific word frequency vector V is constructed.
[0098] S32: The management node performs semantic encoding on the word frequency vector to generate semantic features corresponding to the reasoning request.
[0099] Continue to combine Figure 7 As shown, step S32 includes step S1004.
[0100] S1004: The cache-aware scheduler of the management node performs semantic encoding on the word frequency vector to generate semantic features corresponding to the inference request.
[0101] Semantic features, also known as semantic hash fingerprints, are unique identifiers generated by structurally encoding the text content of a reasoning request. They represent the core semantic intent of the request. Semantic encoding can be flexibly implemented using either a model-driven approach or a hash encoding approach, depending on the system's requirements for accuracy, efficiency, and resource consumption. The following provides illustrative examples of two typical implementation methods: The first type is model-based semantic encoding.
[0102] Specifically, using word frequency vectors as input to the first model, the management node uses the first model to obtain the semantic features corresponding to the reasoning request.
[0103] The first model can be a lightweight neural network model that has been pre-trained with historical word frequency vectors corresponding to a large number of historical reasoning requests, such as BERT or other pre-trained models.
[0104] It should be noted that the above-described method of generating semantic features using the model is not limited to using word frequency vectors as the sole input. In practical applications, the input to the first model can take various forms, such as inference requests, and is not limited to a single form here.
[0105] The second method is semantic encoding based on hashing.
[0106] Specifically, the management node uses a hash algorithm on the word frequency vector to obtain the semantic features corresponding to the generated inference request. For example, the word frequency vector is first subjected to random projection dimensionality reduction (e.g., using a sparse random matrix for linear transformation), and then compressed into a fixed-length binary code using a sign function or locality-sensitive hashing mechanism. Finally, this process outputs a fixed-length binary hash code, such as a 64-bit or 128-bit hash fingerprint, which is the semantic feature.
[0107] Continuing with the example above, the management node uses either of the two methods described above to obtain the semantic feature B corresponding to the inference request.
[0108] In summary, historical dialogues themselves constitute high-value contextual prefixes, possessing natural reuse potential in multi-turn interactions. By abstracting and indexing historical dialogues through semantic features, the system can identify the semantic relationship between the current request and past dialogues, thereby directly reusing previously calculated and cached key-value states (KV Cache).
[0109] An inference request refers to one or more raw task requests received by the management node in the initial stage, awaiting inference processing. These task requests are typically initiated by users and include natural language tasks such as question answering, text generation, and summary extraction, which require the participation of a target inference model, namely a large language model (LLM).
[0110] S33: Based on semantic features, obtain the metadata corresponding to the reasoning request.
[0111] Continue to combine Figure 7 As shown, step S33 includes step S1005.
[0112] S1005: The cache-aware scheduler of the management node looks up the metadata corresponding to the inference request from the metadata cluster.
[0113] The metadata is used to represent the context information of the inference request and the address information of the corresponding historical key-value cache. After the management node obtains the metadata corresponding to the inference request, further explanation is provided in conjunction with step S4: S4: Based on metadata, determine the first compute node from multiple compute nodes to execute the inference request.
[0114] Continue to combine Figure 7 As shown, step S4 includes step S1006.
[0115] S1006: The cache-aware scheduler of the management node determines the first compute node to execute the inference request from multiple compute nodes based on metadata.
[0116] Optionally, the specific implementation method for determining the first compute node based on metadata can be flexibly configured to adapt to scheduling requirements in different scenarios. One approach is to select the first compute node solely based on the affinity score associated with the metadata. Another approach is to further combine the real-time load score of each compute node and the network access latency score for accessing shared memory devices to construct a multi-dimensional comprehensive score to determine the first compute node. The affinity score is used to quantify the degree of matching between the available resources of the compute node and the resource requirements of inference requests.
[0117] In one implementation, the first compute node is selected solely based on the affinity score associated with the metadata.
[0118] Figure 9A This is a schematic flowchart of another method for determining the first computing node provided in an embodiment of this application.
[0119] like Figure 9A As shown, determining the first computing node includes the following steps: S41a: The management node determines the degree of matching between semantic features and the context information of each historical inference request in the metadata.
[0120] The context information of historical inference requests for each computing node can be derived from a set of context information of inference requests it has successfully processed in the past, such as by aggregating cluster centers, average embedding vectors, or Top-K representative fingerprints. The matching degree can be calculated using methods such as cosine similarity, Euclidean distance, or Hamming distance (for semantic features). A higher matching degree indicates that the computing node is more likely to have cached historical key-value pairs.
[0121] Continuing with the example above, let's take... Figure 2 The following explanation uses three compute nodes in the compute cluster as an example. Based on the context information of semantic feature B and the historical inference requests of compute node 21a, the routing decision engine obtains a matching degree of 21a′ for compute node 21a; based on the context information of semantic feature B and the historical inference requests of compute node 21c, it obtains a matching degree of 21b′ for compute node 21b; and based on the context information of semantic feature B and the historical inference requests of compute node 21c, it obtains a matching degree of 21c′ for compute node 21b.
[0122] S42a: The management node determines the affinity score of each computing node associated with the inference request based on the matching degree and preset weight.
[0123] Among them, the affinity score is used to quantify the degree of matching between the available resources of the computing node and the resource requirements of the inference request; that is, the affinity score is a quantitative value.
[0124] Specifically, the affinity score can be obtained using formula (2): Formula (2); The preset weight is used to identify the hit probability of the historical key-value cache. It can be flexibly set according to actual needs, such as 0.6, 0.4 or 1, and there is no unique limitation here.
[0125] Continuing with the example above, let's take a preset weight of 0.6 as an example. The cache-aware scheduler of the management node obtains the affinity scores of the three computing nodes based on the product of the matching degree of the computing nodes and the preset weight. That is, the affinity score of computing node 21a is C1, the affinity score of computing node 21b is C2, and the affinity score of computing node 21c is C3.
[0126] S43a: The management node determines the first computing node to execute the inference request from among the computing nodes associated with the inference request, based on each affinity score.
[0127] Optionally, a target affinity score is determined among multiple affinity scores, which is used to characterize the highest affinity score among the multiple affinity scores.
[0128] Continuing with the example above, suppose the cache-aware scheduler of the management node selects compute node 21a, which corresponds to the affinity score C3, as the first compute node.
[0129] Figure 9B This is a schematic flowchart of another method for determining the first computing node provided in an embodiment of this application.
[0130] In another implementation, such as Figure 9B As shown, another method for determining the first computing node includes the following steps: S41b: The management node determines the degree of matching between semantic features and the context information of each historical inference request in the metadata.
[0131] The specific details of step S41b can be found in step S41a above, and will not be repeated here.
[0132] S42b: The management node determines the affinity score of each computing node associated with the inference request based on the matching degree and preset weight.
[0133] The details of step S42b can be found in step S42a above, and will not be repeated here.
[0134] For example, assume that the affinity scores of the three computing nodes are obtained respectively, namely, the affinity score of computing node 21a is C1', the affinity score of computing node 21b is C2', and the affinity score of computing node 21c is C3'.
[0135] It should be noted that, assuming the cluster has five compute nodes: compute node 21a, compute node 21b, compute node 21c, compute node 21d, and compute node 21e. Among them, compute nodes 21a, 21b, and 21c are the compute nodes associated with inference requests, and therefore their affinity scores can be calculated; while compute nodes 21d and 21e are not associated with any related inference requests, so their affinity scores are set to preset default values, which will be explained in conjunction with step S43b below.
[0136] S43b: For compute nodes that are not associated with inference requests among multiple compute nodes, the management node sets the affinity score to the default value.
[0137] Continuing with the example above, compute nodes 21d and 21e are not associated with any related inference requests, so their affinity scores are uniformly set to the preset default values. Assume that the affinity score of compute node 21d is C4 and the affinity score of compute node 21e is C5.
[0138] It should be noted that the default value can be flexibly set according to the system configuration strategy. In one implementation, all computing nodes without historical records adopt the same default value, i.e., C4=C5. In another implementation, different default values can be assigned according to factors such as the hardware capabilities, network location or energy efficiency characteristics of each node, i.e., C4≠C5.
[0139] S44b: The management node obtains the load score and access latency score of each compute node among multiple compute nodes when processing inference requests.
[0140] Continue to combine Figure 9B As shown, step S44b includes steps S441b-S446b.
[0141] S441b: The management node obtains the sequence length of the inference request, the model parameters of the target inference model, and the accuracy of the target inference model.
[0142] Sequence length reflects the number of tokens in the input text, directly affecting efficiency during autoregressive inference. Model parameters refer to the structural configuration of the target inference model, including the number of network layers, hidden layer dimensions, number of attention heads, and total number of parameters. These factors collectively determine the computational complexity of a single inference operation. Model precision is used to characterize the numerical representation format (such as FP16, INT8, or FP32). Different precisions not only affect the computational cost of a single operation but also relate to the hardware's execution efficiency and throughput.
[0143] Continuing with the example above, the cache-aware scheduler of the management node obtains the sequence length A1 of inference request A, the model parameters A2 of the target inference model, and the precision A3 of the target inference model.
[0144] S442b: The management node determines the prediction computation amount required to process the inference request based on the sequence length, the model parameters of the target inference model, and the accuracy of the target inference model.
[0145] Predicted computational load is a quantitative assessment of the computational resources required to perform inference for the current inference request. The load pressure of each computing node is used by the predicted computational load to assess the load pressure that each computing node may face after taking on the request.
[0146] Continuing with the example above, the cache-aware scheduler of the management node determines the prediction computation amount M1 for processing the inference request based on the sequence length A1 of the inference request A, the model parameters A2 of the target inference model, and the accuracy A3 of the target inference model.
[0147] S443b: The management node obtains the real-time load parameters of each compute node among multiple compute nodes.
[0148] Real-time load parameters may include: resource utilization (such as GPU or CPU utilization, SM utilization), memory usage (such as used / remaining capacity of video memory or system memory), the number of inference requests currently being processed, network bandwidth usage, and temperature and power consumption status. These metrics are periodically collected by the monitoring modules built into each compute node and reported to the management node.
[0149] Continuing with the example above, the cache-aware scheduler obtains the real-time load parameters corresponding to compute nodes 21a, 21b, 21c, 21d and 21e, respectively, denoted as M21, M22, M23, M24 and M25, for subsequent load score calculation.
[0150] S444b: The management node obtains the load score of each computing node among multiple computing nodes based on the predicted computing volume and the real-time load parameters of each computing node.
[0151] The predicted computational load is considered as the resource demand intensity of the tasks to be assigned. Combined with the current resource occupancy status of each computing node, a normalized load score is calculated using a preset load assessment function. The lower the score, the more likely the node can maintain low resource pressure and high response efficiency after handling the current inference request. Conversely, the higher the score, the less suitable the node is to be assigned high-load tasks.
[0152] Following the example above, the cache-aware scheduler of the management node calculates the corresponding load scores, denoted as M21′, M22′, M23′, M24′, and M25′.
[0153] S445b: The management node obtains the access latency score of each compute node among multiple compute nodes when processing inference requests.
[0154] The access latency of a compute node refers to the delay required from when a compute node initiates a request to when it receives the request from the other party. The access latency is normalized to a preset baseline value to obtain a quantified value, denoted as the access latency score. Optionally, if the preset maximum tolerable latency is Tmax (i.e., the preset baseline value), and the actual measured latency is t, for example, the access latency score equals 1 minus the ratio of the actual access latency to the preset maximum tolerable latency.
[0155] For example, access latency may include network transmission latency, shared memory access latency (such as the latency of accessing a remote KV Cache via a CXL interconnect), and intra-node scheduling queuing latency.
[0156] Continuing with the example above, the cache-aware scheduler of the management node obtains the access latency scores of each compute node, denoted as M11, M12 and M13, M14 and M15.
[0157] S45b: The management node determines multiple candidate compute nodes from multiple compute nodes based on the filtering conditions pre-set in the service level agreement of the inference request.
[0158] The service level agreement (SLA) pre-defined filtering criteria are used to exclude compute nodes that do not meet the performance, reliability, or resource isolation requirements of the current inference request. For example, if the inference request is a high-priority interactive task, nodes with excessively high response latency, those in maintenance mode, or those that do not support low-latency communication will be filtered out.
[0159] Continuing with the example above, assuming that after SLA filtering, only computing nodes 21a, 21b, and 21c meet the preset conditions among the original five computing nodes (21a, 21b, 21c, 21d, 21e), and are therefore identified as multiple candidate computing nodes for subsequent comprehensive scheduling scoring.
[0160] S46b: The management node determines the comprehensive score of the candidate compute nodes based on their affinity score, load score, and access latency.
[0161] The overall score is used to comprehensively measure the overall fit of each computing node in the current inference request scheduling.
[0162] Alternatively, the comprehensive score corresponding to each calculation node can be obtained through formula (3): Formula (3); Following the example above, the cache-aware scheduler of the management node uses formula (3) to obtain the comprehensive score corresponding to each computing node, which is recorded as comprehensive score N1 of 0.97 (corresponding to computing node 21a), comprehensive score N2 of 0.94 (corresponding to computing node 21b) and comprehensive score N3 of 0.79 (corresponding to computing node 21c).
[0163] S47b: The management node determines the first computing node to execute the inference request based on multiple composite scores.
[0164] Optionally, the management node determines a target composite score from multiple composite scores, and determines the first computing node to execute the inference request based on the target composite score. The target composite score is the highest value among the composite scores of all computing nodes, used to identify the node most suitable for executing the inference request.
[0165] Continuing with the example above, if the overall score N1 = 0.97, and is higher than both overall scores N2 and N3, then overall score N1 is determined as the target overall score. Following the example above, the cache-aware scheduler of the management node selects the compute node 21a corresponding to the target overall score (overall score N1) as the first compute node.
[0166] After the first computing node is determined, the management node sends the inference request to the first computing node, which will be explained in conjunction with step S1007 below.
[0167] S1007: The cache-aware scheduler of the management node sends an inference request to the first compute node.
[0168] Continuing with the example above, the cache-aware scheduler of the management node sends inference request A to the first compute node.
[0169] S5: The first compute node responds to the inference request and obtains the hit results (partial hits: first historical key-value cache) from the local cache of the first compute node.
[0170] The hit result is used to characterize the coverage of the historical key-value cache required by the inference request in the local cache, including three cases: full hit, partial hit, or no hit. In this embodiment, partial hit is used as an example for explanation, and the other cases will be described in detail in subsequent embodiments.
[0171] Continuing with the example above, step S5 includes step S1008.
[0172] S1008: In response to the inference request, retrieve the first historical key-value cache from the local cache of the first compute node.
[0173] The first historical key-value cache is the historical key-value cache required for executing inference requests.
[0174] Continuing with the example above, the first compute node responds to inference request A by determining the first historical key-value cache A1 in the first compute node's local cache.
[0175] It should be noted that the first compute node also needs to determine whether to retrieve supplementary historical key-value cache from the remote cache based on the hit results of the local cache: if the local cache only hits part of the required data, the remaining historical key-value cache corresponding to the inference request can be retrieved from the remote cache, i.e., the second historical key-value cache; if the local cache hits all the required data for the inference request, i.e., the first historical cache and the second historical cache. If the local cache does not hit all the required data for the inference request, a global historical key-value cache can be stored in the memory device. In this embodiment, it is determined that the second historical key-value cache needs to be retrieved based on actual needs.
[0176] S6: When it is necessary to obtain the second historical key-value cache, the first compute node sends the first read request to the control node.
[0177] Continue to combine Figure 7 As shown, step S6 includes steps S1009-S1011.
[0178] S1009: The first compute node sends an addressing request to the metadata cluster of the management node.
[0179] In this embodiment, the addressing request retrieves the compute node currently holding access rights to the second historical key-value cache. This node can also be described as the node controlling the second historical key-value cache, and can be called the home node. The home node is the compute node that manages access rights to the second historical key-value cache. Through this addressing request, the first compute node can determine the home node of the second historical key-value cache from the metadata cluster and then send a first read request to it, as described in step S1010 below.
[0180] S1010: The cache-aware scheduler of the first computing node receives the addressing response corresponding to the addressing request returned by the management node, and determines the control node based on the addressing response.
[0181] The addressing response includes the control node managing the second historical key-value cache. It should be noted that the addressing response at this point includes the address of the control node managing the second historical key-value cache, indicating which specific compute node has the authority to manage the second key-value cache.
[0182] Continuing with the example above, we will use compute node b as the control node.
[0183] S1011: The first computing node sends a first read request to the control node so that the control node can obtain the second key-value cache based on the first read request.
[0184] It should be noted that in step S1011, the first computing node needs to determine whether the control node is the same node as the first computing node. Only if they are not the same node will the first read request be sent to the control node.
[0185] It should be noted that when the control node and the first computing node are not the same node, the control node (home node) is dynamically adjusted by enumerating new home nodes to obtain the final control node. The final control node can be the same as or different from the previous one; no specific limitation is made here. Further, specific implementation examples of the enumeration process will be described in subsequent embodiments.
[0186] S7: The controlling node reads the second historical key-value cache from the shared memory device and sends the second historical key-value cache to the first compute node.
[0187] Continue to combine Figure 7 As shown, step S7 includes steps S1012-S1013.
[0188] S1012: The controlling node responds to the first read request by reading the second historical key-value cache from the shared memory device.
[0189] The first read request is used to retrieve the second historical key-value cache.
[0190] To ensure data access consistency and security, before reading the second historical key-value cache from the shared memory device, the controlling node can query the metadata management cluster block for the metadata corresponding to the second historical key-value cache. This metadata includes the address information of the corresponding historical key-value cache, i.e., the location index. The location index identifies the current data storage area where the second historical key-value cache is located.
[0191] For example, if the index D of the second historical key-value cache points to the active layer (low-latency CXL device), it will be read first through the high-speed path; if it points to the cold data layer, it may trigger the background prefetch or latency compensation mechanism.
[0192] S1013: The controlling node sends the second historical key-value cache to the first computing node.
[0193] The controlling node sends the successfully read second historical key-value cache to the first compute node that initiated the request via a high-speed interconnect link (such as a CXL or RDMA network). This cached data will be used by the first compute node to initialize the inference context, avoiding the recalculation of already generated attention key-value pairs, thereby significantly reducing latency and improving overall throughput efficiency.
[0194] Optionally, after receiving the historical key-value cache, the first computing node can cache it in local memory according to a strategy for reuse in subsequent requests with the same or similar semantics.
[0195] S8: The first computing node performs inference tasks based on the first historical key-value cache and the second historical key-value cache.
[0196] Continue to combine Figure 7 As shown, step S8 includes steps S1014-S1015.
[0197] S1014: The first computing node uses the first historical key-value cache and the second historical key-value cache to perform the corresponding inference task.
[0198] The first computing node loads the received historical key-value cache into the inference engine as the initial context state for autoregressive generation, and continues to perform forward inference based on this to generate new output tokens. By reusing the historical key-value cache, repeated attention calculations on already processed input sequences are avoided, significantly reducing computational overhead and latency.
[0199] It should be noted that the above inference task is a filling phase process. In the current inference task based on the first historical key-value cache and the second historical key-value cache, the key-value cache is decoded step by step, a new key-value cache is output, and stored in the shared memory device. The following will be explained in conjunction with steps S1015-S1018.
[0200] S1015: During the inference task based on the first historical key-value cache and the second historical key-value cache, the first computing node generates a new key-value cache.
[0201] In step S1015, the first computing node merges the first historical key-value cache with the second historical key-value cache to form a complete global key-value cache. This global cache is then used as the Key and Value input in the attention mechanism. Combined with the Query vector of the token to be predicted, the node performs forward computation of the Transformer decoder, thereby outputting one or more new tokens. For each newly output token, the first computing node synchronously generates corresponding Key and Value vectors and organizes these vectors into a new key-value cache. This new key-value cache represents the expanded context state of the current inference step. Its length is equal to the number of newly added tokens, and it has the same dimensions and format as the historical key-value cache for subsequent concatenation or unified management.
[0202] S1016: Write the new key-value cache to the shared memory device in the form of incremental storage, so that the controlling node can perform association management operations between the new key-value cache and the managed second historical key-value cache.
[0203] In step S1016, after generating a new key-value cache, the first computing node writes it to the shared memory device in the form of incremental storage. "Incremental storage" means that only the newly generated key-value cache (i.e., the Key and Value vectors corresponding to the new output token) from the current inference step is written to the shared memory device, without overwriting or overwriting the existing second historical key-value cache. The write location is typically at the end of the pre-allocated cache area, ensuring logical continuity or relevance with the existing cache content (second historical key-value cache).
[0204] Optionally, write operations can be performed using asynchronous I / O mechanisms (such as RDMA-based asynchronous writes, CUDA asynchronous memory copies, or user-space I / O queues), allowing the main inference thread to continue subsequent computations without waiting for the write to complete, thereby reducing latency.
[0205] Optionally, the controlling node can be notified via a notification to associate the new key-value cache with the already managed second historical key-value cache. For example, a cache update notification can be used to instruct the controlling node to associate the new key-value cache with the already managed second historical key-value cache. In other words, the cache update notification indicates that the second historical key-value cache managed by the controlling node has been expanded and needs to be associated with the new key-value cache, such as by adding it to the same cache segment list or updating the total length of the full cache. Optionally, the cache update notification may include the starting storage address of the new key-value cache, the length of the new key-value cache, etc.
[0206] S1017: After the inference task corresponding to the inference request is completed, update the access count of the second historical key-value cache and the new key-value cache.
[0207] In step S1017, in response to the completion of the inference task corresponding to the inference request (e.g., generating an end token, reaching the maximum output length, or receiving a client termination instruction), the access counts of the second historical key-value cache and the new key-value cache are updated, i.e., a unified refresh, to facilitate subsequent determination of whether it is a hot key-value cache.
[0208] In summary, by constructing a computing cluster framework, and reusing the local cache and retrieving the missing historical key-value cache on demand based on the computing node with current access rights, efficient reuse of the historical key-value cache can be achieved, thereby significantly reducing cache access latency and improving overall inference performance.
[0209] The compute nodes with historical key-value cache access rights mentioned above are not statically fixed, but dynamically adjusted. The following section will explain further. Figure 10 Please provide a detailed explanation.
[0210] Figure 10 This is a flowchart illustrating a process for determining a control node, as provided in an embodiment of this application.
[0211] Combination Figure 10 As shown, determining the control node may include the following steps: S101: The first compute node obtains the control node (home node) corresponding to the second historical key-value cache.
[0212] The specific details of step S101 can be found in steps S1009-S1010 above, and will not be repeated here.
[0213] The current control node refers to the compute node in the compute cluster that was most recently responsible for managing the corresponding KV cache, either pre-configured or based on historical records. This node, which manages the historical key-value cache, can be called the home node.
[0214] S102: The first computing node determines the matching result between the control node (home node) and itself (i.e., the first computing node that executes the inference request).
[0215] Specifically, the first computing node obtains the address of the home node and determines whether it is the same as its own address; if they are the same, it is considered a match. In other words, the matching result is used to characterize whether the home node is the same computing node as the first computing node that will subsequently execute the inference task (i.e., whether it is a "local node").
[0216] If the match is positive (i.e., the home node is the first compute node), the inference request is executed directly. In this case, if the local cache contains all the global historical key-value caches required for the inference request (i.e., a full hit, including both the first and second historical key-value caches), inference is performed directly. If the local cache only partially hits (e.g., only contains the first historical key-value cache and lacks the second), the second historical key-value cache is retrieved directly from the shared memory device. If the match is negative (i.e., the current control node is not the first compute node, meaning the first compute node is not the home node), and the first compute node is not the same as the home node, a new home node is enumerated, and the second historical key-value cache is retrieved based on the newly determined home node. This dynamic migration of the home node to the vicinity of the first compute node actually executing inference (or even to itself) significantly reduces the cross-node access overhead for subsequent inference requests.
[0217] The following example illustrates this using the case where the first compute node and the home node are not the same.
[0218] S103: When the first computing node and the control node (home node) are different, the management node determines the weight parameters corresponding to each computing node when processing inference requests.
[0219] Optionally, the management node first collects node parameters corresponding to each compute node's processing of inference requests. Weight parameters characterize the overall scheduling performance when processing inference requests. Overall scheduling performance is a multi-dimensional quantification used to quantify whether a compute node is suitable as a control node. Node parameters include at least one of the following: the historical access frequency of compute node inference requests, the communication latency between each compute node and the compute node with the current access rights to the historical key-value cache, and the real-time load status of each compute node when processing the inference request. Next, based on the above node parameters and preset weight thresholds (e.g., assigning higher priority based on access frequency, or imposing a penalty factor on network latency), the management node calculates the weight parameters corresponding to each compute node's processing of inference requests using a weighted fusion or scoring model.
[0220] S104: Based on the comparison result of the weight parameter and the preset threshold, determine whether to change the access rights of the control node (homenode node) to the second historical key-value cache.
[0221] The management node compares the weight parameters of each compute node with preset thresholds to determine if a better home node exists than the controlling node. If the weight parameters of all compute nodes do not exceed the preset thresholds, the current controlling node is determined to be the better home node, helping to avoid frequent master node switching due to minor fluctuations and ensuring system stability. If the weight parameters of one or more compute nodes exceed the preset thresholds, it indicates that there are more suitable candidates to be the better home node. In this case, the management node selects the compute node with the highest weight parameter from these nodes that meet the threshold conditions as the new home node, and enables the new home node to interact with the compute node with the current access rights through a cache consistency protocol, so that the new home node can manage the second historical key-value cache.
[0222] It should be noted that after the new control node (home node) is determined, the address of the compute node where the access rights corresponding to the historical key-value cache are located is updated, and the metadata cluster of the management node is informed, so that the first compute node can obtain the new control node by interacting with the management node, in order to ensure data consistency.
[0223] Furthermore, the management node can monitor the status of the home node in real time to determine whether a failure has occurred. When a single point of failure, network partition, or resource overload occurs in the home node, a new home node can be quickly elected to ensure that inference tasks are not interrupted and improve overall availability and robustness.
[0224] In summary, by dynamically defining the home node, the home node is no longer a statically bound central node, but is dynamically adjusted according to the current task distribution, node load, data locality, etc. This helps to balance the use of cluster resources, avoid single points of failure, ensure that inference tasks are not interrupted, and improve overall availability and robustness.
[0225] Corresponding to the first embodiment, this application also provides another embodiment, in which the hit result in the local cache of the first computing node completely hits the historical key-value cache corresponding to the inference request. That is, the local cache stores the global historical key-value cache (first historical key-value cache and second historical key-value cache) corresponding to the inference request. The following is in conjunction with Figure 11 Please provide a detailed explanation.
[0226] Figure 11 This is a second flowchart illustrating a reasoning method provided in an embodiment of this application.
[0227] like Figure 11 As shown, the reasoning method includes the following steps: S01: The terminal device responds to the user's input operation and obtains the inference request.
[0228] S02: The terminal device sends an inference request to the management node through the communication interface.
[0229] The specific details of steps S01-S02 can be found in steps S1-S2 above, and will not be repeated here.
[0230] S03: The management node receives the inference request and obtains the metadata corresponding to the inference request based on the inference request.
[0231] Figure 12 This is an interactive schematic diagram of another reasoning method provided in the embodiments of this application.
[0232] like Figure 12 As shown, the reasoning method includes the following steps: S2001: The cache-aware scheduler of the management node receives an inference request.
[0233] S2002: The cache-aware scheduler of the management node performs word segmentation on the text in the inference request to obtain a word sequence.
[0234] S2003: The cache-aware scheduler of the management node performs frequency statistics on the word sequence and generates word frequency vectors.
[0235] S2004: The cache-aware scheduler of the management node performs semantic encoding on the word frequency vector to generate semantic features corresponding to the inference request.
[0236] S2005: The cache-aware scheduler of the management node looks up the metadata corresponding to the inference request from the metadata cluster.
[0237] The specific details of steps S2001-S2005 can be found in steps S1001-S1005 above, and will not be repeated here.
[0238] S04: The management node determines the first compute node to execute the inference request from multiple compute nodes based on metadata.
[0239] Continue to combine Figure 12 As shown, step S04 includes steps S2006-S2007.
[0240] S2006: The cache-aware scheduler of the management node determines the first compute node to execute inference requests from multiple compute nodes based on metadata.
[0241] S2007: The cache-aware scheduler of the management node sends an inference request to the first compute node.
[0242] The specific details of steps S2006-S2007 can be found in steps S1006-S1007 above, and will not be repeated here.
[0243] S05: The first compute node responds to the inference request and obtains the hit result (full hit: global history key-value cache) from the local cache of the first compute node.
[0244] The hit result is used to characterize the historical key-value cache situation required to obtain the inference request.
[0245] In this embodiment, a perfect hit is used as an example for explanation.
[0246] Continue to combine Figure 12 As shown, step S05 includes steps S2008-S2012.
[0247] S2008: The first compute node responds to the inference request by retrieving the global historical key-value cache from the local cache of the first compute node.
[0248] The specific details of step S2008 can be found in step S1008 above, and will not be repeated here.
[0249] Since this global historical key-value cache is previously generated and managed by this node, it eliminates the need for cross-node communication or remote access to shared memory devices, thus enabling context reuse with minimal latency and maximum bandwidth.
[0250] S06: The first computing node uses the global historical key-value cache to execute the corresponding inference task.
[0251] Continue to combine Figure 11 As shown, step S06 includes step S2009.
[0252] S2009: The first compute node uses the global historical key-value cache (first historical key-value cache and second historical key-value cache) to perform the corresponding inference task.
[0253] The specific details of step S2009 can be found in step S1014 above, and will not be repeated here.
[0254] In summary, when the first compute node can directly read the local cache of the global historical key-value cache, the overall system performance and efficiency can be significantly improved. This approach avoids cross-node communication and remote shared memory access, significantly reducing cache access latency and improving overall inference performance.
[0255] It should be noted that after completing the current inference task, the newly generated key-value cache also needs to be persisted and the index updated to support efficient reuse for future requests.
[0256] S2010: During the inference task based on the first historical key-value cache and the second historical key-value cache, the first computing node generates a new key-value cache.
[0257] S2011: Write the new key-value cache to the shared memory device in the form of incremental storage, so that the controlling node can perform association management operations between the new key-value cache and the already managed second historical key-value cache.
[0258] S2012: After the inference task corresponding to the inference request is completed, update the access counts of the second historical key-value cache and the new key-value cache.
[0259] The specific details of steps S2010-S2022 can be found in steps S1015-S1017 above, and will not be repeated here.
[0260] Corresponding to the above embodiments, this application also provides a complete embodiment. In this embodiment, the hit result in the local cache of the first computing node does not hit the historical key-value cache corresponding to the inference request. That is, the local cache does not store the global historical key-value cache (first historical key-value cache and second historical key-value cache) corresponding to the inference request, and needs to be obtained remotely.
[0261] The following is combined with Figure 13 Please provide a detailed explanation.
[0262] Figure 13 This is a third flowchart illustrating a reasoning method provided in an embodiment of this application.
[0263] like Figure 13 As shown, the reasoning method includes the following steps: S001: The terminal device responds to the user's input operation and obtains the inference request.
[0264] S002: The terminal device sends an inference request to the management node through the communication interface.
[0265] The specific details of steps S001-S002 can be found in steps S1-S2 above, and will not be repeated here.
[0266] S003: The management node receives the inference request and obtains the metadata corresponding to the inference request based on the inference request.
[0267] Figure 14 This is an interactive schematic diagram of another reasoning method provided in the embodiments of this application.
[0268] like Figure 14 As shown, the reasoning method includes the following steps: S3001: The cache-aware scheduler of the management node receives inference requests.
[0269] S3002: The cache-aware scheduler of the management node performs word segmentation on the text in the inference request to obtain a word sequence.
[0270] S3003: The cache-aware scheduler of the management node performs frequency statistics on the word sequence and generates word frequency vectors.
[0271] S3004: The cache-aware scheduler of the management node performs semantic encoding on the word frequency vector to generate semantic features corresponding to the inference request.
[0272] S3005: The cache-aware scheduler of the management node looks up the metadata corresponding to the inference request from the metadata cluster.
[0273] The specific details of steps S3001-S3005 can be found in steps S1001-S1005 above, and will not be repeated here.
[0274] S004: The management node determines the first compute node to execute the inference request from multiple compute nodes based on metadata.
[0275] Continue to combine Figure 14 As shown, step S004 includes step S3006.
[0276] S3006: The cache-aware scheduler of the management node determines the first compute node to execute inference requests from multiple compute nodes based on metadata.
[0277] The specific details of step S3006 can be found in step S1006 above, and will not be repeated here.
[0278] After the first computing node is determined, the management node sends the inference request to the first computing node, which will be explained in conjunction with step S3007 below.
[0279] S3007: The cache-aware scheduler of the management node sends an inference request to the first compute node.
[0280] S005: The first compute node responds to the inference request and determines the hit result (complete miss) in the first compute node's local cache.
[0281] Continue to combine Figure 14 As shown, step S005 includes step S3008.
[0282] S3008: The first compute node responds to the inference request and determines that there is no historical key-value cache in the local cache of the first compute node.
[0283] The specific content of step S3008 can be similar to that of step S1008 above, and will not be repeated here.
[0284] It should be noted that, in this embodiment, the global historical key-value cache needs to be obtained based on actual requirements.
[0285] S006: When it is necessary to obtain the global historical key-value cache, the first compute node sends a second read request to the control node.
[0286] The second read request is used to retrieve the global historical key-value cache corresponding to the inference request.
[0287] Continue to combine Figure 14 As shown, step S006 includes step S3009.
[0288] S3009: The first compute node sends an addressing request to the metadata cluster of the management node.
[0289] S3010: The metadata cluster of the management node receives the addressing response corresponding to the addressing request returned by the management node, and the cache-aware scheduler of the management node determines the node with control based on the addressing response.
[0290] S3011: The first compute node sends a second read request to the control node based on the address of the control node, so that the control node can obtain the global key-value cache based on the second read request.
[0291] The specific content of steps S3009-S3-3011 can be similarly referred to in steps S1009-S1011 above, and will not be repeated here.
[0292] S007: The controlling node reads the global historical key-value cache from the shared memory device and sends the global historical key-value cache to the first compute node.
[0293] Continue to combine Figure 14 As shown, step S007 includes step S3012.
[0294] S3012: The controlling node responds to the second read request by reading the global history key-value cache from the shared memory device.
[0295] S3013: The controlling node sends the global historical key-value cache to the first compute node.
[0296] The specific content of steps S3012-S3-3013 can be similarly referred to in steps S1012-S1013 above, and will not be repeated here.
[0297] S008: The first computing node performs inference tasks based on the global historical key-value cache.
[0298] Continue to combine Figure 14 As shown, step S008 includes step S3014.
[0299] S3014: The first compute node performs inference tasks based on the global historical key-value cache.
[0300] The specific content of step S3014 can be similar to that of step S2009 above, and will not be repeated here.
[0301] In summary, when the local cache of the first compute node does not include the global historical key-value cache corresponding to the inference request, reading the global historical key-value cache directly from the remote home node can significantly improve the overall performance and efficiency of the system.
[0302] It should be noted that after completing the current inference task, the newly generated key-value cache also needs to be persisted and the index updated to support efficient reuse for future requests.
[0303] S3015: During the inference task based on the global historical key-value cache, the first compute node generates a new key-value cache.
[0304] S3016: Writes new key-value cache to the shared memory device in the form of incremental storage, so that the controlling node can associate the new key-value cache with the managed global historical key-value cache.
[0305] S3017: After the inference task corresponding to the inference request is completed, update the access count of the global historical key-value cache and the new key-value cache.
[0306] The specific details of steps S3015-S3017 can be found in steps S1015-S1017 above, and will not be repeated here.
[0307] Corresponding to the above embodiments, this application also provides a complete embodiment.
[0308] Figure 15 This is a schematic diagram of the complete process of a reasoning method provided in an embodiment of this application.
[0309] like Figure 15 As shown, the reasoning method includes the following steps: S1501: The management node receives an inference request.
[0310] S1502: The management node obtains the metadata corresponding to the inference request based on the inference request.
[0311] The specific details of steps S1501-S1502 can be found in step S3 above, and will not be repeated here.
[0312] S1503: The management node determines the first compute node from multiple compute nodes based on metadata.
[0313] S1504: The management node sends the inference request to the first compute node.
[0314] The specific details of steps S1503-S504 can be found in step S1014 above, and will not be repeated here.
[0315] S1505: The first compute node responds to the inference request and determines whether the home node is the first compute node.
[0316] S1506: If the home node and the first compute node are the same compute node, check whether the hit result of the local cache is a full hit.
[0317] S1507: In the case of a full hit, perform the inference task based on the global historical key-value cache corresponding to the inference request in the local cache.
[0318] The specific details of step S1507 can be found in steps S05-S06 above, and will not be repeated here.
[0319] S1508: Determine whether a partial hit occurred when a full hit was not achieved.
[0320] S1509: In the case of partial hits, retrieve the first historical key-value cache corresponding to the inference request in the local cache, and interact with the home node to retrieve the second historical key-value cache in the shared memory device.
[0321] The second historical key-value cache is a historical key-value cache that is not in the local cache of the first compute node and is required by the inference request.
[0322] The specific details of steps S1508-S1509 can be found in steps S1008-S1013 above, and will not be repeated here.
[0323] S1510: Perform inference tasks based on the first and second historical key-value caches.
[0324] S1511: In the case of no partial hits (or no hits at all), determine that the global history key-value cache corresponding to the inference request needs to be retrieved, and interact with the home node to retrieve the global history key-value cache in the shared memory device.
[0325] The specific details of step S1511 can be found in steps S005-S007 above, and will not be repeated here.
[0326] S1512: Performs inference tasks based on a global historical key-value cache.
[0327] The specific details of step S1512 can be found in step S008 above, and will not be repeated here.
[0328] S1513: If the home node is not the same as the first compute node, enumerate the new home node.
[0329] The specific details of step S1513 can be found in steps S101-S102 above, and will not be repeated here.
[0330] S1514: Retrieve the first historical key-value cache corresponding to the inference request in the local cache, and interact with the new homenode node to retrieve the second historical key-value cache.
[0331] The specific details of step S1514 can be found in steps S1008-S1013 above, and will not be repeated here.
[0332] S1515: Perform inference tasks based on the first and second historical key-value caches.
[0333] The specific details of step S1515 can be found in steps S1014-S1015 above, and will not be repeated here.
[0334] In summary, by constructing a computing cluster framework, and within this framework, by reusing the local cache and retrieving missing historical key-value caches on demand based on the currently accessible computing nodes, efficient reuse of historical key-value caches can be achieved, thereby significantly reducing cache access latency and improving overall inference performance based on the dynamic semantic features of inference requests.
[0335] Corresponding to the above embodiments, this application also provides another embodiment, which is applied to a first computing node among a plurality of computing nodes.
[0336] Figure 16 This is the fourth flowchart of a reasoning method provided in an embodiment of this application.
[0337] like Figure 16 As shown, the method is applied to a first compute node in a multi-compute node architecture, where multiple compute nodes are connected to a shared memory device based on a cache coherence protocol. The inference method includes the following steps: S161: In response to the inference request, the first compute node retrieves the first historical key-value cache from its local cache.
[0338] The first historical key-value cache is the historical key-value cache required for processing inference requests.
[0339] S162: If it is necessary to obtain the second historical key-value cache, send the first read request to the control node.
[0340] The control node is the compute node where the current access right to the second historical key-value cache resides. The second historical key-value cache is required for processing inference requests and is not located within the first compute node.
[0341] S163: The receiving control node responds to the first read request by reading and returning the second historical key-value cache from the shared memory device.
[0342] S164: Perform inference tasks based on the first and second historical key-value caches.
[0343] In summary, by constructing a computing cluster framework, and reusing the local cache and retrieving the missing historical key-value cache on demand based on the computing node with current access rights, efficient reuse of the historical key-value cache can be achieved, thereby significantly reducing cache access latency and improving overall inference performance.
[0344] Corresponding to the above embodiments, this application also provides another embodiment, which is applied to a management node.
[0345] Figure 17 This is the fifth flowchart of a reasoning method provided in an embodiment of this application.
[0346] like Figure 17 As shown, this method is applied to a management node, which establishes communication connections with multiple compute nodes; the method includes: S171: In response to the inference request, obtain the metadata corresponding to the inference request.
[0347] Metadata is used to represent the context information of inference requests and the address information of the corresponding historical key-value cache.
[0348] The specific details of step S171 can be found in step S1005 above, and will not be repeated here.
[0349] S172: Based on metadata, determine the first compute node from multiple compute nodes to execute the inference request.
[0350] The details of step S172 can be found in step S1006 above, and will not be repeated here.
[0351] S173: When the first computing node and the control node are different, determine the weight parameters corresponding to each computing node when processing inference requests; the weight parameters are used to characterize the overall scheduling performance when processing inference requests.
[0352] Among them, the control node is the computing node where the current access right of the second historical key-value cache is located.
[0353] The details of step S173 can be found in step S103 above, and will not be repeated here.
[0354] S174: If the comparison result between the weight parameter and the preset threshold satisfies the access right change, determine the new control node and enable the new control node to interact with the computing node where the current access right is located through the cache consistency protocol.
[0355] The details of step S174 can be found in step S104 above, and will not be repeated here.
[0356] In summary, by constructing a computing cluster framework, and reusing the local cache and retrieving the missing historical key-value cache on demand based on the computing node with current access rights, efficient reuse of the historical key-value cache can be achieved, thereby significantly reducing cache access latency and improving overall inference performance.
[0357] It should be noted that the application scenarios of this application are not specifically limited and can be widely applied to various typical AI inference scenarios such as large-scale multi-turn dialogue services, batch content generation tasks, and real-time inference edge cloud. In large-scale multi-turn dialogue services, long-term conversations with thousands of concurrent users are supported. By reusing KVCaches of similar dialogues through semantic feature matching, the latency of the first packet is significantly reduced. In batch content generation tasks (such as news writing and code completion), a large number of requests often have highly similar prefix hints. By reusing KVCaches of similar dialogues through semantic feature matching, the overall throughput and resource utilization are effectively improved. In real-time inference edge cloud scenarios, facing edge devices with limited computing power and memory, the system uses the CXL shared memory pool to expand the available memory capacity, supports larger-scale model deployment, and combines a cache-aware scheduling mechanism to ensure that critical tasks receive low-latency, high-priority responses.
[0358] Figure 18 This is a schematic diagram of a computing device provided in an embodiment of this application.
[0359] like Figure 18 As shown, the computing device 1800 includes a processor 1801 and a memory 1802. Exemplarily, the computing device 1800 may also include a communications interface 1803 and a communications bus 1804.
[0360] The processor 1801, memory 1802, and communication interface 1803 communicate with each other via communication bus 1804. The communication interface 1803 may include a transmitter and receiver for communicating with other devices or communication networks. It can be a wired interface (port), such as a fiber distributed data interface (FDDI) or a gigabit Ethernet interface (GE).
[0361] In some embodiments, the processor 1801 is used to execute program 1805, specifically performing the relevant steps in the above-described inference execution method embodiments. Specifically, program 1805 may include program code, which includes computer-executable instructions.
[0362] For example, processor 1801 may be a central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement some embodiments of this application. Computing device 1800 may include one or more processors, which may be processors of the same type, such as one or more CPUs; or processors of different types, such as one or more CPUs and one or more ASICs. The CPU may be a single-core CPU or a multi-core CPU.
[0363] In some embodiments, memory 1802 is used to store program 1805. Memory 1802 may include high-speed random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device.
[0364] Specifically, program 1805 can be called by processor 1801 to cause computing device 1800 to perform inference generation operations.
[0365] Some embodiments of this application provide a computer-readable storage medium storing at least one executable instruction that, when executed on a computing device 1800, causes the computing device 1800 to perform the reasoning method described in the above embodiments.
[0366] For example, the computer-readable storage medium can be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), magnetic tape, a floppy disk, and an optical data storage device.
[0367] This application provides a chip system in some embodiments, which is applied to a server. The chip system includes one or more interface circuits and one or more processors. The interface circuits and processors are interconnected via lines. The interface circuits are used to receive signals from the server's memory and send signals to the processors, the signals including computer instructions stored in the memory. When the server processor executes the computer instructions, the server performs the various steps of the reasoning method shown in the above-described method embodiments.
[0368] The beneficial effects that the readable storage medium provided in some embodiments of this application can achieve can be referred to the beneficial effects in the corresponding task reasoning and execution methods provided above, and will not be repeated here.
[0369] It should be noted that, in this application, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0370] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the apparatus embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0371] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a processor-included system or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).
[0372] For the purposes of this specification, "computer-readable medium" can mean any means that can contain, store, communicate, propagate, or transmit programs for use by or in conjunction with an instruction execution system, apparatus, or device.
[0373] More specific examples of computer-readable media (a non-exhaustive list) include the following: electrical connections having one or more wires (electronic devices), portable computer disks (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM).
[0374] Furthermore, the computer-readable medium can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory. It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof.
[0375] In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc. The above embodiments are merely specific embodiments of this application and are not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made based on the technical solutions of this application should be included within the scope of protection of this application.
Claims
1. A reasoning method, characterized in that, The method, applied to a first compute node in a plurality of compute nodes connected to a shared memory device based on a cache coherence protocol, includes: In response to an inference request, the first historical key-value cache is retrieved from the local cache of the first computing node; the first historical key-value cache is the historical key-value cache required to execute the inference request. If it is necessary to retrieve the second historical key-value cache, a first read request is sent to the control node; wherein, the control node is a computing node that manages the access rights to the second historical key-value cache; the second historical key value is a historical key-value cache that is required to process the inference request and is not within the first computing node; The control node receives the second historical key-value cache, which is read from and returned by the shared memory device in response to the first read request; Inference tasks are performed based on the first historical key-value cache and the second historical key-value cache.
2. The reasoning method according to claim 1, characterized in that, The multiple computing nodes establish communication connections with the management node; Sending the first read request to the control node includes: Send an addressing request to the management node; the addressing request is used to obtain the control node of the second historical key-value cache; Receive the addressing response corresponding to the addressing request returned by the management node; the addressing response includes the control node of the second historical key-value cache; The first read request is sent to the control node so that the control node can obtain the second key-value cache based on the first read request.
3. The reasoning method according to claim 1, characterized in that, Also includes: During the inference task based on the first historical key-value cache and the second historical key-value cache, a new key-value cache is generated; The new key-value cache is written to the shared memory device in the form of incremental storage, so that the control node can perform association management operation between the new key-value cache and the managed second historical key-value cache; After the inference task corresponding to the inference request is completed, the access counts of the second historical key-value cache and the new key-value cache are updated.
4. A reasoning method, characterized in that, Applied to a management node, the management node establishes communication connections with multiple computing nodes; the method includes: In response to an inference request, the metadata corresponding to the inference request is obtained; the metadata is used to characterize the context information of the inference request and the address information for storing the corresponding historical key-value cache. Based on the metadata, a first computing node for executing the inference request is determined from the plurality of computing nodes; When the first computing node and the control node are different, obtain the weight parameters corresponding to each computing node when processing the inference request; the weight parameters are used to characterize the overall scheduling performance when processing the inference request; wherein, the control node is the computing node that manages the access rights of the second historical key-value cache; If the comparison result between the weight parameter and the preset threshold satisfies the access right change, a new control node is determined, and the new control node interacts with the computing node where the current access right is located through the cache consistency protocol; wherein, the access right is the management authority of the control node over the second historical key-value cache.
5. The reasoning method according to claim 4, characterized in that, Also includes: Update the address of the computing node where the access rights corresponding to the historical key-value cache reside, so that the first computing node can obtain a new control node by interacting with the management node.
6. The reasoning method according to claim 4 or 5, characterized in that, The step of responding to the inference request and obtaining the metadata corresponding to the inference request includes: The text in the inference request is subjected to word frequency encoding to generate a word frequency vector; The word frequency vector is semantically encoded to generate the semantic features corresponding to the reasoning request; Based on the semantic features, obtain the metadata corresponding to the reasoning request.
7. The reasoning method according to any one of claims 4-6, characterized in that, The step of determining a first computing node for executing the inference request from the plurality of computing nodes based on the metadata includes: Determine the degree of matching between the semantic features and the context information of each historical reasoning request in the metadata; Based on the matching degree and preset weight, the affinity score of each computing node associated with the inference request is determined; wherein, the affinity score is used to quantify the degree of matching between the available resources of the computing node and the resource requirements of the inference request; Based on the affinity scores, a first computing node for executing the inference request is determined from the computing nodes associated with the inference request.
8. The reasoning method according to any one of claims 4-6, characterized in that, The step of determining a first computing node for executing the inference request from the plurality of computing nodes based on the metadata includes: Determine the degree of matching between the semantic features and the context information of each historical reasoning request in the metadata; Based on the matching degree and preset weights, the affinity score of the computing node associated with the inference request is determined; wherein, the affinity score is used to quantify the degree of fit between the computing node and the inference request; For computing nodes that are not associated with the inference request among multiple computing nodes, the affinity score is set to a default value; Obtain the load score and access latency of each computing node among the plurality of computing nodes when processing the inference request; Based on the filtering conditions pre-set in the service level agreement of the inference request, multiple candidate computing nodes are determined from the multiple computing nodes; A comprehensive score for the candidate computing node is determined based on the affinity score, load score, and access latency score of the candidate computing node. Based on the multiple composite scores, a first computing node is determined for executing the inference request.
9. The reasoning method according to claim 7, characterized in that, The step of obtaining the load score of each computing node among the plurality of computing nodes when processing the inference request includes: Obtain the sequence length corresponding to the inference request; Based on the sequence length, the model parameters of the target inference model, and the accuracy of the target inference model, the prediction computation amount for processing the inference request is determined; Obtain the real-time load parameters of each of the plurality of computing nodes; Based on the predicted computational load and the real-time load parameters of each computing node, the load score of each computing node among the plurality of computing nodes is obtained.
10. The reasoning method according to any one of claims 3-8, characterized in that, The multiple computing nodes are divided into a pre-filled resource pool and a decoding resource pool; wherein, the computing nodes in the pre-filled resource pool are used to perform pre-filled calculations for the inference request and generate corresponding historical key-value caches; the computing nodes in the decoding resource pool are used to perform autoregressive decoding based on the historical key-value caches to generate new lexicals and corresponding new key-value caches; The method further includes: Based on the running status information of each computing node when processing inference requests, an adaptation score is generated for each computing node. Based on the adaptation score and the preset resource pool allocation threshold, the resource type corresponding to each computing node is determined; Based on the resource type corresponding to each computing node, each computing node is dynamically migrated to adjust the number of computing nodes in the pre-filled resource pool and the decoding resource pool.
11. A computing cluster, characterized in that, The computing cluster includes: Multiple computing nodes, which are divided into a pre-filled resource pool and a decoding resource pool; A management node is used to determine the first computing node from multiple computing nodes to execute the inference request, and to trigger the migration of access rights to the second historical key-value cache if the first computing node is different from the computing node that has access rights to the second historical key-value cache. A shared memory device, wherein the plurality of computing nodes are connected to the shared memory device based on a cache coherency protocol; used to store the second historical key-value cache.
12. A computing device, characterized in that, The computing device includes a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program code, the computer program code including computer instructions, which, when executed by the processor, cause the computing device to apply the inference method as described in any one of claims 1 to 10 and the computing cluster as described in claim 11.