Processing inferencing requests in hierarchical inferencing overlay network
A hierarchical inferencing overlay network with iPoPs optimizes KV cache management to address computational and network constraints, improving TTFT and reducing costs in LLM inferencing.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2024-12-02
- Publication Date
- 2026-06-11
Smart Images

Figure CN2024136191_11062026_PF_FP_ABST
Abstract
Description
PROCESSING INFERENCING REQUESTS IN HIERARCHICAL INFERENCING OVERLAY NETWORKTECHNICAL FIELD
[0001] The present disclosure relates, in general, to processing inferencing requests in a hierarchical inferencing overlay network. Aspects of the disclosure relate to optimising streaming of key value (KV) cache information.BACKGROUND
[0002] Large Language Models (LLMs) have emerged as a pivotal technology in artificial intelligence, finding applications across numerous fields including marketing, customer assistance, and a wide range of other sectors. The training of these models, however, represents a significant and growing cost factor in the deployment of effective AI-based services. While the training phase is resource-intensive, the inferencing phase -the process where the model generates responses to user prompts -is crucial for generating revenue from AI services, thus offsetting training costs and enabling the profitability of LLMs.
[0003] Nonetheless, inferencing itself entails substantial operational expenses. One of the critical challenges lies in the response time for each inference request, particularly the "time to first token" (TTFT) -a measure of how quickly the LLM begins producing a response to a given prompt. Reducing TTFT is essential in maintaining user engagement and delivering seamless AI-driven interactions, as delays in response can degrade the quality of the user experience and limit the practical value of these models for real-time applications.
[0004] Current approaches to improving TTFT have included various methods aimed at enhancing efficiency, yet these are often constrained by the inherent computational demands of inferencing and by network latency in distributed computing environments. Context information, which significantly enhances inferencing accuracy and relevance, adds another layer of complexity. While context data can be transferred with relative ease, the computational load required to process it is high. To mitigate this, processed context information may be stored in key value (KV) caches, which reduces processing demands but increases the cost of transferring data over the network. Compressing these KV caches offers some relief by making data transfer less costly, yet decompression and processing remain resource-intensive.
[0005] As such, the problem of optimising TTFT remains a key obstacle in maximising the viability and commercial effectiveness of LLMs. Addressing this challenge would be highly advantageous, allowing for more responsive AI interactions and a more cost-effective inferencing process, thus making LLM-based services more scalable, reliable, and profitable.SUMMARY
[0006] An objective of the present disclosure is to enable optimised streaming of KV cache information so as to reduce the TTFT.
[0007] The foregoing and other objectives are achieved by the features of the independent claims.
[0008] Further implementation forms are apparent from the dependent claims, the description and the Figures.
[0009] A first aspect of the present disclosure provides a system for processing inferencing requests in a hierarchical inferencing overlay network, the system comprising multiple inferencing Points of Presence (iPoP) , wherein each iPoP is arranged to store a key value (KV) cache indexed using KV cache identifiers, receive an inferencing request from a user and, in response to receiving the inferencing request, determine whether relevant KV cache information is available, in response to determining that the relevant KV cache information is available, processing the inferencing request locally, or, in response to determining that the relevant KV cache information is not available, forward the inferencing request to an upstream iPoP of the multiple iPoPs, wherein each iPoP of the multiple iPoPs comprises a KV cache manager arranged to aggregate the inferencing requests and to coordinate retrieval, storage and distribution of KV cache information between the multiple iPoPs, a KV streaming engine arranged to stream KV cache entries between the multiple iPoPs, and an inferencing engine arranged to generate context information and KV cache identifiers based on the received inferencing request.
[0010] The KV cache identifiers may encode context information, a quantification of context and compression levels of the KV cache.
[0011] The inferencing engine may be further arranged to pre-fill the KV cache with compressed context information based on historical inferencing requests.
[0012] To stream KV cache entries between the multiple iPoPs, the KV streaming engine may be further arranged to
[0013] push KV cache entries downstream from upstream iPoPs to downstream iPoPs of the multiple iPoPs based on predicted demand, and pull KV cache entries from upstream iPoPs based on inferencing requests aggregated by the KV cache manager.
[0014] The KV cache manager may be further arranged to determine a popularity metric based on at least one of the aggregated inferencing requests, time of day, and geographic distribution, and instruct the KV streaming engine to proactively push KV cache entries to downstream iPoPs of the multiple iPoPs based on the popularity metric and historical usage patterns.
[0015] The KV cache manager may be further arranged to calculate the popularity metric for specific context information indexed over the context portion of KV cache entries, and instruct the KV streaming engine to reactively pull popular KV cache entries from upstream iPoPs based on this metric.
[0016] The KV cache manager, in coordination with the KV streaming engine, may be arranged to control a compression level encoded in the KV cache identifier to manage network bandwidth usage for streaming KV cache entries.
[0017] Each iPoP of the multiple iPoPs may be arranged to generate a response in response to the received inferencing request, and / or forward the inferencing request to an endpoint for the response to be generated outside the iPoP.
[0018] A second aspect of the present disclosure provides a method for processing inferencing requests in a hierarchical overlay network of inferencing Points of Presence (iPoP) , the method comprising storing a key value (KV) cache indexed using KV cache identifiers, receiving, from a user, an inferencing request at an iPoP, determining, by the iPoP, whether the local KV cache contains relevant KV cache information, in response to determining that relevant KV cache information is available, processing the inferencing request locally by the iPoP, or, in response to determining that the relevant KV cache information is not available, forwarding, by the iPoP, the inferencing request to an upstream iPoP of the multiple iPoPs, and aggregating the inferencing requests and generating context information and KV cache identifiers based on the received inferencing request.
[0019] The method may further comprise, in response to determining that the relevant KV cache information is available at the iPoP, decompressing the KV cache entry, processing the inferencing request using the decompressed KV cache entry, and returning an inference response to the user.
[0020] The method may further comprise pre-filling, by the iPoP, the KV cache with compressed context information based on historical inferencing requests.
[0021] The method may further comprise pushing KV cache entries downstream from upstream iPoPs to downstream iPoPs of the multiple iPoPs based on predicted demand, and pulling KV cache entries from upstream iPoPs based on the aggregated inferencing requests.
[0022] The method may further comprise determining a popularity metric based on at least one of the aggregated inferencing requests, time of day, and geographic distribution, and proactively pushing KV cache entries to downstream iPoPs of the multiple iPoPs based on the popularity metric and historical usage patterns.
[0023] The method may further comprise calculating the popularity metric for specific context information indexed over the context portion of KV cache entries, and reactively pulling popular KV cache entries from upstream iPoPs based on this metric.
[0024] A third aspect of the present disclosure provides a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, causes the system to perform the method described herein.
[0025] These and other aspects of the disclosure will be apparent from the embodiment (s) described below.BRIEF DESCRIPTION OF THE DRAWINGS
[0026] In order that the present disclosure may be more readily understood, embodiments of the disclosure will now be described, by way of example, with reference to the accompanying drawings, in which:
[0027] FIG. 1 is a schematic representation of a hierarchical inferencing overlay network according to an example;
[0028] FIG. 2 is a flow chart of a method for processing inferencing requests in a hierarchical overlay network of inferencing Points of Presence (iPoP) according to an example; and
[0029] FIG. 3 is a schematic representation of a system for processing inferencing requests in a hierarchical inferencing overlay network according to an example.DETAILED DESCRIPTION
[0030] Example embodiments are described below in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein.
[0031] Accordingly, while embodiments can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.
[0032] The terminology used herein to describe embodiments is not intended to limit the scope. The articles “a, ” “an, ” and “the” are singular in that they have a single referent, however the use of the singular form in the present document should not preclude the presence of more than one referent. In other words, elements referred to in the singular can number one or more, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises, ” “comprising, ” “includes, ” and / or “including, ” when used herein, specify the presence of stated features, items, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, items, steps, operations, elements, components, and / or groups thereof.
[0033] Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.
[0034] Large language model inference for AI applications often faces significant challenges due to the inherent computational and network latency costs involved in delivering rapid, relevant responses to users. Specifically, reducing the "time to first token" (TTFT) for inference requests has become critical to providing a seamless and responsive user experience, as latency in response times can negatively impact the effectiveness of AI services. Current solutions rely heavily on key value (KV) cache systems to expedite inferencing by storing and accessing relevant context data. However, these systems are often constrained by network limitations and the computational demands of decompressing and processing the context data at various network endpoints. To address these constraints, prior methods have focused on compressing KV caches to reduce network load and expedite data transfer, though they often still suffer from inefficiencies when deployed in distributed environments, such as data centres, or when faced with varying demand across user locations.
[0035] According to an example, there is provided an optimised system to reduce TTFT by streaming KV cache information to strategically located inferencing Points of Presence (iPoPs) that act as proxies for inferencing requests. More specifically, these iPoPs serve as intermediate caches and processing nodes, dynamically managing inferencing requests based on the availability of context data. The mechanism leverages a chunking and bitstream representation of KV caches, allowing for efficient, both proactive and reactive, streaming of context data to iPoPs. This structure minimises the need for frequent upstream requests and provides localised, low-latency responses where possible. In operation, a user’s device or service connects to a designated iPoP, where inferencing requests are processed if relevant KV cache data is available. Should the iPoP lack necessary data or processing capabilities, it forwards the request to a higher-level iPoP or the LLM provider for completion. This hierarchical iPoP overlay, designed similarly to DNS or content delivery networks, can autonomously adjust to user demand patterns by prefetching or caching popular KV entries, thus optimising the balance between computational and network resources.
[0036] Examples in the present disclosure can be provided as methods, systems or machine-readable instructions, such as any combination of software, hardware, firmware or the like. Such machine-readable instructions may be included on a computer readable storage medium (including but not limited to disc storage, CD-ROM, optical storage, etc. ) having computer readable program codes therein or thereon.
[0037] The present disclosure is described with reference to flow charts and / or block diagrams of the method, devices and systems according to examples of the present disclosure. Although the flow diagrams described above show a specific order of execution, the order of execution may differ from that which is depicted. Blocks described in relation to one flow chart may be combined with those of another flow chart. In some examples, some blocks of the flow diagrams may not be necessary and / or additional blocks may be added. It shall be understood that each flow and / or block in the flow charts and / or block diagrams, as well as combinations of the flows and / or diagrams in the flow charts and / or block diagrams can be realized by machine readable instructions.
[0038] The machine-readable instructions may, for example, be executed by a machine such as a general-purpose computer, user equipment such as a smart device, e.g., a smart phone, a special purpose computer, an embedded processor or processors of other programmable data processing devices to realize the functions described in the description and diagrams. In particular, a processor or processing apparatus may execute the machine-readable instructions. Thus, modules of apparatus (for example, a module implementing a comparator unit, or a firewall structure and so on) may be implemented by a processor executing machine readable instructions stored in a memory, or a processor operating in accordance with instructions embedded in logic circuitry. The term 'processor'is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate set etc. The methods and modules may all be performed by a single processor or divided amongst several processors.
[0039] Such machine-readable instructions may also be stored in a computer readable storage that can guide the computer or other programmable data processing devices to operate in a specific mode. For example, the instructions may be provided on a non-transitory computer readable storage medium encoded with instructions, executable by a processor.
[0040] FIG. 1 is a schematic representation of a hierarchical inferencing overlay network 100 according to an example. In particular, FIG. 1 depicts the hierarchical arrangement of multiple inferencing Points of Presence (iPoP) . The arrangement may be structured to optimise response times by distributing iPoPs across various network layers. The network 100 may comprise multiple iPoPs 101, 102, 103, organised in layers from local networks up through regional and wide-area networks. A user device 105 may initiate inferencing requests, which may be directed to the closest available iPoP (in the example of FIG. 1, iPoP 101) . In this example, the iPoPs 101, 102, 103 are connected hierarchically, with local networks 111 forming the bottom layer and regional / metro networks 112 positioned above. These regional networks 112 connect to a wide-area network 113 that ultimately links to an LLM processing centre 107.
[0041] FIG. 2 is a flow chart of a method for processing inferencing requests in a hierarchical overlay network of inferencing Points of Presence (iPoP) according to an example. The method comprises, in block 201, storing a key value (KV) cache indexed using KV cache identifiers. As used herein, the term “key value cache” may refer to a data storage structure that stores pairs of keys and values, wherein the key is a unique identifier representing specific context information, and the value is the data or context associated with that key.
[0042] In block 202, the method comprises receiving, from a user, an inferencing request at an iPoP. The inferencing request may refer to a query or task requiring the system to process and generate a response based on large language model capabilities. For example, the request could involve natural language processing tasks such as text generation, summarisation, question answering, or sentiment analysis. The iPoP may not necessarily be limited to a central server or processing unit; it could also be a user device, such as a smartphone, personal computer, or any other endpoint device capable of making inferencing requests. Alternatively, the request may be transmitted from the user device to the iPoP for processing.
[0043] The method comprises, in block 203, determining, by the iPoP, whether the local KV cache contains relevant KV cache information. In other words, the iPoP checks its local KV cache to determine whether the data required to process the inferencing request is already stored. The relevant KV cache information may include, for example, previously stored context data, which can be used to generate a response to the inferencing request. As such, the iPoP can quickly assess whether it can serve the request using locally available cached information, or if additional data retrieval is necessary.
[0044] The method comprises, in block 204, in response to determining that relevant KV cache information is available, processing the inferencing request locally by the iPoP, or, in response to determining that the relevant KV cache information is not available, forwarding, by the iPoP, the inferencing request to an upstream iPoP of the multiple iPoPs. This means that, if the relevant KV cache information is available in the current iPoP’s local KV cache, the iPoP can use the cached context data to handle the inferencing request without needing to forward the request elsewhere, achieving a reduction in latency. In the case of iPoP inferencing, KV cache entries may be pushed to the iPoP closest to the user. This setup enables the iPoP to respond to requests using its local KV cache without requiring further processing resources on the endpoint device itself, though it may introduce some additional latency compared to a local endpoint response (discussed later in this specification) .
[0045] In cases where the relevant KV cache information is not available locally, the iPoP forwards the inferencing request to an upstream iPoP of the multiple iPoPs. The decision to forward the request upstream assumes that a higher-level or upstream iPoP may possess the required data, due to the hierarchical nature of the network as depicted in (and described in relation to) FIG. 1.
[0046] The upstream iPoP (i.e., the iPoP which the request has been forwarded to) may then similarly check its own cache to determine whether the relevant KV cache information is available locally. If the upstream iPoP cannot fulfil the request, the inferencing request may be passed on to the next upstream iPoP in the chain. This process may continue until the inferencing request reaches an iPoP that has the required context or, in some cases, until it reaches the LLM provider or data centre, which holds the most comprehensive context data. The LLM provider may then process the request using its own knowledge base and return the response, which can then be forwarded back down the chain of iPoPs to the originally requesting user.
[0047] To further optimise data availability, the system may aggregate inferencing requests based on the received requests’ popularity. This allows the system to assess which KV cache entries are most frequently used over time or in response to certain thresholds, and accordingly place popular context information into the local KV cache based on observed local demand information. This reactive pulling of KV cache entries from upstream iPoPs enables faster response to commonly requested inferencing contexts and may proactively adjust the cache contents according to evolving demand.
[0048] In block 205, the method comprises aggregating the inferencing requests and generating context information and KV cache identifiers based on the received inferencing request. Additionally, proactive strategies may also be employed to push KV cache entries downstream, enabling iPoPs to anticipate demand. This proactive caching may rely on planning statistics, including expected usage at specific times of day, geographic distribution of expected demand, and patterns from previous context usage to pre-fill downstream iPoP caches, thereby improving TTFT for anticipated requests and minimising the need for upstream requests.
[0049] In another example, where iPoPs act as storage-only points, KV cache entries may be pulled directly to endpoint devices, such as smartphones or personal computers, which may be equipped with NPUs or similar processors to decompress and decode the KV cache for local inferencing tasks. This approach allows endpoints to perform inferencing requests using locally held data, achieving the fastest TTFT in cases where the KV cache contains the required context, as there is no need for additional network traversal beyond the iPoP-to-endpoint transmission. This endpoint-local inferencing approach is particularly advantageous for frequently repeated requests, as it eliminates latency associated with iPoP processing and reduces overall system load by decentralising processing requirements.
[0050] Advantageously, the above-described approach improves the TTFT metric for inferencing by strategically pushing context information closer to the user. The system leverages KV caches to balance the computational costs associated with real-time context calculation for inferencing against the network costs of transmitting compressed KV cache information. Additionally, the system may function as a cache overlay, inferencing overlay, or a hybrid of both, in order to meet varied processing requirements. By reactively pulling KV cache information, the system can “frontload” frequently requested local information to improve response times. Alternatively, proactively pushing KV cache data based on anticipated inferencing usage patterns and system-wide analytics allows the system to meet demand in advance, thereby optimising resource distribution and reducing latency. The iPoP overlay further facilitates efficient distribution of KV cache data across the network, ensuring that context information is accessible at optimal locations to enhance TTFT.
[0051] FIG. 3 is a schematic representation of a system for processing inferencing requests in a hierarchical inferencing overlay network according to an example. The system 300 comprises multiple components within any one of the iPoPs 101, 102, 103 shown in FIG. 1. In other words, the system 300 (i.e., the components 301, 302, 303) reside in a single iPoP.
[0052] Each iPoP of the multiple iPoPs 101-103 is arranged to store a key value (KV) cache indexed using KV cache identifiers. The KV cache identifier may comprise a unique reference or key that is used to index and access specific pieces of cached data (the "value" ) in the KV cache. In particular, the KV cache identifiers may encode context information that reflects the specific conditions or data relevant to a particular inferencing task, such as prior interactions, system states, or environmental factors. Additionally, the KV cache identifiers may comprise a quantification of context, which represents the level of detail or precision of the cached data, indicating whether the cache contains broad overviews or more granular, specific information.
[0053] Furthermore, the KV cache identifiers may also encode the compression level of the KV cache, indicating how much the data has been compressed for transmission. While a higher compression level may reduce network overhead, it may also increase the computational load at the downstream iPoP when decompressing the data. The level of compression may be determined based on local iPoP decisions, with the system evaluating factors such as available bandwidth and processing capacity to choose the appropriate compression level. For efficiency purposes, the KV cache identifier may be hashed.
[0054] Each iPoP of the multiple iPoPs 101-103 is further arranged to receive an inferencing request from a user and, in response to receiving the inferencing request, determine whether relevant KV cache information is available. In other words, the iPoP may check its local KV cache to see if it holds the necessary KV pairs that correspond to the specific context required to process the request.
[0055] In response to determining that the relevant KV cache information is available, each iPoP is arranged to process the inferencing request locally. Processing the inferencing request locally may involve retrieving the cached KV pairs, decompressing the data if necessary, and using the stored context to generate a response for the user.
[0056] Conversely, in response to determining that the relevant KV cache information is not available, each iPoP of the multiple iPoPs 101-103 is arranged to forward the inferencing request to an upstream iPoP of the multiple iPoPs 101-103. When considered in relation to the iPoP 103, the iPoP 101 may comprise a downstream iPoP (or a user endpoint, e.g., a user device) , and the iPoP 103 may comprise an upstream iPoP.
[0057] Each of the multiple iPoPs 101-103 comprises a KV cache manager 310 arranged to aggregate the inferencing requests received from a user and to coordinate retrieval, storage and distribution of KV cache information. The KV cache manager 310 may utilise a database indexed using the KV cache identifiers described above. Importantly, the KV cache manager 310 can help to ensure that the right KV cache data is available, enabling efficient processing of inferencing requests by either serving the cached data locally or forwarding the request to another iPoP that holds the necessary context.
[0058] Each of the multiple iPoPs 101-103 further comprises a KV streaming engine 311 and an inferencing engine 312. The KV streaming engine 311 is arranged to stream KV cache entries between the multiple iPoPs 301-303. In particular, the KV streaming engine 311 may facilitate the transfer of KV cache information in both directions (i.e., upstream or downstream) . In other words, the KV streaming engine 311 may pull KV cache entries from upstream iPoPs when they are needed for processing requests that cannot be fulfilled locally, and push KV cache entries downstream to other iPoPs in anticipation of future requests, thus optimising both proactive and reactive KV cache management across the iPoP network.
[0059] The inferencing engine 312 is arranged to generate context information based on the inferencing request received from the user. In particular, the inferencing engine 312 may generate KV cache identifiers that correspond to the required context, allowing the KV cache manager 310 to locate and retrieve the necessary cached data. If the relevant KV cache information is not found locally, the KV cache manager 310 may instruct the KV streaming engine 311 to pull the required KV cache entries from an upstream iPoP.
[0060] Additionally, the inferencing engine 312 may be further arranged to pre-fill the local KV cache with compressed context information based on historical inferencing requests. This pre-filling may aid in optimising future requests, similar to content production in video streaming, where frequently requested context is stored in advance. The newly generated KV cache content may then be pushed to downstream iPoPs for storage and future use, thereby improving the overall system's performance by reducing the time to first token (TTFT) in subsequent inferencing tasks.
[0061] Various transport protocols may be employed to facilitate the pulling and pushing of KV cache entries within the network. For upstream KV cache pulls, protocols like HTTP or Media over QUIC may be used, leveraging container formats such as DASH with manifest files that define KV cache chunks for selective retrieval. Downstream KV cache pushes may use RTP, utilising stream identifiers derived from hashed structured KV cache identifiers to manage specific data streams efficiently. Additionally, different protocols can be adopted for iPoP overlay distribution, such as BIER or SD-WAN, which allow for optimised multicast distribution. For example, a push via RTP can use IP multicast addresses to deliver data to multiple iPoPs simultaneously over a BIER multicast overlay. Pull requests can also benefit from BIER by consolidating multiple requests for the same KV cache identifier into a single multicast response, thus reducing the overall network load.
[0062] According to an example, machine-readable instructions can be loaded onto a computer or other programmable data processing devices, so that the computer or other programmable data processing devices perform a series of operations to produce computer-implemented processing, thus the instructions executed on the computer or other programmable devices provide an operation for realizing functions specified by flow (s) in the flow charts and / or block (s) in the block diagrams.
[0063] Further, the teachings herein may be implemented in the form of a computer or software product, such as a non-transitory machine-readable storage medium, the computer software or product being stored in a storage medium and comprising a plurality of instructions, e.g., machine readable instructions, for making a computer device implement the methods recited in the examples of the present disclosure.
[0064] In some examples, some methods can be performed in a cloud-computing or network-based environment. Cloud-computing environments may provide various services and applications via the Internet. These cloud-based services (e.g., software as a service, platform as a service, infrastructure as a service, etc. ) may be accessible through a web browser or other remote interface of the user equipment for example. Various functions described herein may be provided through a remote desktop environment or any other cloud-based computing environment.
[0065] While various embodiments have been described and / or illustrated herein in the context of fully functional computing systems, one or more of these exemplary embodiments may be distributed as a program product in a variety of forms, regardless of the particular type of computer-readable-storage media used to actually carry out the distribution. The embodiments disclosed herein may also be implemented using software modules that perform certain tasks. These software modules may include script, batch, or other executable files that may be stored on a computer-readable storage medium or in a computing system. In some embodiments, these software modules may configure a computing system to perform one or more of the exemplary embodiments disclosed herein. In addition, one or more of the modules described herein may transform data, physical devices, and / or representations of physical devices from one form to another.
[0066] The preceding description has been provided to enable others skilled in the art to best utilize various aspects of the exemplary embodiments disclosed herein. This exemplary description is not intended to be exhaustive or to be limited to any precise form disclosed. Many modifications and variations are possible without departing from the spirit and scope of the instant disclosure. The embodiments disclosed herein should be considered in all respects illustrative and not restrictive. Reference should be made to the appended claims and their equivalents in determining the scope of the instant disclosure.
Claims
1.A system (100) for processing inferencing requests in a hierarchical inferencing overlay network, the system comprising:multiple inferencing Points of Presence, iPoP, (101, 102, 103) wherein each iPoP (101, 102, 103) is arranged to:store a key value, KV, cache indexed using KV cache identifiers;receive an inferencing request from a user and, in response to receiving the inferencing request, determine whether relevant KV cache information is available;in response to determining that the relevant KV cache information is available, processing the inferencing request locally, or, in response to determining that the relevant KV cache information is not available, forward the inferencing request to an upstream iPoP of the multiple iPoPs (101, 102, 103) ,wherein each iPoP of the multiple iPoPs (101, 102, 103) comprises:a KV cache manager (310) arranged to aggregate the inferencing requests and to coordinate retrieval, storage and distribution of KV cache information between the multiple iPoPs (301, 302, 303) ,a KV streaming engine (311) arranged to stream KV cache entries between the multiple iPoPs (101, 102, 103) , andan inferencing engine (312) arranged to generate context information and KV cache identifiers based on the received inferencing request.2.The system of claim 1, wherein the KV cache identifiers encode context information, a quantification of context and compression levels of the KV cache.3.The system of claim 1 or 2, wherein the inferencing engine (312) is further arranged to pre-fill the KV cache with compressed context information based on historical inferencing requests.4.The system of claim 1, 2 or 3, wherein, to stream KV cache entries between the multiple iPoPs, the KV streaming engine (311) is further arranged to:push KV cache entries downstream from upstream iPoPs to downstream iPoPs of the multiple iPoPs (101, 102, 103) based on predicted demand; andpull KV cache entries from upstream iPoPs based on inferencing requests aggregated by the KV cache manager.5.The system of claim 4, wherein the KV cache manager (310) is further arranged to determine a popularity metric based on at least one of the aggregated inferencing requests, time of day, and geographic distribution, andinstruct the KV streaming engine to proactively push KV cache entries to downstream iPoPs of the multiple iPoPs (101, 102, 103) based on the popularity metric and historical usage patterns.6.The system of claim 5, wherein the KV cache manager (310) is further arranged to calculate the popularity metric for specific context information indexed over the context portion of KV cache entries, and instruct the KV streaming engine to reactively pull popular KV cache entries from upstream iPoPs based on this metric.7.The system of any one of claims 1 to 6, wherein the KV cache manager (310) , in coordination with the KV streaming engine (311) , is arranged to control a compression level encoded in the KV cache identifier to manage network bandwidth usage for streaming KV cache entries.8.The system of any one of claims 1 to 7, wherein each iPoP of the multiple iPoPs (101, 102, 103) is arranged to: generate a response in response to the received inferencing request, and / or forward the inferencing request to an endpoint for the response to be generated outside the iPoP.9.A method for processing inferencing requests in a hierarchical overlay network of inferencing Points of Presence, iPoP, the method comprising:storing a key value, KV, cache indexed using KV cache identifiers (201) ;receiving, from a user, an inferencing request at an iPoP (202) ;determining, by the iPoP, whether the local KV cache contains relevant KV cache information (203) ;in response to determining that relevant KV cache information is available, processing the inferencing request locally by the iPoP, or, in response to determining that the relevant KV cache information is not available, forwarding, by the iPoP, the inferencing request to an upstream iPoP of the multiple iPoPs (204) ; andaggregating the inferencing requests and generating context information and KV cache identifiers based on the received inferencing request (205) .10.The method of claim 9, further comprising, in response to determining that the relevant KV cache information is available at the iPoP:decompressing the KV cache entry;processing the inferencing request using the decompressed KV cache entry; andreturning an inference response to the user.11.The method of claim 9 or 10, further comprising:pre-filling, by the iPoP, the KV cache with compressed context information based on historical inferencing requests.12.The method of claim 9, 10 or 11, further comprising:pushing KV cache entries downstream from upstream iPoPs to downstream iPoPs of the multiple iPoPs based on predicted demand; andpulling KV cache entries from upstream iPoPs based on the aggregated inferencing requests.13.The method of claim 12, further comprising:determining a popularity metric based on at least one of the aggregated inferencing requests, time of day, and geographic distribution, andproactively pushing KV cache entries to downstream iPoPs of the multiple iPoPs based on the popularity metric and historical usage patterns.14.The method of claim 13, further comprising:calculating the popularity metric for specific context information indexed over the context portion of KV cache entries, and reactively pulling popular KV cache entries from upstream iPoPs based on this metric.15.A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, causes the system to perform the method of any of claims 9 to 14.