Resource sharing method and computing device
By allocating a dedicated region for each user and using a region controller and cache manager, the security risks caused by sharing the large language model's GPU memory space are resolved, achieving physical isolation and enhanced security, and improving resource utilization and the stability of the inference process.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XFUSION DIGITAL TECH CO LTD
- Filing Date
- 2026-01-26
- Publication Date
- 2026-06-12
Smart Images

Figure CN122195884A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of storage technology, and in particular to a resource sharing method and computing device. Background Technology
[0002] Large language models can be deployed on cloud computing or hyperconverged infrastructure platforms to provide users with AI (Artificial Intelligence) inference services. When providing inference services, large language models use a KVCache (Key-Value Cache) to cache the keys and values of historical tokens, so that the calculation of each new token only needs to be based on the cached historical data, rather than recalculating, thus reducing time complexity.
[0003] Currently, all users share the same GPU memory space for their KVCache. For any inference request, a storage space can be allocated from this GPU memory space, and the KVCache can be executed through that storage space. However, because this GPU memory space is shared by multiple users, there are security risks, resulting in poor security for the GPU memory space. Summary of the Invention
[0004] This application provides a resource sharing method and computing device, which achieves targeted storage by setting up dedicated areas for users, avoiding different users sharing the same video memory space, solving security risks, and improving the storage security of video memory space.
[0005] According to a first aspect of the embodiments of this application, a resource sharing method is provided, including: Receive inference requests, which include the target user identifier, where the target user refers to the user who needs to use the inference service of the large language model; Based on the target user identifier, query the user region table, which includes the relationship between the user's user identifier and the region identifier of the region assigned to the user. If a target region identifier that matches the target user identifier is found in the user region table, the target physical block corresponding to the target region identifier is obtained. Based on the target physical block, store the key-value cache generated during the execution of inference requests.
[0006] In this embodiment, upon receiving an inference request containing a target user identifier, the system can query a user region table based on the target user identifier to obtain a target region identifier that matches the target user identifier. The target region identifier identifies the region allocated to the user, and since a user's region is allocated physical blocks, the key-value cache generated during the execution of the inference request is stored in the corresponding target physical block based on the target physical block corresponding to the target region identifier. This ensures that the key-value cache for the same user is executed within its corresponding target region, achieving physical risk isolation, avoiding the reuse of physical blocks across users, reducing the risk of unauthorized access by users, and improving the storage security of the key-value cache.
[0007] In conjunction with the first aspect, in certain implementations of the first aspect, obtaining the target physical block corresponding to the target region identifier includes: Based on the target region identifier, determine the target logical block associated with the target region allocated to the target user; Query the cache table of the target region to obtain the target physical block associated with the target logical block. The cache table includes the association between logical blocks and physical blocks.
[0008] In this embodiment, logical blocks serve as a bridge between regions and physical blocks. This allows the target logical block associated with the target region allocated to the target user to be identified through the target region identifier. Then, the cache table for that target region is queried. This cache table records the association between logical blocks and physical blocks, thus enabling the accurate retrieval of the target physical block corresponding to the target logical block. Based on the region association and cache table query mechanism, the targeted location and efficient association of cached resources are ensured. This not only guarantees the regional isolation of user cached data but also improves the efficiency of cache resource lookup and reuse during large model inference.
[0009] In conjunction with the first aspect, in certain implementations of the first aspect, the target logical block associated with the target region allocated to the target user is determined based on the target region identifier, including: Calculate the target root hash value of the inference request based on the target region identifier and token identifier (token_ids); Based on the target root hash value, query the prefix cache index table, which includes the association between the root hash value and the logical block queue; If the target root hash value is found in the prefix cache index table, the logical block queue associated with the target root hash value is determined as the target logical block queue. The logic blocks in the target logic block queue are identified as the target logic blocks.
[0010] In this embodiment, the target region identifier is added to the prefix cache index calculation, ensuring that the prefix cache index is obtained through a unique target region identifier. Since each user's region identifier is unique, the target root hash value is calculated using this unique identifier. This uniqueness ensures that the logical block queue corresponding to the target root hash value can only be hit by users with the same region identifier, thus avoiding access to logical block queues across users. Furthermore, the use of the prefix cache still allows for the reuse of key-value caches with the same prefix for different requests, guaranteeing key-value cache performance. Therefore, cache security can be improved without sacrificing computational performance.
[0011] In conjunction with the first aspect, some implementations of the first aspect also include: If the target root hash value is not found in the prefix cache index table, a new logical block queue is constructed. The new logical block queue includes multiple logical blocks, and each logical block is associated with a physical block. The target root hash value and the new logical block queue are associated and stored in the prefix cache index table.
[0012] In this embodiment, when the target root hash value is not found in the prefix cache index table, a new logical block queue containing multiple associated physical blocks can be constructed. The target root hash value is then associated with and stored in the prefix cache index table, enabling dynamic creation and indexing of the new logical block queue. This ensures the timely supply of cache resources required for inference in miss scenarios, provides a cache foundation for the reuse of the same prefix in subsequent scenarios, improves the prefix cache index system, and further enhances the coverage and long-term utilization efficiency of cache resources in large model inference. Simultaneously, relying on the association between logical blocks and physical blocks, it ensures the effective storage and management of the new block queue.
[0013] In conjunction with the first aspect, in certain implementations of the first aspect, a new logical block queue is constructed, including: New physical blocks are allocated according to the storage requirements of the inference request, and a new logical block queue is determined based on the new physical blocks; Alternatively, request a new physical block queue from the area controller and associate the new physical block queue with a new logical block queue.
[0014] In this embodiment, new physical blocks can be allocated according to the storage requirements of inference requests, and new logical block queues can be determined based on the new physical blocks, or new physical block queues can be requested from the region controller and associated with the new logical block queues. This makes the new logical block queue construction mechanism more flexible, supporting both dynamic allocation of cache resources based on storage requirements of inference requests and coordinated acquisition of physical resources through the region controller to ensure cache supply when resources are scarce in the local region, further improving the flexibility and efficiency of cache management in large model inference.
[0015] In conjunction with the first aspect, in some implementations of the first aspect, before storing the key-value cache generated during the execution of the inference request based on the target physical block, the following is also included: If no target region identifier matching the target user identifier is found in the user region table, a new region is assigned to the target user, and physical blocks are allocated to the new region. The physical block allocated to the new region is identified as the target physical block.
[0016] In this embodiment, when no target region identifier matching the target user identifier is found in the user region table, a new region can be allocated to the target user, and physical blocks can be allocated to the new region to obtain the target physical blocks. This achieves dynamic resource adaptation for new users or users without allocated regions. By automatically creating dedicated regions and allocating physical blocks, it ensures that each user can obtain an independent cache storage medium. This not only guarantees the smooth execution of inference requests from new users but also maintains the independence and security of cached data for different users through the region isolation mechanism, improving the compatibility of the large model inference service with new users and the efficiency of resource allocation.
[0017] In conjunction with the first aspect, in certain implementations of the first aspect, physical blocks are allocated to the new region, including: Allocate free physical blocks from the free block list to the new region. The free block list includes at least one free physical block, which is a physical block that is in an idle state.
[0018] In this embodiment, if no physical blocks are allocated to the target user's target area, free physical blocks can be allocated from the free block list to the target user's new area. By maintaining free physical blocks in real time through the free block list, timely scheduling and allocation of free physical blocks can be achieved, improving the efficiency and effectiveness of user area establishment.
[0019] In conjunction with the first aspect, in certain implementations of the first aspect, a key-value cache generated during the execution of the inference request is stored based on the target physical block, including: If the target physical block can meet the storage requirements of the inference request, execute the inference request and cache the key-value pairs generated during the inference request process in the target physical block; If the target physical block cannot meet the storage requirements of the inference request, a new physical block is allocated from the free block list to the target physical block.
[0020] In this embodiment, if the target physical block can meet the storage requirements of the inference request, inference is executed directly and the key-value cache is stored therein. If the target physical block cannot meet the storage requirements of the inference request, a new physical block is allocated from the free block list to supplement the target physical block. This ensures efficient execution of the inference process when storage resources are sufficient, and flexibly adapts to the growth of cache requirements during inference by dynamically supplementing free blocks, avoiding inference interruptions due to insufficient resources. In addition, managing idle resources through the free list enables on-demand scheduling and efficient reuse of physical blocks, improving the utilization efficiency of storage resources and ensuring the continuity and stability of inference service execution.
[0021] In conjunction with the first aspect, in certain implementations of the first aspect, allocating a new physical block from the free block list to the target physical block includes: If a free physical block exists in the free block list, add the free physical block to the target physical block; If no free physical blocks exist in the free block list, a physical block reclamation strategy is executed for each region. The reclaimed physical blocks are added to the free block list as free physical blocks, and the free physical blocks in the free block list are added to the target physical block.
[0022] In this embodiment, when available free physical blocks exist in the free block list, they can be directly added to the target physical block to meet storage requirements. When no available free physical blocks exist in the free block list, a physical block reclamation strategy can be implemented for each region, and the reclaimed physical blocks can be added to the free list and redistributed to the target physical block. This approach provides a rapid response when storage resources are sufficient, and addresses resource scarcity through proactive reclamation strategies when storage resources are lacking, preventing inference interruptions due to insufficient resources. Furthermore, the reclamation strategy revitalizes and reuses cached resources, improving the overall utilization of physical blocks while ensuring the continuous fulfillment of cache requirements during large model inference through a dynamic replenishment mechanism, thus enhancing the stability and resource supply elasticity of the inference service.
[0023] In conjunction with the first aspect, in certain implementations of the first aspect, a physical block reclamation strategy is implemented for each region, including: Iterate through all regions and obtain the regions that are currently idle. Query unused physical blocks from the physical blocks associated with the idle region, or, if no unused physical blocks are found from the physical blocks associated with the idle region, query unused physical blocks from the physical blocks associated with the target region corresponding to the target region identifier. Release unused physical blocks.
[0024] In this embodiment, all regions are traversed to filter out idle regions. Unused physical blocks are first queried from the physical blocks associated with idle regions. If no unused blocks are found, the search continues from the physical blocks associated with the target region. Finally, the found unused physical blocks are released. This achieves accurate identification and efficient release of physical storage resources, prioritizing the reclamation of unused physical blocks in globally idle regions. Furthermore, the reclamation of physical blocks associated with the target region is a fallback query for resources in the target region, maximizing the utilization of idle cache resources while avoiding the erroneous release of valid resources within the target region. This significantly improves the recycling rate of cache resources and provides ample idle resource reserves for the dynamic scheduling of resources for large model inference services.
[0025] According to a second aspect of the embodiments of this application, a resource sharing device is provided, the resource sharing device comprising: a transceiver unit and a processing unit.
[0026] The transceiver unit is used to receive inference requests, which include the target user identifier of the target user. The target user refers to the user who needs to use the inference service of the large language model.
[0027] The processing unit can be used to: query a user region table based on the target user identifier, where the user region table includes the association between the user's user identifier and the region identifiers of the regions assigned to the user; if a target region identifier matching the target user identifier is found in the user region table, retrieve the target physical block corresponding to the target region identifier; and store the key-value cache generated during the execution of the inference request based on the target physical block.
[0028] According to a third aspect of the embodiments of this application, a computing device is provided, including: a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program to implement any of the above-described resource sharing methods.
[0029] According to a fourth aspect of the embodiments of this application, a communication device is provided, including a transceiver and a processor, wherein the transceiver is used to receive or send data, and the processor is used to execute any of the resource sharing methods of the embodiments of this application.
[0030] According to a fifth aspect of the embodiments of this application, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, it implements any of the resource sharing methods.
[0031] According to a sixth aspect of the present application, a computer product is provided, comprising: a computer program that, when executed by a processor, implements the steps of any resource sharing method.
[0032] It should be understood that both the foregoing general description and the following detailed description are exemplary and intended to provide further illustration of the claimed technology. Attached Figure Description
[0033] The above and other objects, features, and advantages of the embodiments of this application will become more apparent from the more detailed description of the embodiments in conjunction with the accompanying drawings. The accompanying drawings are used to provide a further understanding of the embodiments of this application and constitute a part of the specification. They are used together with the embodiments of this application to explain the embodiments of this application and do not constitute a limitation thereof. In the accompanying drawings, the same reference numerals generally represent the same components or steps.
[0034] Figure 1 The illustration shows an example of an application scenario of a large language model according to an embodiment of this application. Figure 2 The figure shows an example diagram of a resource recycling principle according to an embodiment of this application; Figure 3 The figure shows an example diagram of a physical block mapping according to an embodiment of this application; Figure 4 The figure shows a flowchart of a resource sharing method according to an embodiment of this application; Figure 5 The figure shows an application example of a resource sharing method according to an embodiment of this application; Figure 6 The figure shows another flowchart of a resource sharing method according to an embodiment of this application; Figure 7 The figure shows an example diagram of a region state machine according to an embodiment of this application; Figure 8 The figure shows an example of a resource management process according to an embodiment of this application; Figure 9 The figure shows an application example of a resource sharing method according to an embodiment of this application; Figure 10 The figure shows an example of hash calculation according to an embodiment of this application; Figure 11 The figure shows a schematic diagram of a resource sharing device according to an embodiment of this application; Figure 12 The figure shows a hardware block diagram of a computing device according to an embodiment of this application. Detailed Implementation
[0035] To make the objectives, technical solutions, and advantages of the embodiments of this application more apparent, exemplary embodiments according to the embodiments of this application will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the embodiments of this application, and not all embodiments of the embodiments of this application. It should be understood that the embodiments of this application are not limited to the exemplary embodiments described herein.
[0036] The technical solution of this application embodiment can be applied to the storage field. By establishing user-level regions, the KVCache (Key-Value Cache) generated by the inference requests of the corresponding users is cached in the regions, avoiding the reuse of the same region by different users, reducing the risk of unauthorized access to the region, and improving the security of the storage space.
[0037] The technical solutions of the embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0038] like Figure 1 The diagram shown is an example of an application scenario for a large language model provided in an embodiment of this application. The application scenario for the large language model may include: a service platform 10 and multiple user terminals 20.
[0039] Among them, the service platform 10 can be a server platform such as a cloud computing platform, a cloud AI (Artificial Intelligence) platform, a private cloud platform, or a hyper-converged infrastructure.
[0040] The large language model can be deployed on service platform 10, and inference services can be provided to multiple users through service platform 10. In other words, service platform 10 can provide inference services of the large language model to multiple users. When users use the inference service of the large language model, they generally need to send an inference request to service platform 10 through user terminal 20. Service platform 10 executes the inference request according to the deployed large language model, obtains the corresponding inference result, and feeds back the inference result to the corresponding user terminal.
[0041] In the process of providing reasoning services for large language models, the tokens generated at each step of the service platform 10 depend on all previous tokens, which involves KVCache. KVCache refers to caching the key and value of historical tokens, so that the calculation of each new token only needs to be based on the key and value of the cached historical tokens, rather than recalculating the key and value of all historical tokens, thus reducing time complexity.
[0042] like Figure 2The diagram shown is an example of a resource reclamation principle provided in an embodiment of this application. The video memory of the service platform 10, such as GPU (Graphics Processing Unit) or DRAM (Dynamic Random Access Memory) 201, can provide a KV cache. Specifically, it can be divided into multiple independent physical blocks 202 based on PagedAttention CacheBlock. These blocks can also be referred to as physical blocks. In the KVCache scenario, a block is the smallest allocation / management unit (i.e., a "page") of the KV cache. Each block can store a fixed number of tokens' KV data (e.g., KV data of 256 / 512 tokens). A physical block specifically refers to a physical KVCache Block.
[0043] To address the security risks, resource contention, and cache pollution caused by all users sharing the same video memory space, this embodiment allocates a dynamic and dedicated zone to each user. This zone is essentially a logical storage unit that can be associated with one or more physical blocks in the video memory. Each physical block can be used to perform data caching at the physical level. By setting dedicated zones for users, the physical storage spaces of different users are logically isolated, ensuring that users can only use the physical blocks within their allocated zone and preventing them from accessing physical blocks in other users' zones. This achieves logical isolation of physical blocks and improves the security and utilization of physical storage space. Figure 2 As shown, assume there are user A and user B. Assign user A 203 and region 1, and assign user B 204 and region 2.
[0044] It is understandable that allocating a region to each user requires unified management of each user's region. In this embodiment, 205, a zone controller is used to manage each user's region. Specifically, the zone controller can maintain a user region table, which includes the association between the user's user identifier and the region identifier of the region assigned to the user. The user identifier can be used to uniquely identify the corresponding user, and the region identifier can be used to uniquely identify the corresponding region.
[0045] In addition, the region controller is also used to manage the lifecycle of each region (such as creation, destruction, and state changes) and to distribute requests to the region's cache manager. In the KVCache scenario, the cache manager specifically refers to the KVCacheManager.
[0046] To improve the efficiency of managing dynamic user regions, this embodiment also establishes a cache manager for each region, with one cache manager corresponding to one region. After the region controller determines the target region corresponding to the inference request based on the user-region table, it sends the inference request to the cache manager corresponding to the target region. The cache manager performs operations such as resource allocation and deletion for the inference request, such as allocating physical blocks for the inference request. Figure 2 As shown, 206 and Cache Manager 1 are created for 203 and Region 1, and 207 and Cache Manager 2 are created for 204 and Region 2.
[0047] In addition, to improve the management efficiency of storage resources, it is also necessary to maintain, such as Figure 2 The diagram shows 208, the free block list. The free block list can contain multiple free physical blocks. It is a shared resource visible to cache managers across all regions, allowing them to allocate appropriate free physical blocks for inference requests within a region, thus increasing cache resources.
[0048] like Figure 3 The diagram shown is an example of physical block mapping provided in an embodiment of this application. The video memory space of the service platform 10 is divided into multiple physical blocks (Physical KV Blocks), for example... Figure 3 The video memory space includes a physical block queue 301 consisting of 8 physical blocks, namely Block 0 to Block 7. Each block is of equal size, and each block can be used to store the key and value of the corresponding token. For example, Block 1 can be used to store "Alan Turing is a", and Block 7 can be used to store "computer scientist and mathematics".
[0049] A physical block is the actual storage medium for data; it is a memory block at the hardware level. Physical blocks can be logically mapped, with each physical block associated with one or more logical key-value blocks (Logical KV Blocks).
[0050] Understandably, a logical block is a logical unit used to represent "a certain key-value data" from the user's or sequence's perspective; it is an abstract mapping of physical blocks. A logical block is a software-level logical concept and does not directly correspond to physical storage space, but rather is mapped to a physical storage space.
[0051] Based on this, a region 302 is created for each user, and one or more logical blocks are associated with each region 302. For example, region 302 created for user A is associated with Block0 and Block1 in logical block queue 304. Each logical block is associated with a corresponding physical block. For example, logical block Block0 in logical block queue 304 is associated with physical block Block1 in physical block queue 301, and logical block Block2 in logical block queue 304 is associated with physical block Block7 in physical block queue 301.
[0052] The association between logical blocks and physical blocks can be represented by a cache table 303. The cache table 303 refers to the mapping relationship between the physical block queue 301 and the logical block queue 304 of region 302. This mapping relationship can use the association relationship between logical block identifiers and physical block identifiers.
[0053] As shown in cache table 303, there is a mapping relationship between the identifier of logical block Block L1: 1 and the identifier of physical block Block 1: 1; there is a mapping relationship between the identifier of logical block Block L7: 7 and the identifier of physical block Block 0: 0.
[0054] It is also understandable that Figure 3 The logical block queue 304 shown is merely an example. In practical applications, a logical block is a physical block in the software sense, but it is not actually a physical block that stores data. For example, there may only be a virtual physical block called "Block1", while the specific data of "Block1", such as "computer scientist and mathematics", is actually still stored in the physical block "Block1".
[0055] It should be noted that, in the embodiments of this application, the physical block queue refers to a queue composed of multiple physical blocks in sequence, and the arrangement order of each physical block is known. The logical block queue refers to a queue composed of multiple logical blocks in sequence, and the arrangement order of each logical block is known.
[0056] Based on the above Figures 1-3 The relevant content shown is as follows: Figure 4 As shown in the figure, this application provides a resource sharing method, which may include the following steps: S401. Receive an inference request. The inference request includes the target user identifier of the target user. The target user refers to the user who needs to use the inference service of the large language model.
[0057] In this embodiment of the application, a request user table may also be set up, which is used to register inference requests and the users corresponding to the inference requests.
[0058] Inference requests can refer to a set of instructions or data submitted to a system / model in artificial intelligence, data processing, or logical operation scenarios, requiring it to perform logical deduction, analysis, calculation, or conclusion generation based on existing data (input information, training parameters, rule base, etc.).
[0059] The storage requirements for inference can be initially determined by analyzing the inference requests. Storage requirements refer to the storage needs of the tokens corresponding to the inference request, which can be predicted by the number of tokens in the inference request.
[0060] Specifically, the number of tokens that an inference request might generate, and the average number of blocks required for each token, can be predicted. The product of the number of tokens and the average number of blocks is then calculated to obtain the storage requirements for the inference request. These storage requirements can be used to create new logical block queues for users and allocate corresponding physical block queues to them, enabling on-demand scheduling of physical storage resources.
[0061] S402. Based on the target user identifier, query the user zone table. The user zone table includes the relationship between the user's user identifier (tenant_id) and the zone identifier (zone_id) of the zone assigned to the user.
[0062] The user region table includes a mapping relationship between user identifiers and region identifiers. For example... Figure 5 As shown, each row in the user region table 502 represents a mapping relationship, which includes a user identifier and a region identifier. For example, user A and region 1 can be associated to form a mapping relationship between user identifier and region identifier.
[0063] User identifiers can be carried in inference requests. Parsing the inference request can yield information such as user identifiers.
[0064] Region identifiers can be used to distinguish different user regions.
[0065] S403. Determine whether a target region identifier matching the target user identifier is found in the user region table. If yes, execute S404; otherwise, execute S405.
[0066] If a target region identifier matching the target user identifier is found in the user region table, it means that a user region has been created for the user, and the storage resources within that region can be used directly. If no target region identifier matching the target user identifier is found in the user region table, it means that a user region has not been created for the user, and a new user region needs to be created and its storage resources allocated.
[0067] S404. Obtain the target physical block corresponding to the target area identifier.
[0068] If a user's target area already exists, and that target area has been pre-associated with physical blocks, the physical blocks associated with the target area can be identified as the target physical blocks corresponding to the target area identifier.
[0069] S405. Allocate a new region for the target user and allocate physical blocks to the new region.
[0070] Optionally, allocating a new region to a target user can refer to assigning a new region to the user, constructing a logical block queue for that new region, and associating the logical block queue with a physical block queue. The physical blocks in the physical block queue are the physical blocks of the new region. After creating a region for a user, a region identifier can also be generated. Different regions are distinguished by this region identifier.
[0071] Optionally, after allocating physical blocks for the new region, the target user's user region table can be updated. Specifically, the target user's user identifier and region identifier are added to the user region table as new first mapping information.
[0072] As one example, allocating physical blocks to a new region may include: Allocate free physical blocks from the free block list to the new region. The free block list includes at least one free physical block, which is a physical or logical block that is in an idle state.
[0073] Specifically, a free block list is a block sequence consisting of at least one free physical block arranged in order.
[0074] Understandably, a zone controller can allocate new zones to target users and allocate physical blocks to those new zones.
[0075] In one possible design, the region controller can allocate physical blocks to a new region according to a preset fixed number. That is, during the region initialization phase, the number of physical blocks allocated to the region is a preset fixed number.
[0076] Furthermore, during the execution of inference requests in a region, the number of physical blocks in the region can be dynamically adjusted based on the storage requirements of the inference request. For example, if the storage requirements are not met, the number of physical blocks associated with the region can be increased; if the storage requirements are met and there are a large number of unused physical blocks, the unused physical blocks can be released to reduce the number of physical blocks associated with the region.
[0077] S406. The physical block allocated to the new region is identified as the target physical block.
[0078] S407. Based on the target physical block, store the key-value cache generated during the execution of the inference request.
[0079] Optionally, the target physical block can be either the physical block acquired in S404 or the physical block newly allocated in S406.
[0080] In this embodiment, upon receiving an inference request containing a target user identifier, the system can query a user region table based on the target user identifier to obtain a target region identifier that matches the target user identifier. The target region identifier identifies the region allocated to the user, and since a user's region is allocated physical blocks, the key-value cache generated during the execution of the inference request is stored in the corresponding target physical block based on the target physical block corresponding to the target region identifier. This ensures that the key-value cache for the same user is executed within its corresponding target region, achieving physical risk isolation, avoiding the reuse of physical blocks across users, reducing the risk of unauthorized access by users, and improving the storage security of the key-value cache.
[0081] For ease of understanding, such as Figure 5 The diagram shown is an application example of a resource sharing method provided in this application embodiment. The request user table 501 can include a correspondence between requests and users. Based on this request user table 501, the user identifier corresponding to each inference request can be determined. For example, the user identifier for request 1 is user A, and the user identifier for request 2 is user B.
[0082] Next, you can query the user region table 502. User region table 502 can include the relationships between user identifiers and region identifiers. For example, you can query the region identifier of the region assigned to a user based on their user identifier. For instance, you can query that user A's associated region identifier is region 1, and user B's associated region identifier is region 2.
[0083] In the cache manager 1 of region 1, the logical block 1 corresponding to request 1 can be queried, and the physical block 1 corresponding to logical block 1 can be queried according to the cache table 1 of region 1. For example, physical block 1 is... Figure 5 Blocks 11-17 are shown.
[0084] In cache manager 2 of region 2, the logical block 2 corresponding to request 2 can be queried, and the physical block 2 corresponding to logical block 2 can be queried according to cache table 2 of region 2. For example, physical block 2 is... Figure 5 Block21-Block2a is shown.
[0085] like Figure 6 The diagram shown is another flowchart of a resource sharing method provided in this application embodiment. The difference from the previous embodiments is that, in step S407, storing the key-value cache generated during the execution of the inference request based on the target physical block may include: S601. Determine whether the target physical block can meet the storage requirements of the inference request. If yes, execute S605; otherwise, execute S602.
[0086] Optionally, if the target physical block can meet the storage requirements of the inference request, the inference request can be executed, and the key-value pairs generated during the inference request process can be cached in the target physical block.
[0087] Optionally, if the target physical block cannot meet the storage requirements of the inference request, a new physical block can be allocated from the free block list to the target physical block.
[0088] Specifically, if the target physical block cannot meet the storage requirements of the inference request, and there is a free physical block in the free block list, the free physical block is added to the target physical block.
[0089] If the target physical block cannot meet the storage requirements of the inference request, and there is no free physical block in the free block list, a physical block reclamation strategy is executed for each region, and the reclaimed physical block is added to the free block list as a free physical block.
[0090] S602. Determine if there is a free physical block in the free block list. If yes, execute S604; otherwise, execute S603.
[0091] Optionally, determining whether there is a free physical block in the free block list can specifically mean starting from the first physical block in the free block list, sequentially traversing each physical block in the free block list. If the current physical block is free, unoccupied, or unused, then it is determined that there is a free physical block. If no free physical block is found after traversing to the end of the free block list, then it is determined that there is no free physical block.
[0092] S603. Implement a physical block reclamation strategy for each region and add the reclaimed physical blocks as free physical blocks to the free block list.
[0093] The physical block reclamation strategy specifically refers to traversing the physical blocks in each region, obtaining unoccupied / unused physical blocks, and identifying these unoccupied / unused physical blocks as reclaimable physical blocks. An unoccupied or free physical block can mean that the physical block has not been filled with data or is empty.
[0094] S604. Add the free physical block from the free block list to the target physical block.
[0095] Optionally, after adding a free physical block from the free block list to the target physical block, S605 can continue to be executed until the target physical block can meet the storage requirements of the inference request.
[0096] Understandably, if available free physical blocks exist in the free block list, they can be directly added to the target physical block to meet storage requirements. If no available free physical blocks exist in the free block list, a physical block reclamation strategy can be implemented for each region, adding the reclaimed physical blocks to the free list and redistributing them to the target physical block. This approach allows for rapid response when storage resources are sufficient, and proactive reclamation strategies address resource constraints when storage resources are lacking, preventing inference interruptions due to insufficient resources. Furthermore, the reclamation strategy revitalizes and reuses cached resources, improving the overall utilization of physical blocks while ensuring the continuous fulfillment of cache requirements during large model inference through a dynamic replenishment mechanism, thus enhancing the stability and resource supply elasticity of the inference service.
[0097] S605. Execute the inference request and cache the key-value pairs generated during the execution of the inference request to the target physical block.
[0098] In this embodiment, if the target physical block can meet the storage requirements of the inference request, inference is executed directly and the key-value cache is stored therein. If the target physical block cannot meet the storage requirements of the inference request, a new physical block is allocated from the free block list to supplement the target physical block. This ensures efficient execution of the inference process when storage resources are sufficient, and flexibly adapts to the growth of cache requirements during inference by dynamically supplementing free blocks, avoiding inference interruptions due to insufficient resources. In addition, managing idle resources through the free list enables on-demand scheduling and efficient reuse of physical blocks, improving the utilization efficiency of storage resources and ensuring the continuity and stability of inference service execution.
[0099] As an example, in S603, implementing a physical block reclamation strategy for each region may specifically include: Query unused physical blocks and release unused physical blocks.
[0100] It is understandable that unused physical blocks can refer to physical blocks allocated to a user area that have not been used by users within that area, or physical blocks that are used infrequently. "Unused" can mean a usage count of 0, and "used infrequently" can mean a usage count less than a usage threshold. The usage threshold can be preset.
[0101] For any region, each physical block within that region can be counted. If the count of any physical block within that region reaches 0, it means that the physical block is no longer referenced by any program or thread. At this point, the recycling process is triggered to make it available again or release it, thereby avoiding resource waste.
[0102] Furthermore, the earliest used blocks can be prioritized for eviction, and an LRU / Clock strategy can be used to release unused physical blocks. The LRU (Least Recently Used) strategy refers to querying the least recently used physical blocks. The Clock strategy refers to marking the recent usage status of blocks by an "access bit," using a circular linked list to simulate a "clock," and scanning the pointer cyclically to prioritize the eviction of unaccessed blocks (access bit is 0), thus reducing implementation complexity.
[0103] To avoid affecting the use of storage resources in normal areas, areas that are in an idle state can be searched and physical blocks in those areas can be reclaimed.
[0104] Specifically, releasing unused physical blocks can mean adding physical blocks to the free block queue.
[0105] As one example, querying unused physical blocks includes: A1. Traverse all regions and obtain the regions that are in an idle state.
[0106] A2. Query unused physical blocks from the physical blocks associated with the idle region.
[0107] Furthermore, if no unused physical blocks are found in the physical blocks associated with the idle region, then unused physical blocks are found in the physical blocks associated with the target region corresponding to the target region identifier.
[0108] As yet another embodiment, after obtaining the unused physical block, the method further includes: Determine the user priority associated with each unused physical block, and release the physical blocks of users with lower priority.
[0109] Optionally, if the unused physical blocks found still do not meet the storage requirements of the inference request, a storage space compression strategy can be implemented on the target area. This storage space compression strategy may include, for example, merging fragments, forcibly downgrading inactive sessions, etc.
[0110] In this embodiment, all regions are traversed to filter out idle regions. Unused physical blocks are first queried from the physical blocks associated with idle regions. If no unused blocks are found, the search continues from the physical blocks associated with the target region. Finally, the found unused physical blocks are released. This achieves accurate identification and efficient release of physical storage resources, prioritizing the reclamation of unused physical blocks in globally idle regions. Furthermore, the reclamation of physical blocks associated with the target region is a fallback query for resources in the target region, maximizing the utilization of idle cache resources while avoiding the erroneous release of valid resources within the target region. This significantly improves the recycling rate of cache resources and provides ample idle resource reserves for the dynamic scheduling of resources for large model inference services.
[0111] As described above, the zone controller is also used to manage the lifecycle of each zone. Specifically, the zone controller can set a state machine for each zone, which can indicate the state of the zone. The state machine can take the values of an active, idle, or evictioned state.
[0112] like Figure 7 The diagram shown is an example of a region state machine provided in an embodiment of this application. In the active state 701, there are inference requests being processed within the region. Physical blocks can be obtained from the free block list 700, and key-value caches generated during the execution of the inference request are stored based on these physical blocks. After all inference requests in the region have finished processing, the region can enter the idle state 702. In the idle state 702, there are no inference requests within the region, but the region's physical blocks are reserved for later use. If the region is reclaimed due to resource recycling or has not been used for a long time, the physical blocks associated with the region are evicted. Afterwards, the region enters the evicted state 703. In the evicted state, the region's physical blocks are released, but the association between the region identifier and the user identifier is retained.
[0113] Since the physical blocks associated with an evicted region have been released, the evicted region is no longer associated with physical blocks. Therefore, regions in the evicted state do not participate in the physical block reclamation strategy.
[0114] It is understandable that the execution of physical block reclamation policies for each region can be carried out by the region controller. In other words, the region controller executes the physical block reclamation policy for each region. That is, resource reclamation for each region is performed by the region controller.
[0115] Furthermore, adding free physical blocks from the free block list to the target physical block can be performed by the region controller. In other words, resource requests for each region are executed by the region controller.
[0116] Furthermore, the allocation and release of physical blocks in each region are performed by the corresponding region's cache manager.
[0117] Specifically, if it is necessary to allocate new physical blocks or release unused physical blocks, the region controller needs to send a request to the region's cache manager, and then the cache manager will obtain free physical blocks and allocate them to the region or release the unused physical blocks associated with the region.
[0118] like Figure 8The diagram shown is an example of a resource management process provided in an embodiment of this application. The region controller 801 can send allocation requests to cache manager 1 (802) of region 1 and cache manager 2 (803) of region 2. Cache manager 1 (802) and cache manager 2 (803) can allocate physical blocks based on the allocation requests.
[0119] In addition, the area controller 801 can send release requests to cache manager 1 (802) in area 1 and cache manager 2 (803) in area 2. Cache manager 1 (802) and cache manager 2 (803) can release physical blocks based on the release requests.
[0120] like Figure 9 The diagram shown is an application example of a resource sharing method provided in an embodiment of this application. This resource sharing method may include: S901, Receive inference request, the inference request includes the target user identifier of the target user.
[0121] S902. Register the reasoning request in the requesting user table.
[0122] S903. Based on the target user identifier, query the user region table to see if the target region identifier is matched. If yes, execute S906; otherwise, execute S904.
[0123] S904. Create the target region for the target user.
[0124] S905: Pre-allocate physical blocks for the target region.
[0125] S906, scheduling and processing inference requests.
[0126] S907, Perform video memory management operations based on the target physical block.
[0127] S908. Determine whether the target physical block can meet the storage requirements of the inference request. If yes, execute S912; otherwise, execute S909.
[0128] S909. Determine if there is a free physical block in the free block list. If yes, execute S911; otherwise, execute S910.
[0129] S910. Implement a physical block reclamation strategy for each region and add the reclaimed physical blocks as free physical blocks to the free block list.
[0130] S911. Add the free physical block from the free block list to the target physical block.
[0131] S912. Continue executing the inference request until the inference request is completed.
[0132] S913. Cache the key-value pairs generated during the execution of the inference request to the target physical block.
[0133] As mentioned above, the corresponding target physical block can be obtained based on the target region identifier. The logical block serves as the bridge between the target region identifier and the target physical block. In other words, the target logical block can be determined first using the target region identifier, and then the target physical block can be determined based on the target logical block.
[0134] Therefore, as an example, obtaining the target physical block corresponding to the target region identifier may include: determining the target logical block associated with the target region allocated to the target user based on the target region identifier. The cache table of the target region can be queried to obtain the target physical block associated with the target logical block.
[0135] In this embodiment, logical blocks serve as a bridge between regions and physical blocks. This allows the target logical block associated with the target region allocated to the target user to be identified through the target region identifier. Then, the cache table for that target region is queried. This cache table records the association between logical blocks and physical blocks, thus enabling the accurate retrieval of the target physical block corresponding to the target logical block. Based on the region association and cache table query mechanism, the targeted location and efficient association of cached resources are ensured. This not only guarantees the regional isolation of user cached data but also improves the efficiency of cache resource lookup and reuse during large model inference.
[0136] In large model inference, reusing key-value caches can reduce redundant computations, save GPU memory, and improve inference speed. Key-value cache reuse can be achieved through prefix caching. Prefix caching can refer to a common sequence of starting tokens across multiple requests (such as conversations between different users or text generation tasks).
[0137] For example, request A is "Introduce the development of artificial intelligence", and request B is "Introduce the application of artificial intelligence". Both are prefixed with "Introduce artificial intelligence", and both prefixes can use the same key-value cache.
[0138] In order to isolate the prefix cache for different users, in this embodiment of the application, the user's user identifier is included in the index calculation of the prefix cache, so that the prefix cache of different users is different, and each user can only access the prefix cache under their name.
[0139] Therefore, as an example, determining the target logical block associated with the target region allocated to the target user based on the target region identifier includes: Calculate the target root hash value of the inference request based on the target region identifier and token identifier (token_ids); Query the prefix cache index table, which includes the association between the root hash value and the logical block queue; If the target root hash value is found in the prefix cache index table, the logical block queue associated with the target root hash value is determined as the target logical block queue. The logic blocks in the target logic block queue are identified as the target logic blocks.
[0140] Optionally, the token identifier can refer to a string used to identify all tokens within the target area. For example, the token identifier can be obtained by sequentially concatenating the token identifiers corresponding to multiple tokens for the inference request.
[0141] like Figure 3 As shown, the target region identifier of region 302 and the token identifier under that region can be used to calculate the target root hash value of the inference request. Then, the logical block is queried using the root hash value; specifically, the logical block queue associated with the root hash value of the inference request can be queried through the prefix cache index table. This logical block queue is, for example, [example queue would be inserted here]. Figure 3 The logical block queue 304 is shown. Region 302 is associated with a cache table, which can include a mapping between physical block identifiers and logical block identifiers. Therefore, after obtaining the target logical block corresponding to the target logical block, the target physical block associated with the target logical block can be queried. For example... Figure 3 The physical block queue 301 is shown.
[0142] like Figure 10 The diagram shown is an example of hash calculation provided in an embodiment of this application.
[0143] In the prefix token sequence, each token corresponds to a block, and each token can serve as the parent node of the next token. Therefore, the blocks corresponding to each token also constitute a physical block sequence. (See reference) Figure 10 Suppose there are 3 blocks, where 1001 and Block1 are the parent nodes of 1002 and Block2, and 1002 and Block2 are the parent nodes of 1003 and Block3.
[0144] If Block1 is the root node, then Block1 can use Hash(zone id, token ids) as an index, where token ids is the token identifier and zone id is the user's zone identifier.
[0145] Since Block2 is a child node of the previous node Block1 and Block3 is a child node of the previous node Block2, Block2 and Block3 can use Hash(parent-hash, token ids) as indexes, where parent-hash is the hash value of the parent node and token ids is the token identifier.
[0146] Therefore, the hash value of Block1 is calculated using the region identifier and the token identifier. Since each user's region identifier is unique, distinguishing the root nodes of different regions by the region identifier ensures that only inference requests from the same user can hit the same prefix cache, thus guaranteeing cache security without sacrificing performance.
[0147] In this embodiment, the target region identifier is added to the prefix cache index calculation, ensuring that the prefix cache index is obtained through a unique target region identifier. Since each user's region identifier is unique, the target root hash value is calculated using this unique identifier. This uniqueness ensures that the logical block queue corresponding to the target root hash value can only be hit by users with the same region identifier, thus avoiding access to logical block queues across users. Furthermore, the use of the prefix cache still allows for the reuse of key-value caches with the same prefix for different requests, guaranteeing key-value cache performance. Therefore, cache security can be improved without sacrificing computational performance.
[0148] As yet another embodiment, it also includes: If the target root hash value is not found in the prefix cache index table, a new logical block queue is constructed. The new logical block queue consists of multiple logical blocks, each of which is associated with a physical block.
[0149] The target root hash value and the new logical block queue are associated and stored in the prefix cache index table.
[0150] In this embodiment, when the target root hash value is not found in the prefix cache index table, a new logical block queue containing multiple associated physical blocks can be constructed. The target root hash value is then associated with and stored in the prefix cache index table, enabling dynamic creation and indexing of the new logical block queue. This ensures the timely supply of cache resources required for inference in miss scenarios, provides a cache foundation for the reuse of the same prefix in subsequent scenarios, improves the prefix cache index system, and further enhances the coverage and long-term utilization efficiency of cache resources in large model inference. Simultaneously, relying on the association between logical blocks and physical blocks, it ensures the effective storage and management of the new block queue.
[0151] The construction of a new logical block queue includes: New physical blocks are allocated according to the storage requirements of the inference request, and a new logical block queue is determined based on the new physical blocks; Alternatively, request a new physical block queue from the area controller and associate the new physical block queue with a new logical block queue.
[0152] Understandably, allocating new physical blocks according to the storage requirements of the inference request includes: determining the number N of physical blocks required for the storage requirements of the inference request, allocating N physical blocks as new physical blocks, and arranging the N physical blocks in order to obtain a new logical block queue.
[0153] A region controller can be used to manage the allocation of storage resources to each user, such as allocating physical blocks and associating allocated logical blocks with physical blocks. Allocating storage resources to users through a region controller can improve storage resource utilization and flexibility.
[0154] In this embodiment, new physical blocks can be allocated according to the storage requirements of inference requests, and a new logical block queue corresponding to the new physical blocks can be determined. Alternatively, a new physical block queue can be requested from the region controller and associated with the new logical block queue. This makes the new logical block queue construction mechanism more flexible, supporting both dynamic allocation of cache resources based on the storage requirements of inference requests and coordinated acquisition of physical resources through the region controller, ensuring cache supply when resources are scarce within the local region. This further improves the flexibility and efficiency of cache management in large model inference.
[0155] like Figure 11 The diagram shown is a structural schematic of a resource sharing device provided in an embodiment of this application. The resource sharing device 1100 includes a transceiver unit 1101 and a processing unit 1102.
[0156] The transceiver unit 1101 is used to receive inference requests. The inference request includes the target user identifier of the target user, which refers to the user who needs to use the inference service of the large language model.
[0157] Processing unit 1102 can be used to: query a user region table based on the target user identifier, the user region table including the association between the user identifier and the region identifier of the region assigned to the user; if a target region identifier matching the target user identifier is found in the user region table, obtain the target physical block corresponding to the target region identifier; and store the key-value cache generated during the execution of the inference request based on the target physical block.
[0158] As one embodiment, the processing unit 1102 performs the task of obtaining the target physical block corresponding to the target region identifier, specifically including: Based on the target region identifier, determine the target logical block associated with the target region allocated to the target user; query the cache table of the target region to obtain the target physical block associated with the target logical block. The cache table includes the association relationship between logical blocks and physical blocks.
[0159] As another embodiment, the processing unit 1102 executes the determination of the target logical block associated with the target region allocated to the target user based on the target region identifier, specifically including: Calculate the target root hash value for the inference request based on the target region identifier and token identifier (token_ids); query the prefix cache index table based on the target root hash value, which includes the association between the root hash value and the logical block queue; if the target root hash value is found in the prefix cache index table, determine the logical block queue associated with the target root hash value as the target logical block queue; determine the logical blocks in the target logical block queue as the target logical blocks.
[0160] As yet another embodiment, the processing unit 1102 is also configured to perform: If the target root hash value is not found in the prefix cache index table, a new logical block queue is constructed, which includes multiple logical blocks, each of which is associated with a physical block; the target root hash value and the new logical block queue are associated and stored in the prefix cache index table.
[0161] As another embodiment, the processing unit 1102 constructs a new logic block queue, specifically including: Allocate new physical blocks according to the storage requirements of the inference request; Alternatively, request a new physical block queue from the area controller and associate the new physical block queue with a new logical block queue; Alternatively, request a new logical block queue from the area controller and associate the new logical block queue with a new physical block queue.
[0162] As yet another embodiment, the processing unit 1102 is also configured to perform: If no target region identifier matching the target user identifier is found in the user region table, a new region is assigned to the target user, and physical blocks are allocated to the new region. The physical block allocated to the new region is identified as the target physical block.
[0163] As another embodiment, the processing unit 1102 performs the allocation of physical blocks for the new region, specifically including: Allocate free physical blocks from the free block list to the new region. The free block list includes at least one free physical block, which is a physical block that is in an idle state.
[0164] As another embodiment, the processing unit 1102 executes a key-value cache based on the target physical block, storing the cache generated during the execution of the inference request, including: If the target physical block can meet the storage requirements of the inference request, execute the inference request and cache the key-value pairs generated during the inference request process in the target physical block; If the target physical block cannot meet the storage requirements of the inference request, a new physical block is allocated from the free block list to the target physical block.
[0165] As another embodiment, the processing unit 1102 performs the allocation of a new physical block from the free block list to the target physical block, including: If a free physical block exists in the free block list, add the free physical block to the target physical block; If no free physical blocks exist in the free block list, a physical block reclamation strategy is executed for each region. The reclaimed physical blocks are added to the free block list as free physical blocks, and the free physical blocks in the free block list are added to the target physical block.
[0166] As another embodiment, the processing unit 1102 executes a physical block reclamation strategy for each region, including: Iterate through all regions and obtain the regions that are currently idle. Query unused physical blocks from the physical blocks associated with the idle region, or, if no unused physical blocks are found from the physical blocks associated with the idle region, query unused physical blocks from the physical blocks associated with the target region corresponding to the target region identifier. Release unused physical blocks.
[0167] In the embodiments of this application, Figure 11 The device shown can also be a chip or a chip system, such as a system on chip (SoC) or a baseboard management controller (BMC).
[0168] Figure 12 This is a hardware block diagram of a computing device provided in an embodiment of this application. The computing device 1200 according to an embodiment of this application includes at least a memory 1201, a processor 1202, and a transceiver 1203. The memory 1201 is used to store computer programs, and the transceiver 1203 is used to communicate with a user terminal to receive inference requests or send inference results to the user terminal. The processor 1202 is used to execute the computer program to implement the resource sharing method of any of the above embodiments.
[0169] In addition, the memory 1201, processor 1202, and transceiver 1203 are all electrically connected to the communication bus 1204.
[0170] Furthermore, embodiments of this application also provide a computer-readable storage medium for storing a computer program. When executed by a processor, the computer program implements the resource-sharing method of any of the preceding embodiments of this application.
[0171] Computer-readable storage media include, but are not limited to, volatile storage media and / or non-volatile storage media. Volatile storage media may include, for example, random access storage media (RAM) and / or cache storage media. Non-volatile storage media may include, for example, read-only storage media (ROM), hard disks, flash memory, optical disks, magnetic disks, etc.
[0172] This application also provides a computer program product, including a computer program / instruction, which, when executed by a processor, implements the resource sharing method of any of the preceding embodiments of this application.
[0173] The basic principles of the embodiments of this application have been described above with reference to specific examples. However, it should be noted that the advantages, benefits, and effects mentioned in the embodiments of this application are merely examples and not limitations, and should not be considered as essential features of each embodiment of this application. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the embodiments of this application from necessarily employing the aforementioned specific details.
[0174] The block diagrams of devices, apparatuses, devices, and systems involved in the embodiments of this application are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context explicitly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.
[0175] Additionally, as used herein, the "or" used in a list of items beginning with "at least one" indicates a separate list, such that a list of, for example, "at least one of A, B, or C" means A or B or C, or AB or AC or BC, or ABC (i.e., A and B and C). Furthermore, the word "exemplary" does not imply that the described example is preferred or better than other examples.
[0176] It should also be noted that in the systems and methods of this application embodiment, each component or step can be decomposed and / or recombined. These decompositions and / or recombinations should be considered as equivalent solutions of the embodiments of this application.
[0177] Various changes, substitutions, and modifications can be made to the technology herein without departing from the teachings defined by the appended claims. Furthermore, the scope of the claims of the embodiments of this application is not limited to the specific aspects of the processes, machines, manufactures, events, means, methods, and actions described above. Currently existing or later-developed processes, machines, manufactures, events, means, methods, or actions that perform substantially the same function or achieve substantially the same result as the corresponding aspects herein can be utilized. Therefore, the appended claims include such processes, machines, manufactures, events, means, methods, or actions within their scope.
[0178] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use embodiments of this application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects without departing from the scope of embodiments of this application. Therefore, embodiments of this application are not intended to be limited to the aspects shown herein, but rather to be accorded the widest scope consistent with the principles and novel features disclosed herein.
[0179] The above description has been given for illustrative and descriptive purposes. Furthermore, this description is not intended to limit the embodiments of this application to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.
Claims
1. A resource sharing method, characterized in that, include: Receive an inference request, the inference request including the target user identifier of the target user, the target user being a user who needs to use the inference service of the large language model; Based on the target user identifier, query the user region table, which includes the association between the user's user identifier and the region identifier of the region assigned to the user; If a target region identifier that matches the target user identifier is found in the user region table, the target physical block corresponding to the target region identifier is obtained; Based on the target physical block, store the key-value cache generated during the execution of the inference request.
2. The method according to claim 1, characterized in that, The step of obtaining the target physical block corresponding to the target region identifier includes: Based on the target region identifier, the target logical block associated with the target region allocated to the target user is determined; Query the cache table of the target region to obtain the target physical block associated with the target logical block. The cache table includes the association relationship between logical blocks and physical blocks.
3. The method according to claim 2, characterized in that, The step of determining the target logical block associated with the target region allocated to the target user based on the target region identifier includes: Calculate the target root hash value of the inference request based on the target region identifier and the token identifier; Based on the target root hash value, query the prefix cache index table, which includes the association between the root hash value and the logical block queue; If the target root hash value is found in the prefix cache index table, the logical block queue associated with the target root hash value is determined as the target logical block queue. The logic blocks in the target logic block queue are identified as the target logic blocks.
4. The method according to claim 3, characterized in that, Also includes: If the target root hash value is not found in the prefix cache index table, a new logical block queue is constructed, which includes multiple logical blocks, each of which is associated with a physical block. The target root hash value and the new logical block queue are associated and stored in the prefix cache index table.
5. The method according to claim 4, characterized in that, The construction of the new logical block queue includes: New physical blocks are allocated according to the storage requirements of the inference request, and a new logical block queue is determined based on the new physical blocks; Alternatively, request a new physical block queue from the area controller and associate the new physical block queue with a new logical block queue; Alternatively, a new logical block queue can be requested from the area controller, and a new physical block queue can be associated with the new logical block queue.
6. The method according to any one of claims 1-5, characterized in that, Before storing the key-value cache generated during the execution of the inference request based on the target physical block, the method further includes: If no target region identifier matching the target user identifier is found in the user region table, a new region is assigned to the target user, and a physical block is allocated to the new region. The physical block allocated to the new region is determined as the target physical block.
7. The method according to claim 6, characterized in that, The allocation of physical blocks to the new region includes: The free physical blocks in the free block list are allocated to the new region. The free block list includes at least one free physical block, which is a physical block that is in an idle state.
8. The method according to any one of claims 1-6, characterized in that, The storage of key-value caches generated during the execution of the inference request, based on the target physical block, includes: If the target physical block can meet the storage requirements of the inference request, the inference request is executed, and the key-value generated during the inference request process is cached in the target physical block; If the target physical block cannot meet the storage requirements of the inference request, a new physical block is allocated from the free block list to increase the target physical block.
9. The method according to claim 8, characterized in that, The allocation of a new physical block from the free block list includes: If a free physical block exists in the free block list, the free physical block is added to the target physical block; If no free physical blocks exist in the free block list, a physical block reclamation strategy is executed for each region. The reclaimed physical blocks are added to the free block list as free physical blocks, and the free physical blocks in the free block list are added to the target physical block.
10. The method according to claim 9, characterized in that, The implementation of the physical block reclamation strategy for each region includes: Iterate through all regions and obtain the regions that are currently idle. Query unused physical blocks from the physical blocks associated with the idle region, or, if no unused physical blocks are found from the physical blocks associated with the idle region, query unused physical blocks from the physical blocks associated with the target region corresponding to the target region identifier. Release the unused physical block.
11. A computing device, characterized in that, include: A processor and a memory, the memory storing a computer program that is invoked by the processor to execute the resource sharing method according to any one of claims 1-10.