System and method for moving data between system components
By enabling continuous data storage and automatic overflow management across multiple buffers, the system addresses buffer overflow and latency issues in accelerators, ensuring efficient handling of non-deterministic data sizes and maximizing system performance.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- MICROSOFT TECHNOLOGY LICENSING LLC
- Filing Date
- 2024-05-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing data management systems face inefficiencies due to buffer overflows and latency issues when handling non-deterministic output sizes, particularly in accelerators that require additional processing steps to manage buffer usage, leading to reduced system performance.
Implementing a mechanism that allows accelerators to continue processing instructions without waiting for previous outputs to complete, using automatic overflow management across multiple buffers to store non-deterministic data results efficiently, ensuring continuous data storage without additional latency or resource overhead.
Enables efficient storage and retrieval of non-deterministic data results across multiple buffers, maximizing performance by allowing parallel processing and reducing the need for additional memory management, thus enhancing system efficiency and throughput.
Smart Images

Figure 2026519917000001_ABST
Abstract
Description
【Technical Field】 【0001】 【0001】 This disclosure generally relates to data handling within digital systems, and more particularly to systems and methods for moving data between system components. 【Background Art】 【0002】 【0002】 As the amount of data constantly increasing is processed by modem processors, the importance of data management and efficient handling increases. One common data movement artifact is the buffer. When a processing component of a digital system completes a data processing task, it is common to store the result in a buffer. However, due to the limited size of the buffer, several measures must be taken when the buffer reaches its capacity. Moving data from the buffer, handling overflows, and managing buffer usage can reduce the efficiency of the system. 【0003】 【0003】 Such challenges particularly apply to accelerators that assist in directly submitting client workloads through, for example, the command queues of virtual function portals (i.e., Sr-IOV) (e.g., Intel DSA and QAT, MS SDM and SDED). For workloads with deterministic output sizes, the accelerator can support the submission of many commands and execute them in parallel since the output destination address of each command can be pre-computed. However, for workloads with non-deterministic output sizes such as compression, the accelerator must wait for the previous command to complete, generate the output size, and use this to issue the next command. Alternatively, the accelerator can issue many commands to be executed in parallel into a side buffer, but then a second process is required to copy their outputs into the packed destination buffer. Both of these techniques require more latency and / or more system resources. 【Summary of the Invention】 【Means for Solving the Problems】 【0004】
[0004] The following disclosures include improved techniques to address these and other issues. [Brief explanation of the drawing] 【0005】 Brief explanation of the drawing [Figure 1]
[0005] An exemplary system for moving data according to one embodiment is shown. [Figure 2] 【0006】 This document describes a method for moving data according to one embodiment. [Figure 3] 【0007】 This document illustrates an exemplary system for moving data between an accelerator and a disk according to one embodiment. [Figure 4] 【0008】 An exemplary instruction, data result, and buffer according to one embodiment are shown. [Modes for carrying out the invention]
[0006] Detailed explanation 【0009】 This specification describes techniques for moving data between components of a system. For illustrative purposes, numerous examples and specific details are provided below to allow for a full understanding of several embodiments. The various embodiments defined by the claims may include some or all of the features in these examples alone or in combination with other features described below, and may further include modifications and equivalents of the features and concepts described herein.
[0007] 【0010】 The features and benefits of this disclosure include a mechanism that allows an accelerator or other digital processor to continue processing instructions, where the previous instruction has completed without causing completion and submission waits to / from the client process. This may be called write buffer continuity.
[0008] 【0011】 Some embodiments of this disclosure may include ordering outputs such that the next output does not begin until the previous output has completed. In some applications, ordering may occur for each supported stream. Ordering is a challenge when there are outputs that depend on the previous output (for example, when the previous output has finished so that the next output can be placed sequentially into a buffer). However, while outputs can be ordered, the input and computation of the workload may begin in parallel (or even randomly) to absorb as much latency as possible and maximize performance.
[0009] 【0012】 Figure 1 shows an exemplary system for moving data according to one embodiment. A digital processor 101 receives a plurality of instructions 150. Instructions 150 can cause the digital processor 101 to perform the same processing function to generate a plurality of data results 150a-n, for example (by using, for example, functional processing blocks 110a-n). The exemplary digital processor 101 includes a wide variety of accelerators, which are hardware devices configured to enhance the overall performance of a computer or system. There are various types of accelerators available to help enhance the performance of various aspects of the computer's functions (e.g., compression accelerators or artificial intelligence accelerators). Features and advantages of this disclosure include sequentially sending a plurality of instructions 150 for execution to the processor 101. In some embodiments, instructions may be sent in batches without waiting for each instruction to complete its processing function before, for example, the next instruction is issued. Thus, the execution of instructions 150 causes the digital processor 101 to generate a plurality of data results 150a-n. The data results 150a-n may be non-deterministic. For example, the output data size of each data result may be unknown beforehand, thus creating a challenge in managing the storage of data results 150a to n.
[0010] 【0013】The digital processor 101 may store data results 150a to n in a plurality of buffers 120a to b in memory 102. Memory 102 may be a memory such as one or more dynamic random access memory integrated circuits (DRAM ICs). Advantageously, the system may track several ways in which data is stored in the buffers and provide an automatic overflow feature so that the data results generated by an instruction are automatically stored in a buffer location (e.g., an address) following the last address used by the data results from the previous instruction. When the data results of an instruction fill a buffer, the data results are automatically stored in the next buffer. For example, the digital processor 101 may track the starting location ( <stloc>Initially, the data result 150a generated in response to the instruction in buffer 120a starting at ) is stored. The data result 150a is <stloc>and the end (or final) place ( <lloc>) which may be stored between. The digital processor 101 determines the final location of the specific data result in the first buffer. <lloc>It is possible to track the subsequent data result generated in response to the subsequent instruction, in the subsequent location in buffer 120a ( <sloc>) ~ Final location <lloc>It is stored in buffer 120a. However, subsequent data results may fill buffer 120a. Advantageously, the instruction can be associated with a reference ("ref" 151) to the second buffer 120b. Thus, when the data result fills the first buffer 120a, the remainder of the data result causing an overflow is automatically stored in the second buffer 120b using the reference (e.g., automatic overflow of data into the next buffer). In this example, the remainder of the data result causing the overflow is stored at the beginning address of buffer 120b (e.g., the beginning of the buffer). <top>The data is stored in buffer 120b, which begins at (assuming the buffer is completely flushed before being used as an overflow). Subsequently, multiple subsequent data results after the first data result are stored sequentially in buffer 120b. When an overflow occurs, the received instruction 150 may update a reference to, for example, yet another overflow buffer. Thus, at least a portion of the instructions received by the digital processor 101 are stored in one buffer (e.g., buffer 120a) but reference a different buffer for automated overflow purposes.
[0011] 【0014】 In some cases, some instructions may be received when buffer 120a is empty (for example, flushed as described in detail below). When a first instruction is received, it may be associated with a starting location in the first buffer. Thus, the first entry in the data result corresponding to the first instruction is at the starting address in the buffer (for example, buffer 120a). <top>It can be stored in ).
[0012] 【0015】 By using the techniques described above, output data sequentially generated by a series of instructions by functional processing blocks 110a-n can be efficiently stored in multiple buffers without delays caused by buffer fullness and without associated instruction submission and instruction completion wait times that would prevent the accelerator from achieving maximum performance. Once data is present in the buffers, other system components can retrieve this data for various uses, as will be described in detail below. For example, an electronic device 190 may be coupled to memory 102. The electronic device 190 may be, for example, a hard drive, a processor, or an accelerator. The electronic device 190 retrieves the data result stored in the first buffer when the first data result of multiple data results fills the first buffer. As will be described in more detail below, the electronic device may send a signal indicating that the buffer has been emptied (also known as flushed) so that the original buffer can be used as an overflow buffer for the data result. Thus, the first buffer can be used to store the data result while the second buffer can act as an overflow buffer. When the first buffer is filled, it may be flushed and used as an overflow buffer for the buffer currently receiving and storing data. In various embodiments, various numbers of buffers may be used. In some cases, two buffers may be used, and in other cases, three or more buffers may be used to store data from a digital processor that generates multiple streams of output data, for example.
[0013] 【0016】 Figure 2 shows a method for moving data according to one embodiment. In 201, instructions are received sequentially within a digital processor, such as an accelerator. Multiple instructions may perform the same processing function to generate multiple data results. In 202, the digital processor stores the multiple data results in a first buffer and a second buffer of memory. At least a portion of the instructions is associated with a reference to the second buffer, for example, and loaded into the first buffer. In 203, the final location of a particular data result in the first buffer is tracked. In 204, a particular subsequent data result is stored in the subsequent location ~ final location in the first buffer after the previous data result. In 205, the system may detect that the first buffer is full. When the first data result of multiple data results fills the first buffer, in 207, the remainder of the first data result and multiple subsequent data results after the first data result are automatically stored in the second buffer using references. If the first buffer is not full, the results continue to be stored in the first buffer, and the final location is saved for the next set of data results.
[0014] 【0017】 Figure 3 shows an exemplary system for moving data between an accelerator 302 and a disk 304 according to one embodiment. In this example, instructions are generated, for example, by a software client 301 (which may be a virtual machine). However, in other embodiments, instruction submission may be performed by another accelerator, which may include, for example, hardware components. When issued by the software client 201, instructions may include operation codes and pointers, for example, a pointer may point to a buffer for use as an overflow buffer. The pointer may be used, for example, as a reference. Advantageously, a stream of instructions may be automatically issued from the software client 301 to the accelerator 302 to generate a number of data results to be stored in buffers 321-323 without waiting in the software client for an indication that the execution of the instructions is complete. For example, the client 301 may start issuing instructions sequentially. The first instruction is received by the accelerator 302 and may be executed in a functional block 310a (for example, a compression circuit). The next instruction may be received by the accelerator 302 and executed in the function block 310b before the function block 310a has finished processing the first instruction. The instruction is thus sent to the accelerator 302 and may be processed, for example, in function blocks 310a-n at least partially in parallel. The first instruction may include a reference to buffer 321, where the output data generated in response to the instruction will be stored. However, the subsequent instruction may include a reference to buffer 322, which acts as an overflow buffer.
[0015] 【0018】 Functional block 310a may be a first functional block for outputting data results. As described above, the first instruction is received along with a reference to buffer 321 (which specifies where the data results of the first instruction will be stored). The first instruction also has a start location. <stloc>(For example, the starting address in buffer 321 for storing data results) can be associated with the entry. Thus, the first instruction can associate the entry with the first address in buffer 321 (for example <top>Stored within the data result that starts at ). The final location where the last entry of the data result is stored. <lloc>(For example, the final address) can be tracked by accelerator 302. Therefore, if the output data result of functional block 310b is available, this means that <lloc>The data can be stored by starting at the next address in buffer 321. Functional blocks 310b-n continue to store data results in buffer 321 by starting at an address after the last address of the previous result and tracking the last address of each data result. However, once the data results of a particular instruction fill the buffer, the reference received with the instruction is used to automatically stop storing data result entries in buffer 321 and to start storing the remaining data result entries in the buffer specified by the reference (in this case, buffer 322).
[0016] 【0019】 As described above, in some cases, the output data results are non-deterministic (for example, the size of the data results stored in each buffer may not be the same and may not be known). Using this technique, which tracks the last location when the current buffer is full and automatically fills the next buffer, allows the system to efficiently store non-deterministic data results without the additional memory management overhead and / or delay caused by, for example, interaction with the client to manage buffers. In particular, various embodiments may not need to wait for a completion notification indicating the output data result size or buffer status after each instruction is issued. In addition, various embodiments may not need to allocate a separate output space for the data results generated for each instruction.
[0017] 【0020】 In some embodiments, the memory 303 further includes a third buffer 323. Thus, when data results fill one buffer, the electronic device can retrieve the stored data results in one buffer while the accelerator stores the data results in the other buffers. In this example, data results from accelerator 302 are stored in memory 303, and three buffers 321-323 are accessed by hard drive 304. For example, while data generated by an instruction is loaded into buffers 321 and 322, hard drive 304 may empty buffer 323, which is already full. Client 301 may send a signal to hard drive 304 indicating that buffer 323 is full and ready to be flushed ("flush"). In response to this signal, hard drive 304 may begin retrieving data from buffer 323 while buffers 321 and 322 are being filled. When hard drive 304 is finished and buffer 323 is empty, hard drive 304 may send a signal ("Flash complete") indicating that the buffer is empty. Thus, buffer 232 may be available for use as an overflow buffer. In particular, if buffer 321 is full and data is stored in buffer 322, when the "Flash complete" signal is received, a subsequent instruction may be issued, for example, with a reference to buffer 323 (buffer 323 becomes the overflow buffer for buffer 322). Thus, data is stored contiguously across multiple buffers, where at least one buffer is used as the starting buffer, the instruction contains a reference pointing to the overflow buffer, and a downstream device emptys the already filled buffer.
[0018] 【0021】 Figure 4 shows an exemplary instruction, data result, and buffer according to one embodiment. In some embodiments, the client may determine the number of instructions issued in a batch. In this example, the buffer size is 65536 bytes, and the maximum data result of an instruction is 8129 bytes. The remaining size of the buffer is 65536 bytes, and the pointer offset into the buffer is 1 (for example, <top>Address; buffer is empty). Based on this information, the client calculates that there are currently 15 instructions that can fit into the buffer and overflow buffer. The 15 instructions (CMD0...CMD14) are shown in the table. These instructions are sent by the client along with the destination buffer address (Dest Buf Addr) and destination buffer size (Dest Buf Size), and continue writing (ContWrBuffer). The destination buffer address indicates the buffer for storing the instruction output data. ContWrBuffer is a bit in the instruction descriptor that is set to indicate whether the instruction should continue using the previous write buffer (ContWrBuffer==l) or explicitly use the write buffer supplied by the instruction descriptor (ContWrBuffer==0).
[0019] 【0022】 The exemplary output bytes for each instruction are shown. In this example, the output bytes are randomly generated to demonstrate the non-deterministic size of the output data. As shown herein, instructions CMD0 to CMD10 are stored in the initial buffer. However, CMD11, with 4230 bytes of output, begins at address 61524. This fills the initial buffer, and the system automatically begins filling new buffers. After the CMD11 output data is stored, the final address is 218. The accelerator or other digital system tracks the buffer end offset (the last column in the table).
[0020] 【0023】 The first command (CMD0) stores data in the initial buffer (Old), while all subsequent commands store data in the initial buffer until it is full and then, next, use the overflow buffer (New). For the first command of this sequence, the write buffer start address (including the offset) and the remaining size are provided. The overflow buffer presents the start address (the base address of the buffer) and the total size of the buffer in the subsequent command descriptor. If the initial buffer is partially full, the offset of the first command (e.g., 12000) will be the last address before the empty part of the buffer. In this case, the accelerator may receive this offset and may update the buffer start and end offsets. 【0021】 Another example 【0024】 Each of the following non-limiting features in the following examples can function independently or can be combined in various permutations or combinations with one or more of the other features within the following examples. In various embodiments, the present disclosure can be implemented as a system or a method. 【0022】 【0025】 In one embodiment, the present disclosure includes a digital processor that continuously receives a plurality of commands that perform the same processing function to generate a plurality of data results, and a memory that includes at least a first buffer and a second buffer. The digital processor stores the plurality of data results in the first buffer and the second buffer. With respect to the plurality of data results, the digital processor tracks the final location in the first buffer of a particular data result of the plurality of data results, and a particular subsequent data result of the plurality of data results after the particular data result is stored in a subsequent location to the final location in the first buffer. At least a portion of the commands is associated with a reference to the second buffer, and when the first data result of the plurality of data results fills the first buffer, the remainder of the first data result and the plurality of subsequent data results after the first data result are automatically stored in the second buffer by using the reference. 【0023】 【0026】 In one embodiment, the present disclosure includes a method of moving data, the method comprising: continuously receiving, within a digital processor, a plurality of instructions that perform the same processing function to generate a plurality of data results; storing the plurality of data results in a first buffer and a second buffer of a memory by the digital processor, wherein at least a portion of the instructions is associated with a reference to the second buffer; tracking a final location in the first buffer of a particular data result of the plurality of data results; storing a particular subsequent data result of the plurality of data results after the particular data result in a subsequent location to the final location in the first buffer; and automatically storing the remainder of the first data result and the plurality of subsequent data results after the first data result in the second buffer by using the reference when the first data result of the plurality of data results fills the first buffer. 【0024】 【0027】 In one embodiment, one of the instructions is the first instruction of the plurality of instructions, the first instruction is associated with a starting location in the first buffer, and the first entry in the data result corresponding to the first instruction is stored at the starting address in the buffer. 【0025】 【0028】 In one embodiment, the instructions are generated by a software client or another digital processor. 【0026】 【0029】 In one embodiment, when the instructions are issued by a software client, they are associated with a reference to the second buffer, and the digital processor stores at least a portion of the plurality of data results in the first buffer. 【0027】 【0030】 In one embodiment, the software client is a virtual machine. 【0028】 【0031】 In one embodiment, the instruction is automatically issued from the software client to the digital processor to generate a plurality of data results stored in the first and second buffers, without waiting for an instruction from the software client that the execution of the instruction has been completed.
[0029] 【0032】 In one embodiment, the digital processor is an accelerator that generates a non-deterministic output size.
[0030] 【0033】 In one embodiment, the accelerator is one of a data compression circuit and an artificial intelligence (AI) accelerator circuit.
[0031] 【0034】 In one embodiment, the system further includes an electronic device coupled to memory, which retrieves data results stored in the first buffer when the first data result of a plurality of data results fills the first buffer.
[0032] 【0035】 In one embodiment, the first electronic device is a hard drive or another processor.
[0033] 【0036】 In one embodiment, the memory further includes a third buffer, and when the first data result of a plurality of data results fills the first buffer, the electronic device retrieves the data result stored in the first buffer while the digital processor stores the plurality of data results in the second and third buffers.
[0034] 【0037】 The above description illustrates various embodiments, along with examples of how some aspects of the embodiments may be implemented. The above examples and embodiments should not be considered as sole embodiments, and are presented to demonstrate the flexibility and advantages of some embodiments as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents may be adopted without departing from the scope of the invention as defined by the claims.< / top> < / lloc> < / lloc> < / top> < / stloc> < / top> < / top> < / lloc> < / sloc> < / lloc> < / lloc> < / stloc> < / stloc>
Claims
[Claim 1] A digital processor (101) that continuously receives multiple instructions (150) that perform the same processing function in order to generate multiple data results, In a system including a memory (102) that includes at least a first buffer (120a) and a second buffer (120b), The digital processor (101) stores the plurality of data results (120a to b) in the first buffer and the second buffer, With respect to the plurality of data results (150a to n), the digital processor (101) tracks the final location of a specific data result of the plurality of data results (150a to n) within the first buffer. The subsequent data results of the plurality of data results (150a to 150n) following the specified data result are stored in the subsequent location to the final location within the first buffer (120a). At least a portion of the instruction (150) is associated with a reference to the second buffer, and (when the first data result of the plurality of data results fills the first buffer (120a), the remainder of the first data result and the plurality of subsequent data results (150a to n) after the first data result (150a) are automatically stored in the second buffer (120b) using the reference. [Claim 2] The system according to claim 1, wherein one of the instructions is the first instruction of the plurality of instructions, the first instruction is associated with a start location in the first buffer, and a first entry in the data result corresponding to the first instruction is stored at the start address in the buffer. [Claim 3] The system according to claim 1, wherein the instructions are generated by a software client or another digital processor. [Claim 4] The system according to claim 3, wherein when the instruction is issued by the software client, it is associated with a reference to the second buffer, and the digital processor stores at least a portion of the plurality of data results in the first buffer. [Claim 5] The system according to claim 3, wherein the software client is a virtual machine. [Claim 6] The system according to claim 3, wherein the instruction is automatically issued from the software client to the digital processor to generate the plurality of data results stored in the first and second buffers, without the software client waiting for an instruction that the execution of the instruction has been completed. [Claim 7] The system according to claim 1, wherein the digital processor is an accelerator that generates a non-deterministic output size. [Claim 8] The system according to claim 7, wherein the accelerator is one of a data compression circuit and an artificial intelligence (AI) accelerator circuit. [Claim 9] The system according to claim 1, further comprising an electronic device coupled to the memory, wherein when the first data result of the plurality of data results fills the first buffer, the electronic device retrieves the data result stored in the first buffer. [Claim 10] The system according to claim 9, wherein the first electronic device is a hard drive or another processor. [Claim 11] The system according to claim 9, wherein the memory further includes a third buffer, and when the first data result of the plurality of data results fills the first buffer, the electronic device retrieves the data result stored in the first buffer while the digital processor stores the plurality of data results in the second buffer and the third buffer. [Claim 12] A method of moving data, The process of sequentially receiving multiple instructions within a digital processor that perform the same processing function in order to generate multiple data results. The process involves storing the plurality of data results in a first buffer and a second buffer of memory by the digital processor, wherein at least a portion of the instruction is associated with a reference to the second buffer. With respect to the plurality of data results, the final location of a specific data result within the first buffer of the plurality of data results, The specified subsequent data results of the plurality of data results after the specified data result are stored in the subsequent location to the final location in the first buffer, and A method comprising, when the first data result of the plurality of data results fills the first buffer, automatically storing the remainder of the first data result and a plurality of subsequent data results after the first data result in the second buffer by using the reference. [Claim 13] The method according to claim 12, wherein one of the instructions is the first instruction of the plurality of instructions, the first instruction is associated with a starting location in the first buffer, and a first entry in the data result corresponding to the first instruction is stored at the starting address in the buffer. [Claim 14] The method according to claim 12, wherein the instruction is generated by a software client. [Claim 15] The method according to claim 14, wherein when the instruction is issued by the software client, it is associated with a reference to the second buffer, and the digital processor stores at least a portion of the plurality of data results in the first buffer. [Claim 16] The method according to claim 14, wherein the software client is a virtual machine. [Claim 17] The method according to claim 14, wherein the instruction is automatically issued from the software client to the digital processor to generate the plurality of data results stored in the first and second buffers, without waiting for an instruction from the software client that the execution of the instruction has been completed. [Claim 18] The method according to claim 12, further comprising an electronic device coupled to the memory, wherein when the first data result of the plurality of data results fills the first buffer, the electronic device retrieves the data result stored in the first buffer. [Claim 19] The method according to claim 18, wherein the first electronic device is a hard drive or another processor. [Claim 20] The method according to claim 19, wherein the memory further includes a third buffer, and when the first data result of the plurality of data results fills the first buffer, the electronic device retrieves the data result stored in the first buffer while the digital processor stores the plurality of data results in the second buffer and the third buffer.