Task allocation method for processing in memory-based language model service
By allocating the Prefill phase to processors and the Decode phase to PIM memory, the method addresses the execution challenges of large language models, enhancing efficiency and speed in language model execution.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ELECTRONICS TECH INST
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-18
AI Technical Summary
Conventional large language models struggle to simultaneously satisfy the different execution characteristics of the Prefill and Decode phases on a single type of device, such as GPUs or NPUs, due to their distinct computational requirements.
A method for managing task allocation between processors and PIM memory based on the operation phase of a language model, where the Prefill phase is executed on a processor and the Decode phase is executed on PIM memory, optimizing task distribution to efficiently utilize their respective strengths.
This approach enables efficient execution of language models by leveraging the computational advantages of both processors and PIM memory, providing high-speed services to multiple users.
Smart Images

Figure KR2024097021_18062026_PF_FP_ABST
Abstract
Description
Task assignment method for Processing in memory-based language model services
[0001] The present invention relates to the execution of a large language model service, and more specifically, to a method for managing task assignment / distribution according to the operation phases of a language model when executing a large language model.
[0002] Conventional large language models are mostly executed on systems utilizing GPUs (Graphics Processing Units), and devices such as NPUs (Neural Processing Units) also execute the entire model on a single type of device based on the same concept as GPUs.
[0003] However, in this case, since the execution characteristics of the Prefill phase, which interprets user-input sentences and generates Attention in large language models, and the Decode phase, which generates sentences in token units, are different, there is a problem in that it is difficult to simultaneously satisfy these different execution characteristics on a single type of device.
[0004] Accordingly, a method for managing task assignment / distribution to appropriate devices based on the operation phase of the language model is required.
[0005] The present invention has been devised to solve the aforementioned problems, and the objective of the present invention is to provide a method for distributing and managing work allocation between a processor and a PIM memory according to the operation phase of a language model, as a means to efficiently execute a language model in a system equipped with a PIM (Processing In Memory) memory and provide optimal high-speed services to multiple users.
[0006] A language model task allocation method according to an embodiment of the present invention for achieving the above objective comprises: a first allocation step in which a language model driving system allocates and executes a Prefill task for a language model service to a processor; and a second allocation step in which a language model driving system allocates and executes a Decode task for a language model service to a PIM memory.
[0007] The first allocation step may include: a step in which the language model driving system checks whether a Prefill task can be executed on a processor; and a step in which, if the Prefill task can be executed on a processor, the language model driving system allocates a Prefill task to the processor.
[0008] The first allocation step may further include a step of waiting until the language model driving system becomes able to execute a task on the processor if the Prefill task cannot be executed on the processor.
[0009] The second allocation step may be performed when the Prefill operation is completed.
[0010] The second allocation step may include: a step of checking whether the language model driving system can execute a Decode task in the PIM memory; and, if the language model driving system can execute a Decode task in the PIM memory, a step of allocating a Decode task to the PIM memory.
[0011] The language model task assignment method according to the present invention may further include: a step of checking whether a language model driving system can execute a Decode task in a processor if a Decode task cannot be executed in the PIM memory; and a step of assigning a Decode task to a processor if a Decode task can be executed in the processor.
[0012] The language model task assignment method according to the present invention may further include the step of waiting until the language model driving system can execute the task in the PIM memory or processor if the Decode task cannot be executed even in the processor.
[0013] The language model task assignment method according to the present invention may further include the step of the language model driving system returning the task result to the user when the Decode task is completed.
[0014] The processor may include at least one of a CPU, a GPU, and an NPU.
[0015] According to another aspect of the present invention, a language model driving system is provided, characterized by comprising: a processor that assigns and executes a Prefill task for a language model service to itself; and a PIM memory that receives and executes a Decode task for a language model service assigned by the processor.
[0016] According to another aspect of the present invention, a method for assigning a language model task is provided, characterized by comprising: a step of determining whether a language model driving system can execute a Decode task for a language model service in a PIM memory; a step of the language model driving system assigning a Decode task to the PIM memory if the Decode task can be executed in the PIM memory; a step of determining whether the language model driving system can execute a Decode task in a processor if the Decode task cannot be executed in the PIM memory; and a step of the language model driving system assigning a Decode task to the processor if the Decode task can be executed in the processor.
[0017] According to another aspect of the present invention, a language model driving system is provided, characterized by comprising: a PIM memory for executing a Decode task for a language model service; and a processor that assigns a Decode task to the PIM memory if execution of the Decode task is possible in the PIM memory, and assigns a Decode task to itself if execution of the Decode task is not possible in the PIM memory, by checking whether it can execute the Decode task.
[0018] As described above, according to embodiments of the present invention, in a system equipped with a PIM memory, the Prefill phase assigns a task to a processor and performs it, and the Decode phase assigns a task to the PIM memory preferentially and performs it, thereby enabling the efficient execution of a language model and the provision of optimal high-speed services to multiple users.
[0019] Fig. 1. Structure of a language model execution system using general memory
[0020] Fig. 2. Prefill phase and Decode phase of the language model
[0021] Fig. 3. Prefill phase operation time of a language model on a single GPU
[0022] Fig. 4. Decode phase operation time of a language model in GPU and PIM memory
[0023] Fig. 5. Example of a language model execution system utilizing both a processor and PIM memory simultaneously
[0024] FIGS. 6, 7. Task allocation method in a system utilizing a processor (NPU) and PIM memory simultaneously.
[0025] The present invention will be described in more detail below with reference to the drawings.
[0026] An embodiment of the present invention presents a task allocation method for a PIM (Processing In Memory)-based language model service. It is a technology for managing task allocation and distribution to accelerate language model execution and optimally service multiple users simultaneously in a system that utilizes both a processor (CPU, GPU, NPU, etc.) and PIM memory.
[0027] In particular, in an embodiment of the present invention, in order to execute a language model having different execution characteristics depending on the phase at high speed, the memory-intensive language model execution phase can be executed more efficiently compared to the prior art by utilizing PIM.
[0028] Figure 1 is a simplified schematic diagram of the structure of a system that executes a language model using general memory. In a system using general memory, all operations are performed by a processor (a semiconductor that performs operations, such as a CPU, GPU, or NPU), and the memory supplies data to the processor through a memory interface.
[0029] In this structure, all data used for the computation of the language model is loaded into memory before the model runs, and the processor receives the data necessary for the computation from memory through the memory interface.
[0030] Figure 2 is a diagram illustrating the Prefill phase and Decode phase when executing a language model.
[0031] In the Prefill phase, the language model receives user input and performs an Attention operation to calculate the relationships and importance between each word; after storing the calculated importance values in the memory cache (KV-cache), the first token is output.
[0032] In the Decoding phase, tokens generated in the previous iteration are fed back into the language model as input data. The relationship and importance between the KV-cache stored in the Prefill phase and the newly input tokens are calculated to update the KV-cache, after which a new output word is generated. This decoding process is repeated until the <termination> token, which terminates the language model's operation, is generated.
[0033] Figure 3 shows the analysis of the execution time of the Prefill phase on a GPU device (based on NVIDIA H100) when running the LLAMA-7B language model. On the GPU, the Prefill operation takes 0.0044 seconds. On PIM memory (based on SK Hynix AIMX), the Prefill phase operation time increases proportionally to the length of the input data, so it is variable, but generally requires a much longer time compared to the Prefill time of the GPU. This is because the Prefill phase is a computationally intensive operation in which matrix multiplication is repeatedly performed when calculating the Attention score for all input words, and the performance of matrix multiplication depends heavily on the performance of the arithmetic unit rather than the memory. Therefore, processors such as GPUs or NPUs are more advantageous than PIM memory, which cannot accommodate many arithmetic units due to limitations in the memory process.
[0034] Figure 4 shows the analysis of the execution time of the Decode phase when running the LLAMA-7B language model on a GPU and PIM memory. For the convenience of comparison, the Decode phase was limited to generating 500 tokens on both devices. At this time, generating 500 words took 2.2 seconds on the GPU and 0.925 seconds on PIM memory. Unlike the Prefill phase, the Decode phase involves the repeated multiplication of matrices and vectors rather than matrix multiplication. Since the performance of matrix and vector multiplication is a memory-intensive operation heavily dependent on memory performance rather than the performance of the arithmetic unit, PIM memory, which can utilize internal bandwidth within memory, is advantageous compared to processors such as GPUs or NPUs.
[0035] FIG. 5 is a diagram illustrating the configuration of a language model driving system in which a processor (110) and a PIM memory (120) to which an embodiment of the present invention is applicable are utilized simultaneously. The processor (110) used in the language model driving system according to the embodiment of the present invention is the same as the existing one, but the memory does not utilize the existing memory but utilizes the PIM memory (120).
[0036] As before, all data required for running the language model is loaded into memory cells of the PIM memory (120) before running the model, and during the execution of the language model, the processor (110) assigns tasks by selecting whether to process the operations directly or to process them at the arithmetic unit of the PIM memory (120) according to the execution phase. Additionally, the KV-cache, which is continuously generated and updated during the Prefill phase and Decode phase, is also stored in memory cells within the PIM memory (120).
[0037] FIGS. 6 and 7 illustrate a method of assigning tasks in a situation where a language model is provided to multiple users in a system that utilizes an NPU as a processor (110) and simultaneously utilizes a PIM memory (120).
[0038] To assign a task requested by the user to the language model, as shown in FIG. 6, the system is first initialized, and the Prefill task queue and Decode task queue are initialized (S210). The two task queues may be implemented in hardware within the processor (110) or in software by a program running on the processor (110).
[0039] When the work queue initialization is complete, the user request queue is checked to see if there are any user requests (S220). If there are no user requests (S220 - no requests), wait until a user request is added (S280).
[0040] On the other hand, if there is a user request (S220 - request exists), add the Prefill task to the queue (S230) and check if the Prefill task can be executed on the NPU (110) (S240). If the Prefill task cannot be executed on the NPU (S240 - impossible), wait until the task can be performed on the NPU (110) (S290).
[0041] If the NPU (110) is able to perform a task (S240), a task is assigned to the NPU (110) from the Prefill task queue (S250), and the NPU (110) performs the Prefill task (S260). When the execution of the Prefill task is completed at the NPU (110) (S270), the subsequent Decode task is performed.
[0042] For the Decode operation, as shown in FIG. 7, the Decode operation following the Prefill operation completed in step S270 of FIG. 6 is first added to the Decode operation queue (S310), and it is checked whether the Decode operation can be performed in the PIM memory (120) (S320).
[0043] If a Decode operation can be performed in the PIM memory (120) (S320-possible), the Decode operation is assigned to the PIM memory (120) (S330), and the PIM memory (120) performs the Decode operation (S340). When the Decode operation is completed (S350), the result of the operation is returned to the user (S360). Afterwards, it returns to the initial state, step S220 of FIG. 6, and performs a service to another user according to the user request queue, or returns to a waiting state.
[0044] Meanwhile, if it is confirmed at step S320 that the PIM memory (120) is currently performing other tasks and therefore the Decode task cannot be performed (S320-impossible), it is checked whether the NPU (110) can perform the Decode task (S370). If the NPU (110) can perform the Decode task (S370-possible), the Decode task is assigned to the NPU (110) (S380), and the NPU (110) performs the Decode task (S390). When the Decode task is completed (S350), the result of the task is returned to the user (S360), and the process returns to the initial state at step S220 of FIG. 6.
[0045] Meanwhile, if the Decode task cannot be performed in the NPU (110) (S370-impossible), wait until the task can be performed in the PIM memory (120) or the NPU (110) (S395).
[0046] Up to now, a task assignment method for PIM-based language model services has been described in detail with reference to preferred embodiments.
[0047] In the above embodiment, in a system equipped with PIM memory, the Prefill phase assigns tasks to the processor for execution, and the Decode phase assigns tasks to the PIM memory for execution, thereby efficiently executing the language model to provide optimal high-speed services to multiple users.
[0048] Meanwhile, it goes without saying that the technical concept of the present invention may also be applied to a computer-readable recording medium containing a computer program that enables the device and method according to the present embodiment to perform their functions. Furthermore, the technical concept according to various embodiments of the present invention may be implemented in the form of computer-readable code recorded on a computer-readable recording medium. A computer-readable recording medium may be any data storage device that can be read by a computer and store data. For example, a computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, etc. Additionally, computer-readable code or a program stored on a computer-readable recording medium may be transmitted through a network connected between computers.
[0049] Furthermore, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above. Various modifications are possible by those skilled in the art without departing from the essence of the invention as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present invention.
Claims
1. A first allocation step in which a language model driving system allocates and executes a Prefill task for a language model service to a processor; and A language model task allocation method characterized by including a second allocation step in which a language model driving system allocates and executes a Decode task for a language model service in a PIM memory.
2. In Claim 1, The first allocation step is, A step in which the language model driving system checks whether the Prefill task can be executed on the processor; A language model task assignment method characterized by including the step of, if Prefill task execution is possible in the processor, the language model driving system assigning the Prefill task to the processor.
3. In Claim 2, The first allocation step is, A language model task assignment method characterized by further including the step of, if Prefill task execution is not possible in the processor, waiting until task execution becomes possible in the processor for the language model driving system.
4. In Claim 3, The second allocation stage is, A language model task assignment method characterized by being performed when a prefill task is completed.
5. In Claim 4, The second allocation stage is, A step of checking whether the language model driving system can execute a Decode operation in PIM memory; A language model task assignment method characterized by including the step of assigning a Decode task to the PIM memory in a language model driving system when the execution of a Decode task in the PIM memory is possible.
6. In Claim 5, If the Decode operation cannot be executed in PIM memory, a step to check whether the language model driving system can execute the Decode operation on the processor; A language model task assignment method characterized by further including the step of assigning a Decode task to a processor in a language model driving system when the execution of a Decode task is possible in a processor.
7. In Claim 6, A method for assigning language model tasks, characterized by further including the step of waiting until the language model driving system can execute the task in the PIM memory or processor if the Decode task cannot be executed even in the processor.
8. In Claim 7, A language model task assignment method characterized by further including the step of, when the decode task is completed, the language model driving system returning the task result to the user.
9. In Claim 1, The processor, A language model task assignment method characterized by including at least one of a CPU, a GPU, and an NPU.
10. A processor that assigns and executes a Prefill task for language model services to itself; A language model driving system characterized by including a PIM memory that is allocated by a processor to execute a Decode task for a language model service.
11. A step of determining whether the language model driving system can execute a Decode operation for language model services in PIM memory; If the Decode task can be executed in the PIM memory, the language model driving system assigns the Decode task to the PIM memory; If the Decode operation cannot be executed in PIM memory, a step to check whether the language model driving system can execute the Decode operation on the processor; A method for assigning a language model task, characterized by including the step of assigning a Decode task to a processor if the Decode task can be executed in the processor, wherein the language model driving system assigns the Decode task to the processor.
12. PIM memory for executing Decode tasks for language model services; A language model driving system characterized by including a processor that, if the execution of a Decode task in PIM memory is possible, assigns the Decode task to PIM memory, and if the execution of the Decode task in PIM memory is not possible, checks whether it can execute the Decode task and assigns the Decode task to itself.