Weight memory mapping method and system for streaming computations in large-scale generative artificial intelligence hardware
The weight memory mapping system addresses latency issues in large-scale generative AI hardware by employing partial sum reuse and optimized weight positioning, resulting in reduced latency and improved computational efficiency.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- HYPERACCEL CO LTD
- Filing Date
- 2024-05-20
- Publication Date
- 2026-06-11
AI Technical Summary
The low parallelism in Transformer structures of generative artificial intelligence models leads to high latency, necessitating an efficient weight memory mapping method to reduce latency in large-scale hardware operations.
A weight memory mapping system with a hardware arithmetic unit that processes matrix multiplication in lanes, utilizing partial sum reuse and pre-processing, and includes a preprocessing routing unit to optimize weight matrix positioning for adjacent operations, along with quantization of rotary embedding parameters for efficient streaming computations.
The system significantly reduces hardware latency by enabling early calculation of final sums and optimizing weight data access, enhancing the efficiency of large-scale generative AI operations.
Smart Images

Figure 2026519186000001_ABST
Abstract
Description
【Technical Field】 【0001】 Embodiments of the present invention relate to a weight memory mapping method and system for streaming operations of large-scale generative artificial intelligence hardware. 【Background Art】 【0002】 Most generative artificial intelligence models exhibit high performance by adopting a Transformer structure. However, due to the low parallelism of operations in the Transformer structure, latency becomes crucial. To reduce the latency of hardware, an efficient weight mapping method is required. 【Summary of the Invention】 【Problems to be Solved by the Invention】 【0003】 A weight memory mapping method and system for streaming operations of large-scale generative artificial intelligence hardware can be provided. 【0004】 The technical problems of the present invention should not be limited to those described above, and other technical problems not described will be clearly understood by those skilled in the art from the following description. 【Means for Solving the Problems】 【0005】 A weight memory mapping system is provided, which includes a weight memory for storing a weight matrix for a pre-trained artificial intelligence model, an input register for storing a plurality of input data, a first hardware arithmetic unit that processes a matrix multiplication between the plurality of input data and the weight matrix and calculates a final sum in units of lanes during the progress of the matrix multiplication by reusing partial sums of the matrix multiplication, and a second hardware arithmetic unit that pre-processes a next matrix multiplication during the progress of the matrix multiplication using the final sum. 【0006】 According to one side, the partial sum may be characterized in that it includes the result of matrix multiplication between one column of the weight matrix and the plurality of input data, and the final sum per lane may be characterized in that it includes the accumulated value of the partial sums which are the result of matrix multiplication between each lane-by-lane column of the weight matrix and the plurality of input data. 【0007】 In other aspects, the first hardware arithmetic unit may be characterized by including a plurality of MAC (multiply-and-accumulation) trees that process the matrix multiplication on a lane-by-lane basis, a partial sum register that stores the partial sum of the matrix multiplication, and a plurality of partial sum accumulators that accumulate the partial sums of the partial sum registers and calculate the final sum on a lane-by-lane basis. 【0008】 Another aspect of this design is that the number of MAC trees and the number of partial sum accumulators correspond to the number of lanes, respectively. 【0009】 In another aspect, the second hardware arithmetic unit may be characterized by pre-processing the next operation using the final sum per lane calculated by at least one of the plurality of partial sum accumulators. 【0010】 In other respects, the artificial intelligence model may include a transformer model, the weight memory stores shared weight data for token embedding operations and LM (Language Modeling) head operations for the transformer model as the weight matrix, and the first hardware arithmetic unit may, during the token embedding operation, read the weights of a specific column of the weight memory to process the token embedding operation. 【0011】 In another aspect, the weight memory mapping system may include a preprocessing routing unit that adjusts the positions of the weight matrix values stored in the weight memory before the matrix multiplication so that the values required in the next operation of the matrix multiplication are adjacent to each other. 【0012】 In other respects, the artificial intelligence model may be characterized by including a transformer model, and the subsequent operation may include a rotary embedding operation for the transformer model. 【0013】 In other respects, the artificial intelligence model may include a transformer model, and the weight memory mapping system may include a rotary embedding parameter processing unit that quantizes the sine and cosine values of one position of the rotary embedding parameter for rotary embedding calculations of the transformer model with fixed-point 8 bits, and then stores a set of quantized sine and cosine values packed into 16 bits in the weight memory. 【0014】 In other respects, the rotary embedded parameter processing unit may be characterized by storing the set in the weight memory or reading the set from the weight memory according to the value obtained by dividing the position of the set by the number of channels, the channel determined by the head number, and the address determined by the remainder obtained by dividing the position of the set by the number of channels. 【0015】 A weight memory mapping method is provided, comprising the steps of: storing a weight matrix for a pre-trained artificial intelligence model in weight memory; storing a plurality of input data in input registers; processing matrix multiplication between the plurality of input data and the weight matrix through a first hardware arithmetic unit, and reusing the partial sum of the matrix multiplication to calculate a final sum in lane units while the matrix multiplication is in progress; and using the final sum to pre-process the next matrix multiplication while the matrix multiplication is in progress through a second hardware arithmetic unit. 【0016】 Specific details of other embodiments will be included in the detailed description and drawings. [Effects of the Invention] 【0017】 A weight memory mapping method and system for streaming operations of large-scale generative artificial intelligence hardware can be provided. 【0018】 The effects of the present invention should not be limited to those described above, and other effects not described will be clearly understood by those skilled in the art from the description of the claims. 【Brief Description of the Drawings】 【0019】 [Figure 1] It is a diagram showing an example of the structure of a latency processing unit according to an embodiment of the present invention. [Figure 2] It is a diagram showing an example of an implementation model of an LPU according to an embodiment of the present invention. [Figure 3] It is a diagram showing an example of an implementation model of an LPU according to an embodiment of the present invention. [Figure 4] It is a diagram showing an example of an implementation model of an LPU according to an embodiment of the present invention. [Figure 5] It is a diagram showing an example of an implementation model of an LPU according to an embodiment of the present invention. [Figure 6] It is a diagram for explaining the weight matrix data mapping of a high-bandwidth memory for matrix multiplication of a latency processing unit according to an embodiment of the present invention. [Figure 7] It is a diagram for explaining the high-bandwidth memory interface included in a latency processing unit according to an embodiment of the present invention. [Figure 8] It is a diagram for explaining a reconfigurable multifunctional arithmetic unit included in a latency processing unit according to an embodiment of the present invention. [Figure 9] It is a diagram for explaining the configuration of an address-based unordered multi-unit scheduler included in a latency processing unit according to an embodiment of the present invention. [Figure 10] It is a diagram for explaining the concepts of input reuse and partial sum (output) reuse in an embodiment of the present invention. [Figure 11] FIG. is a diagram for explaining the concepts of input reuse and partial sum (output) reuse in one embodiment of the present invention. [Figure 12] FIG. schematically shows an example of a weight memory mapping system based on partial sum reuse according to one embodiment of the present invention. [Figure 13] FIG. shows an example of data deduplication in one embodiment of the present invention. [Figure 14] FIG. shows an example of a process of matrix multiplication in one embodiment of the present invention. [Figure 15] FIG. shows an example of a process of processing a rotary embedding operation in one embodiment of the present invention. [Figure 16] FIG. shows an example of a process of processing a rotary embedding operation in one embodiment of the present invention. [Figure 17] FIG. is a block diagram showing an example of the internal configuration of a weight memory mapping system according to one embodiment of the present invention. [Figure 18] FIG. is a flowchart showing an example of a weight memory mapping method according to one embodiment of the present invention. 【BEST MODE FOR CARRYING OUT THE INVENTION】 【0020】 The advantages, features, and methods for achieving them of the present invention will become apparent by referring to the embodiments described in detail below together with the accompanying drawings. However, the present invention should not be limited to the embodiments disclosed below and may be realized in various different forms. The embodiments are provided only to make the disclosure of the present invention complete and to fully inform those with ordinary knowledge in the technical field to which the present invention belongs of the scope of the invention, and the present invention is only defined based on the claims. Throughout the specification, the same reference numerals indicate the same components. 【0021】 When a component is described as “connected to” or “coupled to” another component, this includes both direct connection or coupling, and cases where the other component is an intermediary. Conversely, when a component is described as “directly connected to” or “directly coupled to” another component, this means that no other component is an intermediary. “And / or” includes each of the listed items, and all combinations of one or more of them. 【0022】 The terms used herein are for illustrative purposes only and are not intended to limit the invention. In this specification, the singular form includes the plural form unless otherwise specified in the context. As used in this specification, “comprises” and / or “comprising” means that the components, steps, operations, and / or elements described herein do not exclude the presence or addition of one or more other components, steps, operations, and / or elements. 【0023】 While terms such as "first," "second," etc., are used to describe various components, these components should not be limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component described below may also become the second component within the technical concept of the present invention. 【0024】 Unless otherwise defined, all terms used herein (including technical and scientific terms) shall be used in a way that can be commonly understood by a person of ordinary skill in the art to which this invention pertains. Furthermore, commonly used, dictionary-defined terms shall not be interpreted ideally or excessively unless otherwise specified. 【0025】 Figure 1 shows an example of the structure of a latency processing unit according to an embodiment of the present invention. 【0026】 Referring to Figure 1, the LPU (Latency Processing Unit) 100 according to an embodiment of the present invention may include an SMA (Streamlined Memory Access) 110, an OIU (Operand Issue Unit) 120, an SXE (Streamlined eXecution Engine) 130, a VXE (Vector eXecution Engine) 140, an LMU (Local Memory Unit) 150, an ISU (Instruction Scheduling Unit) 160, a PCIe (Peripheral Component Interconnect express) Interface 170, and a P2P (Peer to Peer) Interface 180. 【0027】 The SMA110 may be a special DMA (Direct Memory Access). For example, the SMA110 may connect all channels of the HBM200 (for example, 32) to an execution engine (for example, SEE130) and transmit FP16 (half precision floating point) data at maximum bandwidth. The SMA110 may be designed with a deep FIFO (First In First Out) to transmit contiguous memory requests based on preloaded memory (MEM) instruction words. Hardware-aware memory mapping may reduce latency by eliminating matrix modification or transposition operations. Thus, the SMA110 can stream data received at maximum burst size to the execution engine with minimal latency. Furthermore, the SMA110 may efficiently perform matrix transposition using strobe signals. The streaming data may include parameters for vector matrix execution (e.g., weights, bias) and other vector-related execution (e.g., gamma / beta, embedding). 【0028】 The OIU120 may reconcile data streamed by the SMA110 (e.g., the first operand) with input in on-chip memory (e.g., the second operand) before issuing it to the execution engine. Based on the EXE instruction, the OIU120 may generate microcode to configure the execution engine and determine the target engine for the operands. The OIU120 may also have a reuse buffer to eliminate read wait times for static operands (e.g., input vectors) and an asymmetric buffer to maintain vectorized data used as scalars (e.g., biases). Thus, appropriate operands are almost always prefetched and ready to be immediately issued to the execution engine. 【0029】 SXE130 is the primary computing hardware of LPU100 and may be designed to maximize the use of the bandwidth returning to perform vector-matrix multiplication (V·M) such as attention, 1D convolution, and feedforward networks. SXE130 may include a number of MAC (multiply-and-accumulation) trees 131 capable of matching the receiving bandwidth of HBM200 with the processing bandwidth. For example, if 1024 elements are received from HBM200 per cycle, 16 MAC trees 131 with 64 input vectors each may match the receiving bandwidth with the processing bandwidth. Furthermore, a MAC tree 131 with 64 input vectors may consist of 64 multipliers and 63 adders. 【0030】 Multiple MAC trees 131 may perform matrix multiplication and may be connected channel by channel through the HBM200 and SMA110, which are external memory but high-bandwidth memory. Specifically, one of the multiple MAC trees 131 connects to the HBM200 through one channel to maximize the transmission bandwidth between the LPU 100 and the HBM200, thereby performing the matrix multiplication required for the ultra-large artificial intelligence model without bottlenecking. Therefore, the number of multiple MAC trees 131 and the number of memory channels in the HBM200 may be configured to be the same. 【0031】 The matrix multiplication results of multiple MAC trees 131 may be provided to VXE140. VXE140 may be implemented using a user-specified low-latency ALU (Arithmetic Logic Unit) and may perform vector operations such as token embedding, softmax, normalization, and residual calculations. Since such vector operations occur relatively infrequently, hardware resources can be reduced with negligible performance loss by adjusting the fan-in to this path in OIU120. While VXE140 receives the calculation results of multiple MAC trees 131, it may also receive an activation value from LMU150 before performing subsequent calculations. VXE140 may be configured to include various combinations of arithmetic units by including multiple multifunction arithmetic data paths. 【0032】 The LMU150 may transmit activation values to multiple MAC trees 131 and VXE140. In this case, the LMU150 may copy and transmit the activation values to multiple MAC trees 131 to transmit the same activation values. The LMU150 may also store the results of calculations performed by the multiple MAC trees 131 and VXE140. In other words, the LMU150 may function within the LPU100 as an internal buffer corresponding to the HBM200. In this case, the LPU100 may store activation values or model parameters with high reuse rates in matrix multiplication in the LMU150, and weights with low reuse rates in the HBM200. The LMU150 may be implemented as a 4MB multibank register file with scalar vector separation for high-speed, high-bandwidth access to input, output, and intermediate data. The LMU150 may also be a multiport that simultaneously supports reading and writing during the write and storage phase of the OIU120 and execution engine. 【0033】 The ISU160 may control the entire execution flow of the LPU100. The ISU160 may utilize the PIC (Parallel Instruction Chaining) method, which allows dependent instructions to be executed sequentially. PIC separates instructions requiring independent hardware into groups of dependent instructions (e.g., memory (MEM) instructions, execution (EXE) instructions, network (NET) instructions), so that all instructions are executed in parallel with the instruction chain of each group, achieving low control overhead and latency savings. The ISU160 may also update control registers (e.g., tokens and hierarchy numbers) for engine execution. An internal scheduler may support the non-sequential execution of SXE130 and VXE140 to maximize hardware utilization, and a robust scoreboard may be designed to handle data hazards. For example, the ISU160 may schedule multiple MAC trees 131 and VXE140 to perform calculations simultaneously. Furthermore, the ISU160 may increase the amount of computation and delay time by pre-executing dependency-free instructions in order to maximize parallel processing, thereby minimizing the idle time of each arithmetic unit and memory access unit. 【0034】 The LPU100 may be connected to a host computer via a PCIe interface 170, and may receive the necessary instructions, input values and weights for the operation of the LPU100, a super-large-scale artificial intelligence model, from the host computer, perform calculations, and then transmit the results back to the host computer. 【0035】 The LPU100 may be scaled out as a cluster of multiple LPUs connected via the P2P interface 180. The expanded cluster structure can further accelerate the computation of ultra-large-scale artificial intelligence models. 【0036】 Figures 2-5 show examples of LPU implementation models according to embodiments of the present invention. In the embodiment shown in Figure 1, an example of an implementation model using HBM200 external memory was described. Instead of HBM200, DDR (Double Data Rate) may be used as the external memory. In this case, since it is difficult to store a large-scale model on a single device, it may be divided into multiple partitions and stored in separate partitions in external memory for multiple devices (multiple LPUs). In this case, synchronization between multiple devices may be required for inference of the large-scale model. 【0037】 In the embodiment shown in Figure 2, similar to the embodiment described in Figure 1, multiple external memories 320 are shown to store multiple partitions 310 of a large-scale model, and multiple LPUs 330 are shown connected in parallel to the multiple external memories 320. One LPU may be implemented on one FPGA (Field Programmable Gate Array), and one partition may be connected in parallel to one FPGA. The transformer structure includes multi-head attention, layer normalization, feedforward, etc., within the decoder layer, but multi-head attention and feedforward may be model-parallelized. In this case, when multi-head attention is completed, one embedding vector may be output as a result. Since one device only has the embedding vector portion, multiple devices need to share each embedding vector in order to move to the next operation, and this requires synchronization. In this case, considering scalability, one LPU may be implemented in a form that has multiple external memories (for example, two, four, etc.). As an example, the embodiment in Figure 1 shows an example in which two HBM200s, each storing one partition, are used. 【0038】 In the embodiment shown in Figure 3, as an example of a PIM (Processing-in-Memory) model, one LPU is implemented on a PIM chip, and both the partition and the LPU arithmetic unit are integrated on a single chip. In the embodiment of Figure 3, multiple LPUs 410, multiple partitions 310, and multiple LPU arithmetic units 420 that can be implemented on a PIM chip are shown. In this case, each of the multiple LPUs 410 may include one partition and one LPU arithmetic unit. 【0039】 The embodiment in Figure 4 shows an example of a PNM (Processing-near-Memory) model. It can be difficult to include the configuration for processing all LPU calculations within a single PIM chip. The embodiment in Figure 4 shows a model in which multiple memory chips 510 store multiple partitions 310, and a buffer chip 520, such as a PNM chip, includes an LPU calculation unit 521 for LPU calculations. 【0040】 The embodiment shown in Figure 5 illustrates an example of a model in which PIM and PNM are combined. For example, multiple memory chips 610 may store multiple partitions 310. Furthermore, each of the multiple memory chips 610 may have a PIM-type LPU arithmetic unit 611, which is a storage unit such as a MAC tree. In this case, the buffer chip 620 may have an LPU arithmetic unit 621 for the remaining high-level operations of the LPU implemented in a PNM manner. 【0041】 Figure 6 illustrates the weight matrix data mapping of a high-bandwidth memory for matrix multiplication in a latency processing unit according to one embodiment of the present invention. 【0042】 Referring to Figure 6, in this embodiment, the LPU 100 has the same number of MAC trees as the number of MAC trees 131, and the same number of memory channels in the SMA 110. Therefore, weight matrix data mapped to a high-bandwidth memory 610 such as an HBM 200 can be stored so that weight data can be retrieved without accessing other memory channels during matrix multiplication of each MAC tree. 【0043】 Specifically, the weight matrix data may be stored in the high-bandwidth memory 610 such that the column direction D1 of the weight matrix is mapped to each channel 620-n for each of the multiple MAC trees 131. Since matrix multiplication can be performed in parallel in the column direction of the weight matrix, multiple MAC trees 131 may read the column direction data on their respective allocated memory channels 620-n and perform matrix multiplication. 【0044】 Next, the weight matrix data may be mapped so that multiple MAC trees 131 can accumulate the weight matrix in the row direction D2 to complete the final calculation result. The number of rows of data mapped at one time is determined by the bandwidth of the high-bandwidth memory 610, which may be determined by the size of the tile that multiple MAC trees 131 can process at one time. 【0045】 Figure 7 is a diagram illustrating a high-bandwidth memory interface included in a latency processing unit according to one embodiment of the present invention. 【0046】 Referring to Figure 7, the SMA110 may connect the LMU150, multiple MAC trees 131, and high-bandwidth memory 610. The SMA110 is not connected to other computing units of the LPU100, and therefore the high-bandwidth memory interface can be minimized in terms of hardware resources. 【0047】 Multiple MAC trees 131 and memory channels 620-n may be connected in a one-to-one correspondence. That is, multiple MAC trees 131 do not need to access channels other than the one directly assigned to them, and matrix multiplication can be performed without using complex interfaces such as crossbars, which use many resources and have high latency. 【0048】 The SMA110 may consist only of a read interface for multiple MAC trees 131 to retrieve weight matrix data stored in the high-bandwidth memory 610. In other words, as described below, the results of the calculations are stored in the high-bandwidth memory 610 via the LMU150, so no write interface for multiple MAC trees 131 to the high-bandwidth memory 610 is configured, and hardware resources can be reduced accordingly. 【0049】 Conversely, the SMA110 may only configure a write interface between the LMU150 and the high-bandwidth memory 610. The calculation results stored in the internal buffer LMU150 may be transmitted through the SMA110 to be recorded in the high-bandwidth memory 610, and the memory channel to be recorded may be selected using the demultiplexer 710. 【0050】 Figure 8 is a diagram illustrating a reconfigurable multifunction computing unit included in a latency processing unit according to one embodiment of the present invention. 【0051】 Referring to Figure 8, the VXE140 may include multiple multifunction arithmetic data paths 810 and 820, and these multiple multifunction arithmetic data paths 810 and 820 may be connected to an operator / result value chain network 830 to form various combinations of arithmetic units. 【0052】 As shown in Figure 8, the multiple multi-function computation data paths 810 and 820 may include various computation units necessary for, for example, lookup table-based nonlinear activation functions and masking operations. However, the configuration of the computation units of the reconfigurable multi-function computation data paths 810 and 820 shown in Figure 8 is merely illustrative, and it goes without saying that any additional computation units necessary for large-scale model computations can be included in the multi-function computation data paths 810 and 820. The results calculated by the VXE140 may be transmitted to the LMU150. 【0053】 Figure 9 is a diagram illustrating the configuration of an address-based non-sequential multi-unit scheduler included in a latency processing unit according to one embodiment of the present invention. 【0054】 Referring to Figure 9, the ISU160, which is an address infrastructure non-sequential multi-unit scheduler included in the LPU100 according to this embodiment, may include an address infrastructure instruction dependency determination and scheduling controller 910, multiple instruction issue controllers 921, 922, 923, 924, a multi-bank buffer address state table 950, an instruction buffer 960, and a result address state update logic 970 and a multi-unit instruction dispatcher 980. 【0055】 The ISU160 may operate each arithmetic unit and data transfer unit simultaneously through the address base instruction dependency determination and scheduling controller 910 and multiple instruction word issue controllers 921, 922, 923, and 924. In this case, the ISU160 may change the state of the operator address and result address in the multibank buffer address state table 950 for instructions executed by each arithmetic unit to 1. 【0056】 The multibank buffer address state table 950 may change the state of the result address of an instruction that has finished executing to 0 through the result address state update logic 970. 【0057】 The address infrastructure's instruction dependency determination and scheduling controller 910 may refer to the address status through the multibank buffer address status table 970 and determine the dependencies between instructions to be executed and instructions currently being executed, as well as the dependencies between instructions to be executed. This allows for pre-processing of instructions with no dependencies, thereby minimizing the idling time of each arithmetic unit and data transfer unit. 【0058】 The address-based instruction dependency determination and scheduling controller 910 included in the ISU160 may load and process instruction words from the instruction word buffer 960. At this time, the address-based instruction dependency determination and scheduling controller 910 may execute loop instruction words, decode other instruction words to distinguish them, and transmit them through the multi-unit instruction word dispatcher 980 to the device-to-device instruction word issue controller 921, the direct memory access instruction word issue controller 922, the MAC tree instruction word issue controller 923, and the reconfigurable multi-function arithmetic unit instruction word issue controller 924. 【0059】 The ISU160 may receive and store instructions for the LPU100 from the host computer via the PCIe interface 170, and store the current state of the LPU100 in a state register. The host computer may check the state register via the PCIe interface 170. 【0060】 On the other hand, the weights of the artificial intelligence are pre-trained and called from external memory to perform calculations during the inference of the artificial intelligence model. Therefore, if there is no need to process the weights at runtime, they may be handed over to the hardware that processes the inference of the artificial intelligence model after performing maximum optimization in advance. Embodiments of the present invention provide a memory mapping method and system for large-scale generative artificial intelligence hardware. 【0061】 Artificial intelligence hardware must decide whether to reuse inputs, weights, or outputs during computation. For example, transformer-based models cannot reuse weights during inference, so it is necessary to decide whether to reuse either inputs or outputs to reduce hardware latency and perform computations efficiently. 【0062】 Furthermore, it is necessary to determine whether there are any inefficient calculations when executed at runtime and to check whether such calculations can be handled during mapping. It should also be possible to perform quantization by utilizing the parameter distribution and the maximum and / or minimum values. 【0063】 The memory mapping methods and systems according to embodiments of the present invention must be implemented with a focus on reducing hardware latency. In a transformer structure where the output of the current operation is connected to the input of the next operation, techniques that increase the utilization rate of the arithmetic unit, such as out-of-order processing, cannot be used. Therefore, it is necessary to adopt a method that allows the final result value to be obtained quickly and the next operation to be processed more rapidly. In embodiments of the present invention, the next operation is performed and the efficiency of the arithmetic unit is increased by using a method for rapidly calculating the final sum based on an output reuse memory mapping method. 【0064】 Furthermore, we proposed mapping methodologies for cases where the same weights are used in different ways, such as token embeddings and LM (Language Modeling) head operations. In the case of LM head operations, it is a matrix multiplication, and in the case of token embeddings, it is a vector that is read from a specific address, so the mapping methods must be different. However, since the weight data itself is the same, if the mapping to external memory is different for each, duplicate data will be written, and the external memory cannot be used efficiently. To solve this, we proposed a method that uses token embedding as a matrix multiplication mapping method. 【0065】 Furthermore, in one embodiment of the present invention, a dedicated mapping method may be provided to reduce the complexity of hardware routing that may occur when rotary embedding. In the weight memory mapping method and system according to this embodiment, by eliminating routing in advance during weight memory mapping, the runtime hardware can reduce the associated latency and proceed with processing. 【0066】 Figures 10 and 11 illustrate the concepts of input reuse and partial sum (output) reuse in one embodiment of the present invention. When assigning weights, one may decide whether to retrieve and reuse the input once, or to store and reuse the output once. When reusing inputs as in Figure 10, the number of partial sums increases by the number of rows, so it is necessary to use FIFO (First In First Out) resources for the number of rows. Conversely, when reusing partial sums as in Figure 11, resources must be used to store input data for the number of columns in registers. In other words, even if one method is used, the amount of hardware resources used is the same because the row and column sizes are roughly the same in the transformer. However, there is a difference in when the final result is obtained. When reusing inputs, the final sum is calculated sequentially at the end, whereas when reusing partial sums, the final sum is calculated quickly. In the embodiment of the present invention, since the objective is to reduce latency, it is desirable for the final sum to be calculated as early as possible. For example, if there are arithmetic units that do not participate in the calculation, the next calculation can be performed in advance using the final sum that has been calculated earlier. 【0067】 Figure 12 is a schematic diagram showing an example configuration of a weight memory mapping system based on partial sum reuse according to one embodiment of the present invention. The weight memory mapping system 1200 according to this embodiment may include a weight memory 1210, an input register 1220, and a hardware arithmetic unit 1230. Here, the hardware arithmetic unit 1230 may correspond to the LPU 100 described above. The weight memory 1210 is a memory that stores the weight matrix of a pre-trained artificial intelligence model and may correspond to an HBM 200. As explained above, a register is required to store input data equal to the number of columns (number of columns in the weight matrix) for partial sum reuse, and the input register 1220 may correspond to a register for storing such input data. 【0068】 The embodiment in Figure 12 shows an example where there are four lanes. For this purpose, the hardware arithmetic unit 1230 may include four MAC trees 1231 and four partial sum accumulators 1232, and may further include a partial sum register 1233 for storing the partial sums. The hardware arithmetic unit 1230 reads the data from the weight memory 1210 via a streamline, directly calculates the partial sums through the MAC trees 1231 and stores them in the partial sum register 1233, and accumulates the partial sums through the partial sum accumulators 1232, so that the final sum is calculated quickly in the middle of the matrix multiplication. In this way, as shown in Figure 11, since the final sum is calculated quickly in the middle of the operation rather than at the end, if there is a second hardware arithmetic unit 1240 that does not participate in the operation, the second hardware arithmetic unit 1240 can receive the final sum calculated in advance from the partial sum accumulators 1232 and use the received final sum to pre-execute the next operation, thereby reducing hardware latency. For example, the second hardware arithmetic unit 1240 may be compatible with other LPUs. 【0069】 Figure 13 shows an example of data deduplication in one embodiment of the present invention. The transformer has a token embedding 1310 operation and an LM head 1320 operation. In the embodiment of Figure 13, since the transformer itself is a well-known technology, only the token embedding 1310 operation and the LM head 1320 operation are shown. Such token embedding 1310 operation and LM head 1320 operation share weight data. In other words, it is possible to store it in only one place in the HBM200, read the vector from that address during the token embedding 1310 operation, and perform matrix multiplication using the weights during the LM head 1320 operation. However, when different operations are performed using the same data in this way, the GPU considers them to be completely different data, and the weight data may be stored redundantly in the HBM200, which leads to waste of HBM200. 【0070】 In this embodiment, it is also possible to process token embedding 1310 in memory mapping for matrix multiplication. As shown in Figure 13, when memory mapping is processed using the partial sum reuse method, it can be seen that all the data is gathered into one row (one column). Therefore, when token embedding 1310, the necessary weight data can be easily read by only specifying which lane and which column to read the weight data from. In this case, although the bandwidth of the HBM200 is not fully utilized, memory waste can be reduced because there is no need to store data redundantly. 【0071】 Figure 14 shows an example of matrix multiplication processing in one embodiment of the present invention. In this embodiment, an example of rotary embedding is described, but the invention is not limited to this. In the case of rotary embedding, after matrix multiplication, operations are performed between values that are far apart in the weight matrix, such as performing operations between the 0th and 64th values in the weight matrix, and then between the 1st and 65th values. If this is implemented in hardware, it will not only cause congestion in hardware routing, but if the values to be operated on are too far apart, streaming processing will become difficult. 【0072】 Therefore, by changing the position of the weight matrix values at the compiler stage, operations between adjacent data can be processed immediately without complex hardware routing. Although pre-processing routing occurs at the pre-processing stage, hardware routing can be eliminated at runtime, thus providing advantages from a hardware perspective. 【0073】 Furthermore, even if the order of the values in the weight matrix is arbitrarily changed in this way, there is no routing overhead at runtime because the next operation is an order-independent inner product operation. 【0074】 Figures 15 and 16 illustrate an example of the process for processing rotary embedding operations in one embodiment of the present invention. Rotary embedding operations involve sine and cosine calculations. However, since the results of sine and cosine calculations are in the range of -1 to 1, the use of a typical 16-bit floating-point number (float16) is inefficient. We confirmed that there are no problems with calculation accuracy even when quantization is performed using a fixed-point 8-bit number by utilizing a source code level simulator. As a result, in this embodiment, memory usage can be reduced by packing two 8-bit values and using the sine and cosine values as a single set in 16 bits. Furthermore, there are as many sine and cosine values as there are heads and as many as there are positions. These can be stored in a memory such as HBM200, and when reading them, the memory channel and address can be determined to retrieve them. 【0075】 Figure 17 is a block diagram showing an example of the internal configuration of a weight memory mapping system according to one embodiment of the present invention, and Figure 18 is a flowchart showing an example of a weight memory mapping method according to one embodiment of the present invention. 【0076】 The weight memory mapping system 1700 according to this embodiment may include a weight memory 1710, an input register 1720, a preprocessing routing unit 1730, a rotary embedded parameter processing unit 1740, and a plurality of hardware arithmetic units 1750. Depending on the embodiment, the weight memory 1710 may be a collection of weight memories for each of the plurality of hardware arithmetic units 1750. For example, each weight memory may correspond to the HBM200 described above. Also, depending on the embodiment, at least some of the components included in the weight memory mapping system 1700 may be implemented in physically different hardware devices. For example, each of the plurality of hardware arithmetic units 1750 may be implemented in a separate hardware device, while the weight memory 1710, input register 1720, preprocessing routing unit 1730, and rotary embedded parameter processing unit 1740 may be implemented in other hardware devices. Each hardware device may be connected to each other via a network and communicate with each other. Also, at least some components may be excluded from the weight memory mapping system 1700, or additional components may be included. For example, for embodiments that do not handle rotary embedding calculations for transformer models, the rotary embedding parameter processing unit 1740 may be excluded from the weight memory mapping system 1700. In other examples, the weight memory mapping system 1700 may further include input / output interfaces for connection to input / output devices, or further include communication interfaces. 【0077】 In step 1810, the weight memory mapping system 1700 may store a weight matrix for a pre-trained artificial intelligence model in the weight memory 1710. Here, the weight matrix stored in the weight memory 1710 may be a part of the overall weight matrix of the artificial intelligence model. For example, if a large-scale artificial intelligence model is separated into multiple partitions and processed through multiple LPUs, there may be a weight memory for each of the multiple LPUs, and the weight matrix to be processed by each LPU may be stored in the weight memory of the corresponding LPU. In this embodiment, the weight memory 1710 may store a weight matrix for a first hardware arithmetic unit 1751, which is one of the multiple hardware arithmetic units 1750. 【0078】 In step 1820, the weight memory mapping system 1700 may, through the preprocessing routing unit 1730, adjust the positions of the weight matrix values stored in the weight memory 1710 so that the values required for the operation following matrix multiplication are adjacent to each other. For example, if the artificial intelligence model is a transformer model, the following operation may include, but is not limited to, a rotary embedding operation for the transformer model. If the operation after matrix multiplication requires an operation between distributed values, the weight memory mapping system 1700 may adjust the positions of the weight matrix values in advance for that operation. Depending on the embodiment, step 1820 may be omitted. 【0079】 In step 1830, the weight memory mapping system 1700 may, through the rotary embedding parameter processing unit 1740, quantize the sine and cosine values of one position of the rotary embedding parameter for the rotary embedding calculation of the transformer model to 8 fixed-point bits, and then store a set of 16-bit packed quantized sine and cosine values in the weight memory 1710. In this case, the rotary embedding parameter processing unit 1740 may store the set in the weight memory 1710 or read the set from the weight memory 1710 according to the value obtained by dividing the position of the set by the number of channels, the channel determined by the head number, and the address determined by the remainder obtained by dividing the position of the set by the number of channels. Such step 1830 is for the rotary embedding calculation of the transformer model and may be omitted depending on the embodiment. Also, the order of steps 1820 and 1830 may be changed. 【0080】 In step 1840, the weight memory mapping system 1700 may store multiple input data in the input register 1720. For the reuse of partial sums, matrix multiplication between multiple input data and each column of the weight matrix may occur sequentially. Therefore, since multiple input data are used repeatedly as many times as there are columns in the weight matrix, the weight memory mapping system 1700 can store such multiple input data in the input register 1720. 【0081】 In step 1850, the weight memory mapping system 1700 processes matrix multiplication between multiple input data and weight matrices through the first hardware arithmetic unit 1751, but may reuse partial sums of matrix multiplication to calculate the final sum on a lane-by-lane basis during the matrix multiplication. Here, the first hardware arithmetic unit 1751 may correspond to the first hardware arithmetic unit 1230 described with reference to Figure 12. In other words, the first hardware arithmetic unit 1751 may include multiple MAC trees 1231 that process matrix multiplication on a lane-by-lane basis, a partial sum register 1233 that stores partial sums of matrix multiplication, and multiple partial sum accumulators 1232 that accumulate the partial sums of the partial sum registers 1233 to calculate the final sum on a lane-by-lane basis. In this case, the number of MAC trees 1231 and the number of partial sum accumulators 1232 may correspond to the number of lanes. For example, if the number of lanes is 4, the number of MAC trees 1231 and the number of partial sum accumulators 1232 may also be 4. On the other hand, the partial sum may include the result of matrix multiplication between one column of the weight matrix and multiple input data. Furthermore, the final lane-by-lane sum may include the accumulated value of the partial sums, which are the result of matrix multiplication between each lane-by-lane column of the weight matrix and multiple input data. 【0082】 In step 1860, the weight memory mapping system 1700 may pre-process the next matrix multiplication while the matrix multiplication is in progress by using the final sum through the second hardware arithmetic unit 1752. The second hardware arithmetic unit 1752 may be one of the multiple hardware arithmetic units 1750 that is not currently participating in the calculation. In other words, the weight memory mapping system 1700 may transmit the quickly calculated final sum to the second hardware arithmetic unit 1752, which is an arithmetic unit not currently participating in the calculation, to pre-process the next calculation. At this time, the second hardware arithmetic unit 1752 may pre-process the next calculation by using the lane-by-lane final sum calculated by at least one of the multiple partial sum accumulators 1232. 【0083】 Furthermore, depending on the embodiment, if the artificial intelligence model is a transformer model, the token embedding operation and LM head operation for the transformer model may share weight data. For this purpose, in step 1810, the weight memory mapping system 1700 may store the shared weight data as a weight matrix in the weight memory 1710. Also, in step 1850, the weight memory mapping system 1700 may, during the token embedding operation, read the weights of a specific column in the weight memory through the first hardware arithmetic unit 1751 and process the token embedding operation. As an example, the above describes how vector operations such as token embedding, softmax, normalization, and residual operations can be performed in the VXE 140 included in the LPU 100. 【0084】 Thus, according to embodiments of the present invention, a weight memory mapping method and system for streaming computation of large-scale generative artificial intelligence hardware can be provided. 【0085】 As described above, the embodiments have been explained based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and variations from the above description. Therefore, even different embodiments fall within the scope of the attached claims as long as they are equivalent to the claims.
Claims
[Claim 1] A weight memory mapping system, A weight memory for storing the weight matrix for a pre-trained artificial intelligence model, An input register that stores multiple input data, A first hardware arithmetic unit that processes matrix multiplication between the plurality of input data and the weight matrix, and reuses the partial sum of the matrix multiplication to calculate the final sum in lane units during the matrix multiplication process, A second hardware arithmetic unit that uses the aforementioned final sum to pre-process the next matrix multiplication while the matrix multiplication is in progress, A weight memory mapping system, including a weight memory mapping system. [Claim 2] The partial sum includes the result of matrix multiplication between one column of the weight matrix and the plurality of input data, The final sum for each lane includes a value obtained by accumulating partial sums, which are the results of matrix multiplication between each lane column of the weight matrix and the plurality of input data. A weight memory mapping system according to claim 1, characterized by the above. [Claim 3] The first hardware arithmetic unit is, A plurality of MAC (multiply-and-acculation) trees that process the matrix multiplication on a lane-by-lane basis, A partial sum register that stores the partial sum of the matrix multiplication, Multiple partial sum accumulators that accumulate partial sums in the partial sum registers and calculate the final sum for each lane, including A weight memory mapping system according to claim 1, characterized by the above. [Claim 4] The weight memory mapping system according to claim 3, characterized in that the number of the plurality of MAC trees and the number of the plurality of partial sum accumulators correspond to the number of lanes, respectively. [Claim 5] The weight memory mapping system according to claim 3, characterized in that the second hardware arithmetic unit pre-processes the next matrix multiplication using the final sum on a lane basis calculated by at least one of the plurality of partial sum accumulators. [Claim 6] The aforementioned artificial intelligence model includes a transformer model, The weight memory stores shared weight data for token embedding operations and LM (Language Modeling) head operations for the transformer model as the weight matrix. The first hardware arithmetic unit reads the weights of a specific column in the weight memory and processes the token embedding operation during the token embedding operation. A weight memory mapping system according to claim 1, characterized by the above. [Claim 7] Preprocessing routing unit adjusts the positions of the weight matrix values stored in the weight memory before the matrix multiplication so that the values required in the next operation of the matrix multiplication are adjacent to each other. The weight memory mapping system according to claim 1, further comprising: [Claim 8] The aforementioned artificial intelligence model includes a transformer model, The weight memory mapping system according to claim 7, characterized in that the subsequent operation includes a rotary embedding operation for the transformer model. [Claim 9] The aforementioned artificial intelligence model includes a transformer model, The rotary embedding parameter processing unit quantizes the sine and cosine values of one position of the rotary embedding parameter for rotary embedding calculation of the transformer model using fixed-point 8 bits, and then stores a set of quantized sine and cosine values packed into 16 bits in the weight memory. The weight memory mapping system according to claim 1, further comprising: [Claim 10] The weight memory mapping system according to claim 9, characterized in that the rotary embedding parameter processing unit stores the set in the weight memory according to the value obtained by dividing the position of the set by the number of channels, the channel determined by the head number, and the address determined by the remainder obtained by dividing the position of the set by the number of channels, or reads the set from the weight memory. [Claim 11] A weight memory mapping method, The steps include storing the weight matrix for a pre-trained artificial intelligence model in weight memory, The steps include storing multiple input data in an input register, The steps include: processing matrix multiplication between the plurality of input data and the weight matrix through a first hardware arithmetic unit, and reusing the partial sum of the matrix multiplication to calculate the final sum in lane units during the matrix multiplication; The steps include: using the final sum through a second hardware arithmetic unit to pre-process the next matrix multiplication while the matrix multiplication is in progress; A weight memory mapping method that includes this. [Claim 12] The partial sum includes the result of matrix multiplication between one column of the weight matrix and the plurality of input data, The final sum for each lane includes a value obtained by accumulating partial sums, which are the results of matrix multiplication between each lane column of the weight matrix and the plurality of input data. The weight memory mapping method according to claim 11, characterized by the above. [Claim 13] The first hardware arithmetic unit is, A plurality of MAC (multiply-and-acculation) trees that process the matrix multiplication on a lane-by-lane basis, A partial sum register that stores the partial sum of the matrix multiplication, Multiple partial sum accumulators that accumulate partial sums in the partial sum registers and calculate the final sum for each lane, including The weight memory mapping method according to claim 11, characterized by the above. [Claim 14] The aforementioned artificial intelligence model includes a transformer model, The aforementioned step of storing is, The shared weight data for the token embedding operation and LM head operation for the transformer model is stored in the weight memory as the weight matrix. The step of calculating the final sum is: During the token embedding operation, the first hardware arithmetic unit reads the weights of a specific column in the weight memory and processes the token embedding operation. The weight memory mapping method according to claim 11, characterized by the above. [Claim 15] The step of adjusting the positions of the values in the weight matrix stored in the weight memory before the matrix multiplication so that the values required for the next operation of the matrix multiplication are adjacent to each other. The weight memory mapping method according to claim 11, further comprising: [Claim 16] The aforementioned artificial intelligence model includes a transformer model, The steps include: quantizing the sine and cosine values of one position of the rotary embedding parameter for the rotary embedding calculation of the transformer model using fixed-point 8 bits; and then storing a set of quantized sine and cosine values packed into 16 bits in the weight memory. The weight memory mapping method according to claim 11, further comprising: