Computing device with non-volatile weight storage
By employing a hierarchical memory architecture in the computing device and utilizing independent direct channels to directly transfer weight parameters to SRAM, the data bus bottleneck caused by the slow read speed of NVM is resolved, thereby improving the efficiency and performance of neural network computing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANAFLASH INC
- Filing Date
- 2024-11-28
- Publication Date
- 2026-06-19
Smart Images

Figure CN122249803A_ABST
Abstract
Description
[0001] Related applications
[0002] This application claims priority and benefit to U.S. Provisional Patent Application No. 63 / 603,122, filed on November 28, 2023. Technical Field
[0003] This invention belongs to the field of computing systems, and more specifically, relates to processors for neural network computing with non-volatile memory (NVM). Background Technology
[0004] Artificial neural networks are increasingly used in artificial intelligence and machine learning applications, with large language models (LLMs) such as ChatGPT being among the most powerful and widely used tools in this field. Transformer-based LLMs require large amounts of memory to store pre-trained weight parameters and numerous matrix multiplication operations. Typically, non-volatile memory (NVM) is used to store large amounts of data, while dynamic data is stored in dynamic random access memory (DRAM) during processing or computation. Currently, a small portion of the data required for computation is transferred from DRAM to the processor's internal static random access memory (SRAM). The processor's arithmetic logic unit (ALU) performs computations based on this data in SRAM and then stores the results back in DRAM.
[0005] Each type of memory (SRAM, DRAM, and NVM) has different typical capacities and read speeds. SRAM capacities range from several megabytes (MB), and typical read speeds within a processor clock cycle are usually less than 1 nanosecond. DRAM capacities are several gigabytes (GB), and read speeds are tens of nanoseconds.
[0006] On the other hand, NVM can store several terabytes (TB), but its read speed is much slower, approximately tens of microseconds. In traditional computing devices, the processor, DRAM, and NVM share a single data bus. Therefore, when transferring large amounts of pre-trained weight parameters stored in the NVM for LLM calculations, the slow read speed of the NVM can create a bottleneck on the data bus. This can lead to a significant performance degradation of the computing device. This invention describes a computing device with a hierarchical memory architecture, comprising DRAM for dynamic data and NVM for static data, such as pre-trained weight parameters. In this architecture, the processor and each memory interact with the processor in a novel way to exchange data. Summary of the Invention
[0007] In one embodiment of the present invention, a computing device for facilitating neural network operation by transforming input data through a series of layers includes: a dynamic random access memory (DRAM) storing one or more input matrices, each input matrix containing digital inputs; a non-volatile memory device storing one or more weight matrices, each weight matrix containing weight parameters; and a processor including a pair of static random access memories (SRAMs), the processor being adapted to: load the input matrices from the DRAM into a first SRAM of the pair, and load the weight matrices from the non-volatile memory device into a second SRAM of the pair, and perform matrix operations on the loaded input matrices and loaded weight matrices, wherein the first SRAM is connected to the DRAM via a data bus, and the second SRAM is connected to the non-volatile memory device via one or more direct channels independent of the data bus, thereby allowing the weight parameters to be directly transferred from the non-volatile memory to the second SRAM.
[0008] In another embodiment, the processor is further configured to transmit and load the corresponding output matrix generated by the matrix operation into DRAM.
[0009] In another embodiment, the second SRAM has a size to store all the segmented weight matrix parameters, allowing the processor to immediately read the segmented weight parameters from the non-volatile device and transfer them to the second SRAM to complete the neural network operation for each layer.
[0010] In another implementation, the second SRAM has a specified size to store a significant portion of the weight matrix parameters, thereby reducing the number of weight parameter transfers from non-volatile memory to the second SRAM to complete the neural network operation for each layer.
[0011] In another embodiment, the processor is configured to partition at least one of the input matrix and the weight matrix into several partial matrices of a corresponding SRAM size less than or equal to that of the pair.
[0012] In another embodiment, the processor is configured to divide the weight matrix into several partial weight matrices along the row direction so as to load them into a second SRAM via a direct channel.
[0013] In another embodiment, when the input matrix from DRAM is loaded into the first SRAM via the data bus, the processor is configured to load one or more partial weight matrices into the second SRAM via a direct channel.
[0014] In another embodiment, the processor is configured to perform matrix multiplication on the loaded input matrix and the loaded segmented weight matrix, and load the corresponding output matrix into DRAM via a data bus.
[0015] In another embodiment, the non-volatile storage device includes a plurality of non-volatile memory chips, each storing multiple rows of weight matrices.
[0016] In another embodiment, each non-volatile memory chip is connected to a second SRAM via one or more parallel direct channels.
[0017] In another embodiment, the processor is configured to segment multiple rows of a weight matrix stored in a non-volatile memory chip along the column direction.
[0018] In another embodiment, the processor is configured to load columns of the segmented weight matrix via one or more parallel direct channels and merge them into a second SRAM.
[0019] In another embodiment, the processor is configured to simultaneously load multiple rows of a specified column of the weight matrix into a second SRAM via multiple direct channels when the corresponding input matrix is loaded from DRAM into a first SRAM via a data bus.
[0020] In another embodiment, the processor is configured to transmit the corresponding output generated by the matrix operation via a data bus and load it into DRAM.
[0021] In another embodiment, the processor is configured to: (a) divide the input matrix into row groups, each group having one or more rows, and adapt them to a first SRAM; (b) divide the weight matrix into one or more columns adapted to a second SRAM; (c) load one or more columns of the weight matrix into the second SRAM via a direct channel; (d) load a set of input matrices into the first SRAM via a data bus; (e) perform matrix multiplication on a set of input matrices and one or more columns of the weight matrix; (f) transmit and load the corresponding output generated by the matrix multiplication into DRAM via a data bus; (g) repeat steps (d) to (f) from the first row group of the input matrix to the last row group of the input matrix; and (h) repeat steps (c) to (f) from the first column group to the last column group of the weight matrix.
[0022] In another embodiment, the processor is configured to: (a) divide the input matrix into multiple column groups, each group having one or more columns, and adapt them to a first SRAM; (b) divide the weight matrix into multiple row groups, each group having one or more rows, and adapt them to a second SRAM; (c) load the entire column of the divided input matrix into the first SRAM via a data bus; (d) load one or more columns of the corresponding divided weight matrix into the second SRAM via one or more direct channels; and (e) divide the loaded entire column of the divided input matrix and the loaded one or more columns of the divided weight matrix into the second SRAM. (f) Perform matrix multiplication on the input matrix; (g) Transmit the corresponding output generated by the matrix multiplication via the data bus and load it into DRAM; (h) Repeat steps (c) to (f) from the first column to the last column of the segmented weight matrix; (i) Load the output matrix stored in DRAM, which is generated by matrix multiplication of each set of input matrices and weight matrices, and perform element-wise addition on the output of each segmented input matrix and one of the corresponding segmented weight matrices.
[0023] In another embodiment, the processor is configured to transfer the element-wise addition of the output to at least one of a second SRAM and a DRAM.
[0024] In one embodiment, a non-transitory computer-readable storage medium stores instructions thereon, wherein the instructions are executed by a computing device to cause the computing device to: store one or more input matrices in random access memory (DRAM), each matrix containing digital inputs; store one or more weight indices, each weight matrix containing weight parameters; load the input matrices from the DRAM into a first SRAM and load the weight matrices from the non-volatile memory device into a second SRAM; and partition at least one of the input matrices and weight matrices into partial matrices of a size less than or equal to a corresponding size in the first and second SRAMs, wherein the first SRAM is connected to the DRAM via a data bus and the second SRAM is connected to the non-volatile memory device via one or more direct channels independent of the data bus, allowing the weight parameters to be directly transferred from the non-volatile memory to the second SRAM.
[0025] In another embodiment, the non-transitory computer-readable storage medium described above, wherein the processor in the computing device performs: (a) dividing the input matrix into row groups, each row group being adapted to a first SRAM; (b) dividing the weight matrix into one or more columns adapted to a second SRAM; (c) loading one or more columns of the weight matrix into the second SRAM via a direct channel; (d) loading a row group of the input matrix into the first SRAM via a data bus; (e) performing matrix multiplication on the row groups of the input matrix and one or more columns of the weight matrix; (f) transmitting and loading the corresponding output generated by the matrix multiplication into DRAM via a data bus; (g) repeating steps (d) to (f) from the first row group of the input matrix to the last row group of the input matrix; and (h) repeating steps (c) to (f) from the first column group of the weight matrix to the last column group.
[0026] In another embodiment, the processor in the computing device performs: (a) partitioning the input matrix into multiple column groups, each column group being adapted to a first SRAM; (b) partitioning the weight matrix into row groups, each row group being adapted to a second SRAM; (c) loading the entire column of the partitioned input matrix into the first SRAM via a data bus; (d) loading one or more columns of the corresponding partitioned weight matrix into the second SRAM via one or more direct channels; (e) performing matrix multiplication on the partitioned input matrix with the entire column loaded and the partitioned weight matrix with one or more columns loaded; (f) transmitting and loading the corresponding output generated by the matrix multiplication into DRAM via a data bus; (g) repeating steps (d) to (f) from the first column to the last column of the partitioned weight matrix; (h) repeating steps (c) to (f) from the first to the last partitioned input matrix; (i) loading the output matrix stored in DRAM, the matrix being generated by each group of matrix multiplications of the input matrix and the weight matrix, and performing element-wise addition on the output of each partitioned input matrix and one of the corresponding partitioned weight matrices.
[0027] In another implementation, the processor transfers the output element-wise addition to at least one of the second SRAM and DRAM. Attached Figure Description
[0028] Figure 1 Draw a diagram of matrix multiplication in transformer operation.
[0029] Figure 2 illustrates the processor and various types of memory in a conventional computing device.
[0030] Figure 3 illustrates a conventional system for performing matrix multiplication in LLM computation.
[0031] Figure 4 The system proposed in this invention is illustrated, which directly uses weight parameters in flash memory.
[0032] Figure 5 This is a block diagram illustrating an example of a computing device having multiple memory chips according to one embodiment of the present invention.
[0033] Figure 6A and Figure 6B This is a block diagram illustrating an example of a computing device using two different input data partitions according to some embodiments of the present invention.
[0034] Figure 7 This is a block diagram illustrating an example of a computing device using an element-wise addition method of two matrices according to an embodiment of the present invention.
[0035] Figure 8A and Figure 8B This is a simplified block diagram illustrating an example of an application system according to some embodiments of the present invention.
[0036] Figure 9 This is a simplified block diagram illustrating a multiprocessor architecture, based on a system proposed according to an embodiment of the present invention. Detailed Implementation
[0037] In the following description, the invention will be explained in detail with reference to the accompanying drawings. The features of the invention will become readily apparent to those skilled in the art through the description of the drawings. It should be understood that the drawings depict only typical embodiments of the invention and do not limit its scope, and that the invention can be described through features and details beyond those shown in the drawings.
[0038] Terms containing ordinal numbers (such as "first", "second", etc.) can describe various elements, but these terms do not limit the elements. The terms mentioned above are only used to distinguish elements relatively.
[0039] When a component is referred to as "connected" or "accessed" to another component, it may be directly connected to or accessed by the other component, but it should be understood that other components may exist between the two. On the other hand, when a component is referred to as "directly connected" or "directly accessed" to another component, it should be understood that there are no other components between the two.
[0040] Singular expressions include plural expressions, unless the context clearly specifies otherwise.
[0041] In this document, terms such as "comprise" or "have" are intended to indicate the presence of features, quantities, steps, operations, elements, components, or combinations thereof described in the specification; however, they do not preclude the presence or possibility of other features, quantities, steps, operations, elements, components, or combinations thereof being present or pre-added.
[0042] Figure 1 Describe matrix multiplication 100 in transformer operation. Figure 1 In this context, the OUT matrix 130 can be obtained by multiplying the IN matrix 110 by the W matrix 120, which is referred to as a "layer" in neural network terminology.
[0043] In neural networks, "layer as a matrix output" means that the output of a layer is represented as a matrix. In the IN matrix (110), each row (A, B, C… M) can represent a single input data point. In the W matrix (W120), each column (1, 2, 3… L) can represent a single feature (or neuron). The number of columns can be equal to the number of neurons in the layer. For example, a fully connected layer with 500 neurons will have an output matrix with 500 columns. In the OUT matrix (130), the values in a specific row and column indicate the activation level of that neuron for a specific data point.
[0044] The M×N dimension IN matrix 110 contains dynamic data computed and transmitted in real time from previous matrix multiplications. In small models (such as LLaMA[2]), the size of IN matrix 110 is typically tens of megabytes (MB), while in large models (such as GPT-3[3]), it can reach several gigabytes (GB). The N×L dimension W matrix 120 contains weight parameters that remain unchanged during inference once training is complete. The size of each W matrix 120 ranges from tens to hundreds of MB, and the total size of W matrices in large language models (LLMs) like GPT-4 can reach several terabytes (TB). The M×L dimension OUT matrix 130, derived from matrix multiplication, is then used as the IN matrix 110 for the next matrix multiplication.
[0045] Similar to standard matrix multiplication, element A1 in OUT matrix 130 is the sum of the products of each element in the first row (row A) of IN matrix 110 and the corresponding element in the first column (column 1) of W matrix 120. Similarly, element A2 in OUT matrix 130 is the sum of the products of each element in the first row (row A) of IN matrix 110 and the elements in the second column (column 2) of W matrix 120. Element AL is the sum of the products of each element in the first row (row A) of IN matrix 110 and the elements in the last column (column L) of W matrix 120. Finally, element ML in OUT matrix 130 is the sum of the products of each element in the last row (row M) of IN matrix 110 and the elements in the last column (column L) of W matrix 120. Therefore, multiplying the M×N IN matrix 110 and the N×L W matrix 120 yields the M×L OUT matrix 130. This type of matrix multiplication is widely used in transformer operations.
[0046] Figure 2 illustrates the various types of memory 220, 230 in the processor 210 and the conventional computing device 200.
[0047] Processor 210 includes an ALU (211, Arithmetic Logic Unit) for computation and SRAM 212 for storing data. SRAM 212 data can be read within system clock cycles (e.g., <1 ns), but SRAM 212 capacity is typically limited to several megabytes (MB). Therefore, processor 210 requires external memory to store and manage large amounts of data. DRAM 220 is a high-density volatile memory used to store gigabytes (GB) of data and connects to the processor at high speed based on DRAM's high-speed read capability (approximately tens of nanoseconds), such as LPDDR5 (128 GB / s). On the other hand, flash memory 230 is a non-volatile memory used to store large amounts of data up to several terabytes, such as LLM weight parameters, although its read speed is slow (tens of microseconds).
[0048] In the conventional computing device 200, data exchange between the processor 210 and the memories 220 and 230 occurs via the data bus 240. Transferring the required data from the large LLM weight parameters stored in the flash memory 230 to the DRAM 220 is necessary for LLM computation. However, the slow read speed of the flash memory 230 creates a bottleneck on the data bus, reducing the performance of the conventional computing device 200. Therefore, a new data transfer method is needed to efficiently handle large-scale LLM weight parameters.
[0049] Figure 3 illustrates a conventional system 300 used to perform matrix multiplication in LLM computation.
[0050] For example, a pair of SRAMs (Static Random Access Memory) can be used in processor 310 for neural network operations. One of these SRAMs 312 can be used to store the network's weights, while the other SRAM 311 can be used to store the start-up values (the neuron's output). This allows for fast access to both during computation. The weights can be stored in a compressed format to save space, and various data layouts (e.g., row-major, column-major) can be optimized for specific network architectures and operations.
[0051] and Figure 1 Similarly, IN matrix 321 is an M×N matrix, W matrix 331 is an N×L matrix, and OUT matrix 323 is an M×L matrix. First, W matrix 331 is copied from flash memory 330 to DRAM 320 via data bus 340. Second, one or more rows 3211 of IN matrix 321 and one or more columns 3221 of the copied W matrix 322 in DRAM 320 are transferred to SRAM 311 and SRAM 312 respectively, and the sum of the product results 3231 is stored in OUT matrix 323 in DRAM 320 via data bus 340. Note that the elements of W matrices 331 and 322 are repeated in both flash memory 330 and DRAM 320, and because the read speed of flash memory 330 is very slow, data flow is generated between processor 310 and memories 320 and 330 to copy elements from flash memory 330 to DRAM 320.
[0052] Although the flash memory 330 has a slow read speed, its large capacity allows it to store multiple W matrices 331 containing numerous weight parameters for LLM calculations, as shown in Figure 3. The W matrices 331 used for the current calculation are copied to DRAM 320. Since the SRAMs 311 and 322 within the processor 310 are fast but have limited capacity, single or multiple rows 3211 of the IN matrix 321 and single or multiple columns 3221 of the W matrix 322 are transferred from DRAM 320 to SRAMs 311 and 312 via data bus 340, where the processor's ALU (not shown) performs the calculations.
[0053] The sum of the products of a single row 3111 of the IN matrix and a single column 3121 of the W matrix becomes element 4121 of the OUT matrix 323, stored in DRAM 320. When several rows 3111 of the IN matrix are transferred to SRAM 311 and several columns 3121 of the W matrix are transferred to SRAM 312, the number of elements corresponding to the sum of the products in the matrix multiplication is stored as the result 3231 in the OUT matrix 323 in DRAM 320. When the sum of the products of all rows of the IN matrix 321 and all columns of the W matrix 322 is calculated and stored as an element of the OUT matrix 323 in DRAM 320, a set of calculations is completed. The resulting OUT matrix 323 then becomes the IN matrix for the next set of calculations. Furthermore, the next W matrix 331 in flash memory 330 is copied to DRAM 320 for a subsequent set of calculations.
[0054] Figure 4 The present invention describes a system 400 that transfers weight parameters directly from flash memory 430 to processor 410 for calculation.
[0055] Similar to the conventional system shown in Figure 3, the IN matrix 421 can be an M×N matrix, the W matrix 431 can be an N×L matrix, and the OUT matrix 423 can be an M×L matrix. The direct channel 450 can be used to transfer the weight matrix 431 directly from the flash memory 430 to the SRAM 412 without sending it to the DRAM 420. Considering that read operations from the flash memory 430 are significantly slower than those from the DRAM 420, system performance can be improved by reducing the frequency of weight parameter transfers from the flash memory 430.
[0056] The computation time of matrix multiplication, as well as the DRAM 420 access time for the IN 421 and OUT 423 matrices, can also be hidden within the flash memory 430 read time. Here, the weight parameter matrix 431 can be directly used for computation from the flash memory 430 without being copied to the DRAM 420. For computation, the entire IN matrix 421 is first copied to the SRAM 411 via the data bus 440 to avoid redundant movement of the weight parameter matrix 431. When the IN matrix 421 is copied to the SRAM 411, one or more columns 4311 of the W matrix 431 in the flash memory 430 are also copied to the SRAM 412 via the direct channel 450.
[0057] The ALU (not shown) in processor 410 performs matrix multiplication between the copied IN matrix 4111 and one or more columns 4121 of the W matrix, and stores the result 4231 in the OUT matrix 423 in DRAM 420 via data bus 440. Figure 1In matrix multiplication, the sum of the products of a row from IN matrix 4111 and a column from W matrix 4121 forms a single element of OUT matrix 423. For example... Figure 4 As shown, IN matrix 4111 is copied to SRAM 411, and a single column 4121 of W matrix is also copied to SRAM 412. When calculating the sum of the products between a column 4121 of W matrix and all rows of IN matrix 4111, column 4231 of OUT matrix 423 is generated as the result and stored in DRAM 420. If multiple columns 4121 of W matrix are copied to SRAM 412 and multiplied by the entire IN matrix 4111, an equal number of columns 4231 in OUT matrix 423 are generated as the result and stored in DRAM 420.
[0058] As mentioned earlier, flash memory 430 is significantly slower than DRAM 420. The proposed system aims to reduce data transfer bottlenecks and maximize system performance by addressing the slow read operations and data transfer limitations of flash memory 430.
[0059] Specifically, for each inference operation, accessing the weight parameters from flash memory 430 will significantly slow down the entire process. Therefore, it is recommended to use a larger SRAM 412 to cache more weights, thereby minimizing the number of accesses to the slow flash memory 430. This directly impacts inference latency and throughput.
[0060] Therefore, according to one embodiment of the present invention, the second SRAM 412 may have a size that stores all the segmented weight matrix parameters, thereby allowing the processor to immediately read and transfer the segmented weight parameters from the non-volatile device into the second SRAM to complete the neural network operation for each layer.
[0061] In another embodiment, the second SRAM 412 may have a specified size to store a significant portion of the segmented weight matrix parameters 4311, thereby reducing the number of weight parameter transfers from non-volatile memory to the second SRAM to complete the neural network operation for each layer.
[0062] In another implementation, by having a sufficiently large SRAM 412, the W matrix transferred from the flash memory can be reused for the next operation, thereby reducing the total number of W matrix transfers from the flash memory. In this way, system performance can be improved by reducing the slow flash memory access frequency.
[0063] Figure 5 This is a block diagram illustrating an example of a computing device having multiple memory chips according to an embodiment of the present invention. Figure 5A multi-chip (X) and multi-channel (Y) data transmission method in flash memory according to an embodiment of the present invention is described, which is used to improve the neural network operation performance of each layer of the neural network.
[0064] exist Figure 5 In this configuration, the IN matrix 521 is copied from DRAM 520 to SRAM 511 in processor 510 via data bus 550, and the calculation results are stored in OUT matrix 523 within DRAM 520 via data bus 550. IN matrix 521 is an M×N matrix, W matrix 531 is an N×L matrix, and OUT matrix 523 is an M×L matrix. However, the methods for retrieving one or more columns of W matrix from flash memory differ.
[0065] The system proposed in this application includes two flash memory chips, X1 530 and X2 540, each flash memory containing two data transmission channels. Flash memory X1 530 has channels Y11 and Y12, while flash memory X2 540 has channels Y21 and Y22. Half of the W matrices 531a and 531b are stored in X1 530, and the other half, 531c and 531d, are stored in X2 540.
[0066] exist Figure 5 In the example shown, when a column of W matrix 531 is transferred to SRAM 512 and multiplied by IN matrix 5111 copied from DRAM 520 to calculate column 5231 of OUT matrix 523, half of the desired column of W matrix 531 is stored in X1 530 and the other half in X2 540. The first quarter 5311 of the column stored in X1 530 is transferred to SRAM 512 via channel Y11, while the second quarter 5312 of the column in X1 530 is transferred via channel Y12. For the portion in X2 540, the third quarter 5313 of the column is transferred to SRAM 512 via Y21, and the last quarter 5314 is transferred via Y22.
[0067] Because data transmission through each channel is performed in parallel, an entire column of matrix W 5121 can be transferred from flash memory 530 and flash memory 540 to SRAM 512 in only a quarter of the time required to transfer the column. Therefore, in Figure 5 In the configuration shown, the speed at which the W matrix column 5121 is transferred from flash memory 530 and flash memory 540 to SRAM 512 is faster than... Figure 4 It is four times faster than in the middle (X*Y=2*2=4).
[0068] Figure 6A and Figure 6BTwo different matrix segmentation methods based on segmentation direction are described.
[0069] When the size of the IN matrix exceeds the capacity of the SRAM, the IN matrix must be partitioned, and matrix multiplication must be processed over multiple cycles. Figure 6A and Figure 6B In the middle, with Figure 3, Figure 4 and Figure 5 Similarly, the IN matrix is an M×N matrix, the W matrix is an N×L matrix, and the OUT matrix is an M×L matrix.
[0070] exist Figure 6A In the process, the IN matrix is divided along the column direction into an IN(1) matrix 621 and an IN(2) matrix 622. For each IN(1) matrix 621 and IN(2) matrix 622, the W matrix 631 is repeatedly read and transmitted, resulting in repeated data transmission from the flash memory 630. Figure 4 and Figure 5 The W matrix 631 stored in flash memory 630, one or more columns 6311, can be directly transferred to SRAM 612. Figure 6A An example is described in which columns 6311 of matrix W 631 are sequentially transferred to SRAM 612 for matrix multiplication.
[0071] First, the IN(1) matrix 621, a portion of the IN matrix stored in DRAM 620, is transferred via data bus 640 to SRAM 611 within processor 610. While the IN(1) matrix 621 is copied to SRAM 611, one or more columns 6311 of the W matrix 631 stored in flash memory 630 are also directly transferred to SRAM 612, and matrix multiplication of the transferred data 6111 and 6121 is performed, storing the result 6231 in OUT matrix 623 in DRAM 620. Since the transferred IN(1) matrix 6111 is displayed as half of the IN matrix in the column direction, only half of the first column of the entire OUT matrix 623 is calculated and stored. Next, one or more columns of the W matrix 631 are transferred from flash memory 630 to SRAM 612 to perform multiplication with the IN(1) matrix 6111, and the result is stored in the first half of the next column of OUT matrix 623 in DRAM 620. For each subsequent column of the W matrix 631, the process is performed sequentially until the last column is transferred to the SRAM 612, multiplied by the IN(1) matrix 6111, and the result is stored in the OUT matrix 623 of the DRAM 620. This process produces partial computation results along the column direction of the entire OUT matrix 623, corresponding to splitting the IN matrix in the column direction into the IN(1) matrix 621 and the IN(2) matrix 622.
[0072] Next, a portion of the IN matrix, IN(2) matrix 622, stored in DRAM 620, is transferred to SRAM 611. In the same manner as the multiplication performed on IN(1) matrix 621, each column of W matrix 631, from the first column to the last, is sequentially transferred to SRAM 612 to perform multiplication with the IN(2) matrix 6111, which has been copied from DRAM 620 to SRAM 611. Then, the remaining half of OUT matrix 623 in DRAM 620 is calculated and stored. In summary, in Figure 6A In this case, the process of sequentially transferring each column of the W matrix 631 stored in flash memory 630 from the first column to the last column to SRAM 612 is performed twice: once for multiplication with IN(1) matrix 621, and again for multiplication with IN(2) matrix 622, both requiring repeated read and transfer operations. The need to repeatedly transfer columns 6311 of the W matrix 631 stored in flash memory 630 to SRAM 612 can be considered a drawback of the method of dividing the IN matrix in the column direction. Figure 6A As shown.
[0073] exist Figure 6BIn the process, the IN matrix is divided into IN(1)661 and IN(2)662 in the row direction. For the IN(1)661 and IN(2)662 matrices, the W(1)671 and W(2)672 matrices are read and transmitted respectively, which avoids redundant data transmission from the flash memory 670, although it requires element-wise addition of the OUT(1) matrix 663 and the OUT(2) matrix 664 to obtain the final OUT matrix. First, the IN(1) matrix 661, which is divided from the IN matrix in the row direction, is transmitted from the DRAM 660 inside the processor 650 to the SRAM 651 via the data bus 680. Since the multiplication of the IN matrix and the W matrix is calculated by the sum of the element-wise products of the rows of the IN matrix and the columns of the W matrix, the W matrix should also be divided into W(1)671 and W(2)672 matrices in the column direction accordingly. When the IN(1) matrix 661 is copied to SRAM 651, one or more columns 6711 of the W(1) matrix 671 in flash memory 670 are directly transferred to SRAM 652. The result of multiplying the IN(1) matrix and one or more columns 6521 of the W(1) matrix 671 is stored via data bus 680 as the corresponding number of columns 6631 of the OUT(1) matrix 663 in DRAM 660. For the next calculation, the next column or more of the W(1) matrix 671 is transferred to SRAM 652, and the result of multiplying it with the IN(1) matrix 6511 is stored as the next column of the OUT(1) matrix 663. In turn, each column of the W(1) matrix 671 is transferred to SRAM 652 and calculated, and the result is stored in the OUT(1) matrix 663.
[0074] Next, the IN(2) matrix 662, divided along the rows of the IN matrix, is transferred from DRAM 660 to SRAM 651 via data bus 680. Similar to the calculation using IN(1) matrix 661, for the calculation using IN(2) matrix 662, each column of W(2) matrix 6721 is transferred directly from flash memory 670 to SRAM 652 from the first column to the last column, where it is multiplied by IN(2) matrix 6511. The resulting output is stored as column 6641 of OUT(2) matrix 664 in DRAM 660. Finally, element-wise addition is performed on OUT(1) matrix 663 and OUT(2) matrix 664 to generate the entire OUT matrix, which is then stored in DRAM 660 as the final result. Figure 7 The element-wise addition of two matrices will be explained in detail.
[0075] exist Figure 6A In this case, one advantage is that element-wise addition of the OUT matrix is not required. However, its disadvantage is that the W matrix is repeatedly transferred from flash memory to SRAM. Figure 6A The situation is different in China. Figure 6B The examples in the examples do not involve repeated reads and transfers of the W matrix from flash memory 670 to SRAM 652. In the proposed system, the focus is on reducing the number of read and transfer operations from flash memory to SRAM. Figure 6B The row matrix partitioning in the middle is more suitable.
[0076] Figure 7 An element-wise addition method for two matrices is described. One or more columns or portions of matrices 7211 and 7221 of OUT(1) matrix 721 and OUT(2) matrix 722 are copied from DRAM 720 in processor 710 to SRAM 711 via data bus 740. Element-wise addition is performed on corresponding elements 7111 and 7112 of OUT(1) matrix 721 and OUT(2) matrix 722. For example, element (1, 1) of OUT(1) matrix 721 is added to element (1, 1) of OUT(2) matrix 722 to obtain element (1, 1) of the final OUT matrix. Similarly, element (2, 4) of OUT(1) matrix 721 is added to element (2, 4) of OUT(2) matrix 722 to obtain element (2, 4) of the final OUT matrix. Typically, the elements (m, n) of OUT(1) matrix 721 are added to the elements (m, n) of OUT(2) matrix 722 to form the elements (m, n) of the final OUT matrix. Once the element-wise addition of one or more columns or portions of the OUT(1) matrix 721 and OUT(2) matrix 722 copied to SRAM is completed, the remaining columns or portions of the OUT(1) matrix 721 and OUT(2) matrix 722 stored in DRAM 720 are sequentially copied to SRAM 711 via data bus 740 to perform element-wise addition. After all element-wise additions are completed, the final OUT matrix is finished. After the element-wise addition operation is performed in processor 710, the final OUT matrix 723 is stored in DRAM 720 via data bus 740, or alternatively, the final OUT matrix 7121 can be stored in another SRAM 712 for matrix multiplication in the next layer without being transferred to DRAM 720.
[0077] about Figure 6B Matrix partitioning methods, and Figure 7 The supplementary explanation of element-wise addition of matrices is as follows. For example... Figure 1 As shown, the IN matrix is an M×N matrix, the W matrix is an N×L matrix, and the OUT matrix is an M×L matrix.
[0078] When the IN matrix is divided into two matrices along the row direction, the IN(1) matrix 661 and the IN(2) matrix 662 will become M×(N / 2) matrices. Therefore, the W matrix is divided into two matrices along the column direction, resulting in the W(1) matrix 671 and the W(2) matrix 672, which are (N / 2)×L matrices.
[0079] The matrix multiplication of IN(1) matrix 661 and W(1) matrix 671, as well as IN(2) matrix 662 and W(2) matrix 672, involves an M×(N / 2) matrix and an (N / 2)×L matrix, resulting in an M×L matrix. Therefore, OUT(1) matrix 663 and OUT(2) matrix 664 are both M×L matrices.
[0080] Figure 7 The element-wise matrix addition involves adding corresponding elements of two matrices, 721 and 722, which does not change the dimensions of the matrices. Therefore, the final OUT matrix 723 also remains an M×L matrix.
[0081] In summary, Figure 1 The dimension (M×L) of the OUT matrix 130 obtained by multiplying the IN matrix 110 (M×N) and the W matrix 120 (N×L) described in the text is the same as that obtained by... Figure 6B The dimension (M×L) of the OUT matrix 723 obtained by the matrix partitioning method in the text is... Figure 7 The subsequent element-wise matrix additions are the same.
[0082] Figure 8A and Figure 8B Example applications according to embodiments of the present invention are described. For example... Figure 8A As shown, the processor 810 may be an application processor (AP), a microcontroller unit (MCU), a CPU, a GPU, or the like, integrated with the NVM 830 and DRAM 820, and has data flow optimized for the NVM 830. Figure 8A The integrated system 800 in the present invention illustrates the data flow of the static weight parameter 831, such as... Figures 4 to 6A and Figure 6B As shown. Static weight parameters 831, which are large-capacity static pre-training data, are stored in non-volatile memory 830. Dynamic data 821, which changes during LLM calculation, is stored in DRAM 820, corresponding to... Figures 4 to 6A and Figure 6BThe IN and OUT matrices in it. For each calculation, a part of the weight parameter 831 data from the NVM 830 is directly transferred to the SRAM 812 within the processor 810. The dynamic data 821, such as the IN matrix stored in the DRAM 820, is transferred to another SRAM 811 in the processor 810 through the data bus 840, where the matrix calculation is performed by the ALU (not shown) of the processor, and the result is stored as the OUT matrix in the DRAM 820 through the data bus 840.
[0083] Figure 8B Describes another example of the interface between the system 850 and the application system 860 as a stand-alone artificial intelligence (AI) accelerator 870. The application system 860 can include various examples, such as a mobile device, a personal computer, or an automobile. Thus, the processor 861 of the application system 860 can be a general-purpose CPU, AP, MCU, GPU, etc. Similar to other systems, the application system 860 also includes a DRAM 862 and an NVM 863 to execute the necessary application programs. The artificial intelligence accelerator 870, similar to Figure 8A the system in, includes SRAMs 8711 and 8712 in its processor 871, DRAM 872, and non-volatile memory 873. Similar to Figure 8A that, the dynamic data 8721 in the DRAM872 is transferred to the SRAM 8711 through the data bus 874, and the static weight data 8731 in the non-volatile memory 873 is directly transferred to the SRAM 8712. The result calculated in the processor 871 is stored in the DRAM 872 through the data bus 874. However, in Figure 8B the case of, the artificial intelligence accelerator 870 is mainly responsible for performing LLM calculations, thereby reducing the load on the application system 860 and only delivering the calculation results through the interface 880. This design improves the efficiency of the entire system. In this case, the interface 880 between the application system 860 and the artificial intelligence accelerator 870 can be USB, Bluetooth, Wi-Fi, etc.
[0084] Figure 9 Shows a multi-processor architecture 900 for large-scale acceleration by utilizing the parallelism of multiple computing devices. Assuming Figure 4 the proposed system shown in as a single computing device, the entire architecture 900 includes a host 910 and N computing devices 920. Each of the N processors, from processor 1 to processor N, controls its corresponding computing device 920 and executes the assigned LLM calculations. Each computing device 920 includes a DRAM 922, a non-volatile memory 923, and a data bus 924, and, as Figure 4As shown, static data 9231 and weight parameters stored in each non-volatile memory 923 are directly transferred to the SRAM 9212 within the processor 921 of the corresponding computing device. Dynamic data 9221, such as IN and OUT matrices, are stored in the DRAM 922 of each computing device. The IN matrix is transferred via data bus 924 to another SRAM 9211 within each processor 921, where matrix multiplication with weight parameters is performed by the processor's ALU (not shown). The calculation result is stored as the OUT matrix in DRAM 922 via data bus 924, serving as the IN matrix for subsequent calculations. The host 910 optimally distributes the overall LLM calculation among the processors 921 in the N computing devices 920 to maximize the efficiency of the parallel system. Pre-trained static weight parameter data 9231 is carefully stored in the N non-volatile memories 923 to be appropriately distributed across the LLM calculations in each computing device, thereby minimizing data movement between computing devices 920.
Claims
1. A computing device that transforms input data through a series of layers to facilitate neural network operation, the computing device comprising: Dynamic random access memory (DRAM) stores one or more input matrices, each of which contains digital inputs; A non-volatile storage device that stores one or more weight matrices, each weight matrix containing weight parameters; A processor, including a pair of static random access memories (SRAM), is adapted to: The input matrix is loaded from the DRAM into the first SRAM of the pair, and the weight matrix is loaded from the non-volatile memory device into the second SRAM of the pair. Matrix operations are performed on the loaded input matrix and the loaded weight matrix, wherein the first SRAM is connected to the DRAM via a data bus, and the second SRAM is connected to the non-volatile memory device via one or more direct channels independent of the data bus, thereby allowing the weight parameters to be transferred directly from the non-volatile memory to the second SRAM.
2. The computing device according to claim 1, wherein, The processor is further configured to transmit and load the corresponding output matrix generated by the matrix operation into the DRAM.
3. The computing device according to claim 1, wherein, The second SRAM has a size for storing all the segmented weight matrix parameters, allowing the processor to immediately read the segmented weight parameters from the non-volatile storage device and transfer them to the second SRAM to complete the neural network operation for each layer.
4. The computing device according to claim 1, wherein, The second SRAM has a specified size to store most of the segmented weight matrix parameters, thereby reducing the number of weight parameters transferred from the non-volatile memory to the second SRAM to complete the neural network operation for each layer.
5. The computing device according to claim 1, wherein, The processor is configured to divide at least one of the input matrix and the weight matrix into several partial matrices that are smaller than or equal to the size of the corresponding SRAM in the pair of SRAMs.
6. The computing device according to claim 5, wherein, The processor is configured to divide the weight matrix into several partial weight matrices along the row direction so that they can be loaded into the second SRAM through the direct channel.
7. The computing device according to claim 6, wherein, The processor is configured to load one or more of the partial weight matrices into the second SRAM when the input matrix from the DRAM is loaded into the first SRAM via the data bus.
8. The computing device according to claim 7, wherein, The processor is configured to perform matrix multiplication on the loaded input matrix and the loaded segmented weight matrix, and load the corresponding output matrix into the DRAM via the data bus.
9. The computing device according to claim 5, wherein, The non-volatile storage device includes a plurality of non-volatile memory chips, each of which stores multiple rows of the weight matrix.
10. The computing device according to claim 9, wherein, Each of the non-volatile memory chips is connected to the second SRAM via one or more parallel direct channels.
11. The computing device according to claim 10, wherein, The processor is configured to divide multiple rows of the weight matrix stored in the non-volatile memory chip along the column direction.
12. The computing device according to claim 11, wherein, The processor is configured to load and merge the columns of the weight matrix into the second SRAM via one or more parallel direct channels.
13. The computing device according to claim 12, wherein, The processor is configured to simultaneously load several rows of a specified column of the weight matrix into the second SRAM via multiple direct channels when the corresponding input matrix of the DRAM is loaded into the first SRAM via the data bus.
14. The computing device according to claim 13, wherein, The processor is configured to transmit and load the corresponding output generated by the matrix operation into the DRAM via the data bus.
15. The computing device according to claim 5, wherein, The processor is configured as follows: (a) The input matrix is divided into multiple row groups, each row group having one or more rows, and adapted to the first SRAM; (b) Divide the weight matrix into one or more columns adapted to the second SRAM; (c) Load one or more columns of the weight matrix into the second SRAM through the direct channel; (d) Load one of the plurality of row groups of the input matrix into the first SRAM via the data bus; (e) Perform matrix multiplication on a set of the input matrices and one or more columns of the weight matrix; (f) The corresponding output generated by the matrix multiplication is transmitted and loaded into the DRAM via the data bus; (g) Repeat steps (d) to (f) for the first group of the plurality of row groups of the input matrix to the last group of the input matrix; and (h) Repeat steps (c) to (f) for the first to last group of the multiple column groups of the weight matrix.
16. The computing device according to claim 5, wherein, The processor is configured as follows: (a) The input matrix is divided into multiple column groups, each column group having one or more columns, and adapted to the first SRAM; (b) The weight matrix is divided into multiple row groups, each row group having one or more rows, and adapted to the second SRAM; (c) Load the entire segmented input matrix into the first SRAM via the data bus; (d) Load one or more columns of the corresponding segmented weight matrix into the second SRAM through one or more of the direct channels; (e) Perform matrix multiplication on the loaded whole column of the segmented input matrix and the loaded one or more columns of the segmented weight matrix; (f) The corresponding output generated by the matrix multiplication is transmitted and loaded into the DRAM via the data bus; (g) Repeat steps (d) to (f) for the first column to the last column of the segmented weight matrix; (h) Repeat steps (c) through (f) for the first column group to the last column group of the segmented input matrix; and (i) Load the output matrix stored in the DRAM, which is obtained by matrix multiplication of each set of matrices of the input matrix and the weight matrix, and perform element-wise addition on the output of each segmented input matrix and its corresponding segmented weight matrix.
17. The computing device according to claim 16, wherein, The processor is configured to transfer the result of the element-wise addition of the output to at least one of the second SRAM and the DRAM.
18. A non-transitory computer-readable storage medium having instructions stored thereon, wherein, The instructions are executed by the computing device to cause the computing device to: One or more input matrices are stored in random access memory (DRAM), each of which contains digital inputs; Store one or more weight matrices, each weight matrix containing weight parameters; The input matrix is loaded from the DRAM into the first SRAM, and the weight matrix is loaded from the non-volatile memory device into the second SRAM. At least one of the input matrix and the weight matrix is divided into several partial matrices, wherein the partial matrices are less than or equal to one of the corresponding ones in the first SRAM and the second SRAM. The first SRAM is connected to the DRAM via a data bus, and the second SRAM is connected to the non-volatile memory device via one or more direct channels independent of the data bus, thereby allowing the weighting parameters to be directly transferred from the non-volatile memory device to the second SRAM.
19. The non-transitory computer-readable storage medium according to claim 18, wherein, The processor in the computing device executes: (a) The input matrix is divided into multiple row groups, each row group being adapted to the first SRAM; (b) Divide the weight matrix into one or more columns adapted to the second SRAM; (c) Load one or more columns of the weight matrix into the second SRAM through the direct channel; (d) Load one of the plurality of row groups of the input matrix into the first SRAM via the data bus; (e) Perform matrix multiplication on the group of the plurality of rows of the input matrix and one or more columns of the weight matrix; (f) The corresponding output generated by the matrix multiplication is transmitted and loaded into the DRAM via the data bus; (g) Repeat steps (d) to (f) for the first group of the plurality of row groups of the input matrix to the last group of the input matrix; and (h) Repeat steps (c) to (f) for the first to last group of the multiple column groups of the weight matrix.
20. The non-transitory computer-readable storage medium according to claim 18, wherein, The processor in the computing device executes: (a) The input matrix is divided into multiple column groups, each column group being adapted to the first SRAM; (b) Divide the weight matrix into multiple row groups, each row group being adapted to the second SRAM; (c) Load the entire segmented input matrix into the first SRAM via the data bus; (d) Load one or more columns of the corresponding segmented weight matrix into the second SRAM through one or more of the direct channels; (e) Perform matrix multiplication on the loaded whole column of the segmented input matrix and the loaded one or more columns of the segmented weight matrix; (f) The corresponding output generated by the matrix multiplication is transmitted and loaded into the DRAM via the data bus; (g) Repeat steps (d) to (f) for the first column to the last column of the segmented weight matrix; (h) Repeat steps (c) through (f) for the first column group (661) to the last column group of the segmented input matrix; and (i) Load the output matrix stored in the DRAM, which is obtained by matrix multiplication of each set of matrices of the input matrix and the weight matrix, and perform element-wise addition on the output of each segmented input matrix and its corresponding segmented weight matrix.
21. The non-transitory computer-readable storage medium according to claim 20, wherein, The processor transmits the output element-wise addition result to at least one of the second SRAM and the DRAM.