Near-memory computing cascading method combining non-volatile storage and dynamic memory
By combining non-volatile storage and dynamic memory in the near-memory computing cascade method during large model training and inference, the problems of high cost and slow fault recovery in the large model training and inference process are solved, and fast fault recovery and efficient computing are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING YUXIN MICRO INFORMATION TECH CO LTD
- Filing Date
- 2026-01-14
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, the large number of parameters during large model training and inference processes leads to high costs for dedicated high-bandwidth volatile memory chips, and loading parameters from the outside during fault recovery takes too long, affecting the system recovery speed.
A near-memory computation cascade method combining non-volatile memory and dynamic memory is adopted. By storing active parameters in local dynamic memory and inactive parameters in non-volatile memory, the access operations of inactive parameters are reduced by taking advantage of the active characteristics of model parameters, and the model parameters are quickly loaded from non-volatile memory to dynamic memory in the event of system failure.
It significantly reduces the need for expensive dynamic memory chips, shortens fault recovery time, improves system availability and reliability, and optimizes the balance between cost and efficiency.
Smart Images

Figure CN122195342A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a near-memory computing cascade method that combines non-volatile memory and dynamic memory. Background Technology
[0002] The increasing number of parameters in deep learning and other models leads to massive input and output volumes during training and inference. To address this, the industry has designed expensive, large-area dedicated chips such as Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), and Neural Processing Units (NPUs), while also utilizing high-bandwidth memory (HBM) and graphics double-data-rate memory (GDDR) for volatile storage. While volatile storage chips offer high read / write bandwidth, they are expensive, and some foreign chips and technologies are subject to various restrictions, making them unable to meet the demands of high-capacity, high-concurrency model inference. Furthermore, in the event of system failures such as power outages, the parameters stored in volatile storage chips are lost. When a large model inference system resumes operation, all model parameters need to be reloaded into the volatile storage chips, making the recovery process extremely cumbersome, slow, and time-consuming.
[0003] Therefore, overcoming the shortcomings of the existing technology is an urgent problem to be solved in this technical field. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to provide a near-memory computing cascade method that combines non-volatile storage and dynamic memory. The purpose is to solve the problems of existing technologies in the process of model training and inference, which cannot meet the model inference requirements due to the huge number of parameters and the high cost of dedicated high-bandwidth volatile (dynamic) memory chips, and the long time required to read a large number of model parameters from the outside level during fault recovery, resulting in slow recovery of large model inference systems.
[0005] The present invention adopts the following technical solution: In a first aspect, the present invention provides a near-memory computing cascade method combining non-volatile storage and volatile (dynamic) memory, comprising: Each cascaded near-memory computing chip is simultaneously connected to local dynamic memory and non-volatile memory; Different methods are used to store the model parameters and cached data required for different data; during training and / or inference computation, active parameters are stored in local dynamic memory and inactive parameters are stored in non-volatile memory. Using the active and inactive parameters, training and / or inference computations are completed in collaboration with multiple cascaded near-memory computing chips.
[0006] Furthermore, the method also includes: The complete model parameters are stored hierarchically in the non-volatile memory connected to the near-memory computing chip; When the system meets the preset startup conditions, the local model parameters are read from the non-volatile memory and loaded into the dynamic memory connected to the near-memory computing chip.
[0007] Furthermore, the preset startup conditions include: a system failure occurs, or a rapid model cold start is required; The system failure scenarios include power failure of the near-memory computing chip and / or power failure of the dynamic memory.
[0008] Furthermore, the method also includes: By using multiple cascaded near-memory computing chips and connected local memory, a distributed approach is adopted to store model parameters and cached data during inference computation.
[0009] Furthermore, the method also includes: Each near-memory computing chip obtains the corresponding level's model parameters and cached data from its connected local memory. Using the model parameters, the cached data, and the input data generated by the previous level, it completes the hierarchical calculation of the corresponding model decoding stage to obtain updated cached data and the next level's input. If the output of the final-level near-memory computation does not generate a termination token, it returns to the first-level near-memory computation chip to continue inference computation, so that the distributed and cascaded inference computation of the model is completed in the cooperation of multiple cascaded near-memory computation chips.
[0010] Furthermore, the local memory includes dynamic memory and non-volatile memory; The method includes: Based on the activity level of the model parameters during the calculation process, the model parameters are divided into active parameters and inactive parameters; The active parameters are stored in dynamic memory; The inactive parameters are stored in non-volatile memory.
[0011] Furthermore, the method includes: The near-memory computing chip integrates a dynamic memory interface and a non-volatile memory interface, so that when the system meets the preset startup operation conditions, inactive parameters can be read from the non-volatile memory and loaded into the dynamic memory connected to the near-memory computing chip.
[0012] In a second aspect, the present invention also provides a near-memory computing cascade device combining non-volatile memory and dynamic memory, for implementing the near-memory computing cascade method combining non-volatile memory and dynamic memory described in the first aspect, the device comprising: At least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the processor for performing the near-memory computing cascade method combining non-volatile memory and dynamic memory as described in the first aspect.
[0013] Thirdly, the present invention also provides a non-volatile computer storage medium storing computer-executable instructions, which are executed by one or more processors to perform the near-memory computing cascade method combining non-volatile storage and dynamic memory described in the first aspect.
[0014] During model training or inference computation, not all model parameters are activated and participate in the calculation each time; rather, most model parameters remain in an inactive state. This invention connects each cascaded near-memory computing chip to both local dynamic memory and non-volatile memory, employing different methods to store the required model parameters for different active states. During model training and inference computation, by storing active parameters in local dynamic memory and inactive parameters in non-volatile memory, combined with the active characteristics of model parameters, multiple cascaded near-memory computing chips can significantly reduce access operations to most inactive parameters in each iteration, significantly reducing the need for expensive dynamic memory chips. Furthermore, active parameters can be accessed at high speed through dynamic memory; this reduces the number of parameters read and the number of calculations during model training and inference computation without affecting model inference performance; and it solves the problem of high cost of dedicated high-bandwidth dynamic memory chips, achieving an optimized balance between cost and efficiency. Attached Figure Description
[0015] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments of the present invention will be briefly described below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.
[0016] Figure 1 This is a schematic diagram of a computing architecture based on cascaded near-memory computing and parameter broadcasting provided by an embodiment of the present invention; Figure 2 This is a flowchart illustrating a near-memory computing cascade method combining non-volatile memory and dynamic memory provided in an embodiment of the present invention. Figure 3 This is a schematic diagram illustrating a specific example of the structure of a computing unit provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of a specific example of a rack-level large model computing server provided in an embodiment of the present invention; Figure 5 This is a schematic diagram illustrating a specific example of how computing servers are interconnected at the rack level to form different topologies, as provided in an embodiment of the present invention. Figure 6 This is a schematic diagram of a specific example of a 2D Mesh network structure formed by a computing unit according to an embodiment of the present invention; Figure 7 This is a schematic diagram of a specific example of a 3D Mesh network structure formed by a computing unit according to an embodiment of the present invention; Figure 8 This is a schematic diagram of a specific example of a 2D Torus network structure formed by a computing unit according to an embodiment of the present invention; Figure 9 This is a flowchart illustrating another near-memory computing cascade method combining non-volatile storage and dynamic memory provided in an embodiment of the present invention. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0018] In the description of this invention, the terms "inner", "outer", "longitudinal", "lateral", "upper", "lower", "top", "bottom", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and do not require that this invention must be constructed and operated in a specific orientation. Therefore, they should not be construed as limiting this invention.
[0019] In this invention, the terms "first," "second," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined with "first," "second," etc., may explicitly or implicitly include one or more of that feature. In the description of this application, unless otherwise stated, "a plurality of" means two or more.
[0020] In this application, unless otherwise expressly specified and limited, the term "connection" should be interpreted broadly. For example, "connection" can be a fixed connection, a detachable connection, or an integral part; it can be a direct connection or an indirect connection through an intermediate medium. Furthermore, the term "coupled" can refer to an electrical connection that enables signal transmission.
[0021] Furthermore, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.
[0022] Example 1: To address the aforementioned problems, embodiments of the present invention provide a near-memory computing cascade method combining non-volatile storage and dynamic memory, such as... Figure 1 The diagram shows a pre-filling stage computational architecture for a large deep learning model based on cascaded near-memory computing and parameter broadcasting, according to an embodiment of the present invention. Each near-memory computing chip is connected to a corresponding local memory; multiple near-memory computing chips are connected in sequence, and the multiple near-memory computing chips and the connected local memory form a string to form a cascaded structure.
[0023] In this context, the elliptical box represents the local memory chip, the rectangular box represents the local memory, and the local memory near each local memory chip is the local memory connected to that local memory chip. Figure 1 In the middle, multiple near-memory computing chips, connected end-to-end by arrows in each horizontal row, along with the local memory connected to these near-memory computing chips, form a string, for example, Figure 1It contains three strings. The specific types of the local memory and the near-memory computing chip are determined by those skilled in the art based on the specific application scenario. In one embodiment, a low-cost, general-purpose memory with moderate storage capacity can be used, such as Double Data Rate Synchronous Dynamic Random-Access Memory (DDR), Graphics Double Data Rate Synchronous Dynamic Random-Access Memory (GDDR), and Low Power Double Data Rate Synchronous Dynamic Random-Access Memory (LPDDR) as the local memory; where SDRAM stands for Synchronous Dynamic Random-Access Memory. The near-memory computing chip can be a near-memory computing chip integrating the controller intellectual property core (IP) corresponding to the local memory, programmable matrix and vector computing units, and cascading interfaces; where the controller intellectual property core corresponding to the local memory can be the controller IP of DDR, GDDR, or LPDDR.
[0024] Based on such Figure 1 The architecture shown is as follows: Figure 2 As shown, the method includes: Step 10: Each cascaded near-memory computing chip is simultaneously connected to local dynamic memory and non-volatile memory.
[0025] In one embodiment, the near-memory computing chip integrates a dynamic memory interface and a non-volatile memory interface to read inactive parameters from the non-volatile memory and load them into the dynamic memory connected to the near-memory computing chip when the system meets preset startup conditions.
[0026] In one embodiment, a near-in-memory computing chip and local memory are integrated using computing units. The local memory includes non-volatile memory and dynamic memory (DRAM). Each computing unit simultaneously supports both DRAM (i.e., volatile memory) and non-volatile memory chip (e.g., flash memory) interfaces. A specific implementation involves integrating a faster DRAM controller interface, such as DDR, GDDR, or LPDDR controller interfaces, and a slower but larger-capacity non-volatile memory interface, such as a flash controller interface. This allows the near-in-memory computing chip to directly perform fast data read / write and copy operations between non-volatile memory and DRAM. By simultaneously supporting DRAM (i.e., volatile memory, such as DRAM) and non-volatile memory chip (e.g., FLASH) interfaces on the near-in-memory computing chip, the model parameters required for near-in-memory computing can be stored in a more flexible, low-cost, and localized manner.
[0027] like Figure 3 The diagram shows a specific example of the structure of a computing unit. A computing unit (300) includes: a near-memory computing chip (500), dynamic memory (501), a horizontal expansion interface (502), non-volatile memory (503), and a vertical expansion interface (504). The specific connection methods of the horizontal and vertical expansion interfaces are determined by those skilled in the art based on the specific application scenario; in one embodiment, a high-speed serial computer expansion bus standard (Peripheral Component Interconnect Express, abbreviated as PCIe) can be used.
[0028] Step 20: For different data, use different methods to store the model parameters and cached data that need to be calculated; during training and / or inference calculations, store active parameters in local dynamic memory and inactive parameters in non-volatile memory.
[0029] In one embodiment, such as Figure 1The multiple cascaded near-memory computing chips shown collectively complete the computation during the training or inference phases of a model based on a network structure. The specific network structure of this model is determined by those skilled in the art based on the specific application scenario. This model can be a model with a small number of parameters (e.g., a traditional convolutional neural network) or a large model (e.g., a large model based on a transformer structure), and is not limited here. Each near-memory computing chip is responsible for completing the computation of one or more layers specified in the network structure during each iteration. Different data refers to the model parameters and / or cache (KV-Cache) data required by each near-memory computing chip during training and / or inference computation, categorized into active parameters and inactive parameters according to their activity level during computation (i.e., the frequency of use), and stored separately using dynamic memory and non-volatile memory.
[0030] Step 30: Using the active parameters and the inactive parameters, complete the training and / or inference computations with the cooperation of multiple cascaded near-memory computing chips.
[0031] During model training or inference computation, not all model parameters are activated and participate in the calculation each time; rather, most model parameters remain in an inactive state. This invention connects each cascaded near-memory computing chip to both local dynamic memory and non-volatile memory, employing different methods to store the required model parameters for different active states. During model training and inference computation, by storing active parameters in local dynamic memory and inactive parameters in non-volatile memory, combined with the active characteristics of model parameters, multiple cascaded near-memory computing chips can significantly reduce access operations to most inactive parameters in each iteration, significantly reducing the need for expensive dynamic memory chips. Furthermore, active parameters can be accessed at high speed through dynamic memory; this reduces the number of parameters read and the number of calculations during model training and inference computation without affecting model inference performance; and it solves the problem of high cost of dedicated high-bandwidth dynamic memory chips, achieving an optimized balance between cost and efficiency.
[0032] Based on the aforementioned calculation unit, such as Figure 4 A specific example of a rack-scale large-scale computing server (200) based on the aforementioned computing units is also provided, wherein each server (200) includes multiple computing units (300). For example... Figure 5 A specific example is provided of servers (200) interconnected at the rack level to form different topologies; wherein the rack includes an input interface (201) and an output interface (202). In one embodiment, different interconnection methods can form different structures as follows: by configuring the interfaces of the servers (200), computing units can form such as Figure 6 The 2D Mesh network structure shown is as follows: Figure 7 The 3D Mesh network structure shown and as Figure 8 The 2D Torus and 3D Torus network structures are shown.
[0033] Example 2: In real-world applications, chips often encounter errors during computation. In existing technologies, when a chip detects an error, it often reloads the required model parameters from a remote network. The large number of these model parameters creates a significant storage burden, resulting in excessively long fault recovery times.
[0034] Furthermore, in existing technologies, deep learning, especially large models based on the transformer architecture, involves a massive number of parameters and computational demands. During training and inference, the sheer volume of parameters places high demands on input / output bandwidth, typically requiring storage on volatile memory chips such as HBM or DDR. If the system experiences a power outage or other failure, the relevant parameters (e.g., model parameters and KV-Cache) will be lost and need to be reloaded. Reloading these parameters in the event of system failures or power outages is extremely cumbersome, slow, and leads to slow system recovery.
[0035] To solve the above problems, based on Embodiment 1 of the present invention, as follows: Figure 9 As shown, the method further includes: Step 40: Store the complete model parameters hierarchically in the non-volatile memory connected to the near-memory computing chip.
[0036] Step 50: When the system meets the preset startup conditions, read the local model parameters from the non-volatile memory and load them into the dynamic memory connected to the near-memory computing chip.
[0037] In one embodiment, the preset startup conditions include: a system failure occurring, or a need for a fast model cold start. The system failure conditions include power failure of the near-memory computing chip and / or power failure of the dynamic memory.
[0038] In one embodiment, in step 40, the KV-Cache generated during the near-memory computing chip's computation process can also be stored in the non-volatile memory connected to the near-memory computing chip; in step 50, when the system meets the preset startup operation conditions, the local KV-Cache can be read from the non-volatile memory and loaded into the dynamic memory connected to the near-memory computing chip.
[0039] The system stores complete model parameters hierarchically in non-volatile memory connected to the near-memory computing chip. In the event of system failures such as power loss of the near-memory computing chip or dynamic memory, the system can read local model parameters from the non-volatile memory into the dynamic memory, quickly restoring the inference function of the distributed large model. This solves the problems of existing technologies in model training and inference, such as the enormous number of parameters, computational load, and parameter input / output, coupled with the high cost of dedicated chips, and the long time required to read large amounts of model parameters from external sources step by step during fault recovery.
[0040] In the event that a chip detects an error (i.e., a system failure), since each computing unit in this embodiment includes local non-volatile memory, when a near-memory computing chip in a computing unit detects an error, since step 40 has already been executed, there is no need to reload the massive model parameters from the remote network. Instead, it is reset locally, and following step 50, the required model parameters are quickly loaded directly from the local non-volatile memory of its computing unit into the local dynamic memory, thereby greatly improving the speed of fault recovery. This also improves data loading speed during system startup. This direct recovery from local storage avoids time-consuming network transmission, significantly shortens fault recovery time, and improves the overall availability and reliability of the system, which is highly beneficial for large-scale online inference services with large models.
[0041] For scenarios such as initial model deployment, restart after prolonged inactivity, and initialization after resource allocation (e.g., elastic scaling of cloud services), rapid model cold starts are required to shorten the startup time from model loading to being able to handle requests, reducing user-perceived latency (e.g., first token time). Because this embodiment enables local reading of model parameters through steps 40 and 50, a large amount of data involved in model weight loading and computing resource initialization (GPU memory allocation, CPU memory allocation) during the cold start process does not need to be reloaded from the remote network but is directly read from non-volatile memory, significantly shortening the cold start time and improving performance efficiency.
[0042] Example 3: This embodiment can be considered a preferred embodiment of Embodiments 1 and 2 of the present invention.
[0043] In the cascaded architecture of Embodiment 1 of the present invention (i.e., Figure 1 Based on this, this embodiment provides a specific example for a large model: The following example uses a typical autoregressive generative language model based on a transformer architecture to illustrate how the architecture of this embodiment works: Background: Large language models are based on the Transformer structure and generally consist of several parts: (1) input text embedding; (2) multiple layers of self-attention and forward connection layers with the same structure, each layer containing a large number of network model weight parameters; (3) output layer, which converts the vectors output by the previous multiple layers into token text.
[0044] The work on autoregressive generative formulas is divided into two stages: (1) In the pre-filling stage, the input text is converted into L tokens, and then converted into a two-dimensional matrix with length L and width d. After that, matrix calculation is performed with the self-attention layer and the forward connection layer of each layer to output a new two-dimensional matrix with length L and width d. This process is called the pre-filling stage.
[0045] (2) Decoding stage: The aforementioned two-dimensional matrices of length L and width d are merged along the length dimension to obtain a one-dimensional vector of width d; this one-dimensional vector is then added to the aforementioned two-dimensional matrix of length L to form a two-dimensional matrix of length L+1. This new two-dimensional matrix is then used as input to the first self-attention layer and the forward connection layer of the transformer structure for the next round of computation. Each time it passes through all layers, a new one-dimensional vector is formed until a terminator is generated.
[0046] In one embodiment, in order to reduce the amount of computation, during the decoding stage, the calculation results of the key vector and value vector in each layer of the L-dimensional query vector, key vector and value vector can be cached, i.e., KV-Cache. Each time, the calculation is not repeated, and only the calculation of the latest (e.g., L+1 vector) is required for each layer.
[0047] Based on this, firstly, each local memory stores the KV-Cache of the nearby computing chip connected to it. Then, each nearby computing chip retrieves the model parameters and cached data of the corresponding level from its connected local memory. Using the model parameters, the cached data, and the input data generated by the previous level, it completes the hierarchical calculation of the corresponding model decoding stage, obtaining updated cached data and the next level input. The output result of the last level of nearby computing, without generating a termination token, is returned to the first level of nearby computing chip to continue inference calculation, so that the distributed and cascaded inference calculation of the model is finally completed with the cooperation of multiple cascaded nearby computing chips.
[0048] The specific method of using KV-Cache in the calculation is determined by those skilled in the art based on the specific application scenario and is not limited here. In one embodiment, when each near-memory computing chip needs to perform calculations during the large model decoding stage, it obtains the KV-Cache required for this calculation from its own connected local memory, and uses the model parameters to be calculated, the obtained KV-Cache, and the input data to perform calculations to obtain intermediate quantities.
[0049] like Figure 1 As shown, in one embodiment, a central processor manages multiple strings; the near-memory computing chips between the different strings managed by the central processor are interconnected. In a specific example, such as Figure 1 As shown, a series consists of one or more central processors connected in series with multiple near-memory computing chips; one central processor can connect to and manage multiple series. Figure 1 The central processor connects and manages the three strings. The proximity computing chips between different strings can be used... Figure 1 The connection is either a vertical bus or a point-to-point connection. Each string processes its own input data. Figure 1 The three strings in the process are processed into three "inputs". For the first near-memory computing chip of each string (i.e., Figure 1 For the leftmost near-memory computing chip, the input data involved in the calculation in step 20 is... Figure 1 The "input" in the context refers to the intermediate value, which is the result of the current calculation by the near-memory computing chip and can be understood as the intermediate calculation result of the string in the current round of calculation.
[0050] Based on this, in one embodiment, in order to illustrate the process of obtaining the model parameters to be calculated, the local memory corresponding to one of the strings managed by the central processor stores the model parameters of a preset number of layers in a distributed manner.
[0051] In this context, the local memory corresponding to one of the strings managed by the central processor refers to the local memory connected to all the near-memory computing chips in that string. The preset number of layers refers to the number of network layers of a model (e.g., a deep learning model) network structure that a near-memory computing chip needs to compute each time. This preset number of layers is determined by those skilled in the art based on the specific application scenario and is not limited here. Model parameters refer to the network weight parameters that participate in the computation and are iteratively trained during the training phase.
[0052] In one embodiment, such as Figure 1 As shown, the model parameters of each layer of the deep learning model used in the current computation are stored in a distributed manner. Figure 1The model parameters for one or more layers can be stored in the local memory connected to a local memory chip, which is located at the bottommost level. Each local memory chip is responsible for calculating the model parameters and input data for only one or more layers.
[0053] To support the computation of autoregressive generative models, the intermediate values are transmitted to the next near-memory computing chip and used as input data in the computation to obtain the intermediate values of the next near-memory computing chip, thereby obtaining the output result corresponding to the string and completing the response generation; that is, after the last near-memory computing chip completes the computation, the vector of the output result obtained by the current string in the current round is returned to the first near-memory computing chip of the current string through the input feedback loop (e.g., Figure 1 (The leftmost near-memory computing chip).
[0054] Here, the next nearest-memory computing chip refers to the next nearest-memory computing chip directly connected to the nearest-memory computing chip performing the computation; for example, when the nearest-memory computing chip performing the computation is... Figure 1 When the leftmost near-memory computing chip is selected, the next near-memory computing chip is... Figure 1 The second near-memory computing chip from left to right. Each near-memory computing chip obtains intermediate values in the same way, using these intermediate values as input data for the current calculation. That is, the next near-memory computing chip uses its required model parameters, its acquired KV-Cache, and the intermediate values to complete the large model decoding stage calculation, obtaining the intermediate values for the current calculation. This process is repeated level by level, using the intermediate values obtained by each near-memory computing chip on the same string as input data for the next near-memory computing chip, and so on. After the last near-memory computing chip of each string completes its calculation, the output result of that string in the current round of calculation is obtained.
[0055] For example Figure 1 As shown, for each input, multiple rounds of computation are required to generate the final output for that input token by token. For example, in a scenario where the user provides an input and the model needs to generate a dialogue response, each string generates the output result of the current round after each round of computation. Then, the output result of the current round is used as the input for the next round of computation, and the iteration continues to generate the output result of the next round. After the final iteration, the statement generated token by token in each round is the final output, which is the complete response generated for the user input.
[0056] In one embodiment, the near-memory computing chip can also integrate a general-purpose processor as a preprocessing unit for deep learning and large models, and together with the cascaded near-memory computing chip, form a complete architecture for large model computing.
[0057] It is worth noting that in large model training computations: gradient-based backpropagation computation can be achieved through a single forward inference process and a single backpropagation process across multiple cascaded near-memory computing chips. In large model fine-tuning computations: the implementation process is similar to that in large model training computations, with only the parameter changes for fine-tuning based on pre-trained parameters distributed across different near-memory computing chips. During the training phase, a single cascaded near-memory computing chip can support backpropagation training through distributed multi-level computation, or support training solely through parameter optimization methods using forward propagation.
[0058] Example 4: Based on Embodiment 3 of the present invention, the method further includes: The local memory stores model parameters of a preset number of layers in a distributed manner; the local memory corresponding to the near-memory computing chip broadcasts the model parameters to other near-memory computing chips.
[0059] In one embodiment, for Figure 1 All strings managed by this central processing unit are stored in a distributed manner as model parameters used in the current computation. Figure 1 When the model is in the lowest row of local memory, the local memory computing chips in the bottom row (or other rows) broadcast the model parameters needed by the large model during inference or training to the local memory computing chips in other rows.
[0060] In the pre-filling stage of large model inference computation, the scheme of reading and cascading broadcasting model parameters according to the embodiments of the present invention allows the same data to be reused among multiple strings, reducing the storage capacity required to store model parameters and reducing the number of memory reads. In the decoding stage of large model inference computation, each near-memory chip obtains one or more layers of parameters from the vertically broadcast parameters. In this process, parameter broadcasting also reduces the number of memory reads. Multiple parallel cascaded near-memory computing chips simultaneously support the computation of the same model through parameter broadcasting during the parallel pre-filling stage of inference computation.
[0061] like Figure 1 In the architecture shown, the multiple strings managed by the central processing unit use the same network model. Therefore, the model parameters required by the near-memory computing chips for computing the same network layer in each string are the same. Only one string needs to read the corresponding model parameters once and broadcast them to the near-memory computing chips of other strings. Thus, other strings do not need to read the same model parameters repeatedly, which greatly reduces the number of read and write operations for reading model parameters by other strings and the corresponding resource consumption.
[0062] Example 5: For large models with over a hundred billion parameters, not all model parameters are activated and participate in the computation every time a single inference task is performed. During the computation process, the weights of most model parameters are actually inactive because their values are zero (or close to zero).
[0063] To further optimize cost and efficiency, based on Embodiment 3 of this invention, the local memory includes dynamic memory and non-volatile memory. The method further includes: classifying the model parameters into active and inactive parameters according to their activity level during the calculation process; storing the active parameters in dynamic memory; and storing the inactive parameters in non-volatile memory. The specific method for classifying the model parameters into active and inactive parameters shall be determined by those skilled in the art based on the specific application scenario.
[0064] In one embodiment, inactive model parameters can be identified in advance through offline computation and designated as inactive parameters; typically, 80% to 90% of model parameters in large models belong to this category of rarely used inactive parameters. Conversely, the few model parameters that are frequently activated are identified as active parameters. Based on this characteristic, embodiments of the present invention employ a hierarchical storage strategy: frequently accessed active parameters are stored in local, faster dynamic memory to ensure high-speed access; infrequently used inactive parameters are stored in slower, larger-capacity, and lower-cost non-volatile memory. This significantly reduces the need for expensive DRAM capacity without affecting the overall computational speed, thereby optimizing cost and efficiency. During inference, the differences in read and write speeds of different memories are fully utilized to achieve faster inference and training speeds, thus enabling more efficient support for larger-scale models or scenarios where multiple models inference and training coexist in a low-cost manner, thereby improving the efficiency of dialogue response generation.
[0065] Based on this, in one embodiment, when the system meets the aforementioned preset startup operation conditions, inactive parameters can also be read from non-volatile memory and loaded into the dynamic memory connected to the near-memory computing chip.
[0066] By simultaneously supporting dynamic memory (i.e., volatile memory, such as DRAM) interfaces and non-volatile memory chip (e.g., FLASH) interfaces, hierarchical large model parameters required by local near-memory computing chips can be stored in a more flexible, low-cost, and localized manner, meeting the needs of rapid fault recovery, rapid cold start, simultaneous support of multiple models, and distributed storage and computation of large model cold-active parameters.
[0067] In one embodiment, the near-memory computing chip can automatically read model parameters stored in local memory during large model computation and inference via the flash controller interface, and store them in integrated DDR memory as local parameters. This enables localized parameter reading, rapid model cold start, or rapid recovery in case of failure. When system failure and / or power supply problems occur, the model parameters are directly reloaded from the corresponding non-volatile memory.
[0068] This invention connects each cascaded near-memory computing chip to both local dynamic memory and non-volatile memory. Different methods are used to store the required model parameters and cached (KV-Cache) data for different types of data. During large model inference computation, by storing active parameters in local dynamic memory and inactive parameters in non-volatile memory, combined with the active nature of large model parameters, the number of parameters read and the amount of inference computation are reduced without affecting the inference performance of the large model. Simultaneously, the system stores the complete large model parameters hierarchically in the non-volatile memory connected to the near-memory computing chips. In the event of system failures such as power loss of the near-memory computing chips or dynamic memory, the local model parameters can be read from the non-volatile memory into the dynamic memory, quickly restoring the inference function of the distributed large model. This solves the problems of existing technologies in the training and inference of large models, which involve huge numbers of parameters, huge computational loads, and huge parameter input / output volumes, as well as the high cost of dedicated chips and the long time required to read large numbers of model parameters from external sources during fault recovery.
[0069] It is worth noting that the information interaction and execution process between the modules and units in the above-mentioned device and system are based on the same concept as the processing method embodiment of the present invention. For details, please refer to the description in the method embodiment of the present invention, and will not be repeated here.
[0070] Those skilled in the art will understand that all or part of the steps in the various methods of the embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, which may include: read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc.
[0071] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A near-memory computing cascade method combining non-volatile memory and dynamic memory, the method comprising: Each cascaded near-memory computing chip is simultaneously connected to local dynamic memory and non-volatile memory; Different methods are used to store the model parameters and cached data that need to be calculated for different types of data; During training and / or inference computation, active parameters are stored in local dynamic memory and inactive parameters are stored in non-volatile memory. Using the active and inactive parameters, training and / or inference computations are completed in collaboration with multiple cascaded near-memory computing chips.
2. The near-memory computing cascade method combining non-volatile memory and dynamic memory according to claim 1, characterized in that, The method further includes: The complete model parameters are stored hierarchically in the non-volatile memory connected to the near-memory computing chip; When the system meets the preset startup conditions, the local model parameters are read from the non-volatile memory and loaded into the dynamic memory connected to the near-memory computing chip.
3. The near-memory computing cascade method combining non-volatile memory and dynamic memory according to claim 2, characterized in that, The preset startup conditions include: a system failure occurs, or a rapid model cold start is required. The system failure scenarios include power failure of the near-memory computing chip and / or power failure of the dynamic memory.
4. The near-memory computing cascade method combining non-volatile memory and dynamic memory according to claim 1, characterized in that, The method further includes: By using multiple cascaded near-memory computing chips and connected local memory, a distributed approach is adopted to store model parameters and cached data during inference computation.
5. The near-memory computing cascade method combining non-volatile memory and dynamic memory according to claim 4, characterized in that, The method further includes: Each near-memory computing chip obtains the corresponding level's model parameters and cached data from its connected local memory. Using the model parameters, the cached data, and the input data generated by the previous level, it completes the hierarchical calculation of the corresponding model decoding stage to obtain updated cached data and the next level's input. If the output of the final-level near-memory computation does not generate a termination token, it returns to the first-level near-memory computation chip to continue inference computation, so that the distributed and cascaded inference computation of the model is completed in the cooperation of multiple cascaded near-memory computation chips.
6. The near-memory computing cascade method combining non-volatile memory and dynamic memory according to claim 4, characterized in that, The local memory includes dynamic memory and non-volatile memory; The method includes: Based on the activity level of the model parameters during the calculation process, the model parameters are divided into active parameters and inactive parameters; The active parameters are stored in dynamic memory; The inactive parameters are stored in non-volatile memory.
7. The near-memory computing cascade method combining non-volatile memory and dynamic memory according to any one of claims 1-6, characterized in that, The method includes: The near-memory computing chip integrates a dynamic memory interface and a non-volatile memory interface, so that when the system meets the preset startup operation conditions, inactive parameters can be read from the non-volatile memory and loaded into the dynamic memory connected to the near-memory computing chip.