Computing devices, data processing methods, apparatuses, and media for high-bandwidth inference
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ICY TECHNOLOGY (BEIJING) CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, matrix-vector multiplication operations in neural network model inference are inefficient and deterministic, and external interfaces have high bandwidth requirements for transporting weight matrices.
A weight-locked memory array based on MRAM is adopted. The fixed read cycle is determined by the pre-calibrated read latency of MRAM. Parallel operations are performed during the inference of the neural network model through the input vector streaming interface, read scheduling unit, parallel multiplication unit and accumulation unit, which reduces the bandwidth requirements of the external interface.
It improves the efficiency of neural network model inference and the determinism of matrix-vector multiplication, reduces the bandwidth requirements of external interfaces for transporting weight matrices, and enhances the inference efficiency and timing determinism of computing devices.
Smart Images

Figure CN122263993A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of integrated circuit technology, and in particular to a computing device, a data processing method, apparatus, and medium for high-bandwidth inference. Background Technology
[0002] Matrix-vector multiplication is a fundamental computational operation in neural network model inference and is widely present in various linear layers. Magnetic Random Access Memory (MRAM) is a non-volatile memory based on a magnetic tunnel junction (MTJ), characterized by high storage density, low static leakage pressure, and no need for dynamic refresh.
[0003] The methods described in this section are not necessarily methods that had been previously conceived or adopted. Unless otherwise specified, no method described in this section should be assumed to be prior art simply because it is included in this section. Similarly, unless otherwise specified, the issues mentioned in this section should not be considered to be accepted in any prior art. Summary of the Invention
[0004] According to one aspect of this disclosure, a computing device is provided for performing matrix-vector multiplication on a weight matrix and an activation vector during inference of a neural network model. The computing device includes: a weight-locked memory array based on MRAM, comprising multiple memory partitions, and configured to statically store the weight matrix in the multiple memory partitions during inference of the neural network model, wherein the MRAM has a pre-calibrated readout latency, and the computing device has a fixed readout tick predetermined based on the readout latency; an input vector streaming interface configured to receive the activation vector based on the fixed readout tick; a readout scheduling unit configured to read multiple sets of weight values in parallel from the multiple memory partitions based on the fixed readout tick; a parallel multiplication unit configured to perform parallel multiplication operations on the multiple sets of weight values read in parallel and the activation vector in each readout tick; an accumulation unit configured to accumulate the product results generated by the parallel multiplication unit in each readout tick to generate a result vector of the matrix-vector multiplication operation; and a result output interface configured to output the result vector.
[0005] According to another aspect of this disclosure, a data processing method for high-bandwidth inference is provided for performing matrix-vector multiplication on a weight matrix and an activation vector during inference of a neural network model. The method includes: pre-storing the weight matrix in a statically resident manner in multiple storage partitions of a weight-locked memory array based on MRAM; obtaining a fixed readout tick, the fixed readout tick being determined based on a pre-calibrated readout latency of the MRAM; and, during inference of the neural network model, performing the following operations based on the fixed readout tick: receiving the activation vector; reading multiple sets of weight values in parallel from the multiple storage partitions; performing parallel multiplication operations on the multiple sets of weight values read in parallel and the activation vector; and accumulating the product results generated by each readout tick; wherein, the result vector of the matrix-vector multiplication operation is generated after accumulating over multiple readout ticks.
[0006] According to one or more embodiments of this disclosure, the weight matrix is stored in a weight-locked memory array based on MRAM in a static resident manner, and a fixed readout clock is determined by the pre-calibrated readout latency of MRAM. This enables the input vector streaming interface, readout scheduling unit, parallel multiplication unit and accumulation unit to work synchronously based on the fixed readout clock, thereby completing weight reading and matrix-vector multiplication operations on-chip during neural network model inference. This reduces the bandwidth requirements of the external interface for transporting the weight matrix, improves model inference efficiency, and enhances the determinism of the execution timing of matrix-vector multiplication operations.
[0007] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0008] Further details, features, and advantages of this disclosure are disclosed in the following description of exemplary embodiments in conjunction with the accompanying drawings, in which: Figure 1 A structural block diagram of a computing device according to an exemplary embodiment of the present disclosure is shown; Figure 2 A flowchart of a data processing method according to an exemplary embodiment of the present disclosure is shown; Figure 3 A structural block diagram of a computer device according to an exemplary embodiment of the present disclosure is shown. Detailed Implementation
[0009] In this disclosure, unless otherwise stated, the use of terms such as "first," "second," etc., to describe various elements is not intended to limit the positional, temporal, or importance relationships of these elements; such terms are merely used to distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of that element, while in other cases, based on the context, they may refer to different instances.
[0010] The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context expressly indicates otherwise, an element may be one or more unless the number of elements is specifically limited. As used herein, the term "multiple" means two or more, and the term "based on" should be interpreted as "at least partially based on". Furthermore, the terms "and / or" and "at least one of..." cover any one of the listed items and all possible combinations thereof.
[0011] According to one aspect of this disclosure, a computing device is proposed for performing matrix-vector multiplication on a weight matrix and an activation vector during inference of a neural network model. Figure 1 This is a structural block diagram illustrating a computing device 100 according to an exemplary embodiment.
[0012] refer to Figure 1 The computing device 100 includes: a weight-locked memory array 110 based on MRAM, comprising multiple memory partitions, and configured to store the weight matrix in a statically resident manner in the multiple memory partitions during inference of the neural network model, wherein the MRAM has a pre-calibrated readout latency, and the computing device has a fixed readout tick predetermined based on the readout latency; an input vector streaming interface 120 configured to receive the activation vector based on the fixed readout tick; a readout scheduling unit 130 configured to read multiple sets of weight values in parallel from the multiple memory partitions based on the fixed readout tick; a parallel multiplication unit 140 configured to perform parallel multiplication operations on the multiple sets of weight values read in parallel and the activation vector in each readout tick; an accumulation unit 150 configured to accumulate the product results generated by the parallel multiplication unit in each readout tick to generate a result vector of the matrix-vector multiplication operation; and a result output interface 160 configured to output the result vector.
[0013] Therefore, this disclosure stores the weight matrix in a statically resident manner in a weight-locked memory array based on MRAM, and uses the pre-calibrated read latency of MRAM to determine a fixed read cycle. This enables the input vector streaming interface, read scheduling unit, parallel multiplication unit, and accumulation unit to work synchronously based on the fixed read cycle. As a result, weight reading and matrix-vector multiplication operations are completed on-chip during neural network model inference, reducing the bandwidth requirements of the external interface for transporting the weight matrix, improving model inference efficiency, and enhancing the determinism of the execution timing of matrix-vector multiplication operations.
[0014] The following will describe in detail the various components of the computing device 100.
[0015] The weight-locked storage array 110 is a storage module built on MRAM and includes multiple storage partitions. The weight matrix is stored in these multiple storage partitions in a statically resident manner. Static resident means that after the weight matrix is written to the weight-locked storage array 110, it remains unchanged during the inference process involving matrix-vector multiplication, and is not overwritten, updated, or replaced. In other words, during the inference computation of the weight matrix, the corresponding weight data in the weight-locked storage array 110 is only read and not modified.
[0016] In some embodiments, the entire weight matrix can be written all at once during the deployment phase and remain statically resident throughout the entire model inference process. Because MRAM is non-volatile, the weight matrix can be retained even under power failure. When model upgrades, parameter reconfigurations, or offline reprogramming are required, the weight data in the storage partition can be rewritten via the weight-locked storage array's own write path after the inference service stops, and then re-enter the static resident state. The weight matrix is written via the weight-locked storage array's own write path, without going through the input vector stream interface 120 or the result output interface 160.
[0017] MRAM is a non-volatile memory that retains its stored data even when power is lost, and its read path is independent of periodic refresh recovery processes. Compared to memories that require dynamic refresh, the read latency of MRAM is unaffected by refresh cycles and can therefore be predetermined through device characterization during the design or deployment phase, remaining stable during inference. This characteristic is referred to in this disclosure as pre-calibrated read latency.
[0018] Furthermore, compared to SRAM, MRAM offers higher storage density and lower static leakage pressure. Compared to DRAM, MRAM requires no dynamic refresh, and its read path is unaffected by refresh cycles, making it more suitable for constructing deterministic streaming read timings. MRAM read speeds can reach nanosecond levels and support high-bit-width parallel reads.
[0019] Based on the aforementioned pre-calibrated readout latency, the computing device 100 has a fixed readout timing that is determined before inference. The fixed readout timing is a timing parameter predetermined based on the MRAM readout latency, remains unchanged during inference, and serves as a timing reference for the synchronous operation of each functional unit in the computing device 100.
[0020] In one exemplary embodiment, the input vector streaming interface 120 and the readout scheduling unit 130 can operate strictly synchronously according to a fixed readout clock, meaning that the start time of each fixed readout clock simultaneously triggers the reception of the activation vector and the reading of the weight value. In other embodiments, the input vector streaming interface 120 and the readout scheduling unit 130 can introduce additional timing offsets based on the fixed readout clock, combined with the propagation delay or buffer depth of their respective data paths, so that their actual operation times are adjusted relative to the fixed readout clock, while still using the fixed readout clock as the global timing reference.
[0021] Multiple storage partitions are the units of division within the weight-locked storage array 110. The weight matrix is allocated to each storage partition for storage, so that weight data can be read from multiple storage partitions simultaneously in each read cycle. In an exemplary embodiment, different parts of the weight matrix can be stored in different storage partitions, each storage partition independently providing a set of weight values in the same read cycle to support subsequent parallel multiplication operations.
[0022] The input vector streaming interface 120 is the entry point for the computing device 100 to receive external activation vectors. During inference, activation vectors are input to the computing device 100 via the input vector streaming interface 120 based on a fixed readout clock. In one exemplary embodiment, the elements or elements of the activation vectors may arrive sequentially in a clockwise order, with each readout clock corresponding to a batch of activation data input, thereby aligning temporally with the weight values read out within the same clock.
[0023] The read scheduling unit 130 may include functional modules for scheduling logic and sensing paths. The scheduling logic is responsible for issuing read control signals to each storage partition based on a fixed read cycle, determining which storage partitions are accessed within each cycle and in what order. The sensing path is responsible for converting the physical state of storage cells in each storage partition into weight values that can be used for subsequent calculations. Within each read cycle, the read scheduling unit 130 reads multiple sets of weight values in parallel from multiple storage partitions, with each set of weight values coming from a different storage partition.
[0024] In each read cycle, the parallel multiplication unit 140 receives multiple sets of weight values read in parallel by the read scheduling unit 130 and the activation vector provided by the input vector streaming interface 120, and performs parallel multiplication operations on the two. Specifically, each set of weight values from different storage partitions is multiplied by the activation vector, and multiple sets of multiplication operations are executed in parallel within the same read cycle. In an exemplary embodiment, different rows of the weight matrix can be stored in different storage partitions. In the current cycle, each storage partition reads the weight value corresponding to the current input activation data in that row. The parallel multiplication unit 140 multiplies the weight value corresponding to each row by the corresponding activation data, thereby generating multiple product results within the same cycle.
[0025] The accumulation unit 150 accumulates the product results generated by the parallel multiplication unit 140 in each readout cycle. As the various parts of the activation vector are sequentially input in consecutive readout cycles, each cycle produces a batch of product results, and the accumulation unit 150 accumulates the product results of each cycle step by step. After accumulation over multiple readout cycles, the accumulation unit 150 generates the result vector of the matrix-vector multiplication operation.
[0026] The result output interface 160 outputs the result vector generated by the accumulation unit 150 to the outside of the computing device 100. During inference, the data stream carried by the external interface of the computing device 100 can be: the activation vector input via the input vector streaming interface 120, and the result vector output via the result output interface 160. The weight matrix is kept stored in the weight-locked memory array 110 during inference and is not moved via the external interface.
[0027] In some embodiments, the computing device 100 can be used as a linear layer, a fully connected layer, an attention projection layer, or a feedforward network layer in the forward propagation of a neural network. In an exemplary embodiment, the weight matrix stored in the weight-locked storage array 110 can be a query projection matrix, a key projection matrix, a value projection matrix in an attention mechanism, or a weight matrix in a feedforward network layer.
[0028] According to some embodiments, multiple storage partitions may each include multiple storage cells, each storage cell may include a magnetic tunnel junction, the magnetic tunnel junction may include a reference layer, a tunneling barrier layer and a free layer, and each weight value in the weight matrix may be characterized by the resistive state of the magnetic tunnel junction.
[0029] Therefore, by characterizing the weight values through the resistive state of the magnetic tunnel junction, and utilizing the resistance difference corresponding to the relative relationship between the magnetization directions of the reference layer and the free layer, a device-level foundation is provided for the storage and readout of the weight matrix in the weight-locked memory array.
[0030] The magnetic tunnel junction (MTJ) is the core memory structure of MRAM. The magnetization direction of the reference layer remains fixed after device fabrication, while the magnetization direction of the free layer can be changed during writing and retains its magnetization state under power-off conditions. A tunneling barrier layer is located between the reference and free layers, providing a barrier for electron tunneling. When the magnetization directions of the free and reference layers are parallel, the MMT exhibits a low-resistance state; when they are antiparallel, it exhibits a high-resistance state. Weighting values are characterized by these two resistance states.
[0031] According to some embodiments, magnetic tunnel junctions can be formed between back-end metal interconnect layers and can be connected to underlying CMOS driving circuits and sensing circuits through vias.
[0032] Therefore, by integrating the magnetic tunnel junction between the back-end metal interconnect layers, the weighted locked memory array and the underlying CMOS circuit are vertically integrated on the same chip, which shortens the signal path between the memory cell and the driving circuit and sensing circuit, which helps to reduce read latency and write latency, thereby further improving inference efficiency.
[0033] The back-end metal interconnect layer is a metal wiring layer located above the transistor in integrated circuit manufacturing. In some embodiments, the multilayer film structure of the magnetic tunnel junction can be deposited between adjacent metal layers and connected to the underlying CMOS driving circuit and sensing circuit through vias, respectively. The CMOS driving circuit is used to apply a driving signal to the memory cell during the write phase. The sensing circuit is used to detect the resistance state of the memory cell during the read phase.
[0034] According to some embodiments, MRAM can be based on STT-MRAM, SOT-MRAM, VCMA-MRAM, toggle MRAM, thermally assisted MRAM, track memory, memory based on domain wall motion, memory based on skyrmion motion, nanomagnet array memory, artificial spin ice storage structure, or a combination thereof.
[0035] In this disclosure, MRAM generally refers to a storage device that stores data based on magnetic principles and can be written to and read from electrically.
[0036] In some embodiments, STT-MRAM achieves writing via spin-polarized current passing through a magnetic tunnel junction, and has a mature engineering foundation for nanosecond-level write speeds. SOT-MRAM achieves writing by generating spin current through in-plane current in a nearby heavy metal layer; its read / write paths are geometrically decoupled, which helps improve read stability and write durability. VCMA-MRAM assists writing by voltage-modulating the interface magnetic anisotropy, which can reduce write power consumption.
[0037] In some embodiments, Toggle MRAM magnetizes and flips the free layer by alternately applying magnetic field pulses. Thermally assisted MRAM reduces the coercivity of the free layer by heating during writing, thereby reducing the current or magnetic field required for writing.
[0038] In some embodiments, the track memory locates data by driving magnetic domain walls to move along nanowires, while the skyrmion-based memory locates data by driving magnetic skyrmions to move within a magnetic thin film. In the aforementioned devices, once the target data is moved to a fixed readout position, the readout delay can be pre-calibrated based on the sensing path characteristics at that readout position.
[0039] In some embodiments, nanomagnet array memories utilize dipole coupling or exchange coupling between nanoscale magnets to store data. Artificial spin ice storage structures utilize magnetization configurations within geometrically arranged nanomagnet arrays to store data.
[0040] The aforementioned types of devices can be used individually or in combination. In some embodiments, different memory partitions in a weighted locked memory array can be based on different types of magnetic storage devices.
[0041] According to some embodiments, MRAM can be addressed or accessed by one or more of row selection, column selection, intersection selection, local switch selection, port selection, or shift selection, and each memory cell can be accessed by a gating device, a selection device, or an access controller.
[0042] In some embodiments, row selection and column selection refer to selecting the target memory cell along the row and column directions of the array, respectively, and together they constitute two-dimensional addressing. Cross-point selection refers to locating the target memory cell at the intersection of the row and column selection lines, suitable for cross-array structures. Local switch selection refers to selecting the target memory cell using switching devices within a local area, which can reduce parasitic load within the addressing range. Port selection refers to accessing the target memory cell through a dedicated read / write port, suitable for multi-port storage structures. Shift selection refers to accessing the data by driving it to a fixed read position within the storage medium, suitable for devices based on domain walls or skyrmion movement, such as raceway memories.
[0043] In some embodiments, a gating device, a selection device, or an access control device works with a memory cell to control the current path of each memory cell, such that only the cell being addressed is activated during a read or write operation.
[0044] According to some embodiments, the gating device may be based on a transistor, a selector, a diode, a threshold switch, a two-terminal nonlinear gating device, a three-terminal gating device, or a combination thereof, and the memory cell may be based on 1T1MTJ, 2T1MTJ, 1S1MTJ, 1D1MTJ, a read / write split transistor structure, a cross array selector structure, or a combination thereof.
[0045] In some embodiments, a 1T1MTJ structure refers to each magnetic tunnel junction being paired with a selector transistor, which controls the on / off state of the read / write path. A 2T1MTJ structure refers to each magnetic tunnel junction being paired with two transistors, providing independent control for the read and write paths, which is beneficial for optimizing drive conditions separately. A 1S1MTJ structure refers to each magnetic tunnel junction being paired with a selector device, which conducts when the applied voltage exceeds a threshold and is cut off when it falls below the threshold, reducing cell area. A 1D1MTJ structure refers to each magnetic tunnel junction being paired with a diode, achieving selection through the unidirectional conduction characteristic of the diode. A read / write separate transistor structure refers to configuring independent transistors for the read and write paths. A cross-array selector structure refers to using selectors instead of transistors as the selection elements in a cross-array.
[0046] In some embodiments, threshold switching devices and two-terminal nonlinear gating devices can be used in a cross array structure, exhibiting a high-impedance state when not selected and rapidly conducting when selected. Three-terminal gating devices can be used in scenarios requiring independent control terminals, controlling the on / off state of the device via a third terminal.
[0047] According to some embodiments, multiple storage partitions can be subarrays divided by rows, columns, blocks, subblocks, or banks.
[0048] Therefore, by dividing the weight-locked storage array into subarrays of various granularities, it is possible to flexibly adapt to weight matrices of different sizes and different parallel readout requirements.
[0049] In one exemplary embodiment, the weighted locked memory array is divided into multiple banks, each bank containing several rows and columns of storage cells, forming a subarray that can be independently addressed and read. In another exemplary embodiment, a bank can be further divided into smaller-granularity blocks or sub-blocks to support finer-grained parallel read control.
[0050] It is understandable that weighted locked storage arrays can be divided in other ways, which will not be limited here.
[0051] According to some embodiments, different output dimensions of the weight matrix can be mapped to different subarrays, and different input dimensions of the activation vector can be input via the input vector streaming interface at different readout beats.
[0052] Therefore, the weight matrix is distributed along the output dimension to different subarrays, and the activation vector is expanded along the input dimension in time into multiple beats, so that multiple subarrays in each readout beat provide their respective weight values in parallel and multiply them with the current input activation data, thereby improving the throughput of parallel multiplication operations.
[0053] In some embodiments, the output dimension can refer to the row direction of the weight matrix, corresponding to the direction of the output neuron; the input dimension can refer to the column direction of the weight matrix, corresponding to the direction of the input feature. In an exemplary embodiment, each row of the weight matrix is mapped to a subarray, and each subarray reads out the weight value corresponding to the current input dimension position in that row within the same readout cycle. As different input dimensions of the activation vector are input sequentially within consecutive readout cycles, each subarray sequentially provides the weight value at the corresponding column position. After multiple cycles of multiplication and accumulation, the matrix-vector multiplication operation is completed.
[0054] According to some embodiments, the weight matrix can be written to a weight-locked storage array during the neural network model deployment phase, and the static residency mode can indicate that the weight matrix remains in a read-only locked state during neural network model inference.
[0055] Therefore, writing to the weight matrix is limited to the deployment phase and kept in a read-only locked state during inference, so that no write operations occur in the weight-locked storage array during inference, eliminating the latency fluctuations and power consumption caused by runtime writes, and improving the stability of inference execution timing.
[0056] The model deployment phase refers to the parameter loading stage after the neural network model has completed training and before it enters the inference service. During this phase, the weight matrix is written to each storage partition via the write path of the weight-locked memory array (MRAM). After writing is complete, the MRAM enters a read-only locked state. In the read-only locked state, each storage cell only responds to read operations and not write operations. Because MRAM has non-volatile retention capabilities, the weight matrix can still be retained under power-off conditions, supporting offline reprogramming during model upgrades or parameter reconfiguration.
[0057] According to some embodiments, the weight matrix can be written to the weight-locked storage array in one of the following ways: spin-transfer torque applied by spin-polarized current, spin-orbit torque applied by transverse current in adjacent heavy metal layers, or bias-adjusted interface magnetic anisotropy.
[0058] In some embodiments, spin-transfer torque writing refers to applying a torque to the free layer through a spin-polarized current passing through the magnetic tunnel junction, causing it to flip its magnetization direction, corresponding to the writing method of STT-MRAM. Spin-orbit torque writing refers to inducing a spin current through a lateral current in the adjacent heavy metal layer, applying a torque to the free layer to flip its magnetization, corresponding to the writing method of SOT-MRAM. Bias-modulated interface magnetic anisotropy writing refers to changing the magnetic anisotropy at the free layer interface by applying a bias voltage, lowering the flip energy barrier to assist writing, corresponding to the writing method of VCMA-MRAM.
[0059] According to some embodiments, the driving conditions for writing the weight matrix can be determined based on device size, thermal stability factor, target bit error rate, and write time budget.
[0060] Therefore, by comprehensively considering device parameters and system-level constraints to determine the driving conditions, the requirements for write reliability and write speed can be met simultaneously, and the bit error rate during the write phase can be reduced.
[0061] Driving conditions refer to the amplitude and duration of the current or voltage pulse applied to the target memory cell during a write operation. Device size indicates the energy required for the magnetization reversal of the free layer. Thermal stability factor indicates the memory cell's ability to resist thermal disturbances; a higher thermal stability factor results in better data retention but also requires more write energy. Target bit error rate indicates the system's upper limit for the write error rate. Write time budget indicates the time constraint allocated to weighted write operations during the deployment phase. In an exemplary embodiment, the driving circuit determines the amplitude and width of the write pulse based on the above parameters, minimizing the write time while meeting the target bit error rate.
[0062] According to some embodiments, the readout scheduling unit can read out multiple sets of weight values by sensing the tunneling magnetoresistance difference of the magnetic tunnel junction in parallel and antiparallel states.
[0063] Therefore, by sensing the difference in tunneling magnetic reluctance between the two magnetization states of the magnetic tunnel junction, the physical state of the memory cell is converted into a weight value that can be used for calculation, providing a deterministic electrical readout path for the readout scheduling unit.
[0064] In some embodiments, a parallel state refers to a state where the magnetization directions of the free layer and the reference layer are the same, in which case the magnetic tunnel junction exhibits a low-resistance state. An antiparallel state refers to a state where the magnetization directions of the free layer and the reference layer are opposite, in which case the magnetic tunnel junction exhibits a high-resistance state. The tunneling magnetoresistance difference refers to the difference in resistance values between these two states. During the readout operation, a read bias is applied between the bit line and the source line, enabling a read current to form in the target memory cell after the word line. The sensing path in the readout scheduling unit determines whether the memory cell is in a parallel or antiparallel state based on this read current or the corresponding voltage, thereby obtaining the corresponding weight value.
[0065] According to some embodiments, the readout scheduling unit may include a read bias generation unit, a sense amplifier, a reference branch, and decision logic.
[0066] Therefore, by refining the readout scheduling unit into four stages—readout bias generation, signal amplification, reference comparison, and state decision—a complete sensing path is formed, improving the accuracy of weight value readout.
[0067] In some embodiments, a read bias generation unit is used to generate a read bias voltage or current applied to the memory cell. A sense amplifier is used to amplify the weak resistance difference between the memory cells into a discernible electrical signal. A reference branch is used to provide a comparison reference, and the sense amplifier compares the signal generated by the target memory cell with the reference signal provided by the reference branch. Decision logic is used to output a state determination of the memory cell based on the comparison result, i.e., the weight value corresponding to the memory cell. In an exemplary embodiment, the reference branch may employ a fixed reference cell, a reference resistor, a mirror branch, or a differential comparator structure.
[0068] According to some embodiments, the pre-calibrated readout delay can be calibrated based on the tunneling magnetoresistance ratio of the magnetic tunnel junction, the RC parameters of the weighted locked memory array, the sensing time budget, and the misjudgment tolerance.
[0069] Therefore, by calibrating the readout delay by combining device-level parameters and circuit-level parameters, the setting of the fixed readout cycle can be better quantified, thus improving the accuracy of the readout timing setting.
[0070] In some embodiments, the tunneling magnetoresistance ratio (TMR) refers to the ratio of the resistance difference of the magnetic tunnel junction in parallel and antiparallel states to the resistance in the low-resistance state. A higher TMR results in a larger signal margin available to the sense amplifier, and a shorter required sensing time. The RC parameter refers to the resistance and parasitic capacitance of the bit lines, word lines, and associated interconnects in the weighted locked memory array. The RC parameter affects the signal propagation delay within the array. The sense time budget is the time constraint allocated to the sense amplifier to complete signal determination. The false positive tolerance is the allowable range of noise and process variations for the sense circuit. In an exemplary embodiment, during the design or deployment phase, the readout delay value is determined through device characterization and circuit simulation based on the above parameters, and a fixed readout cycle is set accordingly.
[0071] According to some embodiments, the computing device may further include a timing control unit configured to generate multi-stage pipeline control signals corresponding to a readout scheduling unit, a parallel multiplication unit, and an accumulation unit, respectively, based on a fixed readout clock cycle. The multi-stage pipeline control signals define the inter-stage delay relationship between the readout times of multiple sets of weight values, the execution times of parallel multiplication operations, and the accumulation sampling times of the product results.
[0072] Therefore, by generating multi-level pipeline control signals through the timing control unit and specifying the delay relationship between each level, the three stages of reading, multiplication, and accumulation are executed in a pipelined manner, thereby improving the throughput of the computing device in continuous reading cycles.
[0073] In some embodiments, the timing control unit can generate multi-stage pipeline control signals based on a fixed readout cycle. The multi-stage pipeline control signals can include at least a readout control signal corresponding to the readout scheduling unit, a multiplication control signal corresponding to the parallel multiplication unit, and an accumulation sampling signal corresponding to the accumulation unit. The inter-stage delay relationship refers to the time offset between adjacent pipeline stages. In an exemplary embodiment, within the k-th readout cycle, the readout scheduling unit reads the k-th set of weight values, the parallel multiplication unit performs a multiplication operation on the weight values read in the (k-1)-th cycle, and the accumulation unit accumulates and samples the product result in the (k-2)-th cycle. These three operations are executed concurrently in time.
[0074] In an exemplary embodiment, the multi-stage pipeline control signal generated by the timing control unit may include an input activation value transmission clock, a Bank gating clock, a sense amplification clock, a local latch clock, and an accumulation sampling clock. The input activation value transmission clock controls the timing when the input vector streaming interface receives activation data. The Bank gating clock controls the timing when the read scheduling unit gating each memory partition. The sense amplification clock controls the timing when the sense path completes signal amplification and state determination. The local latch clock controls the timing when the read weight value is latched and sent to the parallel multiplication unit. The accumulation sampling clock controls the timing when the accumulation unit samples the product result.
[0075] According to some embodiments, a fixed readout clock cycle can correspond to a steady-state readout delay window determined by offline device characterization of the MRAM.
[0076] Therefore, by obtaining the steady-state readout delay window through offline device characterization and setting the fixed readout timing accordingly, the readout timing setting is based on measured data rather than theoretical estimation, which improves the matching degree between the fixed readout timing and the actual device characteristics.
[0077] In some embodiments, the steady-state readout delay window refers to the reusable, stable readout time interval determined after offline characterization of the MRAM read path without relying on the refresh recovery process. Offline device characterization refers to testing and statistically analyzing the readout delay of the MRAM device during the design or deployment phase. In an exemplary embodiment, the readout delay distribution range can be determined by sampling the readout delay of multiple memory cells under different temperature and voltage conditions, and the interval within this distribution range that meets the misjudgment tolerance requirement can be used as the steady-state readout delay window, with the fixed readout cycle time set to be no shorter than the upper bound of this window.
[0078] In some embodiments, since the memory cells in the weight-locked memory array are in a read-only state for a long time during the inference phase, the sensing window and array tick can be optimized around stable readout conditions.
[0079] According to some embodiments, the read scheduling unit can be further configured to read multiple sets of weight values from multiple storage partitions in a multi-bank parallel manner, a block pipeline manner, or a multi-task pipeline manner.
[0080] Therefore, by supporting multiple read scheduling methods, the read strategy can be flexibly selected according to the size of the weight matrix and the organization structure of the storage partition, thereby improving the read bandwidth utilization of the weight-locked storage array.
[0081] In some embodiments, the multi-bank parallel approach refers to activating multiple banks simultaneously for reading within the same read cycle, with each bank independently providing a set of weight values. The block-based pipelined approach divides the weight matrix into multiple blocks, with each block sequentially undergoing read, multiplication, and accumulation stages, allowing different processing stages of different blocks to overlap in time. The multi-task pipelined approach involves multiple different matrix-vector multiplication tasks sharing the read resources of the weight-locked storage array, with each task's read operations executed alternately in a pipelined manner.
[0082] In some embodiments, the parallel multiplication unit may perform parallel multiplication operations using one or more of the following: a digital multiplication path, a hybrid path of bit-serial and bit-parallel multiplication, a column-parallel part and an aggregation path, or a concurrent pipeline path between banks.
[0083] According to some embodiments, the weight matrix can be encoded in floating-point format, fixed-point format, or mixed-precision format and stored in multiple storage partitions. The floating-point format may include FP32, TF32, BF16, FP16, or FP8, and the fixed-point format may include INT16, INT8, INT4, INT2, or binary format.
[0084] Floating-point formats use a sign bit, exponent bit, and mantissa bit to represent weight values. Different floating-point formats differ in the total bit width and the allocation of exponent and mantissa bits. In an exemplary embodiment, FP32 is a 32-bit floating-point format, TF32 is a 19-bit floating-point format, BF16 and FP16 are 16-bit floating-point formats, and FP8 is an 8-bit floating-point format. Fixed-point formats use a fixed integer bit width to represent weight values. In an exemplary embodiment, INT16, INT8, INT4, and INT2 use 16, 8, 4, and 2 bits to represent weight values, respectively. Binary formats use 1 bit to represent weight values, with each weight taking only two discrete values.
[0085] In some embodiments, for layers with high precision requirements, the weight matrix can be encoded and stored in FP16, BF16, or FP32 format. For layers with low precision requirements, the weight matrix can be encoded and stored in INT8, INT4, INT2, or binary format, thereby accommodating more weight parameters with the same storage capacity.
[0086] In some embodiments, when a mixed-precision format is used, different layers of the same model can use different encoding formats. Furthermore, weight data of different precision levels can be deployed in different subarrays of the weight-locked storage array, and the read and multiplication paths of each subarray can be configured according to the precision level it stores.
[0087] According to some embodiments, each weight value in the weight matrix can be sliced by bit, separated by sign bit and magnitude bit, or mapped to different storage locations of the weight-locked storage array by high and low bits respectively. After reading, it can be reassembled into effective weight data for parallel multiplication operations through bit alignment logic.
[0088] Therefore, by splitting the multi-bit weight value into multiple bit slices and mapping them to different storage locations, each storage cell only needs to represent one or a few bits, which reduces the requirement for the impedance resolution of a single storage cell and improves the reliability of readout.
[0089] In some embodiments, bit-by-bit fragmentation refers to storing each binary bit of a multi-bit weight value in different storage locations. In an exemplary embodiment, an 8-bit weight value is split into 8 single-bit fragments, each of which is written to a different storage cell in a weight-locked memory array. Separating the sign bit and magnitude bit means storing the sign bit and magnitude bit of the weight value in different storage locations, so that the sign information and magnitude information can be obtained independently after reading. Mapping high and low bits separately means mapping the high-order bits and low-order bits of the weight value to different subarrays or different storage areas. The above three methods can be selected according to the precision encoding format and array organization structure.
[0090] In some embodiments, bit alignment logic is a circuit module that reassembles each bit slice after reading. After each storage location outputs the value of its corresponding bit slice, the bit alignment logic shifts and concatenates them according to the bit weights of each bit slice, reassembling them into complete weight data for use by the parallel multiplication unit. In an exemplary embodiment, for weight values stored separately by sign bit and magnitude bit, the bit alignment logic merges the sign bit and magnitude bit during reassembly to generate signed valid weight data.
[0091] According to some embodiments, the accumulation unit may include a local accumulation level within a cycle and a global accumulation level across cycles. The local accumulation level within a cycle can aggregate the product results from different storage partitions within the same read cycle, and the global accumulation level across cycles can accumulate the aggregated results of each read cycle step by step to generate a result vector.
[0092] Therefore, by dividing the accumulation process into two levels—local aggregation within a beat and global accumulation across beats—the product results from multiple storage partitions within the same beat are first reduced spatially and then accumulated step by step in time. This reduces the amount of input data for the global accumulation level across beats and improves the accumulation efficiency.
[0093] The following section explains the two-level accumulation process by referring to the distribution of the weight matrix in the storage partition.
[0094] Different parts of the weight matrix are distributed across multiple storage partitions. Within each read cycle, the read scheduling unit reads multiple sets of weight values in parallel from these storage partitions. The parallel multiplication unit performs parallel multiplication operations on these weight values and the activation vector, producing multiple product results. These product results originate from different storage partitions, corresponding to the products of different parts of the weight matrix and the activation vector.
[0095] The intra-cycle local accumulation stage aggregates the multi-way product results within each readout cycle. In one exemplary embodiment, different rows of the weight matrix are stored in different storage partitions. Each storage partition reads the weight value corresponding to the current input activation data in its corresponding row within the same cycle. The parallel multiplication unit multiplies the weight value of each row with the corresponding activation data to generate local products for each row. For multiple local products along the same output dimension, the intra-cycle local accumulation stage aggregates them into the aggregated result for that cycle. In another exemplary embodiment, when multiple banks are read out in parallel within the same cycle, the product results generated by each bank can first be locally aggregated within the bank, and then summarized across banks by the intra-cycle local accumulation stage.
[0096] The cross-beat global accumulator stage operates across multiple readout beats. As the components of the activation vector are sequentially input within consecutive readout beats, each beat produces an aggregation result, and the cross-beat global accumulator stage accumulates the aggregation results from each beat sequentially. In an exemplary embodiment, the cross-beat global accumulator stage maintains a set of accumulation registers, each corresponding to an element in the result vector. Whenever a local accumulator stage within a beat outputs a new aggregation result, the cross-beat global accumulator stage adds that aggregation result to the corresponding accumulation register. After accumulation across all readout beats, the values in each accumulation register represent the final values of each element in the result vector.
[0097] The two-stage accumulation structure described above allows the parallel reduction in space and the stepwise accumulation in time to be completed by independent circuit stages, avoiding data path congestion caused by simultaneously processing multiple product and multi-cycle accumulation in a single accumulation stage.
[0098] According to some embodiments, the result output interface can be configured to output the result vector to an external host processor or an external bus.
[0099] Therefore, by outputting the result vector to an external main processor or external bus, the computing device can work in conjunction with an external system, and the external main processor can perform subsequent processing on the result vector.
[0100] In one exemplary embodiment, the result output interface transmits the result vector to an external main processor via an on-chip bus or inter-chip interconnect, where the external main processor performs subsequent nonlinear activation, normalization, or other operations. In another exemplary embodiment, the result output interface outputs the result vector to an external bus for access by other computing or storage units in the system.
[0101] According to some embodiments, a weighted locked storage array in read-only locked state may be configured with at least one of the following: read disturbance control mechanism, sensing window calibration mechanism, reference branch calibration mechanism, error correction coding mechanism, and redundant unit.
[0102] Therefore, by configuring at least one of the above reliability mechanisms in read-only locking state, the read reliability of the weight-locked storage array during long-term read-only inference is improved, and the probability of misreading weight values is reduced.
[0103] In some embodiments, a read disturbance control mechanism is used to control the effect of read operations on the magnetization state of the memory cell. During inference, the memory cell is repeatedly read at high frequency, and the current flowing through the magnetic tunnel junction during each read may disturb the magnetization state of the free layer. The read disturbance control mechanism can reduce this disturbance by limiting the read current amplitude or shortening the read pulse width.
[0104] In some embodiments, the sensing window calibration mechanism is used to adjust the decision threshold of the sensing amplifier to compensate for resistance shifts caused by temperature drift or aging. The reference branch calibration mechanism is used to calibrate the reference signal provided by the reference branch to match the actual resistance distribution of the memory cells.
[0105] In some embodiments, the error correction coding mechanism adds redundant check bits to the weight data, and detects and corrects erroneous bits after reading. Redundant units refer to spare storage units reserved in the storage array for use when some storage units fail.
[0106] In some embodiments, when the MRAM employs SOT-MRAM, its read-write separation structure can be used to enhance read stability during the inference phase. When the MRAM employs STT-MRAM, sufficient verification can be performed during the write phase through device size, drive transistor capability, sensing threshold, and reference branch design, and high-frequency reading can be performed at a fixed clock speed during the inference phase.
[0107] According to some embodiments, the weight-locked storage array may also include a volatile cache coupled to multiple storage partitions, the volatile cache being configured to temporarily store weight values read from the storage partitions and provide them to parallel multiplication units.
[0108] In some embodiments, the volatile cache may be SRAM or other volatile storage devices. Weight values in the storage partition are read from the MRAM and written to the volatile cache. Parallel multiplication units retrieve weight values from the volatile cache to participate in the computation. In embodiments including a volatile cache, the determination of the fixed read cycle can be based on the read latency of the MRAM, further combined with the write latency, read latency, or other timing characteristics of the volatile cache.
[0109] According to another aspect of this disclosure, a data processing method for high-bandwidth inference is provided, which performs matrix-vector multiplication on a weight matrix and an activation vector during inference of a neural network model. Figure 2 As shown, the data processing method 200 includes: step S201, pre-storing the weight matrix in a statically resident manner in multiple storage partitions of a weight-locked storage array based on MRAM; step S202, obtaining a fixed readout timing, which is determined based on a pre-calibrated readout latency of the MRAM; step S203, during neural network model inference, performing the following operations based on the fixed readout timing: step S2031, receiving activation vectors; step S2032, reading multiple sets of weight values in parallel from multiple storage partitions; step S2033, performing parallel multiplication operations on the multiple sets of weight values read in parallel and the activation vectors; and step S204, accumulating the product results generated by each readout timing. The result vector of the matrix-vector multiplication operation is generated after accumulating through multiple readout timings.
[0110] Therefore, this disclosure pre-stores the weight matrix in a statically resident manner in a weight-locked memory array based on MRAM, and uses the pre-calibrated readout latency of MRAM to determine a fixed readout clock. During inference, activation vector reception, parallel readout of weight values, parallel multiplication operations, and accumulation of product results are executed sequentially based on this fixed readout clock, thereby completing weight reading and matrix-vector multiplication operations on-chip. This reduces the bandwidth requirements of external interfaces for transporting the weight matrix, improves model inference efficiency, and enhances the determinism of the execution timing of matrix-vector multiplication operations.
[0111] Steps S201 to S204 and their sub-steps in data processing method 200 correspond to the operations performed by each functional unit in the computing device 100 described above. Specific implementation methods, optional embodiments, and technical details of each step can be found in the above description of the weight-locked storage array 110, input vector streaming interface 120, read scheduling unit 130, parallel multiplication unit 140, accumulation unit 150, and result output interface 160 in the computing device 100, and will not be repeated here.
[0112] According to one aspect of this disclosure, a computer device is provided, including a memory, a processor, and a computer program stored in the memory. The processor is configured to execute the computer program to implement the steps of any of the method embodiments described above.
[0113] According to one aspect of this disclosure, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of any of the method embodiments described above.
[0114] According to one aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the steps of any of the method embodiments described above.
[0115] In the following text, combined with Figure 3 Illustrative examples describing such computer devices, non-transitory computer-readable storage media, and computer program products.
[0116] Figure 3 An example configuration of a computer device 300 that can be used to implement the methods described herein is shown.
[0117] Computer device 300 can be a variety of different types of devices. Examples of computer device 300 include, but are not limited to: desktop computers, server computers, laptop or netbook computers, mobile devices (e.g., tablet computers, cellular or other wireless phones (e.g., smartphones), notebook computers, mobile stations), wearable devices (e.g., glasses, watches), entertainment devices (e.g., entertainment appliances, set-top boxes communicatively coupled to a display device, game consoles), televisions or other display devices, automotive computers, and so on.
[0118] Computer device 300 may include at least one processor 302, memory 304, multiple communication interfaces 306, display device 308, other input / output (I / O) devices 310, and one or more mass storage devices 312 capable of communicating with each other, such as via system bus 314 or other suitable connections.
[0119] Processor 302 may be a single processing unit or multiple processing units, and all processing units may include single or multiple computing units or multiple cores. Processor 302 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and / or any device that manipulates signals based on operating instructions. Among other capabilities, processor 302 may be configured to acquire and execute computer-readable instructions stored in memory 304, mass storage device 312, or other computer-readable media, such as program code of operating system 316, program code of application program 318, program code of other program 320, etc.
[0120] Memory 304 and mass storage device 312 are examples of computer-readable storage media for storing instructions executed by processor 302 to perform the various functions described above. For example, memory 304 may generally include both volatile and non-volatile memory (e.g., RAM, ROM, etc.). Furthermore, mass storage device 312 may generally include hard disk drives, solid-state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network-attached storage, storage area networks, etc. Both memory 304 and mass storage device 312 may be collectively referred to herein as memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code, which may be executed by processor 302 as a specific machine configured to perform the operations and functions described in the examples herein.
[0121] Multiple programs may be stored on mass storage device 312. These programs include operating system 316, one or more application programs 318, other programs 320, and program data 322, and they may be loaded into memory 304 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components / functions: the methods described herein (including any suitable steps of the methods), and / or other embodiments described herein.
[0122] Although illustrated as being stored in memory 304 of computer device 300, modules 316, 318, 320, and 322, or portions thereof, may be implemented using any form of computer-readable medium accessible by computer device 300. As used herein, “computer-readable medium” includes at least two types of computer-readable media: computer-readable storage media and communication media.
[0123] Computer-readable storage media include volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, DVD, or other optical storage devices, magnetic cassettes, magnetic tapes, disk storage devices or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by computer devices. In contrast, communication media can embody computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms. Computer-readable storage media as defined herein do not include communication media.
[0124] One or more communication interfaces 306 are used for exchanging data with other devices, such as via a network, direct connection, etc. Such communication interfaces can be one or more of the following: any type of network interface (e.g., a network interface card (NIC)), wired or wireless (such as WLAN) wireless interface, Wi-MAX interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth™ interface, Near Field Communication (NFC) interface, etc. Communication interface 306 can facilitate communication across various network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, etc. Communication interface 306 can also provide communication with external storage devices (not shown), such as storage arrays, network-attached storage, storage area networks, etc.
[0125] In some examples, a display device 308, such as a monitor, may be included for displaying information and images to the user. Other I / O devices 310 may be devices that receive various inputs from the user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input / output devices, and so on.
[0126] The technologies described herein can be supported by these various configurations of computer device 300, and are not limited to specific examples of the technologies described herein. For example, the functionality can also be implemented wholly or partially on a “cloud” using a distributed system. A cloud includes and / or represents a platform for resources. The platform abstracts the underlying functionality of the cloud’s hardware (e.g., servers) and software resources. Resources may include applications and / or data that can be used when performing computational processing on servers remote from computer device 300. Resources may also include services provided via the Internet and / or via subscriber networks such as cellular or Wi-Fi networks. The platform can abstract resources and functionality to connect computer device 300 to other computer devices. Therefore, the implementation of the functionality described herein can be distributed throughout the cloud. For example, the functionality may be implemented partly on computer device 300 and partly through a platform that abstracts the functionality of the cloud.
[0127] Although this disclosure has been described and illustrated in detail in the accompanying drawings and the foregoing description, such description and illustration should be considered illustrative and suggestive, not restrictive; this disclosure is not limited to the disclosed embodiments. By studying the drawings, the disclosure, and the appended claims, those skilled in the art will be able to understand and implement variations of the disclosed embodiments in practice with respect to the claimed subject matter. In the claims, the word "comprising" does not exclude other elements or steps not listed, the indefinite article "a" or "an" does not exclude a plurality, the term "a plurality" means two or more, and the term "based on" should be interpreted as "at least partially based on". The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be beneficial.
Claims
1. A computing device for performing matrix-vector multiplication on a weight matrix and an activation vector during inference of a neural network model, the computing device comprising: A weight-locked memory array based on MRAM includes multiple memory partitions and is configured to store the weight matrix in a statically resident manner in the multiple memory partitions during the inference of the neural network model, wherein the MRAM has a precalibrated readout latency and the computing device has a fixed readout tick predetermined based on the readout latency. The input vector streaming interface is configured to receive the activation vector based on the fixed readout tick; The read scheduling unit is configured to read multiple sets of weight values in parallel from the multiple storage partitions based on the fixed read tick; The parallel multiplication unit is configured to perform parallel multiplication operations on multiple sets of weight values read in parallel and the activation vector within each readout cycle; An accumulation unit is configured to accumulate the product results generated by the parallel multiplication unit at each readout cycle to generate the result vector of the matrix-vector multiplication operation; and The result output interface is configured to output the result vector.
2. The computing device according to claim 1, wherein, Each of the multiple storage partitions includes multiple storage cells, and each storage cell includes a magnetic tunnel junction. The magnetic tunnel junction includes a reference layer, a tunneling barrier layer, and a free layer. Each weight value in the weight matrix is characterized by the resistance state of the magnetic tunnel junction.
3. The computing device according to claim 2, wherein, The magnetic tunnel junction is formed between the rear metal interconnect layers and is connected to the underlying CMOS driving circuit and sensing circuit through vias.
4. The computing device according to claim 1, wherein, The MRAM is based on STT-MRAM, SOT-MRAM, VCMA-MRAM, toggle MRAM, thermally assisted MRAM, track memory, memory based on magnetic domain wall motion, memory based on skyrmion motion, nanomagnet array memory, artificial spin ice storage structure, or a combination thereof.
5. The computing device according to claim 2, wherein, The MRAM is addressed or accessed through one or more of row selection, column selection, intersection selection, local switch selection, port selection, or shift selection, and each memory cell is accessed by a gating device, a selection device, or an access controller.
6. The computing device according to claim 5, wherein, The gating device is based on a transistor, selector, diode, threshold switch, dual-terminal nonlinear gating device, three-terminal gating device or a combination thereof, and the storage unit is based on 1T1MTJ, 2T1MTJ, 1S1MTJ, 1D1MTJ, read / write separated transistor structure, cross array selector structure or a combination thereof.
7. The computing device according to any one of claims 1-6, wherein, The multiple storage partitions are subarrays divided by rows, columns, blocks, sub-blocks, or banks.
8. The computing device according to claim 7, wherein, The different output dimensions of the weight matrix are mapped to different subarrays, and the different input dimensions of the activation vector are input via the input vector streaming interface at different readout beats.
9. The computing device according to any one of claims 1-6, wherein, The weight matrix is written to the weight-locked storage array during the neural network model deployment phase, and the static residency mode indicates that the weight matrix remains in a read-only locked state during the inference of the neural network model.
10. The computing device according to claim 9, wherein, The weight matrix is written into the weight-locked storage array in one of the following ways: spin-transfer torque applied by spin-polarized current, spin-orbit torque applied by transverse current in adjacent heavy metal layer, or bias voltage adjusting interface magnetic anisotropy.
11. The computing device according to claim 9, wherein, The driving conditions for writing the weight matrix are determined based on the device size, thermal stability factor, target bit error rate, and write time budget.
12. The computing device according to claim 2, wherein, The readout scheduling unit reads out the multiple sets of weight values by sensing the difference in tunneling magnetoresistance of the magnetic tunnel junction in parallel and antiparallel states.
13. The computing device according to claim 12, wherein, The readout scheduling unit includes a read bias generation unit, a sensing amplifier, a reference branch, and decision logic.
14. The computing device according to claim 2, wherein, The precalibrable readout delay is calibrated based on the tunneling magnetoresistance ratio of the magnetic tunnel junction, the RC parameters of the weighted locked memory array, the sensing time budget, and the misjudgment tolerance.
15. The computing device according to any one of claims 1-6, further comprising a timing control unit configured to generate multi-stage pipeline control signals corresponding to the readout scheduling unit, the parallel multiplication unit, and the accumulation unit respectively, based on the fixed readout clock, wherein the multi-stage pipeline control signals define the inter-stage delay relationship between the readout time of the multiple sets of weight values, the execution time of the parallel multiplication operation, and the accumulation sampling time of the product result.
16. The computing device according to any one of claims 1-6, wherein, The fixed readout clock corresponds to the steady-state readout delay window of the MRAM as determined by offline device characterization.
17. The computing device according to any one of claims 1-6, wherein, The readout scheduling unit is further configured to read the multiple sets of weight values from the multiple storage partitions in a multi-bank parallel manner, a block pipeline manner, or a multi-task pipeline manner.
18. The computing device according to any one of claims 1-6, wherein, The weight matrix is encoded in floating-point format, fixed-point format, or mixed-precision format and stored in the plurality of storage partitions; wherein the floating-point format includes FP32, TF32, BF16, FP16, or FP8; and the fixed-point format includes INT16, INT8, INT4, INT2, or binary format.
19. The computing device according to claim 18, wherein, Each weight value in the weight matrix is sliced by bit, separated by sign bit and magnitude bit, or mapped to different storage locations in the weight-locked storage array by high and low bits. After being read out, it is reassembled by bit alignment logic into effective weight data that participates in the parallel multiplication operation.
20. The computing device according to any one of claims 1-6, wherein, The accumulation unit includes a local accumulation level within a clock cycle and a global accumulation level across clock cycles. The local accumulation level within a clock cycle aggregates the product results from different storage partitions within the same read clock cycle, and the global accumulation level across clock cycles accumulates the aggregated results of each read clock cycle level by level to generate the result vector.
21. The computing device according to any one of claims 1-6, wherein, The result output interface is configured to output the result vector to an external main processor or an external bus.
22. The computing device according to claim 9, wherein, The weighted locked storage array is configured with at least one of the following in the read-only locked state: read disturbance control mechanism, sensing window calibration mechanism, reference branch calibration mechanism, error correction coding mechanism, and redundant unit.
23. The computing device according to any one of claims 1-6, wherein, The weight-locked storage array also includes a volatile cache coupled to the plurality of storage partitions, the volatile cache being configured to temporarily store weight values read from the storage partitions and provide them to the parallel multiplication unit.
24. A data processing method for high-bandwidth inference, comprising performing matrix-vector multiplication on a weight matrix and an activation vector during inference of a neural network model, the method comprising: The weight matrix is pre-stored in a static resident manner in multiple storage partitions of an MRAM-based weight-locked storage array; A fixed readout timing is obtained, which is determined based on a pre-calibrated readout delay of the MRAM; During inference of the neural network model, the following operations are performed based on the fixed readout timing: Receive the activation vector; Multiple sets of weight values are read in parallel from the multiple storage partitions; Perform parallel multiplication operations on the multiple sets of weight values read out in parallel and the activation vector; as well as The product results generated by each readout beat are summed. The result vector of the matrix-vector multiplication operation is generated after accumulating multiple readout beats.
25. A computer device, comprising: The computing device according to any one of claims 1-23.
26. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the method of claim 24.
27. A computer program product comprising a computer program that, when executed by a processor, causes the processor to perform the method of claim 24.