An in-memory operation system compliant with DDR memory access timing

By optimizing the VMM operation process of the cross-point RAM, using row block ACT_BULK and column group VMM commands, combined with decoupling circuits and row block interleaving operations, the problem of insufficient row-level parallelism and column-level parallelism in the cross-point RAM is solved, achieving compatibility with DDR memory access timing and performance improvement.

CN116610604BActive Publication Date: 2026-06-26HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2023-04-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing cross-point RAM hardware designs suffer from insufficient row-level and column-level parallelism when performing simulated vector matrix multiplication operations, resulting in excessive latency, making it difficult to be compatible with DDR memory access timing and affecting computational performance.

Method used

Design an in-memory arithmetic system that conforms to DDR memory access timing. Optimize the VMM operation process through row block ACT_BULK and column group VMM commands. Combine decoupling circuits and row block interleaving operations to achieve coordinated development of row-level parallelism and column-level parallelism, thereby reducing latency.

Benefits of technology

It effectively reduces the overall latency of VMM operations, improves computing performance, enables the cross-point RAM system to be compatible with DDR memory access timing, and improves computing efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116610604B_ABST
    Figure CN116610604B_ABST
Patent Text Reader

Abstract

The application discloses a memory operation system compliant with DDR memory access timing, and belongs to the field of memory and scientific calculation, comprising: a row-level parallelism driven timing termination mechanism, which utilizes row nonlinear charging characteristics to reduce tail delay of tRCD and tRP; and a row block interleaving, row-column collaborative vector-matrix multiplication (VMM) access mechanism, which reduces tRAS and overlaps CL timing parameters without increasing peripheral column ADC precision overhead, and reduces cross-point RAM internal core delay; the proposed row access and column access collaborative optimization memory hardware design can enable VMM operation compliant with DDR memory access timing in a memory-centric manner, so as to realize efficient hardware execution of low-delay, high-bit-width data-intensive scientific calculation (computational physics) loads.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of internal memory and scientific computing, and more specifically, relates to an in-memory computing system that conforms to the DDR memory access timing. Background Technology

[0002] The emerging random-access memory (RAM) can perform analogue vector-matrix multiplication (VMM) in situ based on Ohm's law and Kirchhoff's current law, and can efficiently iteratively solve linear systems derived from complex problems in the physical world, such as computational fluid dynamics, semiconductor device simulation, circuit simulation and structural mechanics.

[0003] like Figure 1 As shown, the crosspoint resistor-type RAM module includes multiple ranks, as well as burst buffers, shift and accumulate circuits, vector logic operation circuits, and a memory controller. Each rank is divided into memory banks that can be accessed in parallel relative to each other, and the layout of each memory bank is as follows. Figure 2 As shown, each memory bank consists of a two-dimensional array of memory tiles. A memory tile includes peripheral core transistors and cell arrays used for sensing, driving, and latching functions. Each cell has a cuboid capacitor-shaped metal-insulator-metal (MIM) layered structure with a thin dielectric sandwiched between top and bottom electrodes, storing binary 1s using a high-conductance state and binary 0s using a low-conductance state. The peripheral core transistors in a memory tile include a row of sense amplifiers and sub-word line drivers connected to the bit lines. A row of memory tiles sharing the same set of global word lines is arranged in a sub-array. These memory tiles are lockstepped, meaning they execute the same commands and start and end simultaneously. Only one sub-array can be accessed in a memory bank at a time. Here, it is important to note the asymmetry between rows and columns. A row refers to a single storage cell, while a column refers to a single bit line. Each bit line is connected to a multi-bit latch (S / H, Sample and Hold) to temporarily store the accumulated result. A row bit line latch is called a row buffer.

[0004] Traditional CMOS inverter-based row buffers (such as StrongARM latches, i.e., sense amplifiers) have three functions when enabled: 1) sensing (evaluating) the potential difference between the true bit line and the padding bit line used as a reference; 2) actively toggling and forcefully pulling (driving) the true bit line; 3) latching a single bit, with the row buffer containing the same data as the accessed cell after a sufficient toggling time. The data stored in the local row buffers within the subarray is also called a physical page, typically 8Kb in size, with a configuration of storing one bit per cell, corresponding to 8192 memory cells.

[0005] Activating an entire row of cells as input completely eliminates the sneak-current-free current in the crosspoint cell array during VMM operations, similar to DRAM read access. Because an entire row of cells is accessed at a time, the local word lines in crosspoint RAM experience significant current flow to deliver it to the cell row via the sub-wordline driver (SWD). Driving involves raising or lowering the word line potential by charging or discharging parasitic capacitors. Write operations on crosspoint RAM subarrays involve hundreds of nanoseconds of cell state changes, so write latency is primarily controlled by cell toggle delay. Unlike crosspoint RAM write operations, VMM operations do not involve changes in cell state; they only involve the precise charging and discharging of word lines and cell parasitic capacitances, resulting in single-step operation latency of only tens of nanoseconds.

[0006] Existing cross-point RAM hardware designs simultaneously activate all rows in the subarray during VMM operations and allow all columns in a memory tile to share a single ADC (analog-to-digital converter). As more rows of cells are activated or precharged simultaneously, these rows must be opened more precisely to the target VMM during time-varying charging or discharging processes. PP The column ADC is switched to a ground state or a neutral state to ensure that it can correctly sense the analog accumulation result. Since row charging and discharging processes are highly nonlinear and have very long tail delays to ensure correct charging and discharging, activating more rows simultaneously significantly increases the required charging and discharging delays. Secondly, activating more rows requires higher ADC accuracy to sense the analog accumulation result at the bit line ends, which increases the area, delay, and power overhead of the sense amplifier-based ADC. Therefore, within a given subarray width, more columns must share a single ADC, meaning that as row-level parallelism (RLP) increases, column-level parallelism (CLP) decreases, such as... Figure 3 As shown, due to the reduction in CLP, subarray row accesses must wait longer before all peripheral column accesses pass through the ADC, thus wasting a significant amount of row access latency and cell array energy consumption.

[0007] In general, row charging and discharging delays caused by parasitic capacitance and line resistance are the main performance bottlenecks for performing high-frequency VMM operations on practical-scale (e.g., 512×512) cell arrays. In fact, because CMOS inverter-based word line drivers must charge or discharge the parasitic capacitances of word lines and bit lines, as well as the parasitic capacitances between the top and bottom electrodes of the entire row of memory cells, the internal cell array core of crosspoint RAM operates in an analog manner. To properly charge and discharge the parasitic capacitances of the cell array RC network during VMM operations and ensure the accuracy of the underlying analog signals of the word lines and bit lines in the cell array, the internal cell array core in crosspoint RAM has a longer delay than the peripheral circuitry. This high operational latency severely impacts the performance of VMM operations. Therefore, how to reduce the latency of VMM operations in crosspoint RAM systems to improve VMM performance is a pressing problem to be solved. Summary of the Invention

[0008] To address the shortcomings and improvement needs of existing technologies, this invention provides an in-memory computing system that conforms to the DDR memory access timing. The purpose is to design a VMM operation mechanism in the cross-point RAM that conforms to the DDR memory access timing, reduce the latency of VMM operations to improve the performance of VMM operations, and enable the proposed in-memory computing system to function as a DDR slave device and interface with modern computer system hosts.

[0009] To achieve the above objectives, the present invention provides an in-memory computing system that conforms to DDR memory access timing, comprising: a cross-point RAM and a memory controller on the module; the memory controller is used to control the cross-point RAM to perform VMM operations in-situ through the following operations:

[0010] Send the row block ACT_BULK command to the target subarray in the intersection RAM to cause the target subarray to perform: open the rows with non-zero inputs in the target row block so that the target row block performs simulated vector matrix multiplication in situ, and then enable the row buffer to store and latch the vector matrix multiplication results of the target row block in the row buffer; the row block is N1 adjacent rows in the cell array;

[0011] When the row buffer latch is stable, a column group VMM command is sent to the target subarray to cause the target subarray to perform the following: read the contents corresponding to the target column group from the row buffer, perform analog-to-digital conversion on the ADC, and then transmit the data to the global shift-accumulate data path for shift-accumulation; the column group consists of N2 columns with the same spacing in the cell array;

[0012] Where N1 and N2 are both integers greater than 1.

[0013] Furthermore, in the cross-point random access memory, each bit line within each memory tile is connected to the row buffer via a decoupling circuit;

[0014] The decoupling circuit includes: an NMOS pull-down transistor PD, a PMOS isolation transistor ISO, and a row buffer decoupling line; the NMOS pull-down transistor PD is connected between the bit line and ground, the PMOS isolation transistor ISO is connected between the bit line and the row buffer, and the gates of both the NMOS pull-down transistor PD and the PMOS isolation transistor ISO are connected to the row buffer decoupling line.

[0015] Furthermore, the memory controller is also used for:

[0016] While issuing the row block ACT_BULK command, enable the row buffer decoupling line to turn on the NMOS pull-down transistor PD and turn off the PMOS isolation transistor ISO, thereby enabling the decoupling circuit, decoupling the bit lines in the target column group from the row buffer and precharging the bit lines to ground potential.

[0017] Additionally, after all rows with non-zero inputs in the current row block are opened, the row buffer decoupling line is disabled to turn on the PMOS isolation transistor ISO and turn off the NMOS pull-down transistor PD, thereby disabling the decoupling circuit and connecting the bit lines in the target column group to the row buffer to latch the operation result.

[0018] Furthermore, the memory controller is also used for:

[0019] After all rows with non-zero inputs in the preceding row block are turned on, a row block PREACT_BULK command is sent to the target subarray, causing the target subarray to: turn off the word lines that are turned on in the target row block, enable the row buffer decoupling lines, and precharge the bit lines in the target column group to the baseline reference voltage; at the same time, turn on the rows with non-zero inputs in the next row block.

[0020] Furthermore, each ADC connected to the row buffer in the memory bank is provided with an output register at its end; the output register is used to temporarily store the output of the ADC after analog-to-digital conversion, and its bit width is not less than the accuracy of the ADC;

[0021] Furthermore, after the target subarray executes the current column group VMM command until the ADC completes the analog-to-digital conversion, the memory controller begins sending the next column group VMM command.

[0022] Furthermore, the memory controller is also used for:

[0023] After all rows with non-zero inputs in the current row block are opened, a row block PRE_BULK command is sent to the target subarray to cause the target subarray to perform the following actions: close the word lines that are open in the target row block, disable the row buffer, and precharge the bit lines in the target column group to the baseline reference voltage.

[0024] Furthermore, when the target subarray executes the row block ACT_BULK command, it opens rows within the target row block that have non-zero inputs, including:

[0025] The number of rows with non-zero inputs (RLP) within the target row block is sensed, and the corresponding row charging coefficient (η) is calculated. p The lower bound η p2 Charge the rows with non-zero inputs within the target row block until the voltage of each cell is equal to the specified value. ; This represents the maximum effective voltage across the unit.

[0026] Furthermore, when the target subarray executes the row block PRE_BULK command, it closes the word lines that are open within the target row block, including:

[0027] Discharge the parasitic capacitance of the word lines and cells corresponding to the rows that are in the open state within the target row block until the voltage of each cell is [value missing]. .

[0028] Furthermore, .

[0029] Furthermore, N1 = 16; N2 = 16 or N2 = 32.

[0030] In summary, the above-described technical solutions conceived in this invention can achieve the following beneficial effects:

[0031] (1) This invention discovers that when binary potential is used as the word line input configuration, the VMM operation is actually a read access. Based on this, this invention redesigns the memory access commands in the VMM operation process with reference to the memory access timing, namely the row block ACT_BULK command and the column group VMM command. The row block ACT_BULK command activates multiple rows in a row block at the same time, and the column group VMM command transmits multiple column results in a column group at the same time. Moreover, the timing of the row block ACT_BULK command and the column group VMM command is completely consistent with the timing of the row activation (ACT) command and the column read (RD) command in the memory access timing. Thus, the similarity between VMM operation and memory access operation can be fully utilized. On the basis of correctly completing the VMM operation, row-level parallelism and column-level parallelism are developed in tandem. In the VMM operation process, only a part of the rows are activated in parallel, which effectively reduces the row activation delay (tRCD) and row precharge delay (tRP), and increases the column-level parallelism, which effectively reduces the ADC sensing delay and row buffer setup delay. Finally, the latency of VMM operation is effectively reduced and the performance of VMM operation is improved.

[0032] (2) This invention achieves the decoupling of the pre-charging function and latching function of the peripheral circuit at the end of the bit line by setting a decoupling circuit between the bit line and the row buffer. When the row block ACT_BULK or row block PREACT_BULK command is issued, the decoupling line of the row buffer is set to a high potential, so that the bit line in the target column group is decoupled from the row buffer and the bit line is pre-charged to ground potential. Since the bit line is decoupled from the row buffer, the pre-charging process of the bit line will not affect the contents of the row buffer. Pre-charging and row buffering (i.e. latching) can be performed in parallel, effectively reducing the time interval tRAS from the start of row opening to the imminent closing, and further reducing the overall delay of VMM operation.

[0033] (3) The present invention found that the row charging and discharging process is symmetrical. When closing one row block and opening the next row block at the same time, it can be ensured that the current row block is completely closed when the next row block is fully opened. Based on this, the present invention designed a new high-level command, namely the row block PREACT_BULK command. This command activates the next row block while precharging the current row block, realizing row block interleaving. When the current row block completes its precharging process, the next row block has been fully opened and the row buffer can be reactivated to start re-latching new data results for the next row block. This allows the precharging and activation processes of adjacent rows to be executed in parallel, which can further reduce the overall VMM operation latency.

[0034] (4) The present invention sets an output register at the end of each ADC to temporarily store the operation result of the current row block. After the target subarray executes the current column group VMM command until the content corresponding to the target column group is sensed by the ADC (i.e. the ADC completes the analog-to-digital conversion), the memory controller sends the next column group VMM command. This realizes the pipeline between the sensing and shift accumulation steps, so that the issuance of the next column group VMM command does not have to wait until the shift accumulation step of the current column group VMM command is completed. This effectively reduces the column read latency CL and the latency tCCD_L between two adjacent column commands in the same memory bank, further reducing the overall latency of VMM operation.

[0035] (5) After the current row block activation process is completed (i.e., all rows with non-zero inputs in the current row block are opened), the present invention sends a row block PRE_BULK command to the target subarray. The row block PRE_BULK command simultaneously completes the pre-charge of multiple rows in a row block. The timing of the row block ACT_BULK command, the column group VMM command, and the row block PRE_BULK command is completely consistent with the timing and timing parameter physical meaning of the row activation (ACT) command, column read (RD) command, and row pre-charge (PRE) command in the DDR memory access operation, i.e., it follows the DDR memory access timing.

[0036] (6) Compared to activating all rows in the subarray simultaneously during VMM execution, this invention only activates a portion of the rows simultaneously, reducing row-level parallelism (RLP), but at the same time increasing column-level parallelism (CLP). After the RLP is reduced, the lower limit of the row charging coefficient can be adjusted accordingly to ensure correct row charging and discharging. Based on this, when performing VMM operations, this invention calculates the corresponding lower limit of the row charging coefficient based on the sensed RLP and adjusts the charging and discharging voltages. The adjusted charging voltage decreases while the discharging voltage increases. Consequently, the delay tRCD from issuing the row block ACT_BULK command to the row buffer latch stabilization and the delay tRP from issuing the row block PRE_BULK command to the completion of row block precharging are both reduced, further reducing the overall latency of VMM operations. Attached Figure Description

[0037] Figure 1 The organizational structure of the existing intersection RAM;

[0038] Figure 2 This is a schematic diagram of the layout of the memory bank in the existing intersection RAM;

[0039] Figure 3 A schematic diagram illustrating row-level and column-level parallelism for performing VMM operations on an existing cross-point RAM;

[0040] Figure 4This is a schematic diagram of existing intersection RAM memory access;

[0041] Figure 5 For existing cross-point RAM memory access timing diagrams;

[0042] Figure 6 A schematic diagram illustrating the row-level parallelism and column-level parallelism of VMM operations performed by the cross-point RAM according to an embodiment of the present invention;

[0043] Figure 7 The nonlinear line voltage curve during the line charging and discharging process provided in the embodiments of the present invention;

[0044] Figure 8 The following is a schematic diagram of subarray VMM access operation provided in the embodiment of the present invention; wherein, (a) is the original VMM access timing diagram, that is, the subarray VMM access operation without precharging and row buffer parallelization, and (b) is the VMM access timing diagram with row block interleaving and row-column coordination, that is, the subarray VMM access operation with precharging and row buffer parallelization.

[0045] Figure 9 The diagrams show the connection between the bit line and the row buffer in different systems; where (a) is a diagram showing the direct connection between the bit line and the row buffer in an existing cross-point RAM system, and (b) is a diagram showing the connection between the bit line and the row buffer in the cross-point RAM system provided in this embodiment of the invention through a decoupling circuit.

[0046] Figure 10 Cross-point RAM and its VMM access provided in embodiments of the present invention;

[0047] Figure 11 This is a timing diagram of memory access polyphony with row-block interleaving and row-column coordination provided in an embodiment of the present invention. Detailed Implementation

[0048] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0049] In this invention, the terms "first," "second," etc. (if present) in the invention and the accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0050] Before explaining the technical solution of this invention in detail, let's briefly introduce the underlying physical meaning of row and column access commands in random access memory from the perspective of the controller, as well as the timing parameters of the double data rate (DDR) memory access standard recommended by the Joint Electron Devices Engineering Council (JEDEC). These parameters ensure the integrity of analog signals and that parasitic capacitances can be charged correctly without component conflicts. It should be noted that since the row address (row command) and column address (column command) do not arrive simultaneously, the row address and column address are time-division multiplexed on a shared address bus to save address pins. The DDR mentioned above is a general and common description, not limited to any generation, and fully includes DDRx, where x = 1, 2, 3, 4, 5.

[0051] like Figure 4 and Figure 5 As shown, from the controller's perspective, the main commands and timing parameters involved are as follows:

[0052] (1) Row Activation (ACT) command applied to cell array

[0053] The ACT command, accompanied by a row address, first opens (pumps) the closed row to achieve a stable active state. Then, the row buffer surrounding the bit line is activated to copy the contents of the entire row cell into the row buffer storage node. The delay from issuing the ACT command to the row buffer latching stabilization is called the RAS to CAS delay time (tRCD), and the time interval from when the row begins to open to when it is about to close is called the RAS time (tRAS), also known as the row activation period. Here, RAS and CAS represent the row address strobe and column address strobe signals, respectively, and they are active low.

[0054] (2) Row precharge (PRE) command applied to the cell array

[0055] The PRE command, accompanied by a line address, immediately begins closing what has been opened and is in V. PP The word line in the state (i.e., discharging the selected row) returns to the fully off state, disables the row buffer (disabling the bit line latch to prevent it from toggling, i.e., corrupting the data in the row buffer), precharges the bit line back to the baseline reference voltage, thereby deselecting the currently accessed column in preparation for accessing another row. The delay of the entire process is called the RAS precharge time (tRP).

[0056] (3) Read (RD) command applied to the outer row buffer

[0057] When the row buffer storage node and its connected bit lines are sufficiently stable during activation and reach a predetermined latching stability state, the RD command is issued along with the column address to enable a specific set of column select lines (CSLs) and transfer a block of data from the row buffer storage node to the target global I / O data line via the local data line. The latency of this entire process is called the read CAS latency (CL, also known as tAA). It is worth noting that the write CAS latency is CWL, and CL is longer than CWL. The latency between two adjacent column commands in the same memory bank is called the CAS-to-CAS latency time (tCCD_L).

[0058] When the accessed row is fully activated (i.e., turned on) under the read voltage, electrical readout performs sensing at the end of the bit line. Non-destructive VMM operation also senses at the end of the bit line. The difference between VMM and read operation is that VMM activates multiple rows simultaneously (word lines with zero input are deactivated) and the data stored in the subarray is known. This invention discovers that when using a binary potential (either V... PP When configured as a word line input (either in a grounded state or otherwise), a VMM operation is actually a single read access, but it requires activating multiple rows. Simply stacking multiple ordinary read access operations would exponentially increase latency, and compared to single-bit sensing for read operations, multi-row activation requires multi-bit sensing, significantly increasing the requirements for sensing accuracy. In fact, VMM access operations activating multiple rows have much stricter requirements for row voltage accuracy compared to read access operations that activate only a single row.

[0059] Existing hardware designs for cross-point RAM in-memory operations have two main drawbacks. First, they lack the ability to follow the DDR memory access timing, making it difficult to interface with modern computer systems. Second, they lack optimization for row-column coordination, leaving significant room for performance improvement.

[0060] In order to fully utilize the similarity between memory access operations and VMM operations without increasing latency and ADC sensing accuracy, this invention redesigns the memory access commands in the VMM operation process for the reference memory access timing to support memory library VMM access.

[0061] The VMM access commands designed in this invention operate on row blocks or column groups. A row block consists of N1 adjacent rows in a cell array, and a column group consists of N2 columns with equal spacing in the cell array; both N1 and N2 are integers greater than 1, typically powers of 2. Through related commands, this invention can simultaneously perform operations on multiple rows within a row block or multiple columns within a column group. It is easy to understand that, constrained by the array width, the product of the row block granularity and the column group granularity is constant. Figure 6The diagram shows an example of row block and column group partitioning, where the granularity of the row block is set to 128 and the granularity of the column group is set to 4. Since the size of the row block and column group is related to the accuracy requirements of the ADC, the larger the granularity of the row block and column group, the higher the accuracy requirement for the ADC. Correspondingly, the ADC area is larger and the power consumption is higher. To avoid excessively high accuracy requirements for the ADC, the granularity of the row block is preferably 16 and the granularity of the column group is preferably 32.

[0062] The following is an example.

[0063] Example 1:

[0064] An in-memory arithmetic system conforming to DDR memory access timing includes: a cross-point RAM and a memory controller on its module; the memory controller is used to control the cross-point RAM cell array to perform in-situ VMM operations through the following operations:

[0065] Send the row block ACT_BULK command to the target subarray in the intersection RAM to cause the target subarray to perform: open the rows with non-zero inputs in the target row block so that the target row block performs simulated vector matrix multiplication in situ, and then enable the row buffer to store and latch the vector matrix multiplication results of the target row block in the row buffer; the row block is N1 adjacent rows in the cell array; the target subarray is a row of memory tiles in the selected memory bank;

[0066] When the row buffer latch is stable, a column group VMM command is sent to the target subarray to cause the target subarray to perform the following: read the contents corresponding to the target column group from the row buffer, perform analog-to-digital conversion on the ADC, and then transmit the data to the global shift-accumulate data path for shift-accumulation; the column group consists of N2 columns with the same spacing in the cell array;

[0067] After all rows with non-zero inputs in the current row block are opened, a row block PRE_BULK command is sent to the target subarray to cause the target subarray to perform the following: close the word lines that are open in the target row block, disable the row buffer, and precharge the bit lines in the target column group to the baseline reference voltage; optionally, in this embodiment, the baseline reference voltage is a low potential.

[0068] In this embodiment, the row block ACT_BULK command simultaneously activates multiple rows within a row block, the column group VMM command simultaneously transmits the data of multiple column output latch results within a column group, and the row block PRE_BULK command simultaneously precharges multiple rows within a row block. Furthermore, the timing of the row block ACT_BULK command, column group VMM command, and row block PRE_BULK command is completely consistent with the timing of the row activation (ACT) command, column read (RD) command, and row precharge (PRE) command in memory access operations. This fully leverages the similarity between VMM operations and memory access operations. Based on the correct completion of VMM operations, row-level parallelism and column-level parallelism are collaboratively developed. During VMM operations, only a portion of rows are activated in parallel, effectively reducing row activation and precharge latency. Column-level parallelism is also increased, effectively reducing ADC sensing latency and row buffer setup latency. Ultimately, this effectively reduces the overall latency of VMM operations performed by the cross-point RAM, improving the performance of VMM operations.

[0069] Considering that the cumulative current error on the bit line limits the minimum charging coefficient that can be achieved within the bit line sensing tolerance, the voltage of the capacitor plates in the RC circuit will never reach full voltage unless the charging time is infinitely long. In order to further optimize the performance of VMM operation, this embodiment further analyzes the intrinsic relationship between row charging coefficient and row parallelism (RLP). Figure 7 The figure shows the nonlinear row voltage curves during the row charging and discharging process in the subarray, η. p1 This indicates the lower limit of the row charging coefficient. In a cell array, the selector connected in series with the memory cell acts as a resistor; therefore, the charging speed of the cell's parasitic capacitance is slower than the charging speed of the parasitic capacitance of the accessed word line, according to... Figure 7 It can be seen that the charging and discharging processes are highly nonlinear and asymptotically approach V. PP And ground potential; at the same time, to ensure proper charging and discharging, both tRCD and tRP have very long tail delays. If more rows are activated simultaneously in the subarray, these rows must be closer to V. PP Due to the ground potential state, the tail delays of tRCD and tRP are longer.

[0070] Further analysis reveals that the unit current-voltage characteristic curve can be considered linear under low voltage. When all word lines corresponding to all row blocks are input at high potential, and all cells on the bit line segments within the selected row block are in a high-conductivity state, the maximum cumulative relative error at the bit line end can be expressed as RE. Acc =RLP×(1-η) p ), where 0η p <1 is the line charging coefficient. RE Acc It should not exceed the ADC sensing static noise margin RE margin=1 / (2×(2) ADC_ -1), ADC_bits represents the precision of the ADC. If a column is all "1", the contents of that column are reversed. Therefore, ADC_bits = log2RLP. When RLP ≥ 2, η can be derived. p The lower bound is Therefore, The delay monotonically increases with increasing RLP, and analysis reveals that this conclusion does not change with variations in the cell volt-ampere characteristic curve. To reduce row access latency, this embodiment proposes an RLP-driven timing termination mechanism with scalable controller delay management for tRCD and tRP. This mechanism utilizes row nonlinear charging based on RLP to truncate unnecessary long-tail delays in tRCD and tRP during row activation and pre-charging. This physical mechanism utilizes the remaining delay margin to reduce latency, i.e., by relaxing the tRCD and tRP timing parameters through an RLP-driven word line charging coefficient adjustment mechanism.

[0071] Specifically, in this embodiment, when the target subarray executes the row block ACT_BULK command, it opens the rows with non-zero inputs within the target row block, including:

[0072] The number of rows with non-zero inputs (RLP) within the target row block is sensed, and the corresponding row charging coefficient (η) is calculated. p The lower bound η p2 Charge the rows with non-zero inputs within the target row block until the voltage of each cell is equal to the specified value. ; This represents the maximum effective voltage across the unit; optionally, ;

[0073] Furthermore, when the target subarray executes the row block PRE_BULK command, it closes the word lines that are open within the target row block, including:

[0074] Discharge the parasitic capacitance of the word lines and cells corresponding to the rows that are in the open state within the target row block until the voltage of each cell is [value missing]. ;

[0075] It is easy to understand that when the unit current-voltage characteristic curve changes, the lower limit of the line charging coefficient can be calculated according to the corresponding relationship.

[0076] In this embodiment, because the RLP is reduced, it is not necessary to charge the row to such a close to the final voltage V. PP It can also ensure proper charging and discharging; such as Figure 7 As shown, this embodiment can effectively reduce RLP, tRCD, and tRP by reducing them to a certain extent.

[0077] In practical applications, the RLP and its corresponding latency values ​​can be managed as metadata. The timing termination mechanism of the RLP driver requires a global lookup table to store the tRCD and tRP latency parameters of the RLP driver. This metadata is stored in a reserved area of ​​the extended serial presence detect (SPD) extreme memory profile (XMP).

[0078] In summary, this embodiment, based on the similarity between VMM operations and memory read access operations, constructs new memory access commands for analogue-based VMM operations in the form of column commands. It proposes a crosspoint RAM block VMM access mechanism that conforms to DDR memory access timing, collaboratively develops row parallelism and column parallelism, and proposes an RLP-driven timing termination physical mechanism. This mechanism utilizes row nonlinear charging to truncate unnecessary long-tail delays in tRCD and tRP. Therefore, this embodiment can effectively reduce the overall latency of VMM operations in crosspoint RAM systems, improve VMM operation performance, and the in-memory computing system provided by this embodiment can act as a DDR slave device, interfacing with current computing system hosts.

[0079] Example 2:

[0080] An in-memory arithmetic system conforming to DDR memory access timing includes: a cross-point RAM and a memory controller on its module; the memory controller is used to control the cross-point RAM to perform VMM operations in-situ through the following operations:

[0081] Send the row block ACT_BULK command to the target subarray in the intersection RAM to cause the target subarray to perform: open the rows with non-zero inputs in the target row block so that the target row block performs simulated vector matrix multiplication in situ, and then enable the row buffer to store and latch the vector matrix multiplication results of the target row block in the row buffer; the row block is N1 adjacent rows in the cell array;

[0082] When the row buffer latch is stable, a column group VMM command is sent to the target subarray to cause the target subarray to perform the following: read the contents corresponding to the target column group from the row buffer, perform analog-to-digital conversion on the ADC, and then transmit the data to the global shift-accumulate data path for shift-accumulation; the column group consists of N2 columns with the same spacing in the cell array;

[0083] Where N1 and N2 are both integers greater than 1.

[0084] The precharge phase includes word line shutdown (i.e., row shutdown) and bit line precharge. In the precharge phase of a conventional DRAM random access memory cell array, the bit lines begin to be precharged only after the accessed word lines are completely shut down; and for VMM access in cross-point RAM, all bit lines must be precisely precharged to ground before the row buffer can be enabled. Figure 8 As shown in (a) above, this is a schematic diagram of VMM operation implemented based on Embodiment 1, where parallelization and other optimization techniques are not implemented. For ease of description, it will be referred to as the native VMM access timing below. Considering that bit line precharging will not interfere with the memory cell corresponding to the word line being turned off, this embodiment aims to precharge the bit line to ground potential when the word line is not completely turned off, thereby achieving parallelization of word line turning off and bit line precharging, and thus reducing latency. However, as Figure 9 As shown in (a), in conventional crosspoint RAM, bit lines are directly connected to the row buffer. Due to the circuit connection properties, the row buffer and bit line precharge circuit are mutually exclusive and cannot be enabled simultaneously; that is, there is a parallelism conflict between bit line precharge and row buffer enablement. Based on this, this embodiment aims to decouple the row buffer from the bit lines at appropriate times. However, further considering that, unlike DRAM, in crosspoint RAM, the bit line current changes immediately once the word line is turned off, the row buffer storage node must be decoupled from the bit lines at the start of precharge (i.e., when the selected word line is turned off). Furthermore, this embodiment aims to keep all bit lines grounded during the current word line's shutdown period while decoupling the bit lines from the row buffer, so that the row buffer remains on and retains its contents without being disturbed by changes in bit line current.

[0085] Based on the above considerations, this embodiment proposes a decoupling circuit to achieve parallelization between the pre-charge process and the row buffer (latch). For example... Figure 9 As shown in (b), each bit line within each memory tile is connected to the row buffer via a decoupling circuit;

[0086] The decoupling circuit includes an NMOS pull-down transistor (PD), a PMOS isolation transistor (ISO), and a row buffer decoupling line. The NMOS pull-down transistor (PD) is connected between the bit line and ground, and the PMOS isolation transistor (ISO) is connected between the bit line and the row buffer. Both the gates of the NMOS pull-down transistor (PD) and the PMOS isolation transistor (ISO) are connected to the row buffer decoupling line. The row buffer decoupling line is essentially a metal (typically tungsten) interconnect in the memory bank. Enabling the row buffer decoupling line (setting it to a high potential) turns on the NMOS pull-down transistor (PD) and turns off the PMOS isolation transistor (ISO), decoupling the bit line from the row buffer and pre-charging it to ground. When the row buffer is enabled, disabling the row buffer decoupling line (setting it to a low potential) turns on the PMOS isolation transistor (ISO) and turns off the NMOS pull-down transistor (PD), connecting the bit line in the target column group to the row buffer. It's easy to understand that the values ​​of "high potential" and "low potential" can be flexibly set while ensuring the on / off state of specific transistors.

[0087] This embodiment introduces a decoupling circuit between the bit line and the row buffer, creating two paths at the end of the bit line: a sensing path and a pre-charge path, as shown below. Figure 9 As shown in (b) above. Because the charge stored in the parasitic capacitance gives the bit line inertia and enables it to maintain its state, the bit line can be properly switched from being connected to the precharge path to being connected to the sensing path by enabling the buffer decoupling line.

[0088] Specifically, in this embodiment, the memory controller is further configured to:

[0089] While issuing the row block ACT_BULK command, enable the row buffer decoupling line to turn on the NMOS pull-down transistor PD and turn off the PMOS isolation transistor ISO, thereby enabling the decoupling circuit, decoupling the bit lines in the target column group from the row buffer and precharging the bit lines to ground potential.

[0090] Additionally, after all rows with non-zero inputs in the current row block are opened, the row buffer decoupling line is disabled to turn on the PMOS isolation transistor ISO and turn off the NMOS pull-down transistor PD, thereby disabling the decoupling circuit and connecting the bit lines in the target column group to the row buffer to latch the operation result.

[0091] like Figure 8 As shown in (b), this embodiment achieves parallelization of pre-charging and row buffering through circuit decoupling, effectively reducing tRAS. Based on the proposed decoupling circuit, in this embodiment, cross-point RAM and VMM access are as follows: Figure 10 As shown.

[0092] Due to the non-destructive nature of VMM operations, tRAS relies on other timing parameters because cell content recovery procedures are no longer required. However, row opening and closing procedures still dominate the original VMM access latency, such as... Figure 9 As shown in (a) above. To further optimize the latency of VMM operations, this embodiment further analyzes the characteristics of the row charging and discharging process. This embodiment finds that the row charging and discharging process is axisymmetric, as shown in (a). Figure 11 As shown, based on this, this embodiment further proposes a row block interleaving operation mechanism, which features simultaneous execution of activation and precharge micro-operations to overlap the row activation and precharge delays in the original VMM access process. When the next row block is activated to η p ×V PP At that time, the current row block is precharged to (1-η). p )×V PP , where η p It is the line charging coefficient. Based on axisymmetry, tRCD=tRP+LL, where LL=k LL log2RLP is the row buffer latch latency, k LL This refers to the delay slope. The row block interleaving mechanism ensures that the current row block is completely closed when the next row block is fully open. In other words, when the current row block completes its precharge process, the row buffer can be reactivated to begin latching new data for the next row block. The row block interleaving mechanism requires modification of the local row address decoder and twice the number of sub-word line drivers to simultaneously select two adjacent row blocks, used to execute the precharge command and activation command simultaneously—that is, selecting the next adjacent row block for activation and simultaneously selecting the current row block for precharge.

[0093] Based on the above analysis, this embodiment further designs a new advanced command, namely the row block PREACT_BULK command, to replace the row block PRE_BULK command in Embodiment 1. Based on this command, the next row block is activated while the current row block is precharged. Specifically, in this embodiment, the memory controller is also used for:

[0094] After the data transmission of the target column group is completed, a row block PREACT_BULK command is sent to the target subarray to cause the target subarray to perform the following actions: turn off the word lines that are open in the target row block, disable the row buffer, and precharge the bit lines in the target column group to the baseline reference voltage; at the same time, turn on the row with non-zero inputs in the next row block.

[0095] Based on the row block PREACT_BULK command, this embodiment implements a row block interleaving operation mechanism. This mechanism automatically issues the activation command for the next row block as soon as the precharge command for the current row block is issued. Since the number of activated rows does not change when the row buffer is enabled, the row block interleaving operation mechanism does not increase the required ADC accuracy.

[0096] It should be noted that there is a timing constraint in this embodiment, namely, the activation command of the current row block cannot be issued before the precharge command of the previous row block is issued. This is because, due to the axisymmetry of the activation and precharge processes, the bit line sample / hold circuit must wait until the precharge process of the previous row block is completed in this case.

[0097] exist Figure 8 In the native VMM access timing shown in (a), the peripheral column access CL latency includes ADC sensing latency (SSL) and shift accumulation latency (ACL), where SSL = k SSL log2RLP, k SSL The corresponding coefficients are used. Since the sensing and latching functions of the peripheral circuitry are decoupled into a sample-and-hold circuit (S / H) and an ADC, this embodiment further proposes a sensing and shift-accumulation pipeline between the memory-bank ADC sensing and the global shift-accumulation logic. This pipeline overlaps the CL delay (i.e., column CAS delay) and reduces the delay between adjacent VMM commands by adding an output register with a bit width not less than the ADC accuracy at the end of each ADC. To implement the above pipeline, in this embodiment, an output register is provided at the end of each ADC. The output register is used to temporarily store the calculation result of the current row block, and its bit width is not less than the ADC accuracy. The output register across all memory banks can be considered as a global row buffer.

[0098] Based on the buffering function of the output register at the end of the ADC, the issuance of the next VMM command does not need to wait for the completion of the current VMM shift and accumulation step. Accordingly, in the embodiment, after the target subarray executes the current column group VMM command until the content corresponding to the target column group is sensed by the ADC (i.e., the ADC completes the analog-to-digital conversion), the memory controller starts sending the next column group VMM command.

[0099] Based on the designed pipeline, tRAS was reduced, such as Figure 8 As shown in (b), the tCCD_L timing parameter in the same VMM command group is reduced from SSL+ACL to SSL. It is easy to understand that, in this case, the Read-To-Precharge time (tRTP) parameter is redefined as the delay from the first VMM command in the column group to the precharge command.

[0100] Considering the high locality of accessed row addresses, this embodiment uses an open-page row buffer management strategy for VMM access. In burst mode VMM access, this strategy only issues subsequent column commands and does not repeatedly issue the same row access command when the row buffer is hit, thus saving row cycles. Figure 11 The timing diagram of the crosspoint memory repeater for overlapping row and column access modes is shown. Figure 8 and Figure 11 In this context, "ShAc" represents shift accumulation.

[0101] The pipeline designed for column group VMM commands in this embodiment can also be applied to the above embodiment 1.

[0102] This embodiment is similar to Embodiment 1 above, and also follows the DDR memory access timing. The difference is that this embodiment further analyzes the internal physical mechanism of the cross-point RAM. Based on the row block ACT_BULK command and column group VMM command, the timing of row block activation, column group VMM operation and row block precharge is further optimized.

[0103] Overall, this embodiment optimizes row charging and discharging latency of subarray VMM access by coordinating row and column access and utilizing micro-operation interactions between the subarray and peripheral circuits, effectively reducing the overall latency of VMM operations and improving VMM operation performance.

[0104] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A memory-based computing system that conforms to DDR memory access timing, characterized in that, include: A memory controller on the cross-point RAM and its module; the memory controller is used to control the cross-point RAM to perform VMM operations in situ through the following operations: Send a row block ACT_BULK command to the target subarray within the intersection RAM to cause the target subarray to perform: open rows with non-zero inputs within the target row block, so that the target row block performs simulated vector matrix multiplication in situ, and then enable the row buffer to store and latch the vector matrix multiplication result of the target row block in the row buffer; the row block is N1 adjacent rows in the cell array; When the row buffer latch is stable, a column group VMM command is sent to the target subarray to cause the target subarray to perform: read the content corresponding to the target column group from the row buffer, perform analog-to-digital conversion on the ADC, and then transmit it to the global shift-accumulate data path for shift-accumulation; the column group is N2 columns with the same interval in the cell array; Where N1 and N2 are both integers greater than 1; in the intersection RAM, each bit line in each memory tile is connected to the row buffer through a decoupling circuit; The decoupling circuit includes: an NMOS pull-down transistor PD, a PMOS isolation transistor ISO, and a row buffer decoupling line; the NMOS pull-down transistor PD is connected between the bit line and ground, the PMOS isolation transistor ISO is connected between the bit line and the row buffer, and the gates of the NMOS pull-down transistor PD and the PMOS isolation transistor ISO are both connected to the row buffer decoupling line. Furthermore, the memory controller is also used for: While issuing the row block ACT_BULK command, enable the row buffer decoupling line to turn on the NMOS pull-down transistor PD and turn off the PMOS isolation transistor ISO, thereby enabling the decoupling circuit, decoupling the bit lines in the target column group from the row buffer and precharging the bit lines to ground potential. Furthermore, after all rows with non-zero inputs in the current row block are opened, the row buffer decoupling line is disabled to turn on the PMOS isolation transistor ISO and turn off the NMOS pull-down transistor PD, thereby disabling the decoupling circuit and connecting the bit lines in the target column group to the row buffer latch operation result.

2. The in-memory computing system conforming to DDR memory access timing as described in claim 1, characterized in that, The memory controller is also used for: After all rows with non-zero inputs in the preceding row block are opened, a row block PREACT_BULK command is sent to the target subarray, causing the target subarray to perform the following actions: close the word lines that are open in the target row block, enable the row buffer decoupling line, and precharge the bit lines in the target column group to the baseline reference voltage; at the same time, open the rows with non-zero inputs in the next row block.

3. The in-memory computing system conforming to DDR memory access timing as described in claim 1, characterized in that, Each ADC connected to the row buffer in the memory bank is provided with an output register at its end; the output register is used to temporarily store the output of the ADC after analog-to-digital conversion, and its bit width is not less than the accuracy of the ADC; Furthermore, after the target subarray executes the current column group VMM command until the ADC completes the analog-to-digital conversion, the memory controller begins to send the next column group VMM command.

4. The in-memory computing system conforming to DDR memory access timing as described in claim 1, characterized in that, The memory controller is also used for: After all rows with non-zero inputs in the current row block are opened, a row block PRE_BULK command is sent to the target subarray to cause the target subarray to perform the following actions: close the word lines that are open in the target row block, disable the row buffer, and precharge the bit lines in the target column group to the baseline reference voltage.

5. The in-memory computing system conforming to DDR memory access timing as described in claim 4, characterized in that, When the target subarray executes the row block ACT_BULK command, it opens rows within the target row block that have non-zero inputs, including: The number of rows with non-zero inputs (RLP) within the target row block is sensed, and the corresponding row charging coefficient is calculated. lower bound Charge the rows with non-zero inputs within the target row block until the voltage of each cell is equal to the specified value. ; This represents the maximum effective voltage across the unit. Furthermore, when the target subarray executes the row block PRE_BULK command, it closes the word lines that are in the open state within the target row block, including: Discharge the parasitic capacitance of the word lines and cells corresponding to the rows that are in the open state within the target row block until the voltage of each cell is [value missing]. .

6. The in-memory computing system conforming to DDR memory access timing as described in claim 5, characterized in that, 。 7. The in-memory computing system conforming to DDR memory access timing as described in any one of claims 1 to 6, characterized in that, N1=16; N2=16 or N2=32.