A multi-core in-memory processing architecture

By designing a multi-core in-memory processor architecture, storage and computing are combined into one, solving the problems of the 'memory wall' and 'power consumption wall' in traditional architectures, and realizing efficient AI edge computing.

CN115456155BActive Publication Date: 2026-06-16ZHEJIANG UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2022-09-15
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Traditional computing architectures cannot meet the needs of energy-constrained AI edge applications, mainly because ML computing is data-centric and memory access consumes a lot of energy, leading to the 'memory wall' and 'power wall' problems.

Method used

Design a multi-core in-memory processor architecture based on the Rocket Chip architecture. By configuring the module coprocessor RoCC, storage and computation are combined into one, supporting in-memory cores to perform calculations in memory, enabling data to be directly processed in the storage module, and feeding the results back to the processor.

🎯Benefits of technology

It significantly reduces the time and energy consumption of data transmission on the bus, improves computing throughput and energy efficiency, and solves the performance bottleneck in traditional architectures.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115456155B_ABST
    Figure CN115456155B_ABST
Patent Text Reader

Abstract

The application discloses a kind of multi-core storage and calculation processor architecture, which includes system bus, memory module, front bus, peripheral bus, control bus and Rocket Tile module, Rocket Tile module includes Rocket Core and Rocket coprocessor RoCC.Rocket Core is used to control RoCC module and memory module to interact with data or control RoCC internal storage and calculation core to enter calculation mode according to different instructions.Rocket coprocessor RoCC is used to configure input cache module, weight cache module, decoding and logic control module and storage and calculation core module CIM Core to complete data storage and calculation process.The application can realize the storage and calculation of storage and calculation processing core CIM Core, cache module and decoding and logic control module by changing the configurable module coprocessor RoCC in architecture, different convolutional neural network is segmented according to the data size supported by storage and calculation core, and the number of storage and calculation cores that can at least realize network mapping is configured to complete calculation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of integrated circuit technology and relates to near-memory computing technology, specifically a multi-core in-memory processor architecture that can target various convolutional neural network mappings. Background Technology

[0002] With the advent of the era of big data and the Internet of Things, artificial intelligence (AI) and machine learning (ML) are widely used in many cognitive tasks, such as image classification and speech recognition, from the cloud to edge devices. In recent years, research on hardware accelerators for AI edge devices has received increasing attention, primarily due to the advantages of AI at the edge: including privacy, low latency, greater reliability, and more efficient use of network bandwidth. However, traditional computing architectures (such as CPUs, GPUs, FPGAs, and even existing AI accelerators ASICs) cannot meet the demands of future energy-constrained AI edge applications. This is because ML computing is data-centric, and a large portion of the energy in these architectures is consumed by memory access. To improve energy efficiency, academia and industry are exploring a new computing architecture: near-in-memory computing or in-memory computing.

[0003] The basic idea behind in-memory computing is to combine computation and storage into one, thereby reducing the frequency of processor access to memory (because most of the computation is already done in memory). It organically combines storage and computation, directly utilizing storage units for calculations, greatly eliminating the overhead caused by data movement, and solving the "memory wall" and "power wall" problems of traditional chips in running artificial intelligence algorithms. It can improve the efficiency of artificial intelligence computing by tens or even hundreds of times and reduce costs.

[0004] Rocket Chip is an open-source SoC generator based on Chisel. It includes a module library consisting of cores, caches, and interconnects, which forms the basis for building a complete SoC and can generate synthesizable RTL code. It offers highly flexible parametric design, allowing for customization to specific application scenarios. By changing just one configuration, we can obtain SoCs of vastly different sizes, ranging from embedded microprocessors to multi-core server chips. Summary of the Invention

[0005] To address the problems in the existing technology, this invention aims to design a multi-core in-memory processor architecture that can support various convolutional neural network mappings based on the Rocket Chip architecture.

[0006] The technical solution of the present invention is as follows:

[0007] This invention provides a multi-core in-memory processor architecture, including a system bus, a memory module, a front-side bus, a peripheral bus, and a control bus, and also includes a Rocket Tile module;

[0008] The Rocket Tile module is used to configure the in-memory computing process on the in-memory computing architecture, including RocketCore and Rocket coprocessor RoCC;

[0009] The Rocket Core is used to control the RoCC module to interact with the memory module or to control the internal computing core of RoCC to enter computing mode according to different instructions.

[0010] The Rocket coprocessor RoCC includes an input / output (I / O) module, an input buffer module, a weight buffer module, a decoding and logic control module, and several in-memory computing core modules (CIM Cores). The Rocket coprocessor RoCC is used to configure its internal input buffer module, weight buffer module, decoding and logic control module, and different numbers of in-memory computing core modules (CIM Cores) to complete the data storage and computation process.

[0011] Furthermore, the input / output I / O module is used for data interaction between the RoCC module and other modules; the input cache module is used to store input activation data read from the memory module; the weight cache module is used to store weight data read from the memory module; the decoding and logic control module is used to pre-store the weight data stored in the weight cache module in the in-memory computing core (CIM Core) according to the time sequence, control the in-memory computing core array to perform calculations, and process and output the calculation results of the in-memory computing array; the in-memory computing core module (CIM Core) is used to implement the multiplication and addition calculation operation of input activation and weight.

[0012] Furthermore, the stored computation core (CIM Core) includes an n×(m×8b) SRAM array, a row decoding circuit, a multiplication module, an addition module, and an accumulation control module. The SRAM array is used to store weight data, and the row decoding circuit is used to receive address data and store the weight data in a specific row of the SRAM array.

[0013] Furthermore, the multiplication module is used to implement the multiplication of each 8-bit data of activation and weight; the addition module is used to add the results of the multiplication module in triplicate, with two levels; the accumulation control module is used to accumulate and output the results of the addition module according to the control logic of the external logic control module.

[0014] Furthermore, the weight data is stored in the SRAM array, and the input activation is externally input. In each multiplication-addition operation, the data in one row of the SRAM array is multiplied by the externally input activation, and a two-level addition operation is performed with three inputs. The accumulation control module accumulates and outputs the data according to the control logic of the external logic control module.

[0015] Furthermore, the system bus is used to transmit data between the Rocket Tile module, memory module, front-side bus, peripheral bus and control bus on the in-memory computing architecture;

[0016] The system bus includes a data bus and an address bus. The data bus is used to carry data, and the address bus determines where the data is sent. The control and data transmission processes of each module are realized through the instructions of the control bus.

[0017] Furthermore, the memory module serves as a cache for storing the input and output data of the Rocket Tile module; the front-side bus interconnects with the outside world for data, command, address, and control signals; the peripheral bus is used to connect peripherals; and the control bus is used to transmit control signals and timing signals.

[0018] Furthermore, the Rocket coprocessor RoCC interacts directly with the memory module via DMA.

[0019] Furthermore, the in-memory computing core (CIM Core) supports both storage and computation modes. Before computation begins, it operates in storage mode by default. The logic control module pre-stores the weight data stored in the weight cache module into the in-memory computing core according to the time sequence. After computation begins, the system bus sends computation instructions to the Rocket Core. The Rocket Core module controls the in-memory computing core inside the RoCC to enter computation mode. At this time, the logic control module controls the in-memory computing core to no longer perform storage operations, but instead reads the data in the input cache module for computation. The decoding and logic control modules can control the in-memory computing core array to perform computation based on external input, realize input activation and weight multiplication and addition operations, and output the results to the memory module.

[0020] Compared to existing technologies, this invention is based on the Rocket Chip architecture. By modifying the configurable module coprocessor RoCC within the architecture, it incorporates the in-memory computing core (CIM Core), a cache module, and a decoding and logic control module. Different convolutional neural networks are segmented according to the data size supported by the in-memory cores, configuring a minimum number of in-memory cores to complete the computation. The in-memory core supports both storage and computation modes. Before computation begins, the in-memory core operates in storage mode, where network weight information is pre-stored in the in-memory core according to a time sequence via an interface. After computation begins, the Rocket Core module controls the in-memory core to enter computation mode. In this mode, data fed into the in-memory core is no longer stored but directly placed into registers, awaiting the read weights before computation begins. This invention allows for internal architecture configuration based on computational needs, generating different processors according to different configurations, whereas traditional architectures are generally fixed and cannot be configured. Furthermore, in the traditional von Neumann architecture, the storage process and the computation process are separated, and the existence of the "memory wall" has become a bottleneck for improving computing performance. The in-memory computing architecture designed in this paper uses in-memory computing, which supports direct computation of data in the storage module and feeds the final result back to the processor, thereby greatly reducing the time and energy consumption of data transmission on the bus and greatly improving the throughput and energy efficiency of computing. Attached Figure Description

[0021] Figure 1 This is a schematic diagram of the overall architecture of a multi-core in-memory processor.

[0022] Figure 2 A schematic diagram illustrating the Rocket Tile module for in-memory processing and its data interaction with the memory module;

[0023] Figure 3 This is a diagram of the internal structure of the CIM Core (Central Information Management System).

[0024] Figure 4 A diagram illustrating the process of unfolding convolutional kernel data into a matrix;

[0025] Figure 5 Mapping for convolutional layers;

[0026] Figure 6 A diagram illustrating the process of unfolding the convolutional kernel data of a fully connected layer into a matrix;

[0027] Figure 7 Mapping for fully connected layers;

[0028] Figure 8 A power consumption percentage diagram for each module when deploying this architecture on the ZCU102 evaluation board. Detailed Implementation

[0029] The present invention will be further described and illustrated below with reference to specific embodiments. The embodiments described are merely examples of the content of this disclosure and do not limit the scope of the invention. The technical features of each embodiment in the present invention can be combined accordingly, provided that there is no mutual conflict.

[0030] The multi-core in-memory processor architecture of this invention is based on Rocket Chip, an open-source SoC generator developed using Chisel. It includes a module library consisting of cores, caches, and interconnects, upon which a complete SoC can be constructed and synthesizable RTL code can be generated. It offers highly flexible parametric design, allowing for customization to specific application scenarios. By changing just one configuration, we can obtain SoCs of vastly different sizes, ranging from embedded microprocessors to multi-core server chips.

[0031] like Figure 1 As shown, the multi-core in-memory processor architecture includes a system bus, memory modules, front-side bus, peripheral bus, control bus, and Rocket Tile modules. The system bus is used to transmit data between the Rocket Tile modules, memory modules, front-side bus, peripheral bus, and control bus on the in-memory processor architecture. The system bus includes a data bus and an address bus. The data bus carries the data, the address bus determines where the data is sent, and the control and data transmission processes of each module are implemented through instructions from the control bus.

[0032] The Rocket Tile module includes the Rocket Core and the Rocket coprocessor (RoCC). The Rocket Core executes sequentially, continuously running a specific program until completion to avoid resource waste. The RoCC coprocessor has a configurable internal structure. The memory module stores the input / output data of the Rocket Tile module and interacts with it via the system bus. The front-side bus (FSB), or external data bus, is the channel for interconnecting the CPU with the outside world for data, commands, addresses, and control signals. The peripheral bus connects other peripherals, such as network cards and block devices. It can interact with an external PC through various interface communication protocols. The control bus includes the Boot ROM, which loads the bootloader upon power-on or reset, and also includes a Device Tree to identify connected peripherals; CLINT, which includes software interrupts and timer interrupts for each CPU; PLIC, used to cluster and mask device interrupts and external interrupts; and a Debug Unit, which can be used with external control chips to load data and instructions into or retrieve data from memory. It can be controlled via a custom DMI (Desktop Management Interface) or the standard JTAG protocol. All of the above modules are controlled by the system bus and together constitute the processor architecture.

[0033] Figure 2 This describes the internal structure of the configurable coprocessor module (RoCC) within the Rocket Tile module, and its data interaction process with memory. The system bus controls the data interaction and internal computation of the RoCC coprocessor module by controlling the Rocket core. Data from the RoCC coprocessor module can be directly input and output to the memory module via DMA (Direct Memory Access), without needing to be transferred through the bus. DMA is an interface technology that allows external devices to exchange data directly with system memory without going through the CPU, solving the problem of batch data input and output. The input activation, convolutional kernel weights, and output of the neural network are all stored in the memory module.

[0034] The RoCC coprocessor module internally includes an input / output (I / O) module, an input buffer module, a weight buffer module, a decoding and logic control module, and a configurable number of in-memory compute cores (CIM Cores). The neural network's input activation and weight data are stored in the memory module. The data interaction between RoCC and the memory module is controlled by the Rocket Core module. The system bus sends instructions to the Rocket Core, which, based on these instructions, controls the RoCC module to read the input activation and weight data from the memory module and store it in the input and weight buffer modules, and then outputs the calculated data back to the memory module. The CIM Cores support both storage and computation modes. Before computation begins, they operate in storage mode by default. The logic control module pre-stores the weight data stored in the weight buffer module into the CIM Core according to a time sequence. After computation begins, the system bus sends computation instructions to the Rocket Core, which then controls the RoCC's internal CIM Cores to enter computation mode. In this mode, the logic control module controls the CIM Cores to stop storing data and instead read data from the input buffer module for computation. The decoding and logic control module can control the in-memory array to perform calculations based on external inputs, realize input activation and weight multiplication and addition operations, and output the results to the memory module.

[0035] Figure 3 The internal structure of the CIM Core includes an n×(m×8b) SRAM array, a row decoding circuit, a multiplication module, an addition module, and an accumulation control module. The SRAM array stores weight data. The row decoding circuit receives address data and stores the weight data in a specific row of the SRAM array. The multiplication module performs multiplication between the activation and each 8-bit weight. The addition module performs a two-stage addition of the results from the multiplication module (three-by-three). The accumulation control module accumulates and outputs the results from the addition module according to the control logic of an external logic control module. In this invention, the weight data is stored in the SRAM array, and the activation input is externally input. Each multiplication-addition operation involves multiplying the data in one row of the SRAM array with the externally input activation, performing a two-stage addition operation with three inputs, and the accumulation control module accumulates and outputs the data according to the control logic of the external logic control module.

[0036] The neural network mapping process can be illustrated using convolutional layers and fully connected layers as examples. The process of converting convolutional kernel data into a matrix is ​​as follows: Figure 4 As shown, each convolutional kernel is unfolded into a row vector along the channel direction. The data mapping and calculation process of the convolutional layer is as follows: Figure 5 As shown. f i X represents the i-th convolutional kernel. jRepresents the input corresponding to the j-th sliding window, This represents the data from the nth channel at the m-th position in the output feature map. Taking a convolutional kernel of size 3×3×3×64 as an example, the kernel weights are mapped to the rows of the SRAM array in channel order. Using a 256×288 SRAM array CIM Core as an example, each row can store 288 8-bit data points, so each row can hold 10 3×3×3 convolutional kernels. A total of 64 convolutional kernels require 7 rows. The calculation order follows a sliding window sequence; one sliding window is calculated before the next is calculated. Within the same sliding window, the input corresponding to each convolutional kernel is consistent. The input needs to be copied multiple times within a single row. Each cycle performs a multiplication-addition operation between the input data and a row in the SRAM array. In the next cycle, the input data is multiplied and added to the weight data stored in the next row of the SRAM array, and so on, until the 7th cycle. At this point, all weight data has been multiplied and added to the input data, and the input data can remain unchanged in each cycle, achieving input data reuse. The accumulation control module outputs the input data according to the convolutional kernel size. Each sliding window calculation yields data for all channels at each position in the output feature map. In the next sliding window calculation, only the input data needs to be changed, achieving weight reuse. Multiple sliding windows can be configured to perform calculations simultaneously, depending on the number of kernels.

[0037] The process of converting fully connected layer convolutional kernel data into a matrix is ​​as follows: Figure 6 As shown, each convolutional kernel is expanded into a row vector. Figure 7 The mapping process for fully connected layers is similar to that of convolutional layers. The input activation is 1x1xIC (number of input channels), the output data size is 1x1xOC (number of output channels), the convolution kernel size is 1x1xIC, and the number of convolution kernels is OC. Each convolution kernel data is stored in one row of an SRAM array. Taking a 256×288 SRAM array CIM Core as an example, if the amount of convolution kernel data exceeds 288, multiple CIM Cores need to be configured in the column direction until all data is stored. Multiple convolution kernel data are stored in separate rows. If the number of output channels exceeds 256, multiple CIM Cores need to be configured in the row direction until all data is stored.

[0038] like Figure 7 As shown, f i,j X represents the position of the i-th row and j-th column in the SRAM array of the i-th memory core. k Represents the k-th data point of the fully connected layer input. X represents the m-th channel in the output feature map. n To X lThe partial sum is obtained by multiplying and adding the weights. Taking a fully connected layer of size 4096×4096 as an example, each convolutional kernel is 4096×1 in size. Therefore, 15 kernels are needed in the column direction, with 4096 output channels, and 16 kernels are needed in the row direction, forming a kernel array. The input activation data of each kernel in the column direction is the same. In the row direction, the input activation of the fully connected layer is split and input into different kernels. The output of each kernel in each row needs to be accumulated externally in each calculation cycle to finally obtain the values ​​of different channels of the output feature map.

[0039] The in-memory processor architecture was further mapped onto an FPGA, with the number of internal RoCC in-memory cores (CIM Cores) fixed at 8. Deployment and testing were conducted on a ZCU102 evaluation board, using 45,608 lookup tables (LUTs), 25,529 registers (FFs), and 198 BRAMs. The hardware resource utilization rates on this board were 21.71%, 16.64%, and 4.66%, respectively. Power consumption was tested at 20MHz, with a total power consumption of 1.021W, a maximum throughput of 543 GOPS, and an energy efficiency of 186 GOPS / W. The power consumption percentage of each module is as follows: Figure 8 As shown, the static power consumption is 0.652W, accounting for 64%, and the dynamic power consumption is 0.369W, accounting for 36%. Dynamic power consumption can be further divided into clock modules (Clocks), signal definition modules (Signas), logic units (Logic), BRAM (Browser RAM), digital signal processing modules (DSP), and input / output (I / O) modules. Specific power consumption values ​​and percentages are detailed below. Figure 8 As shown.

[0040] The above-described embodiments are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. Those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention.

Claims

1. A multi-core in-memory processor device, comprising a system bus, a memory module, a front-side bus, a peripheral bus, and a control bus, characterized in that, It also includes the Rocket Tile module; The Rocket Tile module is used to configure the in-memory computing process on the in-memory computing device, including the Rocket Core and the Rocket coprocessor RoCC; The Rocket Core is used to control the RoCC module to interact with the memory module or to control the internal computing core of RoCC to enter computing mode according to different instructions. The Rocket coprocessor RoCC includes an input / output I / O module, an input buffer module, a weight buffer module, a decoding and logic control module, and several in-memory computing core modules (CIM Cores). The Rocket coprocessor RoCC is used to configure its internal input buffer module, weight buffer module, decoding and logic control module, and different numbers of in-memory computing core modules (CIM Cores) to complete the data storage and calculation process. The input / output I / O module is used for data interaction between the RoCC module and other modules; the input buffer module is used to store input activation data read from the memory module; the weight buffer module is used to store weight data read from the memory module; the decoding and logic control module is used to pre-store the weight data stored in the weight buffer module in the in-memory computing core (CIM Core) according to the time sequence, control the in-memory computing core array to perform calculations, and process and output the calculation results of the in-memory computing array. The in-store computation core module (CIM Core) is used to implement the multiplication and addition operations of input activation and weights; The stored computation core (CIM Core) includes an n×(m×8b) SRAM array, a row decoding circuit, a multiplication module, an addition module, and an accumulation control module. The SRAM array is used to store weight data, and the row decoding circuit is used to receive address data and store the weight data in a specific row of the SRAM array. The multiplication module is used to multiply each 8-bit data of activation and weight; the addition module is used to add the results of the multiplication module in triplicate, with two levels; the accumulation control module is used to accumulate and output the results of the addition module according to the control logic of the external logic control module. Weight data is stored in an SRAM array, and input activation is obtained from external input. In each multiplication-addition operation, the data in one row of the SRAM array is multiplied by the external input activation, and a two-level addition operation is performed with three inputs. The accumulation control module accumulates and outputs data according to the control logic of the external logic control module.

2. The multi-core in-memory processor device according to claim 1, characterized in that, The system bus is used to transmit data between the Rocket Tile module, memory module, front-side bus, peripheral bus and control bus on the in-memory computing device; The system bus includes a data bus and an address bus. The data bus is used to carry data, and the address bus determines where the data is sent. The control and data transmission processes of each module are realized through the instructions of the control bus.

3. The multi-core in-memory processor device according to claim 1, characterized in that, The memory module acts as a cache to store the input and output data of the Rocket Tile module; the front-side bus interconnects with the outside world for data, command, address, and control signals; the peripheral bus is used to connect peripherals; and the control bus is used to transmit control signals and timing signals.

4. The multi-core in-memory processor device according to claim 1, characterized in that, The data from the Rocket coprocessor (RoCC) interacts directly with the memory module via DMA.

5. The multi-core in-memory processor device according to claim 1, characterized in that, The in-memory computing core (CIM Core) supports both storage and computation modes. Before computation begins, it operates in storage mode by default. The logic control module pre-stores the weight data stored in the weight cache module into the CIM Core according to the time sequence. After computation begins, the system bus sends a computation command to the RocketCore. The RocketCore module controls the internal in-memory computing core of RoCC to enter computation mode. At this time, the logic control module controls the in-memory computing core to no longer perform storage operations, but instead reads the data in the input cache module for computation. The decoding and logic control module can control the in-memory computing core array to perform computation based on external input, realize input activation and weight multiplication and addition operations, and output the results to the memory module.