Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

Pending Publication Date: 2020-06-11
SAMSUNG ELECTRONICS CO LTD
View PDF8 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The patent is about a new type of chip that speeds up the process of performing tensor computations. The chip has a special design that allows data to be processed in a more efficient way. By combining two types of units, the chip can quickly accumulate data and perform a type of computation called a partial matrix transposition. This can be useful for applications like machine learning and image processing. Essentially, this chip makes it faster and more efficient to do complicated calculations.

Problems solved by technology

Compute-centric accelerators for tensor computation suffer from a “memory wall” issue, since computation performance scales much faster than memory bandwidth and latency, and off-chip data movement consumes two orders greater magnitude of energy than a floating point operation.
However, conventional PIM approaches mainly explore deep learning inference applications, which may tolerate reduced precision, but may also be incapable of complex floating point training tasks.
However, the number of ALUs is strictly confined due to area budget, and NDP approaches also lose a significant amount of internal bandwidth compared with compute-centric architectures.
These shortcomings make NDP architectures less effective in floating point performance compared with compute-centric approaches.
Moreover, simply adding floating points units to satisfy the computational demands of tensor processing results in significant and unacceptable area overhead in a DRAM die.
In addition, emerging non-volatile memory based accelerators suffer from poor write endurance and long write latency, which are unsuitable for write-intensive deep learning training tasks.
Furthermore, static random-access memory (SRAM)-based accelerators do not have enough on-chip memory capacity to store all of the model parameters and intermediate results needed for deep learning training.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning
  • Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning
  • Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033]Reference will now be made in detail to embodiments of the inventive concept, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth to enable a thorough understanding of the inventive concept. It should be understood, however, that persons having ordinary skill in the art may practice the inventive concept without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.

[0034]It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first stack could be termed a second stack, and, similarly, a second stack could be termed a first stack, without departin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A tensor computation dataflow accelerator semiconductor circuit is disclosed. The data flow accelerator includes a DRAM bank and a peripheral array of multiply-and-add units disposed adjacent to the DRAM bank. The peripheral array of multiply-and-add units are configured to form a pipelined dataflow chain in which partial output data from one multiply-and-add unit from among the array of multiply-and-add units is fed into another multiply-and-add unit from among the array of multiply-and-add units for data accumulation. Near-DRAM-processing dataflow (NDP-DF) accelerator unit dies may be stacked atop a base die. The base die may be disposed on a passive silicon interposer adjacent to a processor or a controller. The NDP-DF accelerator units may process partial matrix output data in parallel. The partial matrix output data may be propagated in a forward or backward direction. The tensor computation dataflow accelerator may perform a partial matrix transposition.

Description

RELATED APPLICATION DATA[0001]This application claims the benefit of U.S. Patent Application Ser. No. 62 / 777,046, filed Dec. 7, 2018, which is hereby incorporated by reference.BACKGROUND[0002]The present inventive concepts relate to deep learning, and more particularly, to a dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning.[0003]Deep neural networks are considered as a promising approach to realizing artificial intelligence, and have demonstrated their effectiveness in a number of applications. Training deep neural network requires both high precision and wide dynamic range, which demand efficient floating point operations. Tensor computation, which includes the majority of floating point operations and contributes the most time in training deep neural networks, is a key primitive operation for acceleration. Compute-centric accelerators for tensor computation suffer from a “memory wall” issue, since computation perfor...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F12/0802G06F17/16
CPCG06F2212/1036G06F17/16G06F2212/1024G06F12/0802G06F2212/22G06N3/063G06N3/084G06F9/3867G06N3/045G06F15/7821G06F12/0207G06F12/0292G06F15/8046G06N3/08H10B12/00G06F12/0877G06N3/008
Inventor GU, PENGMALLADI, KRISHNAZHENG, HONGZHONGNIU, DIMIN
Owner SAMSUNG ELECTRONICS CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products