Memory management for overlapping data between blocks of a neural network
By introducing overlapping data buffers into the accelerator circuit, the problem of excessive consumption of computing resources and bandwidth between neural network layers is solved, achieving more efficient computing utilization and bandwidth management, and improving the performance of deep learning inference.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NVIDIA CORP
- Filing Date
- 2021-06-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from excessive consumption of computing resources and system bandwidth when processing overlapping data between neural network layers, especially in multi-layered connections, where traditional methods waste computing resources and excessively acquire halo data.
By employing overlapping data buffer technology, an auxiliary buffer is used in the accelerator circuit to store overlapping data between neural network layers, reducing the need for reacquiring halo data and the consumption of computational resources, thereby optimizing computational utilization and system bandwidth consumption.
It improves computational utilization, reduces system bandwidth consumption, and enhances the efficiency of deep learning inference, especially in convolution and pooling operations.
Smart Images

Figure CN115867921B_ABST
Abstract
Description
Technical Field
[0001] At least one embodiment relates to processing resources for performing and facilitating artificial intelligence. For example, at least one embodiment relates to an auxiliary buffer for linking overlapping data between blocks in a neural network layer. Background Technology
[0002] In many cases, much of the computational work in deep learning inference is based on mathematical operations, which can generally be divided into four parts: convolution, activation, pooling, and normalization. These operations share some common characteristics that make them particularly well-suited for hardware implementation: their memory access patterns are predictable and they are easily parallelized. Attached Figure Description
[0003] Figure 1 It is a block diagram of an accelerator core having overlapping data buffers for tiling between link layers performed by fixed-function circuitry, according to at least some embodiments.
[0004] Figure 2A This is a diagram illustrating persistent weight options based on at least one implementation method;
[0005] Figure 2B This is a diagram illustrating persistent feature options based on at least one implementation method;
[0006] Figure 3 This is a diagram illustrating an accelerator circuit with two linked hardware layers according to at least some embodiments, which uses an auxiliary buffer between two paths of overlapping data.
[0007] Figure 4 This is a diagram illustrating an accelerator circuit with four linked hardware layers according to at least some embodiments, which uses overlapping data buffers between three paths.
[0008] Figure 5 This is a diagram illustrating two convolutional layers that use hardware instructions to store and retrieve overlapping data in a block between pathways, according to at least some embodiments.
[0009] Figure 6 This is a flowchart of a method for identifying a portion of an output block and storing it in an auxiliary buffer, according to at least some embodiments;
[0010] Figure 7 This is a block diagram of a deep learning accelerator (DLA) system according to at least some embodiments; and
[0011] Figure 8 This is a block diagram of a DLA system according to at least some embodiments. Detailed Implementation
[0012] As mentioned above, deep learning inference is based on operations that are well-suited for hardware implementation. Deep learning accelerator (DLA) circuits, such as the NVIDIA® Deep Learning Accelerator (NVDLA), can be used to meet the computational demands of inference by providing building blocks that accelerate core deep learning operations. Deep learning accelerators can be used to accelerate various neural networks, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), fully connected neural networks, etc. These neural networks may have very different architectures, may not follow any predefined network structures, and new neural networks are being introduced regularly.
[0013] Currently, DLA circuits use fixed-function engines (also referred to herein as fixed-function units, fixed-function circuits, or computational units) for different types of layers in these neural networks, such as fixed-function engines for convolution, activation functions, pooling, batch normalization, etc. Each layer can be the basic hardware instructions for a fixed-function engine to perform operations, and each layer communicates with another layer through a memory interface. For example, the first layer can be executed by a first fixed-function engine that receives an input tensor, performs operations on the input tensor to generate an output tensor, and stores the output tensor in system memory, such as dynamic random access memory (DRAM) coupled to the accelerator. The second layer can be executed by a second fixed-function engine that receives the output tensor from the first layer as a second input tensor, performs operations on the second input tensor to generate a second output tensor, and stores the second output tensor in DRAM. Each communication introduces a tensor read operation and a tensor write operation in the memory interface.
[0014] Linking is a mechanism that utilizes the internal memory of an accelerator, such as internal static random access memory (SRAM). In linking, intermediate tensors are written to the internal SRAM by the current layer, and subsequent layers fetch data from the internal SRAM. Using linking reduces memory interface traffic, thereby improving performance and power efficiency. In the best case, all layers of the network can be linked together, leaving external memory access for the first input of the first layer and the output of the last layer. The compiler can use a linking algorithm to determine how to link the layers to utilize the internal SRAM. Linking algorithms consider the following items: persistent weight or persistent feature options, when to terminate linking, convolution buffer allocation, and feature crossing (e.g., linking plus batching, linking plus weighted prefetching, etc.).
[0015] To avoid the imbalance between processing unit and memory bandwidth, data can reside in the accelerator's internal SRAM. Tiling is a popular technique when the input feature data is too large to fit in the internal SRAM. For example, each layer can be divided into N tiles corresponding to N passes. A pass is defined as a set of hardware layers communicating via SRAM; if a chain is divided into N tiles, then that chain tile contains N passes. For example, suppose a set of linked hardware layers (hardware layer instructions) is divided into N tiles. In this case, the set of linked hardware layers (hardware layer instructions) has N passes, and each hardware layer is executed N times in a linked manner with other hardware layers. It should be noted that from an algorithmic perspective, a neural network can be specified using a set of layers (referred to as "raw layers" in this paper), such as bias and batch normalization. Those raw layers can be compiled or transformed into another set of layers (referred to as "hardware layers" in this paper), where each hardware layer serves as a basic element for scheduling execution on the accelerator circuitry. The mapping between raw layers and hardware layers can be... m : n ,in m It is the original number of layers. n This refers to the number of hardware layers. For example, the original layer bias, batch normalization, and local response normalization (LRN) in a neural network, such as the rectified linear unit (ReLU), can be compiled into a single hardware layer. In this case, m : n The ratio is 3:1. Each hardware layer can be represented by a basic hardware instruction, which is executed by one of the fixed-function engines, and each layer communicates with the other through a memory interface. For example, the first layer can be executed by a first fixed-function engine that receives an input tensor, performs operations on the input tensor to generate an output tensor, and stores the output tensor in system memory, such as dynamic random-access memory (DRAM) coupled to the accelerator. The second layer can be executed by a second fixed-function engine that receives the output tensor from the first layer in memory as a second input tensor, performs operations on the second input tensor to generate a second output tensor, and stores the second output tensor in DRAM. Each communication introduces a tensor read operation and a tensor write operation in the memory interface.
[0016] Convolution and pooling are common operators in neural networks. These operators involve a set of input pixels in a window to obtain a single output pixel. If tiling is used, there is overlapping data between each block. This overlapping data is called a halo. For single-layer operations, a halo is captured for each block, thus consuming additional bandwidth. As mentioned above, when multiple layers are chained together, traditional methods recompile the halo for each block, wasting computational resources. Traditional methods typically either over-capture halos to handle single-layer halos or over-compute to handle multiple-layer halos.
[0017] The aspects and embodiments of this disclosure address these and other challenges by providing, for example, overlapping data buffers to store portions of blocks between pathways of linked layers in a neural network. An accelerator circuit includes at least one processing unit for executing instructions corresponding to linked layers in multiple pathways. In a first pathway, at least one processing unit receives a first input block of an input feature map from a main buffer and performs a first operation on the first input block to obtain a first output block. The processing unit stores the first output block in the main buffer and identifies a portion of the first output block as overlapping data corresponding to blocks between the input feature maps. The processing unit stores this portion in an auxiliary buffer. In a second pathway, the processing unit retrieves this portion to avoid retrieving the overlapping portion and recompiling the overlapping data (halos). Using the embodiments described herein, when multiple layers are linked together, halos are stored in auxiliary buffers between pathways, thereby reducing the additional bandwidth consumption of retrieving halos and reducing the computational resource consumption of recompiling halos. The required buffer size between pathways is predetermined and can be determined during offline compilation. For example, the compiler may reserve a small SRAM region (also referred to herein as an auxiliary buffer, user-defined buffer (UBUF), and overlapping data buffer). The compiler can create instructions in one path to output the halo to reserved SRAM, and in subsequent paths to retrieve the halo from reserved SRAM to tensor data SRAM. In at least one embodiment, the stride is carefully programmed to avoid contaminating valid tensor data. Aspects and embodiments of this disclosure can improve computational utilization while reducing system bandwidth consumption. For example, a large portion of deep learning workloads consists of convolution and pooling operations. Using aspects and embodiments of this disclosure in convolutional and pooling layers improves computational utilization while reducing system bandwidth consumption. Aspects and embodiments of this disclosure can be used in accelerator circuits, graphics processing units (GPUs), etc.
[0018] Figure 1This is a block diagram of an accelerator core 100 with an overlapping data buffer 102 according to at least some embodiments, the overlapping data buffer 102 being used for tiling between link layers performed by fixed-function circuits 104-114 (also referred to herein as a fixed-function engine). Accelerator core 100 (also referred to herein as DLA core or accelerator circuitry) includes an overlapped data buffer 102 and various fixed-function circuitry, such as a convolution engine 104 (also referred to herein as a convolution core), an activation engine 106 (also referred herein as a single data processor (SDP)) for a single-point lookup engine for activation functions, a pooling engine 108 (also referred herein as a planar data processor (PDP)) for a planar averaging engine for pooling, a local response normalization (LRN) engine 110 (also referred herein as a cross-channel data processor (CDP)) for a dedicated cell that applies an LRN function operating in the channel dimension rather than the spatial dimension, a data shaping engine 112 (also referred herein as a RUBIK) that performs data format conversions (e.g., splitting or slicing, merging, shrinking, shaping transfers), and a bridged direct memory access (DMA) engine 114 that can move data between system DRAM and a dedicated memory interface. Additional details of the overlapped data buffer 102 are described below. In other embodiments, accelerator core 100 may include a larger... Figure 1 The text describes more or fewer engines. Each of these engines can be separate and independently configurable. For example, a system that does not require pooling operations can completely remove the planar averaging engine. As another example, a system requiring additional convolutional performance can extend the performance of the convolutional engine without modifying other units in accelerator core 100.
[0019] like Figure 1 As shown, the accelerator core 100 has multiple connections to the rest of the DLA system, including a configuration interface block 116, which includes a configuration space bus (CSB) interface and an interrupt interface. The configuration interface block 116 may be a control channel interface implementing a register file (e.g., a configuration register) and an interrupt interface (labeled CSB / interrupt interface 118). In at least one embodiment, the CSB interface is a synchronous, low-bandwidth, low-power, 32-bit control bus designed to be controlled by a central processing unit (CPU) (…). Figure 1 (Not shown) is used to access the configuration register in configuration interface block 116. The interrupt interface can be a 1-bit level driven interrupt. The interrupt line can be asserted when a task completes or an error occurs. Accelerator core 100 may also include memory interface block 120 that interfaces with memory using one or more bus interfaces. In at least one embodiment, memory interface block 120 has a connection to system memory (…). Figure 1Main memory interface 122 (not shown). System memory may include DRAM. Main memory interface 122 may be shared with the CPU and input / output (I / O) peripherals. In at least one embodiment, main memory interface 122 is a data backbone (DBB) interface connecting accelerator core 100 and other memory subsystems. DBB interface is a configurable data bus that can specify different address sizes, different data sizes, and issue requests of different sizes. In at least one embodiment, DBB interface uses an interface protocol such as AXI (Advanced Extensible Interface) or other similar protocols. In at least one embodiment, memory interface block 120 has a second memory interface 124 that allows connection to higher bandwidth memory dedicated to accelerator core 100 or computer vision subsystem. For example, second memory interface 124 may be used with on-chip SRAM to provide higher throughput and lower access latency.
[0020] Memory interface block 120 is coupled to each of the fixed-function circuits 104-114. A convolution buffer 126 may be used between memory interface block 120 and convolution engine 104 to avoid repeated access to system memory. Convolution buffer 126 may be internal RAM reserved for weights and input feature / pixel storage. In at least one embodiment, overlap data buffer 102 may be a reserved area of convolution buffer 126. Overlap data buffer 102 may be internal SRAM reserved for overlapping data storage between paths when tiling is used.
[0021] During the operation of accelerator core 100, the processing flow begins with the management processor (microcontroller or CPU) coupled to accelerator core 100 sending hardware layer configuration and activation commands. If data dependencies do not preclude this, multiple hardware layers can be sent to different engines and activated simultaneously (i.e., if the input of another layer does not depend on the output of the previous layer). In at least one embodiment, each engine may have a double buffer for its configuration register, allowing the configuration of the second layer to begin processing when the activation of the first layer is complete. Once a hardware engine has completed its activation task, configuration interface block 116 can interrupt the management processor to report completion, and the management processor can restart the process. This command-execution-interrupt flow repeats continuously until the inference of the entire network is complete.
[0022] Back Figure 1Each of the fixed-function circuits 104-114 processes one compiled hardware layer of a neural network at a time, and the fixed-function circuits process neural networks of different layer types. In at least one embodiment, the first fixed-function circuit is any one of the convolution engine 104, activation engine 106, pooling engine 108, LRN engine 110, data shaping engine 112, or bridge DMA engine 114. Alternatively, the first fixed-function circuit may be other computing units of the accelerator core 100 or computing units external to the accelerator core 100.
[0023] One technique for loading tensors into a local cache, namely the convolution buffer 126, is called tiling. Tiling divides a tensor into one or more blocks with pre-specified dimensions that can be loaded into the convolution buffer 126. Each block can be loaded one at a time from global memory into the convolution buffer 126 of the convolution engine 104, and convolution can be performed on the block. Although in Figure 1 As not shown, other fixed-function circuits 106-114 can also access the shared / cache memory. Each block can be loaded from global memory one at a time into the shared / cache memory for the processing unit to perform computations on the block. When used with algorithms based on General Matrix Multiplication (GEMM), the block-based technique may not require data copying.
[0024] Processing units can access tensors to perform operations on them. One such operation is convolution, a computational operation used in deep learning applications. However, embodiments are not limited to convolution. Convolution is used in a layer of a convolutional neural network (CNN) to analyze images for machine learning applications such as image classification, object detection, image segmentation, etc. Convolution can be performed on convolutional layers of a CNN during inference and / or training. For example, a convolutional layer can apply the convolution function of a weighted filter to an element window (receptive field location) in the input tensor, where the receptive field corresponds to a location in the input tensor, to detect the presence of a feature at that location. Applying the filter to strides at different locations in the input tensor generates activation maps (or feature maps), where the feature maps indicate the intensity of the detected features in the input tensor.
[0025] In at least one embodiment, the arithmetic framework for convolution operations can be:
[0026]
[0027] in
[0028] Make .
[0029] The convolution buffer 126 can be the primary buffer, and the overlapping data buffer 102 can be an auxiliary buffer. For example, if there are two layers and the input tensors are too large to be stored in the primary buffer (e.g., internal SRAM), one layer is split into three hardware instructions. Persistent data between layers has two options, such as... Figure 2A-2B As explained below, please refer to the following. Figure 3-5 Describes the overlapping data buffer 102.
[0030] Figure 2A This is a diagram illustrating a persistent weight option 200 according to at least one implementation. For the persistent weight option 200 in a main buffer (e.g., internal SRAM), a first instruction 206 of the first layer 202 retrieves a weight from external memory (e.g., DRAM), performs a first operation using the weight, and stores the weight in the internal SRAM 204 (main buffer). The weight remains in the internal SRAM 204, and a second instruction 208 of the first layer 202 retrieves the weight from the internal SRAM 204 instead of the external memory. Similarly, a third instruction 210 of the first layer 202 retrieves the weight from the internal SRAM 204 instead of the external memory. A second layer 212 is linked to the first layer 202. The weight of the second layer 212 is retrieved from external memory (e.g., DRAM) and then stored in the internal SRAM 204, and the first, second, and third instructions 214, 216, and 218 of the second layer 212 retrieve the weight from the internal SRAM 204 instead of the external memory. In this implementation, there are two weight read accesses from external DRAM, two feature read accesses from external DRAM, and two feature write accesses to external DRAM.
[0031] Figure 2BThis is a diagram illustrating a persistent feature option 250 according to at least one implementation. For persistent feature option 250 in a main buffer (e.g., internal SRAM), a first instruction 256 (hw inst0) of the first layer 252 retrieves first feature data 240 from external memory (e.g., DRAM) and performs a first operation using the first feature data to obtain second feature data 254. The first instruction 256 stores the second feature data 254 in the main buffer (e.g., internal SRAM). Due to the halo property of the convolution / pooling operation, the compiler determines the overlap that can be used in the second path, including input feature map 270 (halo) and input feature map 274 (halo). In at least one embodiment, the compiler generates DMA instructions to trim the halos (input feature maps 270, 274). Input feature map 270 will be used by a second instruction 262 (hw inst2) of the second path to produce feature data 255. Similarly, input feature map 274 will be used by a second instruction 264 (hw inst3) of the second path to produce output feature data 244. Therefore, the compiler stores input feature map 270 (halo) and input feature map 274 (halo) into an auxiliary buffer. The auxiliary buffer can be a reserved area of SRAM. It is important to note that the auxiliary buffer can be logically constructed or physically constructed. From a physical implementation perspective, it can choose a uniform SRAM (i.e., the same SRAM as the main buffer) or other levels of SRAM. The first instruction 258 (hw inst1) of the second layer 260, linked to the first layer 252, retrieves the second feature data 254 from the main buffer instead of external memory, and uses the second feature data 254 to perform a second operation to obtain the third feature data 243. The first instruction 258 stores the third feature data 243 in external memory (e.g., DRAM). The third feature data 243 is part of the final result of the linking layer.
[0032] The second instruction 262 (hw inst2) of the first layer 252 retrieves third feature data 241 from external memory (e.g., DRAM) and performs a first operation using the third feature data 241 to obtain fourth feature data 255. Similar to the first instruction 256, the compiler generates DMA instructions to trim halos (e.g., 274, 276) and stores the halos in an auxiliary buffer for future use. The second instruction 262 stores the fourth feature data 255 in a main buffer (e.g., internal SRAM). The second instruction 264 (hw inst3) of the second layer 260 retrieves the fourth feature data 255 from the main buffer, retrieves the input feature map 274 (halo) from the auxiliary buffer, and performs a second operation using the fourth feature data 255 and the input feature map 274 to obtain fifth feature data 244. Assuming that the fifth feature data 244 is the final output of the link layer, the second instruction 265 stores the fifth feature data 244 in external memory (e.g., DRAM). It should be noted that separate SRAMs need not be allocated for input feature maps 270, 272, 274, and 276. In at least one embodiment, input feature maps 270, 272, 274, and 276 can use the same memory region. For example, there may be no overlap in the lifetimes of input feature maps 270 and 270, so the same memory region can be used for both in a time-division multiplexing manner.
[0033] The third instruction 266 (hw inst4) of the first layer 252 retrieves the fifth feature data 242 from external memory (e.g., DRAM) and performs a first operation using the fifth feature data 242 to obtain the sixth feature data 256. The third instruction 266 stores the sixth feature data 256 in the main buffer (e.g., internal SRAM). Unlike the first instruction 256 and the second instruction 262, the third instruction 266 is the last path and does not require storing halo data. The third instruction 268 (hwinst5) of the second layer 260 retrieves the sixth feature data 256 from the main buffer and the input feature map 276 (halo) from the auxiliary buffer, and performs a second operation using the sixth feature data 256 and the input feature map 276 to obtain the seventh feature data 245. The third instruction 268 stores the seventh feature data 245 in external memory (e.g., DRAM). After the third instructions 266 and 268, the entire output tensor is computed.
[0034] Back Figure 1 In at least one embodiment, the accelerator core 100 is a deep learning accelerator (DLA) core, which includes a register file to store configuration information associated with at least a portion of a neural network having multiple layers. The DLA core includes a device coupled to an external memory device (…). Figure 1The memory interface (e.g., memory interface block 120), convolution buffer 126, and convolution engine 104 (not shown) are also shown. The convolution buffer 126 includes reserved areas for the overlapping data buffer 102 (as shown in the hash block within 126), or as... Figure 1 The overlapping data buffer 102, separate from the convolution buffer 126, is shown and described herein. The convolution engine 104 receives a first input block of an input feature map from the convolution buffer 126 in a first path. The size of the input feature map may exceed the storage capacity of the convolution buffer 126. In at least one embodiment, the input feature map includes at least a first input block and a second input block. The convolution engine 104 performs a first hardware layer on the first input block to obtain a first output block and stores the first output block in the convolution buffer 126. The convolution engine 104 identifies a portion of the first output block as overlapping data corresponding to the first input block and the second input block, and stores a portion of the first output block in a reserved region (overlapping data buffer 102).
[0035] In at least one embodiment, in a second path following the first path, convolution engine 104 receives a portion of a second input block from convolution buffer 126. This portion may represent a part of the second input block that does not overlap with the first input block, since this data has already been acquired and computed. Convolution engine 104 performs a first hardware layer on the portion of the second input block to obtain a portion of the second output block and retrieves a portion of the first output block from a reserved region. Convolution engine 104 stores the second output block in convolution buffer 126, including both portions of the second and first output blocks.
[0036] In at least one embodiment, convolution engine 104 retrieves a first output block from convolution buffer 126 in a first path and performs a second hardware layer on the first output block to obtain a third output block. Convolution engine 104 stores the third output block in convolution buffer 126. In this embodiment, convolution engine 104 does not store overlapping data in a reserved region. In other embodiments, convolution engine 104 may identify additional overlapping data and store it in a reserved region. Convolution engine 104 retrieves a second output block from convolution buffer 126 in a second path and performs a second hardware layer on the second output block to obtain a fourth output block. Convolution engine 104 stores the fourth output block in a convolution buffer. In at least one embodiment, convolution engine 104 identifies a portion of the second output block in the second path as overlapping data corresponding to the first input block and the third input block, and stores this portion of the second output block in a reserved region. In this embodiment, the input feature map includes a first input block, a second input block, and a third input block. In the third path, convolution engine 104 receives a portion of a third input block from convolution buffer 126 and performs a first hardware layer on that portion to obtain a portion of a third output block. Convolution engine 104 retrieves a portion of a second output block from the reserved region and stores that portion in convolution buffer 126 as part of the third output block. The third output block includes the portion of the third output block based on the execution of the first hardware layer and the portion of the second output block retrieved from the reserved region.
[0037] Figure 3 This is a diagram illustrating an accelerator circuit 300 with two linked hardware layers according to at least some embodiments, which uses an auxiliary buffer between two paths of overlapping data. The accelerator circuit 300 includes a main buffer 302 (e.g., internal SRAM), an auxiliary buffer 304, a memory interface 306, and one or more processing units that execute multiple linked layers. The main buffer 302 may be the internal SRAM of the accelerator circuit. The auxiliary buffer 304 may be another internal SRAM of the accelerator circuit or a reserved area for the internal SRAM of the main buffer 302. The auxiliary buffer 304 corresponds to... Figure 1An overlapping data buffer 102 is provided. A memory interface 306 is coupled to an external memory device (external DRAM) coupled to the accelerator circuitry 300. The accelerator circuitry 300 uses one or more processing units and tiled hardware layers of multiple links to execute a neural network in a multi-path manner. The number of paths is equal to the number of blocks used by the linked hardware layers. The accelerator circuitry 300 includes a first layer 308 and a second layer 310. The first layer 308 and the second layer 310 are divided into two blocks. Therefore, the first layer 308 executes a first hardware instruction (HW1) on both blocks and the second layer 310 executes a second hardware instruction (HW2) on both blocks. Specifically, the first layer 308 (Layer 0) executes on the first block (tile 0), then the second layer 310 (Layer 1) executes on the first block (tile 0), then the first layer 308 (Layer 0) executes on the second block (tile 1), and then the second layer (Layer 1) executes on the second block (tile 1). In the illustrated embodiment, the accelerator circuit 300 includes two link layers and two paths for simplicity, but in other embodiments it may include more than two link layers and more than two paths.
[0038] In the first path 312, the first layer 308 receives a first tensor 301. The first tensor 301 includes a first input block of an input feature map from the memory interface 306. In some embodiments, the input feature map is too large to be stored in the main buffer 302. For example, the size of the input feature map exceeds the storage capacity of the main buffer 302, so in the illustrated embodiment, the input feature map is divided into two blocks, including a first input block and a second input block. The first layer 308 performs a first operation corresponding to a first hardware layer instruction on the first tensor 301 (the first input block) to obtain a second tensor 303. The second tensor 303 includes a first output block. The first output block is also an input block of the second layer 310. The first layer 308 stores the second tensor 303 in the main buffer. The first layer 308 also identifies a portion of the first output block as overlapping data 305 corresponding to the first input block and the second input block. The first layer 308 stores the overlapping data 305 in the auxiliary buffer 304.
[0039] In at least one embodiment, in the first path 312, the second layer 310 retrieves a second tensor 303, including the first output block, from the main buffer 302, instead of obtaining data from external memory. The second layer 310 performs a second operation on the first output block corresponding to a second hardware layer instruction to obtain a third tensor 307, including the third output block. The second layer 310 stores the third tensor 307 in the main buffer 302 or in external memory (e.g., DRAM).
[0040] In the second path 314, the first layer 308 receives a fourth tensor 309. The fourth tensor 309 includes a portion of a second input block from the memory interface 306 (or from the main buffer 302). The first layer 308 performs a first operation on the portion of the second input block, corresponding to a third hardware layer instruction, to obtain a fifth tensor 311, which includes a portion of the second output block. The first layer 308 also receives a portion of a first output block from an auxiliary buffer 304 for the fifth tensor 311, which corresponds to overlapping data 305 between the first and second input blocks. The second output block includes portions of the second output block and portions of the first output block (e.g., overlapping data 305). The first layer 308 stores the fifth tensor 311, including the second output block, in the main buffer 302. Using the auxiliary buffer 304, the overlapping data is not over-fetched and over-computed, as described herein.
[0041] In at least one embodiment, in the second path 314, the second layer 310 retrieves a fifth tensor 311, including the second output block, from the main buffer 302, instead of obtaining data from external memory. The second layer 310 performs a second operation on the second output block corresponding to a fourth hardware layer instruction to obtain a fourth tensor 313, including the fourth output block. The second layer 310 stores the fourth tensor 313 in the main buffer 302 or in external memory (e.g., DRAM).
[0042] In at least one embodiment, the first input block and the second input block are retrieved from external memory and stored in the main buffer before the first path 312. In this embodiment, the first layer 308 in the second path retrieves a portion of the second input block from the main buffer 302 and retrieves overlapping data 305 from the auxiliary buffer 304.
[0043] In one embodiment, the first layer 308 is executed by a fixed-function engine, such as convolution engine 104, and the same fixed-function engine executes the second layer 310. In another embodiment, the first layer 308 is executed by a first fixed-function engine, such as convolution engine 104. The second layer 310 is executed by a second fixed-function engine, different from the first fixed-function engine, such as pooling engine 108. Alternatively, the first layer 308 and the second layer 310 may be executed by other fixed-function engines.
[0044] It should also be noted that in the case of more than two paths, the first layer 308 in the second path 314 identifies a portion of the second output block as overlapping data corresponding to the first input block and the third input block and stores the portion of the second output block in the auxiliary buffer 304. For example, if the input feature map includes a first input block, a second input block, and a third input block, a third path can be used, wherein the first layer 308 receives a portion of the third input block from the main buffer 302 and performs a first operation on that portion of the third input block to obtain a portion of the third output block. The first layer 308 also retrieves a portion of the second output block from the auxiliary buffer 304 and stores the second output block as part of the third output block in the main buffer 302. The third output block includes the portion of the third output block based on the first operation and the portion of the second output block retrieved from the auxiliary buffer 304.
[0045] In one embodiment, the primary buffer 302 and the secondary buffer 304 may be implemented in the same internal memory device. In at least one embodiment, the primary buffer is a first region of the internal memory device reserved as Level 1 (L1) memory. The secondary buffer is a second region of the internal memory device reserved as Level 2 (L2) memory. In this embodiment, the external memory device is reserved as Level 3 (L3) memory. In another embodiment, the primary buffer 302 is implemented in a first internal memory device, while the secondary buffer 304 is implemented in a second internal memory device.
[0046] In one example, suppose there is an image convolutional layer with the following parameters: i) Input: WxHxC = 960x480x3; ii) Kernel: 7x7x3x48, stride: 2x2, padding: 3x2; and iii) Output: WxHxC = 480x240x48. The input size parameter can exceed the capacity of the main buffer 302. Therefore, the image convolutional layer can be divided into three compiled hardware layers, executed by the accelerator circuit 300. Each hardware layer produces a 160x240x48 output block. According to the convolution dimension formula: Input = Stride * (Output - 1) + Kernel - Padding Left - Padding Right, the compiler can define the following parameters for the three hardware layers.
[0047] HWL1:
[0048] Input: WxHxC=322x480x3
[0049] Output: W x H x C = 160 x 240 x 48
[0050] HWL2:
[0051] Input: WxHxC=325x480x3
[0052] Output: W x H x C = 160 x 240 x 48
[0053] HWL3:
[0054] Input: WxHxC=323x480x3
[0055] Output: W x H x C = 160 x 240 x 48
[0056] Adding up the input widths, there are 970 lines, resulting in 10 lines overlapping with the three hardware layers.
[0057] In at least one embodiment, overlap can be modeled. For example, if the width of the first block is N, then the last pixel of the first block is N-1, and the first pixel of the second block is N. The last pixel of the first block has corresponding input coordinates, calculated as (N-1)*stance - padding left + (kernel-1). The first pixel of the second block has corresponding input coordinates, calculated as N *stance - padding left. Overlap can also be represented as kernel-stance. For the example above, the total overlap is the first overlap in the first block and the second overlap in the second time, expressed as total overlap = overlap 1 + overlap 2 = (7-2) + (7-2) = 10 rows, which is the same as determined above. In at least one embodiment, the compiler determines the overlap between each hardware layer and creates instructions to correctly retrieve tensor data and overlap data from the main buffer and auxiliary buffer, respectively.
[0058] Several factors can influence link termination, including computational and bandwidth overhead. Convolution operations are region-based operations (when kernel size > 1); therefore, if a layer is divided into multiple hardware instructions, there can be overlap between each instruction on the input. Without links, the entire tensor is already ready in external memory (DRAM), requiring some additional overfetching. However, with links, the entire intermediate tensor between each instruction is unavailable; therefore, overlapping regions should be computed by preceding instructions, introducing overcomputation. This overhead increases with the depth of the linked layers; therefore, the more linked layers, the greater the computational overhead. On the other hand, the more linked layers, the less opportunity there is for DRAM bandwidth; thus, it's a trade-off between computational overhead and the bandwidth advantage of DRAM. This trade-off can depend on layer parameters and boundary factors.
[0059] Back Figure 3Feature data is stored in the main buffer. In at least one embodiment, the same weights can be shared for each path. Storing weights in the main buffer reduces weight traffic on the memory interface. However, the more layers linked, the more main buffer storage is required to store the corresponding weight data. The capacity of the main buffer used for storing weights is another factor that can be considered during link depth evaluation. From a performance perspective, layers not subject to weight acquisition do not need to store them in the main buffer. However, from a power consumption perspective, reducing memory traffic on the memory interface is beneficial. Again, it is a trade-off between weight acquisition power consumption and activation acquisition power consumption. In at least one embodiment, the first layer 308 may use a first number of memory blocks (e.g., 10 CBUF memory blocks) to store weight data in the main buffer 302 and a second number of memory blocks (e.g., 2 CBUF memory blocks) to store feature data in the main buffer 302. The second layer 310 may use a first number of storage units (e.g., 2 CBUF storage units) to store weight data in the main buffer 302, and a second number of storage units (e.g., 10 CBUF storage units) to store feature data in the main buffer 302.
[0060] In at least one embodiment, the accelerator circuit 300 may use feature crosses, where the links are used to combine batching or weight prefetching. Batching can offer several advantages, including sharing weights across different frames to save memory bandwidth for weight data, and in some cases, improving efficiency. In one case, if the main buffer 302 is large enough to store all batches, batches can be fetched in the chain while still being scheduled by software for different batches; otherwise, workloads at the chain boundaries should be scheduled.
[0061] Figure 4 This is a diagram illustrating an accelerator circuit 400 with four links, according to at least some embodiments, which uses overlapping data buffers between three paths. The accelerator circuit 400 includes internal SRAM with an area reserved as an auxiliary buffer 404 (referred to as UB or UBUF). The remainder of the internal SRAM may be reserved as a primary buffer (…). Figure 4 (Not shown in the diagram) (referred to as CBUF). The accelerator circuit 400 also includes a memory interface and one or more processing units that execute four linked layers: an input layer 402 (layer 0), a first layer 406, a second layer 408, and a third layer 410. Here, the input data is divided into three blocks with three paths. Therefore, the first layer 406 executes a first hardware instruction three times on the three blocks, the second layer 408 executes a second hardware instruction three times on the three blocks, and the third layer 410 executes a third hardware instruction three times on the three blocks.
[0062] In the first path, the first layer 406 executes a first instruction that identifies a first portion 412 of a first block 414 to be stored in the auxiliary buffer 404. The first portion 412 represents overlapping data between the first block 414 and the second block 416. In the second path, the first layer 406 executes a second instruction that retrieves the first portion 412 from the second buffer 404 for the second block 416. In the second path, the first layer 406 may also execute another instruction that identifies a second portion 418 of the second block 416 to be stored in the auxiliary buffer 404. The second portion 418 represents overlapping data between the second block 416 and the third block 420. In the third path, the first layer 406 executes a third instruction that retrieves the second portion 418 from the auxiliary buffer 404 for the third block 420. In at least one embodiment, the first layer 406 may execute other instructions (not specified in the original text) to retrieve feature data from the main buffer. Figure 4 (As shown in the diagram). In at least one embodiment, a single instruction can be used to retrieve feature data from the main buffer and overlapping data from the auxiliary buffer 404. In at least one embodiment, a separate instruction can be used to retrieve feature data from the main buffer and overlapping data from the auxiliary buffer 404.
[0063] In at least one embodiment, in a first path, the second layer 408 executes a first instruction that identifies a first portion 422 of a first block 424 to be stored in an auxiliary buffer 404. The first portion 422 represents overlapping data between the first block 424 and the second block 426. In a second path, the second layer 408 executes a second instruction that retrieves the first portion 422 from the auxiliary buffer 404 for the second block 426. In the second path, the second layer 408 may also execute another instruction that identifies a second portion 428 of the second block 426 to be stored in the auxiliary buffer 404. The second portion 428 represents overlapping data between the second block 426 and the third block 430. In a third path, the second layer 408 executes a third instruction that retrieves a second portion 418 from the auxiliary buffer 404 for the third block 430. In at least one embodiment, the second layer 408 may execute other instructions to retrieve feature data from the main buffer. Figure 4 (Not shown in the diagram). In at least one embodiment, a single instruction can be used to retrieve feature data from the main buffer and overlapping data from the auxiliary buffer 404. In at least one embodiment, a separate instruction can be used to retrieve feature data from the main buffer and overlapping data from the auxiliary buffer 404.
[0064] In at least one embodiment, the third layer 410 can perform operations from the main buffer ( Figure 4Other instructions (not shown) retrieve feature data from the auxiliary buffer 404 without retrieving overlapping data from the auxiliary buffer 404. Similarly, the input layer 402 can execute one or more instructions to retrieve feature data from an external memory device or the main buffer (not shown). Figure 4 (Not shown in the diagram) Retrieves input data or feature data. Input layer 402 can execute one or more instructions to store output data or output feature data into the main buffer.
[0065] As described in this article, the compiler can include various parameters that allow it to generate a set of hardware instructions that identify overlapping data and store it in an auxiliary buffer, such as in... Figure 5 As explained in the text.
[0066] Figure 5 This diagram illustrates two convolutional layers using hardware instructions to store and retrieve overlapping data in blocks between pathways, according to at least some embodiments. The first convolutional layer 502 has the following parameters: input size: 16x16, kernel size: 7x7, stride: 1x1, padding: 3x3, and output size: 16x16. The second convolutional layer 504 has the following parameters: input size: 16x16, kernel size: 5x5, stride: 1x1, padding: 2x2, and output size: 16x16. Figure 5 As shown, layer 502 receives a first input 506, which is 16x16. Since the first input may exceed the capacity specified by the main buffer, the first input 506 is divided into two blocks: a first block 508 and a second block 510. Layer 502 outputs a first output 512, which is also 16x16. Because the first input 506 is divided into two blocks, the first output 512 is also divided into two blocks: a first block 514 and a second block 516. Because layer 502 and layer 504 are linked, the first output 512 of layer 502 is also the second input of layer 504. Layer 504 outputs a second output 518, which is also 16x16. Because the first output 512 is divided into two blocks, the second output 518 is also divided into two blocks: a first block 520 and a second block 522.
[0067] In at least one embodiment, the compiler can generate a set of instructions to perform two convolutions in two paths using two blocks with the main buffer (CBUF). A set of example instructions is listed below:
[0068] Convolution 0 (Input: DRAM, 13x16, Output: CBUF, 10x16, Path 0)
[0069] Convolution 1 (Input: CBUF, 10x16, Output: CBUF, 8x16, Path 0)
[0070] Convolution 0 (Input: DRAM, 13x16, Output: CBUF, 10x16, Path 1)
[0071] Convolution 1 (Input: CBUF, 10x16, Output: DRAM, 8x16, Path 1)
[0072] In at least one embodiment, when using an auxiliary buffer to store overlapping data, the compiler can generate a set of instructions to perform two convolutions in two paths using two blocks with both the primary buffer (CBUF) and the auxiliary buffer (UBUF). The compiler can also generate other instructions besides those described above to store and retrieve appropriate feature data from the primary buffer and overlapping data from the auxiliary buffer. A set of example instructions is listed below:
[0073] Convolution 0 (Input: DRAM, 13x16, Output: CBUF, 10x16, Path 0)
[0074] Clip (Input: CBUF, 4x16, stride: 1x1, output: UBUF, 4x16, path 0)
[0075] Convolution 1 (Input: CBUF, 10x16, Output: CBUF, 8x16, Path 0)
[0076] Convolution 0 (Input: DRAM, 9x16, Output: CBUF, 6x16, Path 1)
[0077] Get + Convolution 1 (Input: CBUF(6x16) + UBUF(4x16), Output: DRAM, 8x16, Path 1)
[0078] In at least one embodiment, by storing overlapping data in an auxiliary buffer, 30% memory bandwidth (e.g., (13-9) / 13 = 30% bandwidth saving) and 40% computational bandwidth (e.g., (10-6) / 10 = 40% MAC saving) can be saved. This set of instructions can be used when there is no hardware trimming support in the accelerator circuitry. In at least one embodiment, the compiler can use instructions with hardware trimming to store appropriate feature data from the main buffer and overlapping data from the auxiliary buffer. A set of example instructions is listed below:
[0079] Convolution 0+Crop (Input: DRAM, 13x16, Output: CBUF, 10x16, UBUF, 4x16, Path 0)
[0080] Convolution 1 (Input: CBUF, 10x16, Output: CBUF, 8x16, Path 0)
[0081] Convolution 0 (Input: DRAM, 9x16, Output: CBUF, 6x16, Path 1)
[0082] Get + Convolution 1 (Input: CBUF(6x16) + UBUF(4x16), Output: DRAM, 8x16, Path 1)
[0083] Alternatively, the compiler can generate other instruction sets to identify, store, and retrieve overlapping data between blocks across pathways.
[0084] Figure 6 This is a flowchart of a method 600 for identifying a portion of an output block and storing it in an auxiliary buffer, according to at least some embodiments. Method 600 can be executed by processing logic including hardware, software, firmware, or any combination thereof. In at least one embodiment, method 600 is performed by… Figure 1 The accelerator core 100 is hardware-executed. In at least one embodiment, method 600 is performed by... Figure 1 The convolution engine 104 executes the method. In at least one embodiment, method 600 is performed by... Figure 1 The pooling engine 108 executes.
[0085] Back Figure 6 Method 600 begins with processing logic receiving a first input block of an input feature map from either the main buffer of the accelerator circuit or an external memory coupled to the accelerator circuit (box 602). The size of the input feature map exceeds the storage capacity of the main buffer, and the input feature map includes at least a first input block and a second input block. The processing logic performs a first operation on the first input block to obtain a first output block (box 604). The processing logic stores the first output block in the main buffer (box 606). The processing logic identifies a portion of the first output block as overlapping data corresponding to the first and second input blocks (box 608). The processing logic stores a portion of the first output block in an auxiliary buffer of the accelerator circuit (box 610), and method 600 ends.
[0086] In at least one embodiment, the processing logic identifies the portion in the first path and stores that portion of the first output block in an auxiliary buffer. In the second path following the first path, the processing logic receives a portion of the second input block from the main buffer and performs a first operation on that portion of the second input block to obtain a portion of the second output block. The processing logic retrieves a portion of the first output block from the auxiliary buffer. The processing logic stores the second output block in the main buffer. The second output block includes a portion of the second output block and a portion of the first output block.
[0087] In at least one embodiment, the processing logic in the second path identifies a portion of the second output block as overlapping data corresponding to the first input block and the third input block. In this embodiment, the input feature map includes a first input block, a second input block, and a third input block. The processing logic stores a portion of the second output block in an auxiliary buffer. In the third path, the processing logic receives a portion of the third input block from the main buffer and performs a first operation on that portion of the third input block to obtain a portion of the third output block. The processing logic retrieves a portion of the second output block from the auxiliary buffer and stores that portion in the main buffer as a portion of the third output block. The third output block includes a portion of the third output block based on the first operation and a portion of the second output block retrieved from the auxiliary buffer.
[0088] In at least one embodiment, in the first path, the processing logic retrieves a first output block from the main buffer and performs a second operation on the first output block to obtain a third output block, and stores the third output block in the main buffer. In the second path, the processing logic retrieves a second output block from the main buffer and performs a second operation on the second output block to obtain a fourth output block. The processing logic stores the fourth output block in the main buffer.
[0089] In at least one embodiment, the first operation is performed by a first fixed-function engine that processes a first-layer type, and the second operation is performed by a second fixed-function engine that processes a second-layer type. In at least one embodiment, the first operation and the second operation are performed by the same fixed-function engine.
[0090] Figure 7 This is a block diagram of a DLA system 700 according to at least some embodiments. The DLA system 700 is considered a headless system, wherein cell-by-cell management of the DLA subsystem 702 occurs on the main system processor CPU 704. The DLA subsystem 702 includes an interrupt interface 706, a configuration space bus (CSB) interface 708, a main data bus interface 710 (DBBIF), an auxiliary data bus interface 712, and the above-mentioned... Figure 1 The described overlapping data buffer 102. CPU 704 and DLA subsystem 702 are coupled to system memory 714 (e.g., DRAM). DLA subsystem 702 is coupled to system memory 714 via main data bus interface 710. DLA subsystem 702 may be coupled to auxiliary memory, such as SRAM. Figure 7 (Not shown in the image). It should be noted that the DLA system 700 may not include an optional auxiliary data bus interface 712 because the system memory 714 can consume less power than SRAM when overall system performance is less critical. The DLA system 700 can use the system memory 714 as a compute cache in a more power-efficient manner.
[0091] Figure 7 The DLA system 700 represents a more cost-sensitive system than a DLA system with a dedicated controller or coprocessor for unit-by-unit management of the DLA subsystem 702. The DLA system 700 can be considered a small-scale system model. Small-scale system models can be used for cost-sensitive connected Internet of Things (IoT) devices, artificial intelligence (AI) systems, and automated orientation systems with well-defined tasks, where cost, area, and power are the primary drivers. Cost, area, and power savings can be achieved through the configurable resources of the DLA subsystem 702. Neural network models can be pre-compiled and their performance optimized, allowing for larger models with reduced load complexity. In turn, the reduced load complexity enables a scaled-down DLA implementation where models consume less storage and require less time to load and process system software. In at least one embodiment, the DLA system 700 can perform one task at a time. Alternatively, the DLA system 700 can perform multiple tasks simultaneously. For the DLA system 700, context switching of the DLA system 700 does not overload the CPU 704 with services from numerous interrupts originating from the DLA subsystem 702. This eliminates the need for an additional microcontroller, and CPU 704 performs memory allocation and other DLA subsystem management operations. As described herein, DLA subsystem 702 includes an overlapped data buffer 102 for tiling between the link layer performed by the fixed-function engine and other operations performed by CPU 704.
[0092] Figure 8 This is a block diagram of a DLA system 800 according to at least some embodiments. The DLA system 800 is considered a head system in which a main system processor CPU 802 delegates high-interrupt-frequency tasks to a companion microcontroller 804 coupled to a DLA subsystem 702. The DLA system 800 is similar to the DLA system 700, as indicated by similar reference numerals, except that the DLA system 800 includes the companion microcontroller 804. The DLA system 800 can be considered a larger system characterized by the addition of a dedicated control coprocessor and high-bandwidth SRAM to support the DLA subsystem 702. This larger system model can be used for IoT devices that may run many tasks simultaneously.
[0093] In some cases, when higher performance and versatility are required, use Figure 8Larger DLA models are available. Performance-oriented IoT systems can perform inference over many different network topologies; therefore, they maintain a high degree of flexibility. Furthermore, these systems may perform many tasks simultaneously, rather than serializing inference operations, so that inference operations do not consume excessive processing power on CPU 704. To meet these needs, DLA subsystem 702 includes an auxiliary data bus interface 712 coupled to a dedicated high-bandwidth SRAM 812. SRAM 812 can be used as a cache for DLA subsystem 702. SRAM 812 can also be used by other high-performance computer vision-related components on the system to further reduce traffic to main system memory 714 (e.g., DRAM). DLA subsystem 702 enables interfacing with microcontroller 804 (or a dedicated control coprocessor) to limit interrupt load on CPU 704. In at least one embodiment, microcontroller 804 can be a RISC-V-based PicoRV32 processor, an ARM Cortex-M or Cortex-R processor, or other microcontroller designs. Using a dedicated coprocessor (microcontroller 804), the main processor (CPU 704) can handle some tasks related to managing the DLA subsystem 702. For example, the microcontroller 804 or CPU 704 can still handle fine-grained or coarse-grained scheduling of the DLA hardware, input-output memory management (IOMMU) mapping for DLA memory access (as needed), memory allocation of input data and fixed-weight arrays on the DLA subsystem 702, and synchronization between other system components and tasks running on the DLA subsystem 702.
[0094] In at least one embodiment, the DLA subsystem 702 is programmable to multiple operating modes, such as standalone mode, fusion mode, etc. Each functional block can be configured to execute its execution time and content in standalone mode, with each block performing its assigned task (similar to independent layers in a deep learning framework). Standalone operations can be performed as assigned blocks, starting and ending memory-to-memory operations, moving in and out of main system memory or dedicated SRAM. In fusion mode, some blocks can be assembled into pipelines. Pipelines can improve performance by bypassing memory round trips instead of having blocks communicate with each other through small first-in-first-out (FIFO) queues. For example, a convolutional engine can pass data to a single data point processor, which can then pass data to a planar data processor and a cross-channel data processor.
[0095] The techniques disclosed herein can be incorporated into any processor capable of processing neural networks, such as a central processing unit (CPU), GPU, intelligent processing unit (IPU), neural processing unit (NPU), tensor processing unit (TPU), neural network processor (NNP), data processing unit (DPU), vision processing unit (VPU), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), etc. Such processors can be integrated into personal computers (e.g., laptops), data centers, Internet of Things (IoT) devices, handheld devices (e.g., smartphones), vehicles, robots, voice-controlled devices, or any other device that performs inference, training, or any other processing on neural networks. Such processors can be used in virtualization systems, enabling an operating system running in a virtual machine on the system to utilize the processor.
[0096] As an example, a processor incorporating the techniques disclosed herein can be used to process one or more neural networks in a machine to identify, classify, manipulate, process, operate, modify, or navigate physical objects in the real world. For instance, such a processor can be used in autonomous vehicles (e.g., cars, motorcycles, helicopters, drones, airplanes, ships, submarines, delivery robots, etc.) to enable the vehicle to move in the real world. Furthermore, such a processor can be used in robots in factories to select parts and assemble parts into components.
[0097] As an example, a processor incorporating the techniques disclosed herein can be used to process one or more neural networks to identify one or more features in an image or to alter, generate, or compress the image. For instance, such a processor can be employed to enhance images rendered using rasterization, ray tracing (e.g., using NVIDIA RTX), and / or other rendering techniques. In another example, such a processor can be employed to reduce the amount of image data transmitted from a rendering device to a display device over a network (e.g., the Internet, mobile telecommunications networks, Wi-Fi networks, and any other wired or wireless network system). Such transmissions can be used to stream image data from servers or data centers in the cloud to user devices (e.g., personal computers, video game consoles, smartphones, other mobile devices, etc.) to enhance streaming image services such as NVIDIA GeForce Now (GFN), Google Stadia, etc.
[0098] As an example, a processor incorporating the techniques disclosed herein can be used to process one or more neural networks for any other type of application that can utilize neural networks. Such applications might involve translating languages, recognizing and removing sounds from audio, detecting anomalies or defects in the production of goods and services, monitoring living and non-living things, medical diagnosis, decision-making, and so on.
[0099] Other variations are within the spirit of this disclosure. Therefore, while the disclosed technology is susceptible to various modifications and alternative constructions, certain illustrated embodiments are shown in the accompanying drawings and described in detail above. However, it should be understood that this disclosure is not intended to be limited to a particular one or more forms, but rather is intended to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of this disclosure as defined in the appended claims.
[0100] Unless otherwise stated or obviously contradicted by the context, the use of the terms “a” and “an” and “the” and similar designations in the context of describing the disclosed embodiments (especially in the context of the following claims) will be interpreted as covering both singular and plural, rather than as definitions of the terms. Unless otherwise stated, the terms “comprising,” “having,” “protecting,” and “containing” will be interpreted as open-ended terms (meaning “including but not limited to”). “Connection,” when unmodified and referring to a physical connection, should be interpreted as partially or wholly contained within, attached to, or joined together, even if something intervenes. Unless otherwise stated herein, the enumeration of numerical ranges herein is intended only as a shorthand method for individually referring to each individual value falling within that range. Each individual value is included in the specification as if it were individually referenced herein. In at least one embodiment, unless otherwise stated or contradicted by the context, the use of the terms “set” (e.g., “set of items”) or “subset” will be interpreted as a non-empty set comprising one or more members. Furthermore, unless otherwise stated or contradicted by the context, the term “subset” of a corresponding set does not necessarily mean a proper subset of the corresponding set, but rather that the subset and the corresponding set may be equal.
[0101] Conjunctions, such as phrases in the form of "at least one of A, B, and C" or "at least one of A, B, and C," are generally understood in the context to indicate that an item, term, etc., can be any non-empty subset of the set A or B or C, or A and B and C, unless explicitly stated otherwise or clearly contradicted by the context. For example, in an illustrative case of a set with three members, the conjunctions "at least one of A, B, and C" and "at least one of A, B, and C" refer to any one of the following sets: {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Therefore, this language of conjunctions is generally not intended to imply that certain embodiments require the presence of at least one of A, at least one of B, and at least one of C. Furthermore, unless explicitly stated otherwise or contradicted by the context, the term "plural" indicates a plural state (e.g., "plural items" means multiple items). In at least one embodiment, the number of multiple items is at least two, but may be more when explicitly indicated or indicated by the context. Furthermore, unless otherwise stated or clearly indicated from the context, the word "based on" means "at least partially based on" rather than "based on only".
[0102] The operations of the processes described herein can be performed in any suitable order unless otherwise stated herein or obviously contradicted by the context. In at least one embodiment, processes such as those described herein (or variations and / or combinations thereof) are executed under the control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more application programs) that executes jointly on one or more processors via hardware or a combination thereof. In at least one embodiment, the code is stored, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transient computer-readable storage medium that does not include transient signals (e.g., propagation of transient electrical or electromagnetic transmissions) but includes non-transient data storage circuitry (e.g., buffers, caches, and queues) within the transceiver of the transient signal. In at least one embodiment, code (e.g., executable code or source code) is stored on one or more non-transitory computer-readable storage media (or other memory storing executable instructions) on which executable instructions are stored, causing the computer system to perform the operations described herein when the executable instructions are executed by one or more processors of the computer system (i.e., as a result of execution). In at least one embodiment, the set of non-transitory computer-readable storage media comprises multiple non-transitory computer-readable storage media, and one or more individual non-transitory storage media lack all the code, while the multiple non-transitory computer-readable storage media collectively store all the code. In at least one embodiment, the executable instructions are executed such that different instructions are executed by different processors—for example, the non-transitory computer-readable storage media store the instructions and the main central processing unit (“CPU”) executes some instructions while the graphics processing unit (“GPU”) and / or data processing unit (“DPU”)—possibly used in conjunction with the GPU—executes other instructions. In at least one embodiment, different components of the computer system have separate processors and different processors execute different subsets of instructions.
[0103] Therefore, in at least one embodiment, the computer system is configured to implement one or more services that individually or collectively perform the operations of the processes described herein, and such a computer system is configured with suitable hardware and / or software capable of performing the operations. Furthermore, the computer system implementing at least one embodiment of this disclosure is a single device, and in another embodiment, it is a distributed computer system comprising multiple differently operating devices, such that the distributed computer system performs the operations described herein and that a single device does not perform all the operations.
[0104] The use of any and all examples or exemplary language (e.g., “such as”) provided herein is intended only to better illustrate embodiments of this disclosure and does not constitute a limitation on the scope of this disclosure, unless otherwise stated. No language in the specification should be construed as indicating that any unclaimed element is essential to the practice of this disclosure.
[0105] All references cited in this article, including publications, patent applications and patents, are incorporated into this article to the same extent as if each reference were individually and specifically indicated to be incorporated into the entire text by reference.
[0106] The terms “coupled” and “connected”, as well as their derivatives, may be used in the specification and claims. It should be understood that these terms may not be intended to be synonyms with each other. Rather, in certain examples, “connected” or “coupled” may be used to indicate that two or more elements are in direct or indirect physical or electrical contact with each other. “Coupled” may also mean that two or more elements are not in direct contact with each other, but still cooperate or interact with each other.
[0107] Unless otherwise expressly stated, it will be understood that terms used throughout this specification, such as “processing,” “operation,” “calculation,” “determine,” etc., refer to the actions and / or processes of a computer or computing system or similar electronic computing device that manipulate and / or convert data represented as physical quantities (e.g., electronic quantities) in the registers and / or memory of the computing system into other data similarly represented as physical quantities in the memory, registers, or other such information storage, transmission, or display devices of the computing system.
[0108] Similarly, the term "processor" can refer to any device or part of a device that processes electronic data from registers and / or memory and converts that electronic data into other electronic data that can be stored in registers and / or memory. As a non-limiting example, a "processor" can be a CPU or a GPU. A "computing platform" can include one or more processors. As used herein, a "software" process can include, for example, software and / or hardware entities that perform work over time, such as tasks, threads, and intelligent agents. Furthermore, each process can refer to multiple processes for executing instructions sequentially or in parallel, continuously or intermittently. In at least one embodiment, the terms "system" and "method" are used interchangeably herein, provided that a system can embody one or more methods and a method can be considered a system.
[0109] In this document, reference may be made to obtaining, acquiring, receiving, or inputting analog or digital data into a subsystem, computer system, or computer-implemented machine. In at least one embodiment, the process of obtaining, acquiring, receiving, or inputting analog and digital data can be accomplished in various ways, such as by receiving data as a parameter of a function call or a call to an application programming interface (API). In at least one embodiment, the process of obtaining, acquiring, receiving, or inputting analog or digital data can be accomplished by transmitting data via a serial or parallel interface. In at least one embodiment, the process of obtaining, acquiring, receiving, or inputting analog or digital data can be accomplished by transmitting data from a providing entity to an acquiring entity via a computer network. In at least one embodiment, reference may also be made to providing, outputting, transmitting, sending, or presenting analog or digital data. In various examples, the process of providing, outputting, transmitting, sending, or presenting analog or digital data can be implemented by using data as an input or output parameter of a function call, a parameter of an application programming interface, or an inter-procedural communication mechanism.
[0110] While the description herein illustrates exemplary embodiments of the described technologies, other architectures may be used to implement the described functionality and are intended to fall within the scope of this disclosure. Furthermore, although a specific allocation of responsibilities may be defined above for the purposes of description, various functions and responsibilities may be allocated and divided in different ways depending on the circumstances.
[0111] Furthermore, although the subject matter has been described in language specific to structural features and / or methodological behavior, it should be understood that the subject matter claimed in the appended claims is not necessarily limited to the specific features or behaviors described. Rather, specific features and behaviors are disclosed as exemplary forms for implementing the claims.
Claims
1. An accelerator circuit, comprising: Main buffer; Auxiliary buffer; A memory interface for coupling to external memory devices; as well as One or more processing units are coupled to the main buffer, the auxiliary buffer, and the memory interface, wherein the one or more processing units are configured to execute instructions corresponding to multiple linked layers of a neural network in multiple paths corresponding to multiple blocks, wherein in each of the multiple paths, each of the multiple linked layers is executed in a manner linked to other layers of the multiple linked layers and communicates with other layers via the main buffer, wherein the instructions, when executed by the one or more processing units in a first path of the multiple paths, cause the one or more processing units to: A first input block of an input feature map is received from the main buffer, wherein the size of the input feature map exceeds the storage capacity of the main buffer, and wherein the input feature map includes at least the first input block and a second input block. Perform a first operation on the first input block to obtain a first output block; Store the first output block in the main buffer; A portion of the first output block is identified as overlapping data corresponding to the first input block and the second input block; as well as The portion of the first output block is stored in the auxiliary buffer, wherein the auxiliary buffer has a storage capacity smaller than that of the first output block and is specifically reserved for the overlapping data in the second path following the first path.
2. The accelerator circuit according to claim 1, wherein, When the instruction is executed by the one or more processing units in the second path following the first path, the one or more processing units further: Receive a portion of the second input block from the main buffer; Perform the first operation on the portion of the second input block to obtain a portion of the second output block; Retrieve the portion of the first output block from the auxiliary buffer; as well as The portion of the first output block is stored in the main buffer as part of the second output block.
3. The accelerator circuit of claim 2, wherein, When the instruction is executed by the one or more processing units, the one or more processing units further: In the first pathway, Retrieve the first output block from the main buffer. Perform a second operation on the first output block to obtain a third output block, and Store the third output block in the main buffer; as well as In the second pathway, Retrieve the second output block from the main buffer; Perform the second operation on the second output block to obtain the fourth output block; as well as The fourth output block is stored in the main buffer.
4. The accelerator circuit of claim 3, wherein, The one or more processing units include: A first fixed-function engine is configured to process a first layer type among the plurality of link layers, wherein the first fixed-function engine is configured to receive a first input block, perform the first operation on the first input block, store the first output block, receive a portion of a second input block, perform the first operation on the portion of the second input block, retrieve the portion of the first output block, and store the second output block; and A second fixed-function engine is used to process the second layer type of the plurality of linked layers, wherein the second fixed-function engine is used to retrieve the first output block, perform the second operation on the first output block, store the third output block, retrieve the second output block, perform the second operation on the second output block, and store the fourth output block.
5. The accelerator circuit of claim 3, wherein, The one or more processing units include: A first fixed-function engine is configured to process a first layer type among the plurality of linked layers, wherein the first fixed-function engine is configured to receive a first input block, perform the first operation on the first input block, store a first output block, receive a portion of a second input block, perform the first operation on the portion of the second input block, retrieve a portion of the first output block, store a second output block, retrieve the first output block, perform the second operation on the first output block, store a third output block, retrieve the second output block, perform the second operation on the second output block, and store a fourth output block.
6. The accelerator circuit of claim 2, wherein, When the instruction is executed by one or more processing units in the second path, the one or more processing units further: The portion of the second output block is identified as overlapping data corresponding to the first input block and the third input block; and The portion of the second output block is stored in the auxiliary buffer.
7. The accelerator circuit of claim 6, wherein, The input feature map includes a first input block, a second input block, and a third input block, wherein when the instruction is executed by the one or more processing units in the third path following the second path, the one or more processing units further: Receive a portion of the third input block from the main buffer; Perform the first operation on the portion of the third input block to obtain a portion of the third output block; Retrieve the portion of the second output block from the auxiliary buffer; as well as The portion of the second output block is stored in the main buffer as part of the third output block.
8. The accelerator circuit of claim 1 further includes an internal memory device, the internal memory device including the main buffer and the auxiliary buffer, wherein the main buffer is a first region of the internal memory device reserved as a first level L1 memory, wherein the auxiliary buffer is a second region of the internal memory device reserved as a second level L2 memory, and wherein the external memory device is reserved as a third level L3 memory.
9. The accelerator circuit according to claim 1, wherein, The one or more processing units include at least one of a convolution engine or a pooling engine.
10. The Deep Learning Accelerator (DLA) core includes: A register file for storing configuration information associated with at least a portion of a neural network comprising multiple layers, wherein each of the multiple layers is executed in a manner linked to other layers in multiple pathways corresponding to multiple blocks; A memory interface for coupling to external memory devices; A convolution buffer, wherein the convolution buffer includes a reserved region; A convolution engine, coupled to the convolution buffer and the memory interface, wherein the convolution engine is used in a first path of the plurality of paths for: A first input block of an input feature map is received from the convolution buffer, wherein the size of the input feature map exceeds the storage capacity of the convolution buffer, and wherein the input feature map includes at least the first input block and a second input block. A first hardware layer is executed on the first input block to obtain a first output block; Store the first output block in the convolution buffer; A portion of the first output block is identified as overlapping data corresponding to the first input block and the second input block; as well as The portion of the first output block is stored in the reserved area, wherein the reserved area has a storage capacity smaller than the size of the first output block and is specifically reserved for the overlapping data in the second path following the first path.
11. The DLA core according to claim 10, wherein, The convolutional engine is further used in the second path following the first path to: Receive a portion of the second input block from the convolution buffer; The first hardware layer is executed on the portion of the second input block to obtain a portion of the second output block; Retrieve the portion of the first output block from the reserved area; as well as The portion of the first output block is stored in the convolution buffer as part of the second output block.
12. The DLA core of claim 11, wherein, The convolutional engine in the first path is further used for: Retrieve the first output block from the convolution buffer; Execute a second hardware layer on the first output block to obtain a third output block; and The third output block is stored in the convolution buffer, wherein the convolution engine in the second path is further used for: Retrieve the second output block from the convolution buffer; Execute the second hardware layer on the second output block to obtain the fourth output block; and The fourth output block is stored in the convolution buffer.
13. The DLA core according to claim 11, wherein, The convolutional engine is further used in the second path for: The portion of the second output block is identified as overlapping data corresponding to the first input block and the third input block; and The portion of the second output block is stored in the reserved area.
14. The DLA core according to claim 13, wherein, The input feature map includes a first input block, a second input block, and a third input block, wherein the convolutional engine is further used in the third path after the second path for: Receive a portion of the third input block from the convolution buffer; The first hardware layer is executed on the portion of the third input block to obtain a portion of the third output block; Retrieve the portion of the second output block from the reserved area; as well as The portion of the second output block is stored in the convolution buffer as part of the third output block.
15. A memory management method, comprising: The processing unit of the accelerator circuit receives a first input block of an input feature map from the main buffer of the accelerator circuit, wherein the size of the input feature map exceeds the storage capacity of the main buffer, and wherein the input feature map includes at least the first input block and a second input block. The processing unit performs a first operation on the first input block to obtain a first output block; Store the first output block in the main buffer; Identifying a portion of the first output block as overlapping data corresponding to the first input block and the second input block, wherein identifying the portion of the first output block and storing the portion of the first input block in an auxiliary buffer of the accelerator circuit is performed in a first path of a plurality of paths corresponding to a plurality of blocks, and wherein in each of the plurality of paths, each of the plurality of linked layers is performed in a manner linked to other layers of the plurality of linked layers and communicates with other layers via the main buffer; as well as The portion of the first output block is stored in an auxiliary buffer of the accelerator circuit, wherein the auxiliary buffer has a storage capacity smaller than that of the first output block and is specifically reserved for the overlapping data in the second path following the first path.
16. The method of claim 15, wherein, In the second path following the first path, the method further includes: Receive a portion of the second input block from the main buffer; Perform the first operation on the portion of the second input block to obtain a portion of the second output block; Retrieve the portion of the first output block from the auxiliary buffer; and The portion of the first output block is stored in the main buffer as part of the second output block.
17. The method of claim 16, further comprising, in the second pathway: A portion of the second output block is identified as overlapping data corresponding to the first input block and the third input block, wherein the input feature map includes the first input block, the second input block, and the third input block; The portion of the second output block is stored in the auxiliary buffer; Receive a portion of the third input block from the main buffer; Perform the first operation on the portion of the third input block to obtain a portion of the third output block; Retrieve the portion of the second output block from the auxiliary buffer; as well as The portion of the second output block is stored in the main buffer as part of the third output block.
18. The method of claim 16, further comprising: In the first pathway, Retrieve the first output block from the main buffer. Perform a second operation on the first output block to obtain a third output block, and Store the third output block in the main buffer; as well as In the second pathway, Retrieve the second output block from the main buffer; Perform the second operation on the second output block to obtain the fourth output block; as well as The fourth output block is stored in the main buffer.
19. The method of claim 18, wherein, Performing the first operation includes performing the first operation using a fixed-function engine, and performing the second operation includes performing the second operation using the fixed-function engine.
20. The method of claim 18, wherein, Performing the first operation includes performing the first operation using a first fixed-function engine that processes a first layer type of a plurality of layers, and performing the second operation includes performing the second operation using a second fixed-function engine that processes a second layer type of the plurality of layers.