Convolutional neural network (CNN) accelerator using look-ahead convolution
The CNN accelerator addresses inefficiencies in CNN hardware by serializing input data and processing through pipelined layersets with parallel filter cores, reducing energy and latency while optimizing memory usage and supporting low-precision computations for edge devices.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- INSTITUT MINES TELECOM TELECOM BRETAGNE
- Filing Date
- 2025-12-19
- Publication Date
- 2026-07-02
AI Technical Summary
Existing CNN hardware implementations face challenges in computational efficiency, memory management, energy consumption, and latency, particularly in resource-constrained edge devices, with a lack of portability and integrability with existing systems.
A CNN accelerator architecture that serializes input feature maps into 1D vectors, processes them through interconnected pipelined layersets with filter cores operating in parallel, and performs convolutional operations without intermediate storage, using a weight cache and max pooling blocks to optimize memory usage and computational efficiency.
Reduces energy consumption and latency by minimizing external memory access, optimizes memory usage, and supports low-precision implementations, enabling real-time processing and easy integration into existing hardware systems.
Smart Images

Figure EP2025088616_02072026_PF_FP_ABST
Abstract
Description
[0001] CONVOLUTIONAL NEURAL NETWORK (CNN) ACCELERATOR USING LOOK-AHEAD CONVOLUTION
[0002] The invention generally relates to digital Integrated Circuits and, in particular, to a Convolutional Neural Network (CNN) accelerator.
[0003] BACKGROUND
[0004] Convolutional Neural Networks (CNNs) are currently intensively and widely used in many industrial fields. Today, different types of hardware implementations of Convolutional Neural Networks are being used for various on-edge deep learning applications.
[0005] In the past decade, tremendous efforts have been directed toward novel problems relating to deep learning, and new breakthroughs were achieved in relation with Deep Learning models. This resulted in more complex and computationally intensive Deep Learning models. However, in terms of hardware implementations, implementations of complex models, such as CNNs, are faced with many technical challenges. These challenges can mostly be attributed to computational efficiency and memory management.
[0006] As part of huge endeavors to enhance the performances of CNNs, improved hardware implementations of CNNs are also required to meet the needs for low power, low latency, and easily integrated hardware architectures.
[0007] In addition, the development of CNN hardware implementations faces the challenge of deploying CNN models on hardware without a significant decrease in accuracy and precision. Further, in such implementations, the data and parameters cannot be represented using floating point operation, which necessitates relying on suitable solutions to use low-width fixed point representation so as to balance performance with accuracy.
[0008] Hardware CNN architectures were proposed to try to address the problems of efficiency and memory management as disclosed for example in:
[0009] - Li, H., Yue, X., Wang, Z., Chai, Z., Wang, W., Tomiyama, H. and Meng, L., 2022. Optimizing the deep neural networks by layer-wise refined pruning and the acceleration on FPGA. Computational Intelligence and Neuroscience, 2022;
[0010] - Wu Z, Shen C, Van Den Hengel A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition. 2019 Jun 1;90:119-33;
[0011] - Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy efficient reconfigurable accelerator for deep convolutional neural networks,” in 2016 IEEE International Solid-State CircuitsConference (ISSCC). IEEE, 2016, pp. 262–263;
[0012] - J. Sim, J.-S. Park, M. Kim, D. Bae, Y. Choi, and L.-S. Kim, “14.6 A 1.42 TOPS / W deep convolutional neural network recognition processor for intelligent loE systems,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2016, pp. 264–265;
[0013] - B. Moons and M. Verheist, “A 0.3-2.6 TOPS / W precision-scalable processor for real-time large-scale ConvNets,” IEEE Symposium on VLSI Circuits, Digest of Technical Papers, vol. 2016-Septe, pp. 1–2, 2016.
[0014] - B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verheist, “14.5 Envision: A 0.26-to-10TOPS / W subwordparallel dynamicvoltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), 2017, pp. 246–247. [Online]. Available: http: / / ieeexplore.ieee.org / document / 7870353 / - Lu, L., Xie, J., Huang, R., Zhang, J., Lin, W. and Liang, Y., 2019, April. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) (pp. 17-25). IEEE.
[0015] - Aimar A, Mostafa H, Calabrese E, Rios-Navarro A, Tapiador-Morales R, Lungu IA, Milde MB, Corradi F, Linares-Barranco A, Liu SC, Delbruck T. Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE transactions on neural networks and learning systems. 2018 Jul 26;30(3):644-56.
[0016] In S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” in Proceedings of the Annual International Symposium on Microarchitecture, MICRO, vol. 2016-Dec. IEEE, oct 2016, pp. 1–12, Tensor Processing Units (TPUs) were proposed. Such TPUs form chips built for efficient deep learning applications, and use Systolic Array Multipliers to handle convolutional computations, Systolic Array Multipliers have been widely used to solve the problem of computational efficiency and perform the high number of operations required within a CNN. In recent years, many edge devices with smaller form factors (low power consumption and limited resources) were released.
[0017] Such small-form devices can be split into four categories, namely: Edge ML, Mobile, TinyML, and Offline ML. These devices and platforms are dedicated Artificial Intelligence inference engines.
[0018] These dedicated devices correspond to application-specific hardware solutions to the above complex problems associated with CNN hardware implementations. The use of dedicated processing units or tensor cores in these dedicated devices allows users to quickly deploy deep learning models, on edge, with reduced power and cost overheads.These devices adopt different architectures, depending on a specific target application. The majority of these devices adopt a Systolic Array of Processing Elements approach. However, even if these existing devices address the challenge of implementing deep learning on the edge (i.e. locally, without using a distant server like the cloud), they are not adapted to support a wide range of possible applications and integration. Moreover, their over-reliance on custom-made architectures and blocks restricts their portability and integrability with existing systems.
[0019] There is consequently a need for CNN hardware architectures capable of reducing energy consumption and latency, and improving the computational complexity of the implementation.
[0020] SUMMARY
[0021] To address these and other problems, there is provided an accelerator implementing a convolutional neural network configured to determine an output feature map, in response to the receipt of an input feature map, the input feature map forming a Q-dimensional tensor, the convolutional neural network comprising one or more convolution layers, a convolution layer being associated with a set of weights. The accelerator comprises:
[0022] - At least one serializer configured to serialize the input feature map into a series of 1D serialized vectors comprising elements of the input feature map,
[0023] - A plurality of layersets, the at least one serializer being configured to broadcast the serialized vectors to at least one of the layersets, the layersets being interconnected in a pipelined fashion, each layerset corresponding to a pipeline stage, each layerset being configured to receive a continuous input stream, each layerset comprising one or more filter cores, each filter core implementing K convolutional filters, each layerset being configured to perform the equivalent computation of a convolutional layer using the outputs from the layerset filter cores.
[0024] In response to the receipt of an input stream from a previous pipeline stage by a given layerset, the input stream is broadcasted to the filter cores of the given layerset, the filter cores of the given layerset being configured to operate in parallel, the outputs of the filter cores being combined to form the output of the given layerset, the output of a given interconnected layerset being directly transmitted to the next layerset, without storage of the given layerset output in an external memory, the last interconnected layerset being configured to provide the output feature map.
[0025] In some aspects, the accelerator may comprise a main configurable interconnection unit configured to broadcast the serialized vectors to at least some of the layersets.
[0026] Each layerset may be implemented using a single hardware cluster.In some embodiments, the accelerator may comprise at least two clusters to implement the layersets, at least one cluster implementing two or more layersets.
[0027] In some embodiments, each filter core may comprise:
[0028] - K convolution lanes operating in parallel, each convolutional lane being configured to apply a convolutional filter, defined by a subset of weights, to the input feature stream by the filter core, which provides a convolutional lane output,
[0029] the convolutional lane output provided by each convolution layer being used to determine a row of the output feature map.
[0030] In some embodiments, each filter core may comprise one or more max pooling blocks, wherein the Max Pooling blocks are configured to apply a max pooling operation to the convolutional outputs delivered by the convolutional lanes, immediately, without storage of the convolutional output.
[0031] Each convolution lane may be associated with a max pooling block arranged at the output of the convolutional lane, each max pooling block being configured to apply a max pooling operation to the convolutional output delivered by the associated convolutional lane immediately, without storage of the convolutional output.
[0032] Each max pooling block may be configured to apply a max pooling sliding window of given dimensions.
[0033] In some aspects, each convolution lane may comprise a processing unit configured to compute the internal product of the received input stream and of the received subset of weight.
[0034] The input feature map correspond to an image and may be represented by a 4-D tensor of dimensions N x H xW x C, with N being the number of images, H denoting the height of the tensor, W denoting the width of the tensor, and C denoting the number of channels of the tensor.
[0035] In some aspects, the accelerator may comprise a weight cache configured to store the sets of weights of the CNN layers, and a weight distributor configured to distribute the weights stored in the weight cache to the convolutional lanes of the filter cores.
[0036] The layersets are adapted to perform the performing the equivalent computation of P convolutional layers of the CNN without writing the P - 1 intermediary results delivered by the intermediary layersets preceding the last layerset (i.e. the layersets are adapted to merge P layersets).
[0037] A convolutional filter may have a size equal to m x m, and the accelerator may be adapted to use P + m - 1 cores, each layerset comprising (P - k + m - 1) cores for the k -th layerset.In some embodiments, the accelerator may comprise an encoder configured to compress the input feature map prior to being serialized by the serialization device.
[0038] The accelerator may further comprise a decoder configured to decompress the input feature stream prior to entering a convolutional lane.
[0039] The embodiments of the enclosure thereby provide an improved hardware architecture for the acceleration of convolutional neural networks which may be implemented on semiconductors. They enable optimizing memory usage and computational efficiency while being particularly adapted to highly resource-constrained environments. They further provide low power, small form, and optimized CNN hardware implementations which are particularly suitable for edge applications of deep learning.
[0040] The embodiments of the enclosure further allow for easy integration into existing hardware systems. They also rely on a hardware architecture capable of supporting low-precision and quantization implementations of models to efficiently use the available resources and achieve high performance with low power and small size.
[0041] The architecture allows real-time processing and reduced latency by merging the processing of two or more convolutional layers without writing intermediate features maps to the external memory.
[0042] The accelerator according to embodiments of the disclosure provides an optimized and efficient way to implement convolutional operations on semiconductors, allowing for the conversion of incoming data into matrices. It further alleviates the problems related to memory access and the need to reuse model parameters at run-time efficiently while greatly reducing the number of memory accesses, thereby highly improving performance and enabling computational optimization.
[0043] The accelerator according to embodiments of the disclosure enable reducing energy consumption and latency, and more specifically, reducing external memory access (most energy consuming operations), reducing the bit-width for convolution without sacrificing accuracy.
[0044] BRIEF DESCRIPTION OF THE DRAWINGS
[0045] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate various embodiments of the invention and, together with the general description of the invention given above, and the detailed description of the embodiments given below, serve to explain the embodiments of the invention.
[0046] - Figure 1 A represents an overview of the CNN accelerator, according to some embodiments; adapted to merge convolutional layers as an example;- Figure 1 B depicts the pipelined processing performed by the L layersets, according to an exemplary implementation;
[0047] - Figure 2 represents an exemplary implementation of the layersets, according to some embodiments;
[0048] - Figure 3 represents an exemplary implementation of the CNN accelerator, adapted to merge three convolutional layers, according to some embodiments;
[0049] - Figure 4 represents the structure of a filter core, according to some embodiments;
[0050] - Figure 5 illustrates the ‘look-ahead’ scheme in embodiments where three convolutional layers are merged into one iteration, according to an exemplary embodiment;
[0051] - Figure 6 represents the processing of a convolutional lane, according to an exemplary embodiment where each convolutional lane is configured for 8-bit feature map, and 1, 2, 4 bit weights;
[0052] - Figure 7 represents the processing of a convolutional lane according to an exemplary embodiment where each convolutional lane is configured for 8-bit feature map, and bit weights;
[0053] - Figure 8 depicts the input and outputs of each filter core;
[0054] - Figure 9 depicts the streaming operation corresponding to the presentation of vectors Fi(input feature stream) by a serializer to a filter core and the outputs feature stream obtained at the output of this filter core;
[0055] - Figure 10 represents another exemplary implementation of the layersets to merge three convolutional layers of the CNN;
[0056] - Figure 11 is a diagram representing an exemplary implementation of accelerator 1 according to the pipelined process illustrated by figure 10;
[0057] - Figure 12 depicts the method implemented by the accelerator to determine an output feature map in response to a received input feature map, according to some embodiments;
[0058] - Figure 13 represents a convolutional filtering system in which the serializer 3 may be generally used;
[0059] - Figure 14 depicts an implementation of the serializer, according to some embodiments; - Figure 15 represents an exemplary implementation of the memory device of a serializer consisting of a single dual-port memory;
[0060] - Figure 16 represents a dual-memory assembly comprising one write port and two read ports;- Figure 17 depicts a single convolutional patch extraction that may be performed by the serializer from an input feature map, using a single dual-port memory;
[0061] - Figure 18 illustrates multiple patch extraction, according to some embodiments;
[0062] - Figure 19 is a flow chart depicting the parallel read / write operations performed by the serializer, according to some embodiments;
[0063] - Figure 20 illustrates multiple patch extraction, according to embodiments where the input images are streamed in a column major order; and
[0064] - Figure 21 depicts the parallel read / write operations performed by the serializer in embodiments using column major operation, according to some embodiments.
[0065] DETAILED DESCRIPTION
[0066] Embodiments of the invention provide an improved CNN accelerator enabling energy consumption reduction.
[0067] Figure 1 A represents an overview of the CNN accelerator 1 according to some embodiments. The CNN accelerator 1 implements a Convolutional Neural Network (CNN) to determine the output feature map in response to a received input feature map.
[0068] The CNN accelerator 1 is configured to receive the input feature map and process it to enable a parallel processing of the input feature map by the CNN, without requiring intermediary storing.
[0069] To facilitate the understanding of the embodiments of disclosure, some conventional notions related to the operation of a Convolutional Neural Network (CNN) are first defined.
[0070] A CNN comprises a set of convolutional layers comprising a set of intermediary layers (e.g. hidden layers) and an output layer. The CNN layers are adapted to learn the patterns of the input data and transform them into a representation that is exploited by the CNN to determine predictions (for example classify an input image).
[0071] Each layer of the neural network comprises neurons (artificial neurons, also called neurons). The information received as input by a given neural network layer is processed and passed through the neurons of the layer. Each neuron of a layer is configured to apply a transformation to the received input. The successive transformations performed at the successive convolutional layers produce an output at the output layer.
[0072] A neuron in the neural network is characterized by a set of weights representing the strength / amplitude of the connections between the neuron inputs and the neuron (a weight therebyrepresents the strength / amplitude of the connection between two neurons). Each CNN layer is associated with an activation function introducing non-linearity. A CNN layer uses the set of weights associated with its neurons, biases used to shift the input to the activation function, and its activation function to determine the convolution layer output.
[0073] The weight and the biases (generally called the CNN model parameters) may be learnt during the training of the CNN.
[0074] During the training of the CNN, the CNN implements one or more epochs, each comprising a forward propagation phase and a backward propagation phase.
[0075] The forward propagation phase refers to the process where the input data is passed through the layers of the CNN, using the weights and activation functions to compute the output.
[0076] A convolutional layer applies a convolutional operation to received input data corresponding to the input feature map applied to the CNN to filter the information and produce a feature map. The feature map produced by the last layer of the CNN corresponds to the output feature map.
[0077] In a forward propagation phase, data flows through the CNN layers of the CNN until arriving to the output layer. During a forward propagation phase, at each CNN layer, the output data of the previous CNN layer is received as the current layer inputs, the weighted sum of the received layer inputs and biases is calculated for each neuron of the current CNN layer using the weights associated with the connections of the neuron, and the activation function associated with the current CNN layer is applied to the weighted sum, which provides the CNN layer output.
[0078] In the backward propagation phase, the neural network is configured to calculate the gradients of a loss function (also called cost function), the gradients corresponding to the gradients of the loss with respect to the current network parameters (weight and biaises), and apply a descent gradient algorithm to update the weights and the biases of the neural network model so as to minimize the loss function.
[0079] The trained CNN can be then use to make prediction in response to an input feature map, using the learnt model parameters, for example during an inference phase or online in the CNN is an online CNN.
[0080] As shown in figure 1A, the CNN accelerator 1 according to the embodiments of the disclosure comprises one or more serializers 3 and N filter cores 22 configured to perform convolution operations in parallel in response to the receipt of a part of the input feature map. The serializer(s) may be connected to the N filter cores 22 using at least one main configurable interconnection unit CIU 4.For simplification purpose, the following description of some embodiments of the disclosure will be mainly made with reference to an accelerator 1 using one serializer 3.
[0081] According to some aspects, the N filter cores 22 of the accelerator 1 may be advantageously organized in L layersets 2, each comprising of one or more cores 22, and each layerset being configured to perform the equivalent computation of a convolutional layer. As used herein, the equivalent computation of a convolutional layer of the CNN refers to the convolutional operation performed by this CNN layer.
[0082] The layersets 2 are interconnected in a pipelined fashion. Each layerset 2 therefore corresponds to a set of filter cores 22 and uses the outputs delivered by its filter cores to perform the equivalent computation of a CNN convolutional layer. Each layer set 2 corresponds to a pipeline stage. The layersets 2 may be interconnected in a pipelined fashion using auxiliary Configurable Interconnection Units, CIUs, (not shown in figure 1 A), also called layerset CIUs.
[0083] In some embodiments, the L layersets 2 may be mapped to c hardware clusters 2 (in the example of figure 1A, c = 4) comprising at least two clusters. In some embodiments, one or more layersets 2 may be implemented inside a single hardware cluster. Alternatively, a hardware cluster may be used to implement two or more layersets.
[0084] A cluster may be therefore used to perform the equivalent computation of one or more convolutional layers of the CNN.
[0085] The following description of some embodiments will be made mainly with reference to an implementation of the L layersets in the form of c clusters for illustration purpose only, although the skilled person will readily understand that the disclosure is not limited to such implementation of the layersets.
[0086] The input feature map refers to the initial data structure that is processed by the CNN accelerator to generate multiple output feature maps.
[0087] The input feature map and the output feature map may be represented by a Q-dimensional tensor, Q being an integer number at least equal to 1. The Q-dimensional tensor representing the input feature map will be denoted I.
[0088] For example, the input feature map may correspond to at least one input image. In this case, the input feature map be represented by a 4–Dimensional (Q-D) tensor I (Q = 4) of dimensions < N, H, W, C > (the tensor I is also denoted I< N, H, W, C>), where N is the numbers of images, / / denoting the height of the input image, W denoting the width of the input image, and C denoting the number of features or channels. Alternatively, if the input feature map corresponds to only one inputimage, the input feature map may be represented by a 3–D tensor I (Q = 3) of dimension < H, W, C > (also denoted by I < H, W, C >) with H denoting the height of the input image, W denoting the width of the input image, and C denoting the number of features or channels. For example, the input feature map may be a color image RGB (Red, Green, Blue), with the parameter C being equal to 3 (C = 3).
[0089] Although the description of embodiments of the disclosure will be made essentially with reference to an input feature map of image type (case of an input feature map corresponding to one image) represented by a tensor < H, W, C >, the skilled person will readily understand that the embodiments of the disclosure are not limited to such type of input feature map and encompasses any input feature map on which convolutional filtering may be applied, such as for example an audio input feature map.
[0090] A serializer 3 (also called ‘serialization device’) is configured to serialize the input feature map represented by a Q-D tensor I into a set of serialized vectors, each corresponding to a convolutional patch comprising elements of the input feature map.
[0091] A convolutional patch thereby corresponds to a subset of the Q-D tensor I < H, W, C >, denoted CP < fh,fw, C >, corresponding to a region in I of width fw, height fh, and comprising C channels.
[0092] The input feature map be received by the serializer 3 from an input memory 10, which may be for example a Double Data Rate (DDR) memory accessed through a Direct Memory Access (DMA) such as an AXI (AMBA extensible Interface) DMA.
[0093] The output feature map be delivered by the CNN accelerator 1 to an output memory 30, which may be for example a DDR memory accessed through a Direct Memory Access (DMA), such as for example an AXI DMA.
[0094] The serializer 3 (also called ‘serializer’) is configured to receive the input feature map I and serialize it (or ‘transform’ it) into a series of 1D vectors F corresponding to convolutional patches, each of size Nvsuch that Ty <
[0095]
[0096] fi>2, > Nvis equal to / ^ x fwx C.
[0097] The CNN accelerator 1 may comprise a serialized vector storage memory adapted to store the serialized 1D vectors generated by the serializer 3. In an embodiment, the serialized vector storage memory may be for example and without limitations, a Block RAM (BRAM) in an FPGA implementation of the accelerator 1, a BRAM being suitable to handle large volume of data, or SRAM macros for ASIC implementations.
[0098] The input feature maps may be loaded from the input memory 10 and fed into the serializer 3.The serialized 1 D vectors may be broadcasted to at least some of layersets 2 that are adapted to operate in parallel. In figure 1A, the accelerator 1 comprises for example and without limitations 16 filter cores 22 organized in 4 layersets (L = 4). All of the cores may or may not be used in a particular configuration.
[0099] The serializer 3 may be adapted to convert incoming high dimensional data (corresponding to the input feature map) into a number of serialized 1D vectors suitable for the processing by the layersets 2, while allowing for continuous streaming of input data through the pipeline by prebuffering incoming stream elements and while the dedicated processing resources are occupied.
[0100] In some embodiments, at least some of the layersets 2 may process the serialized vectors determined by the serializer 3 in parallel and the outputs of one or more layersets 2 may be combined to determine the output feature map.
[0101] As shown in figure 1A, the main configurable interconnection unit (CIU) 4 is configured to broadcast the serialized vectors to selected layersets 2 to enable parallel processing.
[0102] More specifically, the configurable interconnection unit (CIU) 4 may be configured to transmit a serialized vector, corresponding to a convolutional patch, to at least one layerset 2.
[0103] A number of layersets 2 of the accelerator 1 may perform in parallel the convolutional operations related to the different layers of the CNN, thereby merging these layers.
[0104] The accelerator 1 according to the embodiments of the disclosure enables a look-ahead computational optimization. The “look-ahead” scheme lies on the merging of two or more CNN convolutional layers.
[0105] Figure 1 B depicts the pipelined processing performed by the L layersets 2 according to an exemplary implementation.
[0106] In the example of figure 1 B, the N filter cores 22 are organized in L = 3 layersets denoted 2i, 22, 23, each comprising of one or more filter cores 22 and each being configured to perform the equivalent computation of a convolutional layer (convolutional layer L1for layerset 2i, convolutional layer L2for layerset 22, convolutional layer L3for layerset 23).
[0107] In this example, the first layerset 2i comprises 3 filter cores 22, the second layerset 22comprises 2 filter cores 22 and the third layerset 23comprises a unique filter core 22.
[0108] For example, the layersets 2 may be physically implemented with clusters such as the three cores 22 of the first layerset 2i are mapped to a first cluster ‘cluster 1 ’, the 2 cores 22 of the layerset 22are mapped to a second cluster ‘cluster 2’, and the final core 22 of the layerset 23is mapped to a third cluster ‘cluster 3’.A given input feature map I may be received in the input memory 10 in block 100 and serialized by the serializer(s) 3 into serialized vectors, in block 103.
[0109] In the example of figure 1 B, serialized vectors are transmitted by a serializer 3 to the layerset 2i, through the main CIU 4, in block 104, each serialized vector corresponding to a convolutional patch, while each layerset 2i, 22, and 23may be configured to perform the convolutional operations related to a given layer of the CNN.
[0110] As shown in figure 1B, the three layersets 2i, 22, and 23are interconnected in a pipelined fashion so that when the input stream is being processed by layerset 2i (blocks 105, 106 and 107) the output of layer 2i is being processed in parallel by the next layerset 22(blocks 109, 110), and the output of layer 22is being processed in parallel by the next layerset 23(block 112). The layersets may be interconnected by layerset interconnections at blocks 108, 111 for example using a layerset output CIU 25, at the output of the each layerset 2.
[0111] The final layerset 23delivers an output. The plurality of layerset outputs may be then combined to determine the output feature map that may be sent to the output memory 30, in block 114 (for example through DMA if the output memory is a DDR).
[0112] Figure 2 represents an exemplary implementation of layerset 2 in the form of a cluster, according to some embodiments. In such embodiment, each layerset 2 of the accelerator 1 is implemented by a single cluster. Each layerset 2 (and therefore each cluster in this embodiment) comprises a number n of filter cores 22 (n being at least equal to one) adapted to process in parallel a data stream received by the layerset 2. In the example of figure 2, the number of filter cores 22 per layerset is n = 4.
[0113] The CNN accelerator 1 therefore comprises n x L filter cores 22 adapted to perform the convolution operations of the CNN while limiting the number of memory accesses.
[0114] A given layerset will be also denoted as layerset 2;, using index i with i being comprised between 1 and L (the reference 2 will be also used to generally denote the layersets).
[0115] Each layerset may use a variable number
[0116]
[0117] of filter cores for the processing of a given input feature map received in the input memory 10. Therefore, the number of filter cores 22 used by a layerset 2 in response to the receipt of given input feature map I to be processed by the accelerator 1 may be inferior or equal to n (i.e. is comprised between 1 and n).
[0118] In each layerset 2i, the input feature stream received by the layerset 2 may be sent to at least some of the filter cores 22 of the layerset.
[0119] Within each layerset 2;, the data are distributed to a number of the filter cores 22.Each filter core 22 is configured to process the data in parallel across K convolutional lanes. As shown in figure 2, in some embodiments, each layerset 2 may comprise or be associated with a layerset input CIU 24 (also called ‘first layerset CIU’ or ‘internal broadcasting unit’). A layerset 2 may receive its input from its associated layerset input CIU 24. The layerset input CIU 24 associated with a layerset 2 is configured to broadcast an input stream received by the layerset 2, corresponding to a serialized vector determined by the serializer(s), to at least some of the filter cores 22 of the layerset 2. Each filter core 22 provides a filter output, in a continuous stream.
[0120] The data exchanged in the CNN accelerator 1 and in the different layersets 2 may be performed according to a data exchange protocol in the form of a continuous data stream, such as for example and without limitation the protocol AXI (AMBA extensible Interface).
[0121] A layerset 2 may further comprise or be associated with a layerset output CIU (also called “second layerset CIU’ or ‘internal combining unit’) configured to combine the core filter outputs (up to nj core filter outputs) provided by the filter cores 22 of the layerset 2 into a single output feature stream. The output feature stream forms the layerset output that is delivered at the output of the layerset 2.
[0122] In some embodiments, at least some of the layersets 2 may be interconnected by one or more associated layerset configurable interconnection units (CIU) 24 and 25, so that the output of at least one layerset 2i(for example 21in figure 1 A) forms the input of a next layerset 2(i+1)(for example 22in figure 1 A), without intermediary storage of the output of the layerset 2iprior to transmission of the output of the layerset 2ito the next layerset 2(i+1), while the two interconnected layersets are operating in parallel in a pipelined fashion, as depicted in figure 1 A.
[0123] Assuming that P interconnected layersets 2 are used for the processing of an input feature map I, each interconnected layerset 2ifrom i = 1 to i = P - 1 may be configured to perform the equivalent computation of the i - th convolutional layer of the CNN, without writing the intermediate output of the i - th layerset 2i(layerset output), in an external memory. The P - th layerset 2Pcorresponding to i = P provides the output feature map of P-th layer that may be stored in the external memory 30. This enables merging P convolutional layers. As used herein, merging P convolutional layers means performing the equivalent computation of P convolutional layers of the CNN without writing the P-1 intermediary results (for i = 1 to P - 1) in an external memory.
[0124] For example, in the case of figure 1 B, the first layerset 21receives its input from the layerset input CIU 24 associated with the layerset 21and the output streams delivered by the filter cores 22 of the layerset 21are combined in the layerset output CIU 25 of the layerset 21. The second layerset 22receives its input stream from the layerset output CIU 25 of the first layerset 21and the thirdlayerset 23receives its input from the layerset output CIU 25 of the second layerset 22
[0125] The layerset input CIU 24 associated with each layerset 21, 22, and 23is configured to receive an input stream and broadcast it to the filter cores 22 of the corresponding layerset, while the outputs of the filter cores 22 of the corresponding layerset are combined by the associated layerset output CIUs 25 and transmitted to the next layerset.
[0126] In particular, the layerset input CIU 24 associated with the first layerset 21is configured to receive a serialized vector determined by the serializer 3 and broadcast it to the used filter cores 22 of the first layerset 21(the first layerset 21using =3 filter cores 22). The layerset output CIU 25 associated with the first layerset 21is configured to combine the output of the three used filter cores 22 of the considered layerset 21, which provides the first layerset output, and transmit the first layerset output to the second layerset 22. The second layerset 22uses n2=2 filter cores 22. The layerset input CIU 24 of the second layerset 22is configured to receive the first layerset output (received from the first layerset 21through the layerset output CIU 25) and to broadcast it to the two filter cores 22 of the second layerset 22. The layerset output CIU 25 corresponding to the second layerset 22is configured to combine the output of the two filter cores 22 of the second layerset 22, and transmit it to the third layerset 23which comprises only one filter core 22 (n3=1). The output of the filter core of the third layerset 23provides the third layerset output, corresponding to the output feature map of the accelerator 1. The third layerset output, corresponding to the output feature map of the accelerator 1, may be stored in the output memory 30.
[0127] The first layerset 21is configured to perform the equivalent computation of the first convolutional layer
[0128] of the CNN without writing the intermediary output of the first layerset 21in an external memory. The second layerset 22is configured to perform the equivalent computation of the second convolutional layer L2of the CNN without writing the intermediary output of the second layerset 22. In this example, this enables merging P =3 convolutional layers of the CNN (i.e. performing the equivalent computation of 3 convolutional layers of the CNN without writing the intermediary layerset output of the first layerset 21and of the second layerset 22in an external memory).
[0129] More generally, in response to a received input feature map, the CNN accelerator 1 may select P layersets 2 for the computation of the output feature map delivered by the CNN accelerator 1. The CNN accelerator 1 may then interconnect the P layersets 2i(P being an integer at least equal to 2), such as the output of each layerset 2ifor i = 1 to P - 1 forms the input of the next layerset 2i+1interconnected thereto, the output of the last interconnected layerset 2Pproviding the output feature map that may be stored in external memory (for example through DMA, in theexternal memory 30 is a DDR memory). The i - th layerset 2imay be configured to perform the equivalent computation of the i - th convolutional layer of the CNN, while the P-1 intermediary outputs of the layersets 2ifor i = 1 to P - 1 are directly sent to the next interconnected layerset, 2i+1without intermediary storage in an external memory (thereby, P convolutional layers of the CNN are merged in this case).
[0130] In response to the receipt of the input feature map, the serializer 3 may initially send a first serialized vector to the first layerset (21) which is adapted to process the received serialized vector, determine its layerset output, and directly send the layerset output of this first layerset to the next layerset (22), without intermediary storage. In the next processing iterations, for each index i = 2 to P - 1, each layerset 2iof the interconnected layersets directly (i.e. without intermediary storage) may receive as input the output stream delivered by the previous layerset 2i-1and process it in parallel to the other layersets 2. The ‘parallel’ and “pipelined” processing implemented by the layerset 2 means that the previous layersets among the P selected layersets, in the interconnection layerset chain, may continue to process in parallel other serialized vectors resulting from the serialization of the input feature stream, while the next layersets in the interconnected chain may be terminating the processing of a current serialized vector.
[0131] Figure 4 is a detailed view of the structure of a filter core 22, according to an exemplary hardware implementation.
[0132] The input feature stream received by a layerset 2, and the data stream transmitted by the main CIU 4 form a data stream having a width, such as for example an AXI stream of width.
[0133] The outputs of the filter core 22, and the output of the CIU 25 may also form a data stream having a given width, such as for example an AXI stream.
[0134] The accelerator 1 comprises a weight cache 5 configured to store weights (also called ‘coefficients’) of the CNN and distribute the stored weights to the different filter cores 22 of the layersets 2, the weights being used by each filter core 22 to perform CNN convolution operations.
[0135] The weight cache 5 may be configured to determine sets of weights to be distributed to each filter core 22.
[0136] In some embodiments, the weight cache 5 may be on-chip weight memory, which can be for example a RAM.
[0137] The weights may be previously learnt during a training phase, for example in a cloud.
[0138] The weight cache 5 may be configured to distribute differently the weights depending on the received serialized vectors.The weight stream transmitted by the weight cache 5, and the data stream transmitted by the layerset input CIU 24 of a layerset 2 to the layerset filter cores 22 may form a data stream having a width, such as for example a AXI stream of width.
[0139] Each filter core 22 may receive the input feature stream received by the associated layerset 2. Each filter core 22 corresponds to a convolutional filter forming a weight matrix of height fhand width fwcomprising the weights applied to the filter core 22.
[0140] The convolutional filter has therefore dimensions fhx fw, where fhis the filter height and fwis the filter width.
[0141] Each filter core 22 comprises K convolutional lanes 2202 configured to process the input feature stream received by the filter core 22 in parallel, using weights among the weights transmitted to the filter core 22 by the weight cache 5. A filter core 22 may comprise a weigh distributor 2201 to distribute the weights to the convolutional lanes 2002.
[0142] Each convolutional lane 2202 may be configured to simultaneously compute one row of the output feature map delivered by the accelerator 1 using the received input feature stream and a weight stream. 22.
[0143] Each filter core 22 may comprise an elementary broadcasting block 2202 configured to broadcast the received input feature stream to the different convolution lanes 2002.
[0144] The convolution lanes 2202 may be configured to apply one filter, also called elementary filter, to the input data stream received by the filter core 22, which provides a convolutional lane output for each convolution lane 2202. The elementary filter applied by a convolutional lane 2202 is defined by a subset of weights received by the convolutional lane 2202. In embodiments using layersets 2 mapped to c clusters, the architecture of the accelerator 1 supports up to c clusters 2, each with up to n filter cores 22, each core 22 comprising K convolutional lanes, allowing up to c x K filters to be applied in parallel (P x K if P clusters are interconnected, out of the c clusters, for the processing of the current input feature map I).
[0145] The weight distributor 2201 may be configured to read and distribute subsets of weights, each subset of weights corresponding to a convolutional lane 2202, from an on-chip memory. This allows for more optimized data reuse (expressed in terms of MAC / Data).
[0146] The filter weights data reuse is of H x w MAC / data, and the input feature map data reuse is of K MAC / data.
[0147] Index j will be used to distinguish the convolutional lanes, the j-th convolutional lane being thereby denoted 2202-j. Each convolutional lane 2202-j is configured to perform a convolutionaloperation using the weights received from the weight distributor 2201 and determine one row of the output feature map. The different convolution lanes 2202 are configured to operate simultaneously.
[0148] Each convolutional lane 2202- j is configured to receive two streams comprising the input feature stream and the corresponding weight stream input applied to the convolutional lane. Each convolutional lane 2202- j may be configured to receive the weights continuously. For example, for a full layer operation, each convolutional lane 2202- j may receive fhx fwx C weights, repeated H x W times.
[0149] The weight distributor 2201 may be configured to distribute the K filter weights received by the filter core 22 to K convolutional lanes 2202 from an on-chip weight memory (such as a RAM).
[0150] In some embodiments, each filter core 22 may optionally comprise a plurality of Max Pooling blocks 2204 associated with the convolutional lanes 2202. For example, the Max Pooling blocks 2024 may comprise a Max Pooling block associated with a respective convolutional lane 2202 (i.e. the number of convolutional lanes is equal to the number of Max Pooling blocks). A Max Pooling block 2204 may be arranged at the output of the associated convolutional lane 2202 and may be configured to receive the convolutional output determined by the associated convolutional lane 2204 and to apply a sliding window across the values of the convolutional lane output to select the maximal values within the window. Each maximum value becomes a single pixel in the new pooled output. The window is then slid across the convolutional lane output received by the Max Pooling block 2204 by a stride of a certain number of pixels, and the process is repeated until the entire convolutional lane output has been processed.
[0151] A Max Pooling block 2204 (the j-th max poolng block being also denoted 2204-;) is configured to process the outputs as they are generated by the associated convolutional lane, thereby optimizing performance by overlapping convolution and pooling operations.
[0152] The Max Pooling blocks 2204 can start processing the values of the convolutional lane outputs immediately when they are available, without requiring prior storage of the outputs of the convolutional lanes 2202. The sliding window has a given size and a given stride. For example, the sliding window can have a size 2x2 and a stride equal to 2 (i.e. the stride with which the window is moved is 2 pixels).
[0153] This allows for applying the Max Pooling processing on the received data stream immediately without the need to save results into memory. This may be implemented by merging several convolutional layers 2202, one after another. For example, this may be implemented by dedicating 2 convolutional lanes per data stream, effectively allowing for output feature map elements to begenerated for the Max Pooling block 2204 to start processing them using a 2x2 window. In such cases, the max pooling blocks 2204 and the convolutional layers 2202 may be merged.
[0154] This allows for a significant improvement in performance as the Max Pooling blocks 2204 are not idle waiting for data to be routed through memory first, thereby minimizing the need to save outputs into memory negatively impacting performance. Max Pooling enable reducing the size of the outputs delivered by the convolutional layers.
[0155] Each filter core 22 may further comprise a merging unit 2207 configured to merge the results returned by the max pooling blocks 2204-j through multiplexer blocks 2206-j (generally referred to as 2206, while the j-th multiplexer is denoted 2206-j). Each block 2206-j may be associated with a max pooling block 2204 and be configured to select if the associated max pooling block 2204 is used or not.
[0156] The merging unit 2207 may be configured to merge the different rows of the output feature map generated by the K convolutional lanes 2202, after the max pooling operation performed at the respective max pooling blocks 2024.
[0157] Each filter core 22 of a given layerset 2 is configured to deliver an output feature stream corresponding to the merged outputs of the max pooling blocks 2204. These output feature streams of the filter cores may then combined by the layerset output CIU 25 of the layerset 2.
[0158] Each convolutional lane 2202-j of the filter core 22 thereby generates a row of the output feature map simultaneously to allow the corresponding Max Pooling block 2204 start processing the feature map elements as soon as they become available. Indeed, a Max Pooling operation performed by a Max Pooling block 2204 requires receiving two output rows. With the arrangement of the filter cores 22, in the embodiments of the disclosure using max pooling, the two output rows can come out simultaneously so that each Max Pooling block 2204 can start the max pooling operation immediately, as opposed to waiting for the second row.
[0159] It should be noted that although the description of embodiments of the disclosure will be made with reference to the use of Max Pooling Blocks 2204 for illustration purpose, the disclosure is not limited to the use of Max Pooling Blocks 2024 downstream the convolutional lanes 2202. Therefore, in some embodiments, a filter core 22 may be implemented without using Max Pooling Blocks 2024. In this case, prior storage of the outputs of the convolutional lanes 2202 before transmission to the merging unit 2207 is not required either, and each filter core 22 of a given layerset 2 is adapted to deliver an output feature stream corresponding to the merged outputs of the N convolutional layers 2202.Further, although the description of some embodiments of the disclosure will be made with reference to a filter core 22 comprising as many max pooling blocks 2204 as the number of convolutional lanes 2202, the skilled person will readily understand that the embodiments of the disclosure where Max Pooling Blocks 2204 are used are not limited to the use of as many max pooling blocks 2204 as the number of convolutional lanes 2202 in each filter core 22 (a filter core 22 may therefore comprise a number of max pooling blocks 2204 that is different from the number of convolutional lanes 2202).
[0160] The CNN accelerator 1 is therefore capable of applying up to c x n x K filters in one iteration in embodiments using c clusters mapped to the L layersets 2 (for example c x n x K = 4 x 4 x 32 = 512 in the examples of figure 1 A). This advantageously allows for implementing large convolutional layers within a low number of iterations. Each of the convolution lanes 2202 of a filter core 22 comprised in a given layerset 2 may be tasked with applying one filter, effectively allowing L x n x K filters to be applied in parallel.
[0161] Each filter core 22 may comprise one or more FIFOs (not represented) for receiving the incoming data streams from each convolutional layer 2202-j, the one or more FIFOs being accessed by the corresponding MaxPool block 2204-j. The use of FIFOs enables the MaxPool blocks 2204 to start the scanning window process immediately. In one embodiment, each filter core may comprise for example two FIFOs.
[0162] In the following description, it will be considered that the number of FIFOs is equal to the number of rows required for the sliding window. However, additional FIFOs may be mapped per convolutional lane 2202 to pre-buffer incoming data for the MAC core to support the continuous streaming of data. The MAC core is configured to perform multiplication and accumulation operations.
[0163] Figure 5 depicts the ‘look-ahead’ scheme in embodiments where three convolutional layers are merged in one iteration, to obviate the need for intermediary storage in a memory. Figure 5 corresponds to the example represented figure 1 B.
[0164] In the look-ahead scheme of figure 5, two convolutional layers are merged comprising a first convolutional layer (Layer-1) having a filter size equal to 2x2 and a second convolutional layer having a filter size equal to 2x2. The second layer requires rows 0 and 1 to apply the filter immediately without buffering results. In order to achieve this, the look-ahead scheme according to the embodiments of the disclosure is used on the previous layer by fetching the required data streams from memory and mapping an additional computational lane, so that the two lanes process incoming data in parallel and generate two rows of output feature elements for the second layer.The second layer applies the 2x2 filter on the available elements as they are streamed in from the first layer without the need to save them into external memory and fetch them for computation.
[0165] In figure 5, the parameter t indicates the clock cycle at which the outputs are calculated.
[0166] In the example of figure 5, P=2 convolutional layers are merged meaning that the output of the first layerset 21;corresponding to the equivalent computation of the first convolutional layer ‘Layer 1’ of the CNN (in the first iteration) is not written in an external memory, and transmitted to the second layerset 22directly. The second layerset 22determines in the second iteration, the equivalent computation of the first convolutional layer ‘Layer 2’ of the CNN, and outputs the final result of the two convolutional layers, that can be then written in the external memory 30. This is enabled by the serialization of the input feature map I. As shown, the serialization of the input feature map I is useful for the first layer coming from memory. For the following layers, the output rows are streamed in parallel.
[0167] In Figure 5, the serialized convolutional patches (streams) are denoted by references 501, 502, 503,.., 509.
[0168] In the example of figure 5, the streams 501, 502 and 503 are fed to the cores 101, 102 and 103 of the first layerset 21respectively. The convolutional patches are denoted 501 -ti, 502-ti, and 503-t; (t;=ti to t3with tl=1, t2= 2, and t3= 3) and correspond to the serializer 3 output in figure 1 B which is a continuous stream of data. The data at time tiis denoted as 501-ti, 502-ti, and 503-ti.
[0169] 501-t1, 501-t2, 501-t3form the sequence of vectors in the same data stream.
[0170] In contrast, in the prior art implementations, two convolutional layers are executed one after another in series, and the intermediate results need to be written back to a memory. This requires two iterations, with a first iteration taking 8 clock cycles and the second iteration taking 4 clock cycles.
[0171] The accelerator 1 according to the embodiments of the disclosure therefore results in an improvement in performance while providing a more flexible design space for trade-offs latency versus resources trade-off.
[0172] The accelerator 1 can be easily implemented and integrated while reducing off-chip memory access by merging layers. Further it enable improving throughput and reduce latency, while supporting continuous streaming interface.
[0173] Figure 6 represents the processing of a convolutional lane 2202, according to an exemplary embodiment where each convolutional lane 2202 is configured for a 8-bit feature map, and 1, 2, 4 bit weights, while figure 7 represents the processing of a convolutional lane 2202 according to anexemplary embodiment where each convolutional lane 2202 is configured for 8-bit feature map, and 8 bit weights. The same data path may be used for different weight width, depending on the application.
[0174] Each convolutional lane 2202 is configured to receive an input feature maps stream, denoted F[.], and a weight stream corresponding to the subset of weights received by the convolutional lane, denoted Wg[.] received from the weight distributor 2201.
[0175] The index i used in this example is independent from the index i used to refer to the layersets 2.
[0176] In the example of figures 6 and 7, the input streams F comprise F[i], ….F[3], F[2], F[1] and F[0] (comprising 8 bits in the example of figure 6, while the weight stream WGHT comprise Wg[0], Wg[1], Wg[2],..., Wg[i] comprising 4 bits in the example of figure 7), with i ranging from 1 to N (with N being equal to total filter length in this diagram, i.e. N = fhx fwx C).
[0177] As shown in Figures 6 and 7, each convolutional lane 2202 may comprises an inner product block 70 (also called ‘multiplier’ configured to perform the inner product of the input feature map stream F, received by the filter core 22, and the weight stream W received from the weight distributor 2201).
[0178] Each convolutional lane 2202 may further comprises an adder 72 configured to perform the addition of the result returned by the inner product block 71 and of the previous output of the adder 72.
[0179] The input feature map stream F may be then redirected to a suitable convolutional lane 2202 according to a chosen bit precision. In some embodiments, a convolutional lane 2202 may support a number of bit precisions, such as 1 / 2 / 4, for the weights Wg and use an input feature map stream comprising a number of bits.
[0180] In the example of figure 6, a convolutional lane 2202 supports three bit precisions corresponding to 1, 2 or 4 bit precisions for the weights while it uses a number of bits equal to 8 bits for the input feature map stream F.
[0181] A bit precision equal to A for the weights indicates that they comprise A bits that can represent 2Avalues.
[0182] In the example of figure 6, a convolutional lane 2202 supports three bit precisions corresponding to 8 bit precisions for the weights while it uses a number of bits equal to 8 bits for the input feature map stream F.
[0183] Therefore the CNN convolutional layers according to the embodiments of the disclosure canbe advantageously merged for continuous processing.
[0184] The serializer 3 may be arranged after the input memory 3 and before the main configurable interconnection unit 4. This ensures that the serialized data is efficiently distributed to the layersets.
[0185] The distribution of the serialized 1 D vectors determined by the serializer 3 is managed by the configurable interconnection units 4, 24 and 25 to ensure that each filter core 22 in an interconnected layerset 2 receives the appropriate data stream.
[0186] The architecture of the CNN accelerator 1 is pipelined to ensure continuous data flow and maximize throughput. It comprises several pipeline stages include the loading data from memory, the serialization performed by serializer 3, the data broadcasting performed by the configurable interconnection units 4, 24 and 25, and the convolution operations performed by the filter cores 22.
[0187] The embodiments of the disclosure provide a hardware implementation of convolutional neural networks (implementable on semiconductors) that is particularly suitable for edge applications. The architecture of the accelerator 1 allows for an ultra-low-power yet high-performance deployment of CNNs on-chip, easy prototyping and implementation of deep learning models on hardware. It further provides an easy way to deploy quantized models with low precision weights and features without a significant loss in accuracy. The architecture of the accelerator 1 also allows for the efficient deployment of CNNs. Further, it makes it easy to customize and reconfigure the design to deploy any CNN models with the supported layers.
[0188] Advantageously, the CNN accelerator 1 according to the embodiments of the disclosure enables a reduction of external memory accesses by merging different layers without requiring writing intermediate data or results to an external memory (e.g. DRAM).
[0189] It further reduces latency through the serialization of the input feature map performed by the serializer 3 that can stream data from an external memory 10 and process the data on the fly.
[0190] The CNN accelerator 1 according to the embodiments of the disclosure also provides as energy consumption reduction. In particular a customized bitwidth and be used for each application or dataset and the same circuit can be configured to various bit weight (for example 1, 2, 4, 8) and a bit feature operation (for example a 8 bit feature operation) without performance penalty. Further, in embodiments using sparsity encoding / decoding, the sparsity can be leveraged in deep layers of the CNN to avoid any computation involving zero.
[0191] The energy cost of hardware operations varies significantly, striking a balance between the accuracy of the model and the resource bill associated with it. The proposed accelerator supports low precision fixed point operations and parameters, allowing for the deployment of quantized models, such as 1 / 2 / 4 and 8 bit fixed point representation supported for the parameters and 8 bit forthe input data.
[0192] In some embodiments, the CNN accelerator 1 may comprise an encoder (not represented) configured to compress the input feature map loaded from memory via the input memory 10 into an input buffer. The serialization device 3 may then receive the compressed input feature map processed by the encoder to transform it the serialized 1D vector. In such embodiments, the CNN accelerator 1 may also comprise a decoder (not represented) configured to decompress the data broadcast to the convolutional lanes, before they enter the convolutional lanes.
[0193] In some embodiments, the encoder may be a sparsity encoder and the decoder may be a sparsity decoder. In such embodiments, the sparsity encoder may be configured to insert sparsity information to the compressed data stream to enable the filter cores 22 only process non-zero elements, while skipping zeros. The sparsity may extract the sparsity information during decoding so that the filter cores 22 may use the extracted sparsity information to only process non-zero elements while skipping the zero elements.
[0194] The compressed data may be decoded when it is fetched back from memory.
[0195] The CNN accelerator 1 may use a control Logic (not represented) configured to direct the data flow based on the compressed format from the sparsity Encoder. The control logic may be implemented in the form of a finite state machine (FSM) configured to control reading of the compressed data and writing of output data.
[0196] The CNN accelerator may use a computation unit (not represented) configured to perform the convolution operations of the convolutional lanes 2202. The computation unit may be implemented using for example Digital Signal Processing (DSP) blocks available on the FPGA, if the CNN accelerator 1 is implemented on FPGA.
[0197] The CNN accelerator may use the Digital Signal Processing (DSP) blocks for the multiply- accumulate (MAC) operations of the MAC core. Each DSP block may handle for example a number of filters 22 (for example 2 filters), each performing a number of operations.
[0198] In embodiments using a sparsity encoder / decoder, the control logic may ensure that MAC operations are performed only for non-zero elements, while skipping flagged MAC operations for zero runs.
[0199] Figure 8 depicts the input and outputs of each filter core 22.
[0200] The input feature stream applied to a filter core 22 is a data bus having a given width Wcore, such as for example a 8 bits width if a 8 bit quantization is used. A filter core 22 implements K convolutional filters which provides K convolutional filter outputs coming out the filter core 22 inparallel out0, out1,..., out(k - 1)), these N convolutional filter outputs forming an output databus of width Wcore. N. This databus of width Wcore. N (for example 8N) may be then transformed back to Wcorebits data bus (for example 8 bits data bus) using a width converter before transmitting it to other cores 22.
[0201] According to the embodiments of the disclosure, a series of 1D vectors F each of size Nvsuch
[0202]
[0203] as Fi’. < fi>2, > may be generated by the serializer 3 from the input feature map I (corresponding for example to an image or an audio input) and applied to the input of each filter core 22.
[0204] For example, considering an input feature map corresponding to an image of width W and height H, then Nv = fhx fwx fc, where fhand fware the size of convolutional filters (e.g. Fh= 3, Fw= 3), fhbeing the filter height and fwbeing the filter width, and fcbeing the number of filter channel (corresponding to C for standard convolution). The total number of vectors in this case is equal to H x W.
[0205] For example, considering Fh= 3, Fw= 3, Fc= 3, the total number of vectors is equal to 27. The first vector consists of the following elements CHWCwith the notation HWC denoting channel C at spatial coordinate < H, W >:
[0206] < c000, c001, c002, c010, c011, c012, c020, c021, c022, c100, c101, c102, c110, c111, c112, c120, c121, c122, c200, c201, c202, c210, c211, c212, c220, c221, c222>
[0207] According to another example, considering an input feature map corresponding to an audio sample of size Na, there will be Navectors Fi, each of size Nv where Nv = fl* fc, flbeing the length of the audio filter.
[0208] Both inputs and outputs of a filter core 22 are serial in time according to a cycle t.
[0209] Figure 9 depicts the streaming operation corresponding to the presentation of vectors Ft(input feature stream) by the serializer 3 to a filter core 22 and the outputs feature stream obtained at the output of this filter core 22 at each clock cycle t:
[0210] - At cycle t=0, the first element f00of the first vector Fois presented at the input of the core filter 22;
[0211] - At cycle t=1, the second element f01of the first vector Fois presented at the input of the filter core 22, and so on for cycle t=2 to cycle t=Nv - 2;
[0212] - At cycle t=Nv - 1, the last element f0, NV-i of the first vector Fois presented at input of the core filter 22.At cycle t=Nv, the first element f10of the second vector is presented at the input of the core filter 22, while the filter core 22 generates and outputs the first element OUT0= (OUT00, OUT01,..., OUTo k-) of the K filter outputs such as:
[0213] OUT0,0= ∑i=0Nv-1f0,i× W0,i,
[0214] OUT0,1= ∑i=0Nv-1f0,i× W1,i, and so on until
[0215]
[0216] At cycle t=2 * Nv, the filter core 22 generates and outputs the second element OUT1= (OUT1,0, OUT1,1,..., OUT1,k-1) of the K filter outputs such as:
[0217] OUT1,0= ∑i=0Nv-1f1,i× W0,i,
[0218] OUT1,1= ∑i=0Nv-1f1,i× W1,i, and so on until
[0219]
[0220] The filter core 22 proceeds similarly until cycle t=(p - 1) * Nv where the filter core 22 generates and outputs the p-th element OUTp= (OUTp,0, OUTp,1,..., OUTp,k-1) of the K filter outputs such as:
[0221] OUTp,0= ∑i=0Nv-1fp,i× W0,i,
[0222] OUTp,1= ∑i=0Nv-1fp,i× W1,i, and so on until
[0223] O
[0224]
[0225] OUTp,k-1= ∑i=0Nv-1fp,i× W(k-1),i
[0226] The parameter p is comprised between 0 and the maximum number of vectors Fi (for example H x W for an input feature map corresponding to an image).
[0227] Figure 10 represents another exemplary implementation of the layersets 2 to merge P = 4 convolutional layers of the CNN.
[0228] In the example of figure 10, four interconnected layersets 2fare used (L = 4), the first layerset 2i (comprising n1=4 filter cores 22) being adapted to perform the equivalent convolutional operation of the first CNN convolutional layer L1;the second layerset 22(comprising n2=3 filter cores 22) being adapted to perform the equivalent convolutional operation of the second CNN convolutional layer L2, the third layerset 23(comprising n3= 2 filter cores 22) being adapted to perform the equivalent convolutional operation of the third CNN convolutional layer L3, and the fourth layerset 2, (comprising n4= 1 filter core 22) being adapted to perform the equivalent convolutional operation of the fourth CNN convolutional layer L4.Turning back to figure 10, the output of the first layerset
[0229]
[0230] layer forms the input stream applied to the second layerset 22, the output of the second layerset 22forms the input stream applied to the third layerset 23, the output of the third layerset 23forms the input stream applied to the fourth layerset 24The output of the fourth layerset 24constitutes the result of the output feature map determined by the accelerator 1.
[0231] More generally, to merge P CNN convolutional layers with a filter size equal to m x m, N + m - 1 lines (i.e. rows) of the input feature stream I may be processed simultaneously, so that P + m - 1 cores may be used. To merge P CNN convolutional layers, P stages (i.e. P layersets) may be needed, with each layerset 2 comprising (P - k + m - 1) cores 22 for the k -th stage (i.e. layerset).
[0232] Merging P CNN convolutional layers (i.e. computing the P convolutional layers on chip without writing the intermediate results to an external memory) enables saving important power.
[0233] Figure 10 illustrates the Look-Ahead Convolution according to the embodiment of the disclosure that is adapted to merge the 4 CNN convolutional layers L1;L2, L3, and L4, using the 4 layersets 21;22, 23and 24(four stages), according to an example using convolutional layers with 2x2 filters and 10 filter cores 22.
[0234] In step 800, the input feature map I is accessed through the external memory 10, such as for example through DMA if the external memory is a DDR memory.
[0235] In step 801, the input feature map is serialized into a series of 1 D vectors Fj and the serialized vectors are presented to the 4 filter cores 22 of a first layerset 24in a first stage corresponding to steps 802, 803, 804, 805 to perform in parallel the convolutional operation corresponding to the first convolutional layer L4. The outputs of the filter cores 22 of a first layerset 24are combined by the output CIU 25 of a first layerset 24and directly sent (i.e. without intermediary storage) to a second layerset 22(step 806). The received output of the first layerset 24is processed in parallel by three filter cores of the second layerset 22, in a second stage, corresponding to steps 807, 808, 809, that perform the equivalent convolutional operation corresponding to the second CNN convolutional layer L2. The outputs of the filter cores 22 of the second layerset 22are combined by the output CIU 25 (step 810) of the second layerset 22and directly sent (i.e. without intermediary storage) to a third layerset 23. Similarly, the received output of the third layerset 23is processed in parallel by two filter cores of the third layerset 23, in a third stage, corresponding to steps 811, and 812, that perform the equivalent convolutional operation corresponding to the third CNN convolutional layer L3. The outputs of the filter cores 22 of the third layerset 23are combined by the output CIU 25 (step 813) of the third layerset 23and directly sent (i.e. without intermediary storage) to a fourth layerset 23, which similarly performs, in a fourth stage, the equivalent convolutional operation corresponding to the fourth CNNconvolutional layer L4using one filter core 22 in step 814. The output of the fourth layerset 24provides the output feature map determined by the accelerator.
[0236] Figure 11 is a diagram representing an exemplary implementation of accelerator 1 according to the pipelined process illustrated by figure 10. Figure 11 shows the interconnection between the four layersets 21;22, 23and 24using associated CIUs 24 and 25. The four layersets 21;22, 23and 24may be implemented in three clusters denoted ‘cluster T, ‘cluster 2’ and ‘cluster 3’, each comprising of n = 4 filter cores 22. The first layerset 24is implemented in a single cluster ‘cluster 1’. The second layerset 22is implemented in a single cluster ‘cluster 2’ comprising one unused core. The third and fourth layersets 23and 24are implemented inside the third cluster ‘cluster 3’ which comprises one unused core.
[0237] Figure 12 depicts the method implemented by the accelerator 1 to determine an output feature map in response to a received input feature map, according to some embodiments.
[0238] In step 900, the input feature map may be loaded from the internal memory 10, for example via DMA, into an input buffer.
[0239] In step 901, sparsity encoding may be performed to compress the input feature map using a sparsity encoder block.
[0240] In step 902, the compressed input feature map is serialized into a set of 1 D vectors.
[0241] In step 903, the serialized data generated in step 902 may be broadcasted to the pipelined filter cores 22, using the configurable Interconnection units 4, 24, 25. In each iteration, a set of filter cores 22 comprised in a layerset 2 may be used to perform the equivalent convolutional operation of a given CNN layer, while the output of this iteration is not stored and is directly sent to an interconnected layerset 2, operating in parallel, which is adapted to perform the equivalent convolutional operation of another CNN layer, which enables merging layers of the CNN.
[0242] More specifically, in each filter core 22 of a layerset 2:
[0243] In step 904, before entering the convolutional lanes 2202 in each filter core 22 of a layerset 2, the data stream may be decompressed by a sparsity decoder block.
[0244] In step 905, the convolutional lanes process the non-zero elements, while skipping zeros as the sparsity information directs.
[0245] In step 906, max pooling may be applied to the output of each convolutional lane 2202, at the associated MaxPool block 2204, for real-time processing without storing intermediate results.
[0246] In step 907, the outputs from the max pooling blocks 2204 may be merged, the merged output providing the filter core output.In step 908, the outputs from the filter cores 22 of the considered layerset 2 may combined, which provides the layerset output.
[0247] In step 909, the outputs from the layersets 2 are combined and the resulting combined output sent back to memory via DMA 30, which provides the output feature map.
[0248] The CNN accelerator 1 according to embodiments of the disclosure may be used in various edge deep learning applications such as for example in preventive maintenance (e.g. to control battery replacement) or in precision agriculture. The embodiments of the disclosure may be also applied in other applications fields such as for example industrial inspection, road infrastructure, video surveillance, domestic robotics, industrial robotics, automated checkout, infrastructure inspection, military applications, satellite imaging, Internet of Things, voice assistance, etc.
[0249] Although some aspects of the disclosure are herein described jointly, the skilled person will readily understand that some of these aspects of the disclosure may be used separately and independently, such as for example the serializer 3. Indeed, the skilled person will readily understand that the serializer 3 may have a separate interest for use in any convolutional filtering system not limited to the accelerator 1, as described in relation with figures 13 to 19. Similarly, the skilled person will readily understand that the accelerator 1 is not limited to the use of the implementation of the serializer 3 described in relation with figures 13 to 19, and may use any other suitable implementation of a serializer 3.
[0250] The serializer 3 may be configured to receive at its input a Q-Dimensional tensor I corresponding to the input feature map that is to be serialized by the serializer 3.
[0251] The following description of some embodiments will be made with reference to a 3-D tensor I may having three dimensions denoted < H, W, C >, where H denotes the height of the input feature map, W the width of the input feature map and C denotes the number of channels of each element (for example each pixel if the input feature map is an image) of the input feature map. The input feature map may be for example an image or an audio file.
[0252] An element I[i'][j'][k'] of the original 3G tensor I is defined by three indexes i', j', k'. In the example of an input feature map being an image, an element I[i'][j'][k'] is an image pixel.
[0253] The serializer 3 is configured to extract a number of convolutional patches from the input tensor I, and outputs the extracted convolutional patches in the form of a series of one dimensional (1D) output vectors j, in a serial fashion, each extracted 1D vector corresponding to a convolutional patch.
[0254] Figure 13 represents a convolutional filtering system 100 in which the serializer 3 may begenerally used. The convolutional filtering system 100 may be for example the CNN accelerator 1. However, although the serializer 3 has particular advantages for use in a CNN based accelerator 1, the serializer 3 may be used more generally in any convolutional filtering system.
[0255] The convolutional filtering system 100 comprises a convolutional filtering device 90 configured to filter the input feature map, represented by the 3D tensor / , after the serialization of the input feature map I by the serializer 3 into a series of 1 D vectors j.
[0256] The convolutional filtering device 90 may be for example constituted by a set of layersets 2, each comprising a number of filter cores, such as filter cores 22, adapted to apply convolutional filters.
[0257] More generally, the convolutional filtering device 90 is configured to implement convolutional filters, using dot product computing.
[0258] The convolutional patches extracted by the serializer 3 in the form of a series of 1 D vectors Fimay be fed to the convolutional filtering device 90 which is adapted to perform dot product operations using the extracted convolutional patches to implement convolutional filters.
[0259] Each convolutional patch extracted by the serializer 3 is defined by a position < p,q > in the input tensor I and has a size defined by two size parameters fhand fw.
[0260] A convolutional patch at position < p,q >, of size fhand fw, is defined as:
[0261]
[0262] The serializer 3 may be advantageously configured to extract several convolutional patches (more than 2) in parallel, without blocking the input feature map write process. By obviating the need for any blocking, the serializer 3 avoids adding delays in the further processing. In prior art serializers, a blocking might occur when using ports of a dual port memory simultaneously, and such ports read / write to the same address resulting in a conflict. The serializer 3 is advantageously configured to avoid any such conflicts.
[0263] The serializer 1 may implement a streaming mode among two streaming modes depending on the input feature map streaming.
[0264] In a first streaming mode (serial streaming mode), the 3D input tensor / may be serially streamed to the serializer 3 in a row major order. This means that the input feature map is streamed according to a row major order by presenting 1 element of the input feature map I (for example one pixel for an image input feature map) at a time.The following table ‘Table T illustrates the first streaming mode, according to an example using an input feature map being an image. In this example, an input feature map I element is serially presented at each instant time t:
[0265] Instant time Pixel presented in the row major order
[0266] t = 0 I[0][0][0]
[0267] t = 1 I[0][0][1]
[0268] t = C - 1 I[0][0][C - 1]
[0269] t = 2 * C - 1 I[0][1][C - 1]
[0270] t = 3 * C - 1 I[0][2][C - 1]
[0271] t = n * C — 1 / [0][n][C - l] assuming n < W
[0272] t = W * C - 1 I[1][0][C - 1]
[0273]
[0274] Table 1
[0275] In a second streaming mode (parallel streaming mode), the 3D input tensor / may be streamed to the serializer 3 according to a parallel in column major order. In this second streaming mode embodiment, K elements (for example K pixels) of the input feature map may be presented at the input of the serializer 3, according to a column major order, as illustrated by the following example where the input feature map is an image):
[0276] Instant time Pixels presented in the column major order
[0277] t=0 I[0][0][0], I[0][1][0], .... I[0][K - 1][0]
[0278] t=1 I[0][0][1], I[0][1][1], .... I[0][K - 1][1] ...
[0279]
[0280] Table 2
[0281] Figure 14 depicts the structure of the serializer 3, according to some embodiments.
[0282] The serializer 3 may comprise a memory device 300 comprising at least three ports with at least two reading ports, one of the ports being a writing port and the other two ports being a reading port.
[0283] The memory device 300 may comprise a plurality of memory banks, each memory bank comprising a plurality of addressable memory cells. Data words may be written to or read from the memory banks of the memory device 300.
[0284] The memory device 300 may be configured to store N data words in addressed locations, each word comprising Bbits (bit width).A writing port of the memory device 300 is configured to write data streamed to the serializer 3 to the memory device 300 (one word may be written per clock cycle) while a reading port of the memory device 300, for example 302, may be used to read data from memory. The dual memory operates according to a clock signal oscillating between a high and a low state at a constant frequency. Therefore, read and write operations may occur at a clock cycle (i.e. time interval between rising edges of a repetitive clock signal).
[0285] In some embodiments, the memory device 300 may be a dual-port memory assembly 300 comprising at least one dual port memory 30, each dual port memory comprising at least two ports.
[0286] Figure 15 represents an exemplary implementation of the memory device 300 of the serializer 3 consisting of a single dual-port memory 30.
[0287] The single dual port memory 30 comprises two input ports 301 and 302, one of the two ports (referred to as the writing port), being a dual port memory 30 (for example port 301) used to write data to the dual port memory 30 (one word per clock cycle), while the other port (referred to as the reading port), for example 302, may be used to read data from the memory.
[0288] In figure 15, DATAIN[V / -1;0] corresponds to a write operation at the writing port while DATAOUT[V / -1;0] corresponds to a read operation at the reading port. ADDR corresponds to the address at which the data is written or read.
[0289] In some embodiments, the memory device 300 may be implemented using at least a dual memory assembly comprising more than two input ports. In such embodiments, the dual-port memory assembly 300 may comprise at least two combined dual port memories 30.
[0290] For example, the dual-memory assembly 300 may comprise one write port and two read ports, as illustrated in figure 16.
[0291] More generally, the dual-port memory assembly 300 may be formed by combining R dual port memories 30, R denoting the number of dual port memories used to implement the serializer memory, the memory device 300 then having 1 to R write ports and 1 to R read ports. For example, in figure 16, the dual-memory assembly 300 comprises two combined dual port memories 30-1 and 30-2 corresponding to two write ports and two read ports.
[0292] Figure 17 depicts a single convolutional patch extraction that may be performed by the serializer 3 from the input feature map I < H, W, C > using a dual-port memory assembly 300 corresponding to a single dual-port memory. For example, to extract a convolution patch of sizeparameters fh= 3 and fw= 3, the memory device 300 used by the serializer 3 may be filled with 3 rows of the input feature map I, for example from row I to I + 3. Each row of the input tensor I is of size W x C. The 3x3 convolutional patches at position < I, 0 >, < I, 1 > up to < I, w - 1 > may be read from a second port of the memory device 300. To avoid occurrence conflict, the serializer 3 may be configured to wait for K to K + 3 (K ■■ K + 3) rows being written, before starting reading the convolutional patches. Similarly, the write port has to wait until the read port has finished reading convolutional patches. More generally, to extract a convolution patch of size parameters fhand fw, the memory device 300 used by the serializer 3 may be filled with fwrows of the input feature map I, for example from row I to I + fw. Each row of the input tensor I is of size W x C. The fwxfhconvolutional patches at position < I, 0 >, < I, 1 > up to < I, w - 1 > may be read from a second port of the memory device 300. To avoid occurrence conflict, the serializer 3 may be configured to wait for K to K + fw(K: K + fw) rows being written, before starting reading the convolutional patches. Similarly, the write port has to wait until the read port has finished reading convolutional patches.
[0293] The extracted patches may be then fed to convolutional filtering device 90 implementing convolutional filters, as represented in figure 13. The serialized 1D vectors delivered by the serialize 3 are particularly suitable for enabling parallel processing convolutional filtering device 90 (for example by a plurality of processing layersets 2). In an application of the serializer 3 to a data filtering device implementing convolutional filters using a convolutional neural network, the convolutional filtering device 90 may receive the weights of the convolutional neural network to perform convolutional filtering using parallel processing.
[0294] Figure 18 illustrates multiple patch extraction, according to some embodiments. In such embodiments, the serializer 3 is adapted to extract at least two patches in parallel. In the specific example of figure 18, the serializer 3 is adapted to extract two patches in parallel, one starting at even coordinates patch(2x,y) and another starting at odd coordinates patch(2x + l,y). In this example, the serializer 3 is implemented using a dual-port memory assembly 300 constituted of 6 combined dual port memory banks 30 so that the dual-port memory assembly 300 forms a memory comprising one write port and two read ports. The incoming feature map I < H, W, C > may be written through the single write port. The two read ports may be used to extract the convolutional patches simultaneously. To avoid conflicts, the dual-port memory assembly 300 may be configured so that the same memory bank is never read or written to at the same clock cycle.
[0295] The write / read operations occur in 3 phases as described in Table 3:Phase 0 Phase 1 Phase 2
[0296] Write 4,5 0,1 2,3
[0297] Read port A 0,1,2 2,3,4 4,5,0
[0298]
[0299] Read port B 1,2,3 3,4,5 5,0,1
[0300] Table 3
[0301] Figure 19 depicts the parallel read / write operations performed by the serializer 3, according to some embodiments. As illustrated by figure 19, the serializer 3 may be configured to perform read and write operations in parallel in a pipelined fashion, without any conflicts (related to read and write to the same memory bank) during one or more phases. Each phase has a duration of w × C × fh× fwclock cycles. For an input feature map tensor I < H, W, C >, (H - 6) / 2 + 3 phases in total may be needed to generate all the necessary patches for a convolutional layer of the convolutional neural network used to implement convolutional filtering.
[0302] The serializer 3 is operable for performing write operations consisting in writing a number of rows from input feature map tensor I to a number of banks of the memory device 300 using a writing port, while being operable for reading in parallel convolutional patches (extracting convolutional patches) from banks of the dual port memory assembly using at least one reading ports (two in the example of figure 16), the extracted convolutional patches being fed to the data filtering device 90 that can then process them in parallel, advantageously.
[0303] Figure 20 illustrates multiple patch extraction, according to embodiments where the input images are streamed in a column major order. In such embodiments, the serializer 3 is adapted to extract at least two patches in parallel. In the specific example of figure 20, the serializer 3 is adapted to extract H patches in parallel, H being the height of the image, starting at coordinates patch (0,y) with y varying from 0 to H - 1. In this example, the serializer 3 is implemented using a dual-port memory assembly 300 constituted of H combined dual port memory banks 30 so that the dual-port memory assembly 300 forms a memory comprising H write ports and H read ports. The column of incoming feature map I < H, W, C > may be written through the H write ports simultaneously in a column-major order. The H read ports may be used to extract the convolutional patches (i,y) simultaneously for j-th iteration (while index i is used herein to describe the convolutional patches extraction, it will be understood that it is not linked to the preceding use of the notation i when referring to the interconnected layersets 2).
[0304] To avoid conflicts, the dual-port memory assembly 300 may be configured so that the same memory bank is never read or written to at the same clock cycle.The write / read operations occur in 4 phases as described in Table 4:
[0305] Phase 0 Phase 1 Phase 2 Phase 4
[0306] Write Col, 3 0 1 2
[0307] (Rows 0.. H-1)
[0308] Port 0 Read col, 0,1,2 1,2,3 2,3,0 3,0,1 (Rows 0,1,2)
[0309] Port 1 Read col 0,1,2 1,2,3 2,3,0 3,0,1 (Rows 0,1,2)
[0310] Port H-1 read col 0,1,2 1,2,3 2,3,0 3,0,1
[0311]
[0312] (Rows H-3, H-2, H-
[0313] Table 4
[0314] Figure 21 depicts the parallel read / write operations performed by the serializer 3 in embodiments using column major operation, according to some embodiments. A illustrated by figure 21, the serializer 3 may be configured to perform read and write operations in parallel in a pipelined fashion, without any conflicts (related to read and write to the same memory bank) during one or more phases. Each phase has a duration of C x fhx fwclock cycles. For an input feature map tensor I < H, W, C >, W phases in total may be needed to generate all the necessary patches for a convolutional layer of the convolutional neural network used to implement convolutional filtering.
[0315] The serializer 3 is operable for performing write operations consisting in writing a number of rows from input feature map tensor I to a number of banks of the memory device 300 using a writing port, while being operable for reading in parallel convolutional patches (extracting convolutional patches) from banks of the dual port memory assembly using at least one reading ports (H in the example of figure 21), the extracted convolutional patches being fed to the data filtering device 90 that can then process them in parallel, advantageously.
[0316] Embodiments of the present disclosure can take the form of an embodiment containing hardware only or both hardware and software elements.
[0317] Furthermore, the methods described herein can be implemented by computer program instructions supplied to the processor of any type of computer to produce a machine with a processor that executes the instructions to implement the functions / acts specified herein. These computer program instructions may also be stored in a computer-readable medium that can direct a computer to function in a particular manner. To that end, the computer program instructions may beloaded onto a computer to cause the performance of a series of operational steps and thereby produce a computer implemented process such that the executed instructions provide processes for implementing the functions specified herein. In particular, the methods described herein may be implemented in a computer system.
[0318] It should be noted that the functions, acts, and / or operations specified in the flow charts, sequence diagrams, and / or block diagrams may be re-ordered, processed serially, and / or processed concurrently consistent with embodiments of the disclosure. Moreover, any of the flow charts, sequence diagrams, and / or block diagrams may include more or fewer blocks than those illustrated consistent with embodiments of the disclosure.
[0319] While embodiments of the disclosure have been illustrated by a description of various examples, and while these embodiments have been described in considerable detail, it is not the intent of the applicant to restrict or in any way limit the scope of the appended claims to such detail.
[0320] Additional advantages and modifications will readily appear to those skilled in the art. The disclosure in its broader aspects is therefore not limited to the specific details, representative methods, and illustrative examples shown and described.
Claims
Claims1. An accelerator (1) implementing a convolutional neural network configured to determine an output feature map, in response to the receipt of an input feature map, the input feature map forming a Q-dimensional tensor, the convolutional neural network comprising one or more convolution layers, a convolution layer being associated with a set of weights, wherein the accelerator (1) comprises:- At least one serializer (3) configured to serialize the input feature map into a series of 1D serialized vectors comprising elements of the input feature map,- A plurality of layersets (2),wherein said at least one serializer (3) is configured to broadcast the serialized vectors to at least one of the layersets, said layersets being interconnected in a pipelined fashion, each layerset corresponding to a pipeline stage, each layerset being configured to receive a continuous input stream, each layerset (2) comprising one or more filter cores (22), each filter core implementing K convolutional filters, each layerset (2) being configured to perform the equivalent computation of a CNN layer using the outputs from the layerset filter cores (22),wherein in response to the receipt of an input stream from a previous pipeline stage by a given layerset (2), the input stream is broadcasted to the filter cores (22) of said given layerset (2), the filter cores (22) of said given layerset (2) being configured to operate in parallel, the outputs of said filter cores being combined to form the output of said given layerset, the output of a given interconnected layerset being directly transmitted to the next layerset, without storage of the given layerset output in an external memory,wherein the last interconnected layerset is configured to provide said output feature map.
2. The accelerator of claim 1, wherein the accelerator (1) comprises a main configurable interconnection unit (4) configured to broadcast the serialized vectors to at least some of the layersets (2).
3. The accelerator of any preceding claim, wherein each layerset (2) is implemented using a single hardware cluster.
4. The accelerator of any preceding claim 1 and 2, wherein the accelerator (1) comprises at least two clusters to implement the layersets, at least one cluster implementing two or more layersets.
5. The accelerator of any preceding claim, wherein each filter core (22) comprises:- K convolution lanes (2202) operating in parallel, each convolutional lane (2202) being configured to apply a convolutional filter, defined by a subset of weights, to the input feature stream by the filter core, which provides a convolutional lane output,wherein the convolutional lane output provided by each convolution layer (2202-i) is used to determine a row of the output feature map.
6. The accelerator of any preceding claim, wherein each filter core (220) comprises one or more max pooling blocks (2204), wherein said Max Pooling blocks (2204) are configured to apply a max pooling operation to the convolutional outputs delivered by the convolutional lanes, immediately, without storage of the convolutional output.
7. The accelerator of claim 6, wherein each convolution lane (2202-j) is associated with a max pooling block (2204) arranged at the output of the convolutional lane, each max pooling block (2204) being configured to apply a max pooling operation to the convolutional output delivered by the associated convolutional lane immediately, without storage of the convolutional output.
8. The accelerator of any claim 6 and 7, wherein each max pooling block (2204-j) is configured to apply a max pooling sliding window of given dimensions.
9. The accelerator of any preceding claim, wherein each convolution lane (2202-j) comprises a processing unit configured to compute the internal product of the received input stream and of the received subset of weight.
10. The accelerator according to any preceding claim, wherein the input feature map corresponds to an image and is represented by a 4-D tensor of dimensions N x H x W x C, with N being the number of images, H denoting the height of said tensor, W denoting the width of said tensor, and C denoting the number of channels of said tensor.
11. The accelerator according to any preceding claim, wherein the accelerator comprises a weight cache (50) configured to store the sets of weights of the CNN layers, and a weight distributor (2201) configured to distribute the weights stored in the weight cache (50) to the convolutional lanes of the filter cores (22).
12. The accelerator of any preceding claim, wherein a convolutional filter has a size equal to m x m, and the accelerator is adapted to use P + m - 1 cores, each layerset (2) comprising (P -k + m - 1) cores (22) for the k -th layerset.
13. The accelerator of any preceding claim, wherein it comprises an encoder configured to compress the input feature map prior to being serialized by the serialization device.
14. The accelerator of claim 13, wherein it comprises a decoder configured to decompress the input feature stream prior to entering a convolutional lane (2202).