Method, data processing system and computer readable storage medium for selecting a digital format

By converting RNNs into test neural networks and applying a number format selection algorithm, the inefficiency of RNNs in hardware implementations when handling infinitely long time series is addressed, the value representation format of RNNs is optimized, and computational performance is improved.

CN113887710BActive Publication Date: 2026-06-26IMAGINATION TECH LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IMAGINATION TECH LTD
Filing Date
2021-07-02
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing recurrent neural networks (RNNs) are difficult to efficiently handle infinitely long time series inputs in hardware implementations, and existing methods cannot effectively select and optimize the value representation format of RNNs, resulting in low computational efficiency.

Method used

By receiving the representation of an RNN, converting it into a test neural network, and applying a number format selection algorithm to identify and combine two or more values ​​of the RNN in a common number format, including the mantissa and exponent, the hardware implementation of the RNN is optimized.

Benefits of technology

It improves the computational efficiency of RNNs in hardware and their ability to process infinitely long time series, optimizes the value representation format of RNNs, and enhances computational performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN113887710B_ABST
    Figure CN113887710B_ABST
Patent Text Reader

Abstract

The present application relates to a method of selecting a numerical format, a data processing system and a computer readable storage medium. A computer-implemented method of selecting a numerical format for representing two or more values of a recurrent neural network (RNN) for use in configuring a hardware implementation of the RNN, the method comprising: receiving a representation of the RNN; implementing the representation of the RNN as a test neural network for operating on a test input sequence, each step of the test neural network comprising an instance of the two or more values of the RNN; operating the test neural network for a plurality of steps on the test input sequence and collecting statistics to provide to a numerical format selection algorithm; and applying a numerical format selection algorithm to the statistics in order to derive a common numerical format for the plurality of instances of the two or more values of the RNN.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to a computer-implemented method and data processing system for selecting a numerical format for the values ​​of a recurrent neural network (RNN). Background Technology

[0002] A recurrent neural network (RNN) is an artificial neural network used to operate on a sequence of inputs, where states generated during the processing of inputs in the sequence are provided to process one or more subsequent inputs in the sequence. Therefore, the output of an RNN is influenced not only by the network inputs but also by the state representing the network context at previous points in the sequence. In this way, the operation of an RNN is influenced by the historical processing performed by the network, and the same input can produce different outputs depending on the previous inputs in the sequence provided to the RNN.

[0003] RNNs can be used in machine learning applications. In particular, RNNs can be applied to inputs representing time series, which may be of infinite length. For example, RNNs are used for speech recognition and synthesis, machine translation, handwriting recognition, and time series prediction. Summary of the Invention

[0004] This summary is provided to introduce some concepts that are further described in the following detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.

[0005] A computer-implemented method is provided, the method comprising selecting a numerical format for representing two or more values ​​of an RNN used in a hardware implementation of a recurrent neural network (RNN), the method comprising:

[0006] Receive the representation from the RNN;

[0007] The representation of the RNN is implemented as a test neural network for operating on a test input sequence, each step of which includes two or more instances of the RNN values;

[0008] The test neural network performs multiple steps on the test input sequence and collects statistical data to provide to the number format selection algorithm; and

[0009] Apply a number format selection algorithm to statistics to derive a common number format for multiple instances of two or more values ​​of an RNN.

[0010] Each step of the test neural network can be used to perform operations on different test inputs to the sequence.

[0011] Applying a format selection algorithm may include applying the number format selection algorithm to statistical data captured in all multiple steps, with a general number format output by the number format selection algorithm.

[0012] A general number format can be a block configurable number format defined by one or more configurable parameters.

[0013] The number format selection algorithm can be configured to identify blocks of predefined types that can be configured to select number formats.

[0014] The application of number format selection algorithms can include:

[0015] For each of the multiple steps, independently identify the numeric format of each instance of two or more values; and

[0016] Combine the number formats of multiple instances of two or more values ​​to derive a general number format for multiple instances of two or more values ​​of an RNN.

[0017] The number format selection algorithm can be configured to identify a block of configurable number formats defined by one or more configurable parameters for each instance of two or more values.

[0018] Combinations may include independently combining each of one or more configurable parameters of a block configurable number format identified for each instance of two or more values, in order to define one or more configurable parameters for a general number format.

[0019] Each of one or more configurable parameters in a block-configurable number format can be used to independently combine a median, minimum, maximum, or average value for each of the one or more configurable parameters, to be used as a corresponding configurable parameter for a general number format.

[0020] A block-configurable number format may include a mantissa and an exponent, and one or more configurable parameters may include one or more of the bit depth of the exponent value and the mantissa.

[0021] A number format that combines multiple instances of two or more values ​​may include the median, average, minimum, or maximum value of the number format that determines the multiple instances of two or more values.

[0022] The computation test of a neural network can be performed using each instance of the RNN with two or more values ​​in floating-point format.

[0023] Applying a number format selection algorithm to statistical data can be performed simultaneously with or after the collection of that statistical data.

[0024] An RNN may include multiple values, including at least two or more values, and the statistics may include one or more of the following: the average of at least some of the multiple values; the variance of at least some of the multiple values; the minimum or maximum value of at least some of the multiple values; one or more histograms summing up at least some of the multiple values; and the gradient calculated relative to the RNN output or based on the error metric of the RNN output at at least some of the multiple values.

[0025] Multiple steps can be a first set of predetermined steps.

[0026] Implementing an RNN representation as a test neural network may include transforming the RNN representation into a test neural network for operation on a first predetermined set of steps, the test neural network being equivalent to the RNN on the first predetermined set of steps.

[0027] The transformation may include unfolding the RNN over a first predetermined number of steps to form a test neural network.

[0028] The test neural network can be configured to operate on a predefined plurality of test inputs, the number of which is equal to the number of first predetermined steps.

[0029] The test neural network can be a feedforward neural network.

[0030] The test neural network can have one or more state inputs, and implementation includes initializing the state inputs to the test neural network according to a set of predefined initial state inputs.

[0031] The method may also include using a common number format as the number format for the corresponding two or more values ​​in the hardware implementation of the RNN.

[0032] Hardware implementations of RNNs may include the implementation of RNNs in hardware formed in the following ways:

[0033] The representation of an RNN is transformed into a derivative neural network for operation on a predetermined plurality of inputs to an input sequence, the derivative neural network having one or more state inputs and one or more state outputs, and being equivalent to an RNN in a second predetermined plurality of steps; and

[0034] The derivative neural network is iteratively applied to the input sequence in the following way:

[0035] A sequence of instances implementing derivative neural networks in hardware; and

[0036] One or more state outputs from each instance of the derivative neural network at the hardware are provided as one or more state inputs to subsequent instances of the derivative neural network at the hardware, so that the RNN can be operated on an input sequence that is longer than a predetermined number of inputs.

[0037] A general number format formed for each of two or more values ​​in an RNN can be used as the number format for all instances of two or more values ​​in a derivative neural network.

[0038] The first set of predetermined steps may differ from the second set of predetermined steps.

[0039] The first predetermined set of steps may include fewer steps than the second predetermined set of steps.

[0040] An RNN may include one or more cells, each arranged to receive the cell state input generated in the previous step, and transforming the RNN into a test neural network may also include, at each cell:

[0041] The identifier is a non-causal operation performed without relying on the state input generated in the previous step; and

[0042] In derivative neural networks, at least some non-causal operations at multiple instances of cells at at least some of the predetermined steps are grouped together for parallel processing at the hardware.

[0043] Cells may include causal operations performed depending on cell state inputs, and the transform RNN may also include configuring the test neural network such that the result of a noncausal operation performed at a cell relative to an input from a test input sequence is combined with a causal operation performed at a cell relative to the same test input.

[0044] Two or more values ​​can be used for non-causal operations, and the RNN includes two or more other values ​​for causal operations. A number format selection algorithm can be applied to the statistics to independently derive a general number format for two or more values ​​of the RNN and a second general number format for two or more other values ​​of the RNN.

[0045] Two or more values ​​include one or more of the following: input values; state values; weight values; and the output values ​​of the RNN.

[0046] The test input sequence may include exemplary input values ​​selected to represent a typical or expected range of input values ​​for an RNN.

[0047] Number format selection algorithms can be one or more of the following: backpropagation format selection, greedy row search and end-to-end format selection, orthogonal search format selection, maximum range (or "MinMax") format selection, outlier rejection format selection, error-based heuristic format selection (e.g., based on the sum of squared errors with or without outlier weighting), weighted outlier format selection, and gradient-weighted format selection algorithms.

[0048] The input sequence can represent a time series.

[0049] A data processing system is provided for selecting one or more numerical formats for representing two or more values ​​of an RNN used in a hardware implementation of a recurrent neural network (RNN), the data processing system comprising:

[0050] processor;

[0051] Control logic configured at the processor to implement the representation of the RNN as a test neural network for operating on a test input sequence, each step of the test neural network comprising instances of two or more values ​​of the RNN; and

[0052] A format selection unit is configured to enable the processor to perform multiple steps on the test neural network against the test input sequence and to collect statistical data to provide to the digital format selection algorithm.

[0053] The format selection unit is configured to apply a number format selection algorithm to statistical data in order to derive a common number format for multiple instances of two or more values ​​of the RNN.

[0054] The data processing system may also include a hardware accelerator for processing neural networks, wherein the control logic is further configured to cause the representation of the RNN to be implemented at the hardware accelerator using a common numerical format of two or more values ​​of the RNN.

[0055] The data processing system may also include:

[0056] A transformation unit is configured to transform the representation of an RNN into a derivative neural network for operation on a predetermined plurality of inputs to an input sequence, the derivative neural network having one or more state inputs and one or more state outputs, and being equivalent to an RNN in a predetermined plurality of steps.

[0057] Iterative logic, configured to apply the derivative neural network iteratively to the input sequence after testing the neural network's operations at the processor, in the following manner:

[0058] To enable the implementation of a sequence of instances of a derivative neural network at a hardware accelerator; and

[0059] One or more state outputs from each representation of the derivative neural network at the hardware accelerator are provided as one or more state inputs to subsequent representations of the derivative neural network at the hardware accelerator, so that the hardware accelerator can operate the RNN on an input sequence longer than a predetermined plurality of inputs.

[0060] The hardware accelerator can be the same device as the processor.

[0061] A graphics processing system may be provided, configured to perform any of the methods described herein. Computer program code for performing the methods described herein may be provided. A non-transitory computer-readable storage medium storing computer-readable instructions thereon, which, when executed at a computer system, cause the computer system to perform the methods described herein. Attached Figure Description

[0062] The invention is described by way of example with reference to the accompanying drawings. In the drawings:

[0063] Figure 1 This is an example of a recurrent neural network (RNN) consisting of three stacked RNN cells.

[0064] Figure 2 This is a schematic diagram of an RNN cell.

[0065] Figure 3 This is a schematic diagram of a data processing system used to implement RNNs.

[0066] Figure 4 This illustrates the expansion in three time steps to form a static (i.e., feedforward) derivative neural network. Figure 1 Dynamic RNN.

[0067] Figure 5 This demonstrates the iterative computation of the input sequence. Figure 4 The unfolded RNN.

[0068] Figure 6 yes Figure 2 A schematic diagram illustrating the implementation of RNN cells, where causal and non-causal computations are performed respectively.

[0069] Figure 7 It is implemented in multiple time steps. Figure 6 A schematic diagram of a single RNN cell, in which causal and non-causal computations are performed separately in each time step, and non-causal computations are performed in parallel across multiple time steps.

[0070] Figure 8This is a flowchart illustrating a method for transforming an RNN into one suitable for implementation on an accelerator capable of executing static graphics.

[0071] Figure 9 A computer system including a neural network accelerator configured to implement an RNN according to the principles described herein is shown.

[0072] Figure 10 This is a schematic diagram of an integrated circuit manufacturing system.

[0073] Figure 11 A method for performing number format selection for RNNs is shown. Detailed Implementation

[0074] The following description is given by way of example to enable those skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein, and various modifications to the disclosed embodiments will be readily apparent to those skilled in the art. The embodiments are described by way of example only.

[0075] Figure 1 An example of a recurrent neural network (RNN) 100 used to illustrate the hardware implementation of an RNN according to the principles described herein is shown. The network comprises three stacked RNN cells RNN1, RNN2, and RNN3 (102 to 104 in the figure). Each cell may include one or more network operations. Each RNN cell processes the state generated by that RNN cell relative to a previous time step of the input sequence in a manner defined by operations including the cell and one or more network parameters (referred to herein as “weights”). An RNN cell is a subgraph (subnetwork) that can be used as a component in an RNN. It takes one or more input data tensors, one or more state input tensors (cell state inputs), and can generate one or more state output tensors and / or one or more output data tensors for that time step. Some of the output data tensors passed by the cell may be the same as the output state tensors passed by the cell.

[0076] The RNN is configured to operate on a time series x(t) 101, which can be, for example, a series of audio samples for which the RNN will perform speech recognition. Figure 1The representation of the RNN in the diagram represents the RNN at a general time step t. At each time step t, the RNN provides an output o(t) 105. By operating the RNN on the input sequence x(t) at each time step, the RNN generates a corresponding output sequence o(t). The lengths of the input and output sequences can be infinite. Therefore, RNNs can be used to process time series where the length or content of the time series is unknown at the start of processing, for example, audio samples of real-time speech captured for the purpose of implementing voice control of a device. More generally, RNNs can operate on any input sequence, which may not be a time series. References to time series in this disclosure, such as "time step," should be understood to apply equally to any input sequence, including but not limited to time series. The operation of the RNN with respect to each input of the sequence represents a step of the RNN, each operation being a single iteration of the RNN, i.e., a single application of the RNN in its original form.

[0077] It should be understood that although in the examples described herein, the RNN generates a single output corresponding to each input in the input sequence, the described method is equally applicable to RNNs with other configurations, including, for example: RNNs that generate a single output at the end of the input sequence (e.g., RNNs suitable for performing classification); RNNs that generate fewer outputs than the inputs received by the network; and RNNs that provide different outputs for the same input, such as RNNs with branches of two output sequences corresponding to the input sequence in a 1:1 ratio.

[0078] In practice, each RNN cell can include multiple operations, each of which is arranged to perform a different set of computations. For example, an RNN cell can include one or more matrix multiplication operations, convolution operations, activation operations, and concatenation operations. These operations are arranged in the RNN cell to perform operations on the input (which may come from the previous RNN cell in the network) and the state generated when the RNN cell was processed in the previous time step.

[0079] The first RNN cell 102 receives input data from the time series x(t) 101 at time step t and processes the input according to a predefined set of computations for that cell. The processing at the first cell is also performed based on the state h1(t-1) generated during the processing of the previous input x(t-1) at the first cell. In the figures, the state forwarded for use during processing in the next time step is shown as state h1(t) 106, which is delayed by 109 such that state h1(t) is provided to the first cell with input x(t+1).

[0080] Figure 1The second and third RNN cells operate in a similar manner to the first RNN cell, but the second RNN cell receives the output of the first RNN cell as its input, and the third RNN cell receives the output of the second RNN cell as its input. The output of the third RNN cell o(t) 105 is the output of the RNN. Each of the second and third RNN cells performs its own predefined set of computations on its corresponding input. With respect to the first RNN cell, the second and third RNN cells receive state inputs from processes performed in one or more previous time steps. In the figures, the second RNN cell 103 outputs state h2(t) 107, which is provided as state input to the second RNN cell at time step t+1, and the third RNN cell 104 outputs state h3(t) 108, which is provided as state input to the third RNN cell at time step t+1. For the first time step at t=0, the RNN is typically initialized using predefined initial state values. The initial state value can be, for example, a constant, the initial state of learning, or all zero.

[0081] In the accompanying diagram, the output of the RNN cell at time step t is provided as state input to the RNN cell at time step t+1. However, in general, the state may include one or more tensors generated at the first RNN cell and / or the output of the first RNN cell. Generally, the state input to a cell may include state from one or more previous time steps; for example, the state may additionally or alternatively include state from processing time step t-2. In some networks, the state input to a cell may additionally or alternatively include state data generated at other RNN cells in the network; for example, the state data provided to the first RNN cell may include state data from a second RNN cell.

[0082] Figure 1 The RNN shown is a simplified example. In general, an RNN can include one or more RNN cells and can perform one or more additional processing steps on the RNN's inputs, outputs from the RNN, and / or between the RNN cells. For example, an RNN can also include one or more convolution operations, activation operations, and fully connected operations that process the inputs, outputs, or intermediate outputs between cells. Input x(t) and state h i (t) is a tensor with any dimension suitable for the application.

[0083] Figure 2 This is a schematic diagram of a simple RNN cell 200. Figure 1 One or more RNN cells from 10² to 10⁴ may have the structure of an RNN cell 200. (Regarding...) Figure 1In this manner, RNN cell 200 receives input x(t) 212 (or the output of a lower cell for a higher cell) and state h(t-1) 210 from the operation of the input of the previous time step x(t-1) from RNN cell 200. The RNN cell itself includes multiple operations. The input and state are combined at concatenation operation 202 (e.g., concatenated along the channel dimension), which provides tensor inputs to matrix multiplication operation 204. Matrix multiplication operation receives a weight tensor as a matrix W 218, which is used to perform multiplication with the concatenated tensor generated by concatenation operation 202. The output of matrix multiplication operation 204 is then processed by activation operation 206, which applies an activation function to the output of matrix multiplication operation. The activation function can be any function suitable for applying an RNN; for example, the activation function can be tanh, ReLU, or a sigmoid function.

[0084] The output of RNN cell 200 is provided as output o(t) 214 and also as state h(t) 216 for use by the RNN cell in the next time step. In other examples, the state may differ from the output of the RNN cell (e.g., the state may include intermediate tensors generated during operations performed at the RNN cell), and / or the state may include multiple tensors.

[0085] With the activation function tanh, the operation of RNN cell 200 on the input tensor x(t) can be expressed as:

[0086] (1)

[0087] RNN cell 200 is a simple example of an RNN cell. It should be understood that many different kinds of RNN cells exist that can be implemented according to the principles described herein. For example, as is known in the art, the RNN cell implemented as described herein can be an LSTM (Long Short-Term Memory) cell or a GRU (Gated Recurrent Unit) cell. Different types of RNN cells have different characteristics, and it should be understood that the choice of any particular type of RNN cell can be determined by the specific application specific to the RNN.

[0088] The data processing system 300 used to implement RNNs Figure 3 As shown in the diagram, the data processing system includes an accelerator 302 for performing tensor operations on neural networks. The accelerator may be referred to as a neural network accelerator (NNA). The accelerator includes multiple configurable resources that enable the implementation of various types of feedforward neural networks, such as various convolutional neural networks and multilayer perceptrons, at the accelerator.

[0089] relative to Figure 3The specific example shown illustrates a data processing system for implementing RNNs in hardware, where accelerator 302 includes multiple processing elements 314, each including a convolution engine. However, it should be understood that, unless otherwise stated, the principles described herein for implementing RNNs in hardware are generally applicable to any data processing system that includes an accelerator capable of performing tensor operations on neural networks.

[0090] exist Figure 3 In this accelerator, an input buffer 306, multiple convolution engines 308, multiple accumulators 310, an accumulation buffer 312, and an output buffer 316 are included. Each convolution engine 308, together with its corresponding accumulator 310 and its share of resources in the accumulation buffer 312, represents a hardware processing element 314. Figure 3 Three processing elements are shown, but typically any number of processing elements can be present. Each processing element receives a set of weights from coefficient buffer 330 and input values ​​(e.g., input tensors) from input buffer 306. The coefficient buffers may be located at the accelerator, for example, on the same semiconductor die and / or in the same integrated circuit package. By combining the weights and input tensors, the processing elements can be operated to perform tensor operations of a neural network.

[0091] Generally, accelerator 302 may include any suitable tensor processing hardware. For example, in some examples, the accelerator may include a pooling unit (e.g., for implementing max pooling and average pooling operations), or an element processing unit for performing mathematical operations on each element (e.g., adding two tensors). For simplicity, Figure 3 Such units are not shown in the diagram.

[0092] The processing elements of an accelerator are independent processing subsystems that can operate in parallel. Each processing element 314 includes a convolution engine 308 configured to perform a convolution operation between weights and input values. Each convolution engine 308 may include multiple multipliers, each configured to multiply a weight by its corresponding input data value to produce a multiplicative output value. Following the multipliers may be, for example, an adder tree arranged to compute the sum of the multiplicative outputs. In some examples, these multiplicative addition computations may be pipelined.

[0093] Typically, a significant amount of hardware operation must be performed at the accelerator to execute each tensor operation of the neural network. This is because the input and weight tensors are usually very large. Since multiple hardware passes of the convolution engine may be required to generate the complete output of a convolution operation (e.g., because the convolution engine may only receive and process a portion of the weight and input data values), the accelerator may include multiple accumulators 310. Each accumulator 310 receives the output of the convolution engine 308 and adds that output to the previous convolution engine output associated with the same operation. Depending on the accelerator implementation, the convolution engine may not process the same operation in consecutive cycles, so an accumulation buffer 312 may be provided to store the partially accumulated output of a given operation. Appropriate partial results can be provided to the accumulators by the accumulation buffer 312 in each cycle.

[0094] An accelerator may include an input buffer 306 and a coefficient buffer 330 arranged to store input data required by the accelerator (e.g., a convolution engine), the coefficient buffer being arranged to store weights required by the accelerator (e.g., a convolution engine) for combining input data with operations of the neural network. The input buffer may include some or all of the input data associated with one or more operations performed at the accelerator in a given period. The coefficient buffer may include some or all of the weights associated with one or more operations processed at the accelerator in a given period. Figure 3 The various buffers of the accelerator shown can be implemented in any suitable manner, for example, as any number of data repositories local to the accelerator (e.g., on the same semiconductor die and / or located within the same integrated circuit package) or accessible to the accelerator via a data bus or other interconnects.

[0095] Memory 304 can be accelerator-accessible; for example, it can be system memory accessible to the accelerator via a data bus. On-chip memory 328 can be provided for storing weights and / or other data (such as input data, output data, etc.). The on-chip memory can be local to the accelerator, allowing data stored in the on-chip memory to be accessed by the accelerator without consuming the memory bandwidth of memory 304 (e.g., system memory accessible via a system bus). Data (e.g., weights, input data) can be periodically written from memory 304 to the on-chip memory. Coefficient buffer 330 at the accelerator can be configured to receive weight data from on-chip memory 328 to reduce bandwidth between the memory and the coefficient buffer. Input buffer 306 can be configured to receive input data from on-chip memory 328 to reduce bandwidth between the memory and the input buffer. Memory can be coupled to the input buffer and / or on-chip memory to provide input data to the accelerator.

[0096] Accumulation buffer 306 may be coupled to output buffer 316 to allow the output buffer to receive intermediate output data of the neural network operation performed at the accelerator, as well as output data of the final operation (i.e., the last operation of the network performed at the accelerator). Output buffer 316 may be coupled to on-chip memory 328 to provide intermediate and final output data to on-chip memory 328, for example, for use as state when implementing an RNN at the accelerator in the manner described below.

[0097] Typically, large amounts of data need to be transferred from memory to processing elements. If this transfer cannot be performed efficiently, it can lead to high memory bandwidth requirements and high power consumption for providing input data and weights to the processing elements. This is especially true when the memory is "off-chip" memory, i.e., implemented on a different integrated circuit or semiconductor die than the processing elements. One such example is the system memory of an accelerator accessible via a data bus. To reduce the memory bandwidth requirements when an accelerator executes a neural network, it is advantageous to provide on-chip memory at the accelerator, where at least some of the weights and / or input data required to implement the neural network can be stored. Such memory can be "on-chip" (e.g., on-chip memory 328) when it is located on the same semiconductor die and / or in the same integrated circuit package.

[0098] exist Figure 3 The examples show various exemplary connections individually, but in some implementations, some or all of these connections may be provided by one or more shared data bus connections. It should also be understood that other connections may be provided as... Figure 3 The connections shown are alternatives or supplements. For example, output buffer 314 may be coupled to memory 304 to provide output data directly to memory 304. Similarly, in some examples, not... Figure 3 All connections shown are necessary. For example, memory 304 need not be coupled to input buffer 306, which can obtain input data directly from an input data source, such as an audio subsystem configured to sample signals from a microphone dedicated to capturing voice from a user in a device including a data processing system.

[0099] Implementing RNNs in hardware

[0100] Implementing RNNs in hardware on data processing systems suitable for executing recurrent neural networks is generally not feasible on hardware such as the aforementioned accelerators, because such systems require neural networks that can be represented by a complete static graph. To allow the execution of RNNs on hardware suitable for executing recurrent neural networks, the inventors propose expanding the RNN over a predetermined number of time steps to produce a static neural network with a fixed set of state inputs and a fixed set of state outputs. This method transforms the dynamic graph of the RNN into a static graph of the recurrent neural network, suitable for implementation on an accelerator according to conventional implementation and optimization algorithms. By iterating over the statically expanded RNN and providing the state output of the first iteration of the expanded RNN as the state input for the next iteration, the RNN can be executed on an infinitely long input sequence.

[0101] Static neural networks are feedforward neural networks and can be represented by static graphs. Dynamic neural networks include one or more feedback loops and cannot be represented by static graphs. The output of a dynamic neural network at a given step depends on the processing performed in one or more previous steps of the neural network. Therefore, a computational graph or neural network containing one or more feedback loops can be called a dynamic graph or neural network. Conversely, a computational graph or neural network that does not contain feedback loops can be called a static or feedforward graph or neural network. The derivative neural network described in this paper is a feedforward neural network.

[0102] For example, Figure 4 Show Figure 1 The triplet RNN is expanded at three time steps t, t+1, and t+2 to form the expanded RNN 400. Figure 4 It can be seen that the state output h1(t) of the first RNN cell 102 at time step t is provided as state input to the same instance of the first RNN cell 102 at the next time step t+1. Similarly, the state output h1(t+1) of the first RNN cell at time step t+1 is provided as state input to the same instance of the first RNN cell 102 at the next time step t+2. Likewise, the state output of each of the second RNN cells 103 and the third RNN cell 104 is provided as state input to the same instance of the second and third RNN cells at the next time step. The expanded RNN generates an output o(t) with respect to each input in the input sequence x(t).

[0103] The three RNN cells 10² to 10⁴ output h at the final time step t+2 of the expanded RNN 400. i(t+2) is provided as the state output 404 of the expanded RNN. The expanded RNN 400 has three state inputs 402, which are the state inputs h of the three RNN cells 102 to 104 at the first time step t. i (t-1). By using the state output 404 of the first instance of the expanded RNN400 as the state input 402 of the next instance of the expanded RNN400, the processing performed by the expanded RNN can be iterated over an infinitely long input sequence x(t).

[0104] Since all first RNN cells are identical, all second RNN cells are also identical, and all third RNN cells are also identical. However, it should be understood that mathematically, Figure 4 The expanded RNN 400 shown is equivalent to operating on a sequence of three inputs. Figure 1 The RNN shown is an example of an RNN. Generally speaking, an RNN can be expanded over any number of time steps. Figure 4 The example shown illustrates that an RNN expands over only three steps, but in real-world systems, the number of time steps is often much larger; for example, an RNN can expand over 16 time steps.

[0105] Figure 5 The diagram illustrates the iteration of an expanded RNN 400, where the state output (network state output) of a first instance 502 of the expanded RNN is provided as state input (network state input) to a second instance 504 of the expanded RNN. The iterative expanded RNN operates on an input sequence 506, where each instance of the expanded RNN operates on three inputs from the input sequence; for example, the first instance might operate on inputs x(t) to x(t+2), and the second instance might operate on inputs x(t+3) to x(t+5). The set of inputs operated on by each instance of the expanded RNN can be referred to as a "part" of the input sequence; therefore, each iteration of the expanded RNN operates on a corresponding partition of the input sequence, each partition comprising a predetermined number of time steps.

[0106] The iteratively expanded RNN generates an output sequence of 508. For simplicity, in... Figure 5 The passage of state values ​​between instances of RNN cells at consecutive time steps is illustrated by simple arrows. However, it should be understood that, for example, the state generated when the second cell instance 510 processes the input at time step t+1 is not available when the second cell instance 512 processes the input at time step t+2, until the relevant processing at the second cell instance 510 is completed.

[0107] Now refer to Figure 3This document describes the implementation of RNNs in hardware within a data processing system. Generally, the principles described herein can be applied to implement RNNs at any accelerator capable of performing tensor operations on neural networks. For example, an accelerator could be a graphics processing unit (GPU), a tensor accelerator, a digital signal processor (DSP), or a neural network accelerator (NNA). Accelerator 302 may not be able to execute independently and may require (e.g., via control logic 324) management and configuration to execute code.

[0108] To implement the RNN 338 on the accelerator 302, the transformation unit 326 is configured to perform the RNN expansion over a predetermined number of time steps (partitions of the input sequence) so as to achieve a ratio relative to Figure 4 and Figure 5 The method described herein generates an expanded RNN. Instead of attempting to configure accelerator 302 to execute a recurrent form of RNN that requires dynamic graphs to be implemented by the accelerator, the transformation unit provides an expanded RNN for implementation at accelerator 302. Since the expanded RNN can be represented by static graphs, it can be implemented at the accelerator, where an RNN would otherwise be impossible to execute in hardware. Therefore, the same accelerator can be used to implement recurrent or non-recurrent neural networks, thereby extending its utility. The method described herein includes executing the expanded RNN at the accelerator to execute the originally defined RNN.

[0109] Control logic 324 is configured to implement the neural network at the accelerator. The control logic configures the accelerator's processing element 314 to perform tensor operations of the neural network, for example, by setting appropriate accelerator parameters, defining appropriate data structures at memory 304 and on-chip memory 328, and providing references to these data structures along with instructions defining the tensor operations to be performed. The control logic can (e.g., via on-chip memory 328) cause the weights required for the tensor operations to be read into coefficient buffer 330 and provide the inputs to input buffer 306. Typically, a significant amount of hardware operation must be performed at the accelerator to execute each tensor operation of the neural network. This is because the input tensors and weight tensors are usually very large. Generally, processing element 314 will take more than one hardware pass to generate a complete output for the computation. The control logic can be configured to synchronize the provision of weights and input data to the accelerator's processing element such that, over many passes, the output of each computation is accumulated at accumulation buffer 312.

[0110] Using control logic to configure and manage the processing of a neural network at an accelerator is known in the art, and suitable control logic is typically configured to configure the accelerator for implementing the neural network. Control logic 324 may include one or more of the following: software (e.g., a driver) executing at a processor of the data processing system 300 (e.g., a CPU); firmware (e.g., at the accelerator 301 itself); a dedicated processor, such as one that may be implemented at the accelerator 302 or in a system-on-a-chip (SoC) coupled to the accelerator. In some examples, control logic may include a driver running at a general-purpose processor of the data processing system and firmware running at the SoC of the accelerator 302. Typically, the accelerator will include registers on the device that configure aspects of the operations performed by the accelerator, and the control logic will set these registers to appropriately configure the accelerator to implement a given neural network.

[0111] The data processing system also includes a transformation unit 326 to transform the RNN into a static neural network for implementation at the accelerator. In some examples, the transformation unit 326 may be provided at the control logic, but other arrangements are possible; for example, the transformation unit may be a separate logic component embodied in software, hardware, or firmware at the data processing system. In some examples, the transformation unit is software configured to process the RNN before it is submitted to the control logic for implementation in the hardware at the accelerator.

[0112] Now refer to Figure 8 The flowchart 800 shown describes the operation of the transformation unit 326, illustrating a method for implementing an RNN in hardware. At 801, the representation of the RNN 338 to be implemented in hardware is received at the transformation unit. The RNN representation can be represented in any suitable manner, such as a mathematical representation, or any other representation of the RNN to which the transformation unit is configured to operate. Several standards exist for high-level definitions of neural networks, any of which can serve as suitable inputs to algorithms. Deep learning framework APIs tend to approach pure mathematical definitions, and there are several cross-framework "standards" that operate at a similar level (e.g., ONNX). Code prepared to execute on a particular accelerator will generally be closer to the hardware and include features specific to that hardware. There are also widely used intermediate representations such as repeaters, which are commonly used in deep neural network (DNN) compilers.

[0113] The transformation unit is configured to unfold the 802 RNN through predetermined steps. Any of the various methods known in the art for unfolding (sometimes called spreading) RNNs can be used. For example, a mathematical approach to unfolding RNNs is described in Chapter 10 (see 10.1 in particular) of Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT Press, 2016), which is incorporated herein by reference in its entirety.

[0114] Choosing a predetermined number of steps based on the specific characteristics of the accelerator is advantageous in order to optimize the execution of the unfolded RNN on the accelerator while maintaining acceptable latency. For example, an instance of a statically unfolded RNN implemented in the hardware at the accelerator typically requires all the inputs of the partition (i.e., the number of inputs equal to the number of time steps processed by the unfolded RNN) to be available before the instance's execution begins. Therefore, while increasing the predetermined number of steps will generally improve execution efficiency, it will also have the effect of increasing the latency of the RNN. This is particularly evident in many applications, especially those performing real-time processing such as speech recognition.

[0115] Transformation unit 326 expands the RNN in a predetermined number of steps to derive an 803 static neural network, which represents a portion of the RNN mathematically equivalent to a fully expanded representation of the received RNN. The state input of the first time step of the derivative neural network is provided as the state input of the derivative neural network itself, and the state output of the last time step of the derivative neural network is provided as the state output of the derivative neural network itself. This enables the derivative neural network to... Figure 5 The process is iterated in the manner shown, where the state output from the first instance of the derivative neural network is provided as the state input for the next instance of the derivative neural network.

[0116] Control logic 326 is configured to implement an 804 derivative neural network in the hardware at accelerator 302. As described above, this can be performed according to conventional methods for implementing neural networks on accelerators, such as by using a driver for the accelerator and firmware executed at the accelerator.

[0117] The data processing system also includes iterative logic 342, which is configured to iteratively apply the derivative neural network to the input sequence 805 and cause the state output from each instance of the derivative neural network (e.g., Figure 4 404) is used as the state input for the next instance of the derivative neural network (e.g., Figure 4(402 in the original text). The iterative logic can implement each instance of the derivative neural network at accelerator 302 by providing the control logic with the same derivative neural network to be implemented at the accelerator each time the current instance of the derivative neural network implemented at the accelerator completes its processing. Of course, successive instances of the derivative neural network will operate on successive inputs from the input sequence that the RNN will operate on.

[0118] The state tensor can be passed between instances of the derivative neural network in any suitable manner. For example, iterative logic 342 can cause the state to be written to on-chip memory 328 to maintain the state tensor between iterative instances of the derivative neural network. Typically, on each new instance of the derivative neural network executed at the accelerator, the contents of the accelerator's buffer are overwritten. The state is configured to be maintained between instances of the derivative neural network, for example, by writing the state to a protected data repository accessible to the accelerator, such as on-chip memory 328. In other examples, the state can be written out to memory 304 and read back when a new instance of the derivative neural network is initialized at the accelerator.

[0119] By iterating a derivative neural network over an input sequence received for processing at an accelerator, the data processing system 300 can be configured to process an input sequence of infinite length using... Figure 5 The method described herein replicates the static derivative neural network at the accelerator to implement the full RNN 338 operation in hardware. In particular, by converting the RNN back into a static derivative neural network for operating on fixed-length partitions of the input sequence, the method described herein enables the implementation of RNNs in hardware where they cannot be dynamically graphically represented.

[0120] Causal / Noncausal Separation

[0121] Hardware used to perform neural network operations, such as neural network accelerators (NNAs), is typically optimized to perform large amounts of tensor computations in parallel. The parallel nature of hardware accelerators is particularly useful when running convolutional neural networks, where each convolutional layer can be processed in parallel, for example, across multiple processing elements. However, when loops are introduced, and the computation performed at a cell of the neural network at a given time step depends on the computation performed at that cell relative to previous time steps, existing methods can lead to poor performance. This is due to several factors, including the low utilization of the typical parallel architecture of accelerators used to perform neural networks, the poor suitability of existing optimization algorithms for executing RNNs in hardware, and the high memory bandwidth consumed due to the inefficiency of reading weights and input data into the accelerator at each time step.

[0122] The inventors have realized that this can be achieved by using RNN cells (e.g., in...) Figure 4The operations performed in one or more RNN cells (as shown in Figure 5) are divided into a set of non-causal operations and a set of causal operations to achieve substantial performance improvements over RNNs implemented in hardware. Causal operations of an RNN cell are operations performed based on the states received as state inputs to that cell. Non-causal operations of an RNN cell are operations that can be performed without depending on the states received as state inputs to that cell; that is, cell operations that can be performed once the input data for that cell is known. Therefore, non-causal operations can be performed simultaneously once the corresponding input data is available for these operations. Since the non-causal part does not need to adhere to a strict execution order, multiple time steps can be executed in parallel to make more efficient use of the hardware, resulting in benefits such as higher utilization and faster inference time. In particular, in hardware such as an accelerator 302 that includes multiple processing instances 314, the separation from non-causal computation allows multiple time steps to be executed in parallel across processing instances.

[0123] It should be understood that when operations are separated as described above, causal operations may include one or more non-causal computations, for example, because it is advantageous to perform these non-causal computations together with causal operations. However, this set of non-causal operations cannot include any causal computations, because the non-causal operations will be executed in parallel in the hardware. Therefore, it should be understood that references to causal operations herein refer to a set of operations that includes all causal operations of the RNN cell, but may also include some non-causal operations of the RNN cell; and references to non-causal operations herein refer to a set of operations that includes at least some non-causal operations of the RNN cell for parallel execution, and no causal operations of the RNN cell.

[0124] Transformation unit 326 is configured to separate non-causal operations from causal operations and form a static neural network for implementation at accelerator 302. This static neural network represents an unfolded RNN, but in which partitioned non-causal operations are grouped together for parallel execution. An example of how to separate causal and non-causal operations in an RNN cell will now be described. The same approach can be used for each cell of the RNN. It should be understood that, depending on the specific operations of the RNN and the available parallel processing elements in the hardware, non-causal operations can be performed in parallel, rather than through convolution.

[0125] return Figure 2 It will be observed that the combination of weight 218 with concatenated input 212 and state 210 is performed as matrix multiplication 204. Figure 2A simple example of an RNN cell is shown, but more complex RNN cells such as LSTM or GRU cells can also be represented as one or more matrix operations on a set of inputs and state tensors, along with various activation functions and other functions. Other types of cells can include other types of mathematical operations, which can also be divided into causal and non-causal parts according to the principles described herein. Different techniques can be used to separate these other types of operations. For example, performing element-wise multiplication after concatenation can be reconstructed as performing two element-wise multiplications after concatenation. In some examples, an RNN that can be processed according to the principles described herein can be defined such that causal and non-causal operations (e.g., in different matrix multiplications) are performed separately, but previous methods do not accelerate the execution of non-causal operations.

[0126] In instances where a cell comprises one or more matrix multiplications, each matrix multiplication of tensors x and W (where x and W are matrices) It can be equivalently represented as the sum of two matrix multiplications. , where W' and W'' are subsets of the elements of W, and x' and x'' are subsets of the elements of x. Therefore, Figure 2 The RNN cell shown can be equivalently expressed as Figure 6 The RNN cell 600 shown is used where the causal and non-causal parts of the cell computation are performed as separate matrix multiplications 612 and 614, each of which receives... Figure 2 The corresponding subset of weights W shown. Assuming both use the activation function tanh, equivalents of RNN cells 200 and 600 can be used. Figure 2 and Figure 6 The symbol is written as follows:

[0127]

[0128] (2)

[0129] The top line represents RNN cell 200 and the bottom line represents RNN cell 600. W x 610 are elements of the weights W combined with the input x(t) (e.g., for processing), and W h608 is an element of the weight W combined with the state h(t-1) (e.g., for processing). The result of the non-causal computation performed at matrix multiplication 612 is combined at addition 606 with the result of the causal computation performed at matrix multiplication 614. For example, addition 606 may include an element-wise addition of the result of the non-causal computation with the result of the causal computation. The sum of the causal and non-causal computations is then affected by activation function 206 to generate output o(t) 214 and output state h(t) 216.

[0130] By separating the causal and non-causal parts of an RNN cell, the execution of causal and non-causal operations in a hardware implementation of an RNN comprising one or more such cells can be optimized individually. In particular, since non-causal operations do not require waiting to generate state values ​​in the previous time step, it is possible to group non-causal operations to be performed on multiple inputs together and execute these operations in parallel at multiple processing elements. This enables the utilization of the parallel processing capabilities of accelerators suitable for implementing neural networks in hardware. Specifically, accelerators for implementing neural networks typically include parallel processing elements suitable for efficiently performing convolution operations, such as the convolution engine 308 of accelerator 302. Without the parallel execution of non-causal computations, the compositional operations of the neural network cells are executed sequentially, typically making full use of only a single hardware instance (e.g., a single processing element 314), resulting in lower utilization and slower inference.

[0131] Figure 7 yes Figure 6 The diagram illustrates the implementation of three cells in an RNN cell, where causal and non-causal operations are performed at each time step, with the non-causal operations being performed in parallel across these three time steps. The same approach can be used for each cell of the expanded RNN. In this way, some or all of the non-causal operations of the derivative neural network described above can be performed in parallel for partitions of the input sequence for instances of the derivative neural network.

[0132] Transformation unit 326 is configured to separate non-causal operations from causal operations in order to derive a static neural network for implementation at accelerator 302. Figure 8This is illustrated in the flowchart by separating causal / non-causal branches. At step 802, according to the above principle, the transformation unit 326 expands the RNN through a predetermined number of steps. At 806, the transformation unit separates at least some non-causal operations from causal operations. The transformation unit groups at least some of the non-causal operations of the expanded RNN 807 so that these operations can be executed in parallel at the accelerator 302. The transformation unit can form one or more groups of non-causal operations relative to the non-causal operations of the expanded RNN; that is, the inputs of a given unit of the RNN can be combined to perform a set of non-causal operations in parallel on these inputs.

[0133] Based on the principles described above relative to step 803, the transformation unit forms a static derivative neural network 808, but the non-causal operations at one or more cells of the neural network are grouped together for parallel execution. The derivative neural network is implemented at processor 302 by control logic 324 in the manner described above 804. In this way, processing each partition of the input at the derivative neural network running at the accelerator includes performing non-causal operations at one or more cells of the derivative neural network in one or more parallel operations.

[0134] Generally, the causal / non-causal separation at step 806 can be performed before or after the expansion step 802. For example, before expansion, the appropriate cells of the RNN can be replaced with decomposed cells that separate non-causal and causal operations.

[0135] One approach to parallelizing non-causal operations is to convert all matrix operations into convolutions for execution at the parallel convolution engine 308 of accelerator 302. Since the convolution engine is optimized for performing convolutions, this substantially improves the performance of running derivative neural networks in hardware, and thus improves the performance of RNNs represented by derivative neural networks. In data processing systems with processing elements optimized for parallel execution of computations other than convolutions, the operations of the cells of derivative neural networks can be reconverted into computations of the optimized processing elements.

[0136] Figure 7 Instructions are provided on how to perform non-causal operations in parallel at the convolution engine 308 of accelerator 302. Figure 7 In the input tensor 702, all input tensors of the partition are concatenated at non-causal cell 710 to form tensor X. For example, input tensor Includes 512 input values (e.g., audio samples from an input audio stream that represents a sequence of audio samples of speech) and weight tensors The input tensor and the weight tensor are both added together, adding additional spatial dimensions (e.g., height and width) in order to become... and weight tensor In this example, the dimensions of W' represent the kernel height, kernel width, number of input channels, and number of output channels, respectively, and the dimensions of X' represent the batch size, data height, data width, and number of input channels, respectively. All available inputs to the partition can then be concatenated along the width dimension at concatenation unit 712 to obtain a tensor. , where T represents the total number of available time steps in the partition. In the example of this invention, T=3, because the derivative neural network operates on partitions with three time steps (i.e., the derivative of the RNN expanded over three time steps).

[0137] The dimension for concatenation can be selected based on the specific characteristics of the hardware accelerator. For example, some hardware accelerators' convolution engines can be configured to make concatenation advantageous in specific dimensions (e.g., "width" or "height") in order to optimize the performance of convolution operations performed on the concatenated result.

[0138] Adding an additional spatial dimension to the input tensor will change its shape, but not the underlying values, because the new dimension has a size of '1'. Changing the tensor's shape can be advantageous because convolution operations in neural networks typically expect data to be presented as 4D tensors. A simple example is a 2D tensor with dimensions (1, 3). It can be reformatted into a 4D tensor of dimension (1, 1, 1, 3), and the tensor will be represented as... .

[0139] tensor Then we can connect at convolution unit 714 with... Perform convolution to obtain intermediate outputs for non-causal computation ,in Indicates weight and partitions The convolution operation between the inputs has a stride of 1 in both the height and width dimensions. The output of the convolution can be written as... Convolution is mathematically equivalent to performing convolution operations separately for each time step. and Instead of matrix multiplication, these computations are performed as convolutions, allowing the parallel utilization of multiple convolution engines 308 at accelerator 302. This reduces memory bandwidth because weights can be copied into coefficient buffers at the start of the convolution, rather than before computation at each individual time step, and the latency is significantly reduced due to the improved performance of the derivative neural network at the accelerator. For a more typical RNN configured to perform speech recognition in time series of audio samples, where each partition comprises 16 time steps and the RNN consists of a stack of five RNN cells plus two preprocessed convolutional layers and a fully connected layer, the method reduces latency by a quarter.

[0140] Hardware accelerators can typically use a common set of weights (filters) in convolutional operations to process parallel input data streams. This is particularly useful when processing convolutional layers, such as those processing images, where the same filters are applied as sliding windows across the entire image. By spatially grouping the input data, it can be processed in a manner similar to the feature maps input to the convolutional operation, thus enabling parallel processing of the input data at the hardware accelerator. In other examples, non-causal operations can be performed in parallel as operations other than convolution.

[0141] Performing noncausal computations in parallel across multiple processing elements improves performance in three ways. First, it improves hardware utilization because computations can run on as many parallel streams as the processing elements. Second, it reduces memory bandwidth consumption because the multiple processing elements performing parallel computations can (e.g., at coefficient buffer 330) use the same weight coefficients, instead of needing to read the same weight coefficients from memory for each input relative to a partition in order to perform noncausal computations on that input. Minimizing bandwidth also has the advantage of reducing the number of cycles spent reading / writing from memory, which improves the overall latency of the model. Furthermore, this approach reduces the processing required in the causal computation sequence because noncausal computations have been separated and are not performed alongside causal computations.

[0142] At the separation unit 716, the intermediate output... The output of each of the three time steps ,in Each Input 704 is provided to the corresponding causal cell 604. The causal cell performs operations on the two-dimensional tensor components, rather than on the 4D tensor provided for the convolution operation.

[0143] Because the causal computation performed at each time step requires the state generated in previous time steps, it cannot be performed in parallel. A causal cell of 604 is provided for each time step of the partition, therefore in Figure 7There are three causal cells. Each causal cell receives the corresponding tensor output from the non-causal cell 710. The state 706 generated by the causal cell relative to the previous time step is taken as input. Each causal cell 604 can have a state 706 relative to the previous time step. Figure 6 The causal cells 604 shown have the same functional structure, wherein each causal cell, for example, operates on the same set of weights 608 over the received state 706 via matrix multiplication 614. For example, the result of the operation of weights 608 on the received state 706 is combined with the corresponding output from a non-causal cell via addition 606. The combination of causal and non-causal computations is then processed by activation function 206 to provide output 708, which, in this example, is also the state of the next causal cell. As described above, in other embodiments, one or more state values ​​(e.g., tensors or single values) may or may not include outputs relative to that time step.

[0144] return Figure 3 The data processing system shown, in order to separate the non-causal computations of the derivative neural network so that these computations can be executed in parallel, allows the transformation unit to process each cell of the derivative neural network, separating those computations that are independent of the state from the previous cell computations, and enabling the non-causal computations to be executed in parallel at the processing elements of the accelerator. For example, in Figure 3 In this process, the transformation unit can be configured to form a derivative neural network from the RNN representation 338, and then further process the derivative neural network to separate causal and noncausal computations in a manner described herein, with noncausal computations being performed in parallel at the processing element 314 of the accelerator 302.

[0145] The control logic cells 324 and / or non-causal 710 and / or causal 604 can be configured to transform inputs and weights into forms suitable for parallel processing and from forms suitable for parallel processing into forms unsuitable for parallel processing, for example, relative to... Figure 7 The noncausal cell 710 (e.g., its convolutional unit 714) can add additional spatial dimensions to the input and weights to transform these tensors into a form suitable for convolution. In some examples, it is not necessary to add additional spatial dimensions to the input and weights, and inference can be performed during computation.

[0146] When deriving a neural network from an RNN representation by unfolding the RNN over a predetermined number of time steps and separating causal and non-causal computations as described herein, it is more advantageous to choose a predetermined number of steps that is an integer multiple of the number of processing elements at the accelerator. This helps to maximize the use of processing elements during the execution of derivative neural networks, as parallel non-causal computations can be evenly distributed across the processing elements of the system, thereby maximizing performance.

[0147] Figure 9 A computer system in which the data processing system described herein can be implemented is shown. The data processing system includes a CPU 902, an accelerator 302 (labeled in the figures as a neural network accelerator, NNA), a system memory 304, and other devices 914, such as a display 916, a speaker 918, and a camera 922. Components of the computer system can communicate with each other via a data bus 920. At least some of control logic 324 and / or iterative logic 342 and / or transformation units 326 can be supported at the CPU 902.

[0148] Number format selection

[0149] Figure 1 The RNN example shown defines the operations performed at each time step t on the elements x(t) of the input sequence and the state variable h(t-1) to generate the state variable h(t) and the output o(t). The function defined by these operations is fixed over time: for the same values ​​of the input and state variables, the output will be the same regardless of the time exponent. This desired property is known as time invariance. For efficiency reasons, the value definition blocks in the network can be configured with numeric formats as described below. These numeric formats should be the same over time to maintain time invariance, and this needs to be considered when choosing the numeric format so that the chosen format is suitable for all time steps.

[0150] The difference between an RNN and a feedforward (static) neural network is that the same pattern is repeated over the input sequence (e.g., a time series). Furthermore, an RNN cell receives a state tensor generated at previous steps of the RNN, which is unknown at design time. To ensure consistent network behavior over time, each step in the expanded RNN of a derivative neural network should operate in the same way given the same input, regardless of the length of the expanded RNN (i.e., the number of steps in the expanded RNN) or the position of the steps in the sequence of expanded RNN steps. Whether the network's behavior is time-invariant is partly determined by the numerical format of the data values ​​involved in the operations performed by the RNN.

[0151] The values ​​of an RNN can include elements from any tensor of the network, such as input values ​​(e.g., elements of an input tensor representing a time series, or the output of a lower cell in a cell stack of the RNN); weight values ​​(e.g., elements of a weight tensor representing network parameters); state values ​​(e.g., elements of a state tensor generated at the previous time step of the RNN); and intermediate tensors representing values ​​between network operations. The values ​​of an RNN may be referred to herein as network values. In the hardware implementation of an RNN, a suitable numerical format needs to be selected for all the values ​​of the network. The numerical format of some values ​​or at least some parameters of the numerical format can be predefined. The numerical format of some or all the values ​​of the network can be determined according to the numerical format selection method described herein.

[0152] Each iteration of an RNN involves an instance of every value of the network (e.g., an element of an RNN tensor). Therefore, the RNN is iterated N times to generate N instances of its network values. To ensure time invariance, all instances of values ​​in the network should have the same number format. One approach to choosing the number format will now be described for use when implementing an RNN in hardware, particularly in hardware (e.g., in…) based on the principles described above. Figure 3 It is used when implementing RNN in the data processing system shown.

[0153] As those skilled in the art know, for hardware to process a set of values, these values ​​must be represented in a numerical format. Two types of numerical formats are fixed-point and floating-point formats. As those skilled in the art know, fixed-point formats have a fixed number of digits after the radix point (e.g., a decimal point or binary point). Conversely, floating-point formats do not have a fixed radix point (i.e., they can be “floating”). In other words, the radix point can be placed anywhere in the representation. While representing input data values ​​and weights in floating-point format allows for more accurate or precise output data, processing numbers in floating-point format in hardware is complex and tends to increase chip area and hardware complexity compared to processing values ​​in fixed-point format. Therefore, hardware implementations can be configured to process input data values ​​and weights in fixed-point format to reduce the number of bits required to represent the values ​​of the network, thereby reducing the silicon area, power consumption, and memory bandwidth of the hardware implementation.

[0154] A number format type defines the parameters that form the number format of that type and how those parameters are interpreted. For example, an exemplary number format type could specify that a number or value is formed by... The last digit Sum of Indices This indicates that the number is equal to As described in more detail below, some number format types can have configurable parameters, also known as quantitative parameters, that can vary between number formats of that type. For example, in the exemplary number formats described above, the bit width... Sum of Indices It can be configurable. Therefore, the first numeric format of this type can use a bit width. 4 and index 6, and this type of second different number format can use bit width 8 and index -3.

[0155] The accuracy of a quantized RNN (i.e., a type of RNN where at least a portion of the network values ​​are represented in a non-floating-point format) can be determined by comparing the output of such an RNN to a baseline or target output in response to input data. The baseline or target output can be the output of an unquantized form of an RNN that responds to the same input data or to ground truth output used for the input data (i.e., a type of RNN where all network values ​​are represented in a floating-point format, which may be referred to herein as a floating-point form of an RNN or a floating-point RNN). The more outputs a quantized RNN receives from the baseline or target output, the lower its accuracy. The size of a quantized RNN can be determined by the number of bits used to represent the network values. Therefore, the lower the bit depth of the numerical format used to represent the network values, the smaller the RNN.

[0156] While a single numeric format can be used to represent all network values ​​of an RNN (e.g., input data values, weights, biases, and output data values), this typically does not produce small and accurate RNNs. This is because different operations in an RNN tend to operate on values ​​with different ranges and produce values ​​with different ranges. For example, one operation might have input data values ​​between 0 and 6, while another operation might have input data values ​​between 0 and 500. Therefore, using a single numeric format may not allow for an effective or accurate representation of any given set of input data values. Therefore, the network values ​​of an RNN can be divided into two or more sets of network values, and a numeric format can be chosen for each set. Preferably, each set of network values ​​includes related or similar network values.

[0157] Each set of network values ​​can be all or a portion of a specific type of network values ​​used in an operation. For example, each set of network values ​​can be all or a portion of the input data values ​​for the operation; all or a portion of the weights for the operation; all or a portion of the biases for the operation; or all or a portion of the output data values ​​for the operation. Whether a set of network values ​​for a cell includes all or only a portion of a specific type of network values ​​can depend on, for example, the hardware implementing the RNN and the application of the RNN. For instance, identifying the number format based on each filter in the convolution weight tensor can improve output accuracy in some cases. For example, some hardware available for implementing RNNs can only support a single number format for each network value type per operation, while other hardware available for implementing RNNs can support multiple number formats for each network value type per operation.

[0158] Hardware used to implement RNNs, such as accelerator 302, can support one type of number format for network values. For example, hardware used to implement RNNs can support formats where numbers are derived from... Tag and exponent The representation of a number format. To allow different sets of network values ​​to be represented using different number formats, hardware implementing an RNN can use a number format type with one or more configurable parameters, where the parameters are shared among all values ​​in a set of two or more values. These types of number formats may be referred to herein as block-configurable type number formats or set-configurable type number formats. Therefore, non-configurable formats, such as INT32 and floating-point formats, are not block-configurable type number formats. Exemplary block-configurable type number formats are described below. The methods described herein can be performed to identify the appropriate block-configurable type number format for two or more values ​​of an RNN.

[0159] An exemplary block-configurable type of number format that can be used to represent RNN network values ​​is the Q-type format, which specifies a predetermined number of integer bits. and decimal places Therefore, the number can be represented as This totals to Bits (including the sign bit). An exemplary Q format is shown in Table 1 below. The quantization parameter for the Q type format is an integer number of bits. and decimal places .

[0160] Table 1

[0161]

[0162]

[0163] However, a drawback of the Q format is that some bits used to represent the numbers can be considered redundant. In one example, the number range [-0.125, 0.125) would be represented with 3 bits of precision. The required Q format for this exemplary range and precision is Q0.5. However, if the range of values ​​is assumed to be known in advance, the first two bits of the number would never be used to determine the value represented in Q format. For example, the first two bits of the representation do not contribute to the final number because they represent 0.5 and 0.25 respectively, and are therefore outside the desired range. However, they are used to indicate the value of the third bit (i.e., values ​​of 0.125 and higher due to relative bit positions). Therefore, the Q format described above is an inefficient fixed-point number format for hardware implementations of neural networks because some bits may not convey useful information.

[0164] Another exemplary block-configurable type of number format that can be used to represent network parameters of an RNN is one in which the number format consists of a fixed integer exponent. and The last digit Define, such that the value equal The number format. In some cases, the mantissa... It can be represented in two's complement format. However, in other cases, other signed or unsigned integer formats can be used. In these cases, the exponent... and the number of digits in the last digit It only needs to store a set of two or more values ​​represented in this number format once. Different number formats of this type can have different mantissa lengths. and / or different indices Therefore, the quantization parameters for this type of number format include the length of the mantissa. (also referred to in this article as bit width, bit depth, or bit length) and exponent .

[0165] Another exemplary block-configurable type of number format that can be used to represent RNN network parameters is the 8-bit asymmetric fixed-point (Q8A) type format. In one example, this type of number format includes the minimum representable number Maximum Representable Number 0:00 And an 8-bit number representing each value of the linear interpolation factor between the minimum and maximum representable numbers in the set. In other cases, variations of this format can be used, where interpolation factors are stored. The number of bits is variable (e.g., the number of bits used to store interpolation factors). (It can be one of several possible integers). In this example, the Q8A type format or a variant of the Q8A type format can approximate a floating-point value. As shown in equation (1), where The number of bits used for quantization representation (i.e., 8 for Q8A format), and This is a quantization zero that always maps accurately back to 0. The quantization parameter for this exemplary type of number format includes the maximum representable number or value. The smallest representable number or value Quantization Zero Point And optionally, the length of the mantissa. (That is, when the bit length is not fixed at 8).

[0166] (3)

[0167] In another example, the Q8A type format includes a zero point that is always precisely mapped to 0.f. Scale factor and 8-digit numbers for each value in the set In this example, this type of number format approximates a floating-point value as shown in equation (2). Similar to the first exemplary Q8A type format, in other cases, the number of bits for the integer or mantissa components can be variable. The quantization parameters for this exemplary type's number format include zero points. ,Proportion And optionally, the length of the mantissa. .

[0168] (4)

[0169] The digital format for a specific block configurable type can be described as one or more quantization parameters that identify that type of digital format. For example, determining the quantization parameters of a block configurable type... Tag and exponent The defined number format type may include identifiers for mantissa and / or exponent. bit width Configurable number formats for specific types of blocks can be predefined for a given network value.

[0170] To reduce the size and improve the efficiency of RNN hardware implementations, hardware implementations can be configured to process data values ​​in a block-configurable numeric format. Generally, the fewer bits used to represent the network values ​​of an RNN (e.g., its input data values, weights, and output data values), the more efficient the RNN implementation in hardware. However, typically, the fewer bits used to represent the network values ​​of an RNN, the less accurate the RNN becomes. Therefore, it is desirable to identify a numeric format that balances the number of bits used to represent the network values ​​and the accuracy of the RNN. Furthermore, since the range of input, weight, and state data values ​​can vary, hardware implementations can process RNNs more efficiently when the block-configurable numeric format used to represent data values ​​can vary for each set of values ​​(e.g., each tensor of the network). For example, by using a block-configurable numeric format defined by an exponent of 2 and a mantissa length of 6 to represent one set of values ​​in the network, and using a block-configurable numeric format defined by an exponent of 4 and a mantissa length of 4 to represent another set of values ​​in the network, the hardware implementation may be able to implement the RNN more efficiently and / or more accurately.

[0171] Methods for determining the number format of a block-configurable type for a set of two or more values ​​of an RNN will now be described. The set of two or more values ​​of an RNN may include a portion or all of one or more tensors. For example, the methods presented herein can be used to determine the number format of some or all of the values ​​of a tensor, where different number formats are identified for different sets of two or more values ​​(e.g., different tensors or portions of tensors). Different number format selection algorithms can be used to identify the number format of different sets of two or more values.

[0172] The methods described herein can be used with any suitable number format selection algorithm, including, for example: backpropagation format selection, greedy row search and end-to-end format selection, orthogonal search format selection, maximum range (or "MinMax") format selection, outlier rejection format selection, error-based heuristic format selection (e.g., based on the sum of squared errors with or without outlier weighting), weighted outlier format selection, or gradient-weighted format selection algorithms. In particular, the methods described herein can be used with specific format selection algorithms disclosed in UK patent applications with publication numbers 2568083, 2568084, 2568081 or UK patent application number 2009432.2, the full text of each of which is incorporated herein by reference.

[0173] To select the numerical format for the network values ​​of an RNN, the RNN is executed on sample input data to provide statistics for a numerical format selection algorithm for each instance of two or more values. Such statistics can be network values, the average / variable of network values, minimum / maximum network values, a histogram summing the network values, gradients computed relative to the network output, or error metrics based on the network output, and any other data used or generated by the logic of the neural network or the monitoring neural network (e.g., format selection unit 344) required by the format selection algorithm. In some examples, the RNN is executed using a floating-point format for the network values. For example, the RNN can be executed in software, using floating-point formats for the input data, weights, states, and output data values ​​in the network. 32-bit or 64-bit floating-point formats perform well because the numerical format should generally be as close to lossless as possible to obtain the best possible results, but block-configurable numerical formats with a large range / a large number of bits may be used.

[0174] RNNs can be executed in any suitable manner to perform number format selection. For example, an RNN can be executed in software (e.g., using a deep learning framework such as TensorFlow, where the software supports the execution of dynamic graphs, or as a static graph representing a single time step run in a sequence for each time step, where the number format of the network values ​​is selected based on statistics collected at each run) or in hardware (e.g., at an accelerator such as accelerator 302).

[0175] In some examples, RNNs can be compared to the above. Figure 4 and Figure 5 The aforementioned method unfolds the RNN to form a test neural network, which is used to select an appropriate numerical format for its variables and parameters. When unfolding the RNN, the same tensor will appear as an instance of that tensor at each time step. To achieve time invariance and make the derivative neural network based on the unfolded RNN equivalent to the original RNN, all instances of the same two or more values ​​need to have the same format on the unfolded RNN. For example, in Figure 4 In the expanded diagram shown, where the block configurable digital format corresponds to the tensor, all input tensors x(t) of the first RNN cell 102 have the same digital format, and all state tensors h1(t) have the same digital format. Different state tensors (e.g., h1 and h2) can have different digital formats, and the inputs of different RNN cells (e.g., RNN cells 102 and 103) can have different digital formats.

[0176] Methods for performing number format selection on two or more values ​​of an RNN are shown in Figure 11Two or more values ​​may include some or all elements of one or more tensors of the RNN. This method can be executed when receiving data for implementing the RNN1101 in hardware, for example, in... Figure 3 The data processing system is located at accelerator 302. This method can be performed in the design phase 1108 before implementing the RNN in hardware according to the principles described herein or otherwise. Figure 11 The format selection for the Chinese logo during the design phase 1108 is available in... Figure 3 The format selection unit 344 shown is executed under its control. In some examples, the format selection unit 344 may be the same unit as or identical to the transformation unit 326.

[0177] In the first step 1102, the RNN is implemented in hardware or software as a test neural network to enable the collection of statistical data for the number format selection algorithm. The RNN can be implemented as a test neural network in any suitable manner. The RNN performs operations on the sample input data at multiple time steps to capture the statistical data required by the number format selection algorithm. Good performance is typically achieved with only a small number of time steps. For example, some applications have been found to provide good number format selection by performing four time steps. The RNN can operate in any functionally correct manner and output the data required by the format selection method.

[0178] In some examples, RNNs can be implemented in software as test neural networks, such as on a CPU (e.g., Figure 9 The format selection unit 344, which runs at the CPU 902 of the computer system shown, is implemented in software. For example, the network can run in TensorFlow or PyTorch and can output the maximum absolute value of all sets of two or more values ​​for use by the MinMax format selection algorithm. The number format selection in design phase 1108 does not need to be performed at the same computing system where the RNN will ultimately be implemented in hardware. In some examples, the RNN is implemented in hardware as a test neural network to select an appropriate number format, for example, at the accelerator 302 in the data processing system 300. The hardware (and its associated logic, such as control logic 324) should be able to execute the network with sufficiently high precision to avoid serious quantization errors (e.g., in 32-bit floating-point) and provide appropriate statistics. In some examples, the RNN can be implemented at the hardware accelerator 302 to select the data format of the network values ​​according to the principles described herein. The RNN can be expanded over a number of test steps to allow for the above-described relative to... Figure 8The test neural network is derived in the manner described in step 803. In some examples, this can be performed at transformation unit 326. The test neural network represents all or part of a fully expanded RNN that is mathematically equivalent to the representation of the received RNN. The state input of the first time step of such a test neural network can be provided as the state input of the test neural network itself, and the state output from the last time step of the test neural network can be provided as the state output from the test neural network itself. This enables the test neural network to be derived in a manner that allows it to... Figure 5 The iterative process is illustrated, where the state output from the first instance of the test neural network is provided as the state input for the next instance of the test neural network. However, if the test neural network spans a sufficient number of time steps to identify the number format according to the selection algorithm used, iterative iteration of the test neural network is unnecessary, and acceptable number formats can be identified from the application of a single instance of the test neural network. In the example of implementing an RNN as a derivative neural network according to the principles described herein, the number of steps in the test network may or may not be equal to the predetermined number of steps in unrolling the RNN to derive the derivative neural network. It is advantageous if the number of test steps in unrolling the RNN to form the test neural network is at least the number of test steps to perform number format selection by the RNN represented by the test neural network. This avoids the need for iteration. Figure 5 The method shown illustrates the need to iterate over the test neural network, and generally, acceptable number formats can be identified by applying a single instance of the test neural network.

[0179] At 1103, the input state tensor of the implemented test neural network is initialized. This is necessary because there is no previous instance of the test neural network from which the first instance of the test neural network can receive the state tensor. The initial state tensor of the neural network is generally different from the typical state tensor of subsequent time steps. Since the first time step of the test neural network is an exception, it is generally not possible to choose a suitable number format based solely on the first time step. The initial state tensor is preferably the same initial state tensor used when implementing the RNN in hardware as, for example, a derivative neural network as described above. However, it is equally important that the number format works for both the first and subsequent time steps. Therefore, it is advantageous to perform number format selection on multiple test time steps, including the first time step. The initialization of the state tensor introduces transient effects in the first few time steps before the network enters its steady-state behavior. Initialization step 1103 is typically performed together with implementation step 1102 as part of the implementation of the test neural network.

[0180] To perform digital format selection, a test neural network implementing an RNN is executed on suitable sample input data to capture appropriate statistics for the digital format selection algorithm. The RNN is executed for one or more predetermined time steps to generate the statistics required by the digital format selection algorithm at each time step. Suitable sample input data may include exemplary data selected to represent a typical or expected input range of the RNN to be implemented in hardware. In some examples, the sample input data may be input data from the actual source to which the RNN will be applied, such as audio signals from which speech identification will be performed. Capturing statistics from neural networks is well known in the art, and it should be understood that the specific nature of the statistics will depend on the nature of the neural network, its application, and the requirements of the digital format selection algorithm being used. Statistics (e.g., data values, maximum / minimum values, histogram data) generated at the RNN and / or at logic associated with the RNN (e.g., at format selection unit 344) can be captured in any suitable manner. For example, in the RNN at Figure 9 In the case of implementation in software running at CPU 902, statistical data can be stored at memory 304 for simultaneous or subsequent processing by format selection unit 344 (which may also run at CPU). In some examples, at least some statistical data includes intermediate data values ​​generated at the RNN (e.g., between operations on stacked RNN cells and / or RNN cells).

[0181] At step 1105, a number format selection algorithm is applied to the statistics collected from the RNN's operations. The number format selection algorithm can run concurrently with the RNN and / or can be performed subsequently on the captured statistics. The format selection in design phase 1108 can be performed at format selection unit 344. The number format selection algorithm can be any algorithm used to identify the block-configurable number format of two or more sets of network values. The specific choice of algorithm is typically determined by one or more of the following: the application to which the RNN is applied; the nature of the tensors to which the two or more values ​​operated on by the tensor belong; and the time and / or amount of computational resources required to run the algorithm (more complex algorithms may provide better results but may require many times more time to run).

[0182] In examples of this invention, the number format is selected from block-configurable type number formats, and the number of bits for the exponent can be fixed (e.g., 6 bits for the signature). Therefore, the exponent length does not necessarily need to be stored with each data value, but can be defined for groups of data values, such as for each tensor of the RNN, for a set of two or more elements of each tensor, for each type of tensor (e.g., different exponent lengths for inputs and / or weights and / or outputs), for groups of tensors, or for all tensors of the RNN. The amount of data required to store the exponent and mantissa length (e.g., the number of bits required to store the number format) can be fixed and negligible compared to the number of bits required to store the actual mantissa of the network value. Therefore, the number of mantissa bits is the primary determinant of the number format required to represent the network value.

[0183] Number format selection algorithms can determine the length of the mantissa (e.g., in bits) of the number format for a block configurable number type. For example, each block configurable number format used by an RNN to represent a data value includes both exponent and mantissa bit lengths. The mantissa bit length of the block configurable number format used by cells at the lowest level of quantization error can be reduced, or the mantissa bit length of the block configurable number format used by cells at the highest level of quantization error can be increased. The quantization error of a data value is the difference between the original floating-point format data value (i.e., the RNN implementation used for number format selection purposes) and the block configurable number format data value (i.e., the hardware implementation recommended for the RNN).

[0184] Several methods have been developed for identifying the numerical format used to represent network values ​​of an RNN. A simple method for selecting the numerical format used to represent a set of network parameters of an RNN (which may be referred to herein as the full-range method, minimum / maximum method, or MinMax method) may include, for a given mantissa depth (or a given index) Select the set of network values ​​to cover the expected range. The minimum index of the range (or minimum mantissa depth) () is used for calculations. For example, for a given mantissa depth... The exponent can be selected according to equation (3). This allows for digital format coverage. The entire range, of which It is an upper bound function:

[0185] (5)

[0186] However, such methods are sensitive to outliers. Specifically, in the network value set... In the presence of outliers, precision is sacrificed to cover them. This can lead to large quantization errors (e.g., the error between the set of network values ​​in the first numeric format (e.g., floating-point format) and the set of network values ​​in the chosen numeric format). Therefore, the error in computation and / or the output data of the RNN caused by quantization can be greater than the error when the numeric format covers a smaller range, but with greater accuracy.

[0187] In other examples, a summation algorithm with outlier weighting can be used. This algorithm may be suitable where the relatively important values ​​are typically at the higher end of the range of values ​​in a given set of two or more values. This is especially true for weight tensors regularized by a penalty metric, so elements with higher values ​​can be expected to have greater relative importance than those with lower values. Additionally, clamping is a particularly destructive form of noise that can introduce a strong bias in the resulting set of two or more quantized values. Therefore, in some applications, it is advantageous to bias the error to retain large values ​​while avoiding extremes that retain the full range at the expense of quantization error (e.g., as in the “MinMax” method). For example, a combination of the weighting function α(x) shown in equation (4) below with a squared measure of the error can be used for the summation algorithm of the squared error.

[0188] (6)

[0189] SAT is defined as saturation point exp It is an exponent in fixed-point format. n It refers to the number of digits in the last digit. yes 2 exp (i.e., a quantification level), and The gradient is chosen based on experience. For some neural networks, a gradient of 20 can work well.

[0190] A weighted outlier method is described in the applicant's UK patent application number 1718293.2, the entire contents of which are incorporated herein by reference. In the weighted outlier method, when using a specific number format, a number format for a set of network values ​​is selected from multiple potential number formats based on a weighted sum of quantitative errors, wherein constant weights are applied to the quantitative errors of network values ​​falling within the representable range of the number format, and linearly increasing weights are applied to the quantitative errors of values ​​falling outside the representable range.

[0191] Another method (which may be referred to as the backpropagation method) is described in the applicant's UK patent application number 1821150.8, the entire contents of which are incorporated herein by reference. In the backpropagation method, the quantization parameters that produce the optimal cost (e.g., a combination of RNN accuracy and RNN size (e.g., bit depth)) are selected by iteratively determining the gradient of the cost relative to each quantization parameter using backpropagation, and the quantization parameters are adjusted until the cost converges. This method can produce good results (e.g., small and accurate RNNs (in terms of bit depth), but it can take a long time to converge.

[0192] Generally, the choice of number format can be conceived as an optimization problem that can be performed on one, some, or all of the parameters of the number format in an RNN. In some examples, multiple parameters of the number format can be optimized simultaneously; in others, one or more parameters of the format selection method can be optimized sequentially. In some examples, the bit depth of the network values ​​can be predefined using the applied format selection algorithm to select an appropriate exponent for the network values ​​of the RNN. The bit depth can be fixed, or in some examples, it can be a parameter to be optimized. In some examples, applying the 1105 number format selection algorithm may include identifying the appropriate bit depth of the RNN. To ensure that each time step of the test neural network is identical, instances of two or more values ​​of the RNN at different time steps are constrained to have the same bit depth. For example, each instance of the state tensor h1(t) has the same bit depth, and each instance of the input tensor x(t) has the same bit depth.

[0193] As already described, at step 1104, the RNN is computed on the sample input data over a predefined number of time steps without any (or minimum) quantization of its network values, in order to capture the statistics required by the format selection method at each time step. The format selection method is then applied to the statistics captured at each time step of the RNN at step 1105 to select the optimal number format for the RNN's network values. A number format selection algorithm can be selected and / or configured to identify the number format of the block configurable type for each network value to which the number format is to be determined. As explained above, the block configurable number format identified by the algorithm will typically be represented as a set of one or more parameters defining the type of block configurable number format to which the block configurable number format belongs.

[0194] Numerical format selection can be performed on statistical data captured at one or more time steps for sets of two or more network values. Numerical format selection can also be performed on statistical data captured at more than one time step sequence for sets of two or more network values, for example, by applying an RNN to a first sample input sequence and then to a second sample input sequence. The numerical format selection algorithm can be applied to all statistical data captured at multiple time step sequences to identify a single general numerical format for sets of two or more network values ​​in the manner described herein, or the numerical format selection algorithm can be applied independently to statistical data captured at different time step sequences, wherein the numerical format is identified relative to each sequence combined according to the methods described herein, so as to identify a single general numerical format for sets of two or more network values. This helps ensure the generality of the general numerical format identified for each set of two or more network values.

[0195] In some examples, the format selection algorithm is applied independently (1105) to the statistics captured at each time step (or a subset of the time steps for capturing statistics) to identify the numerical format of each network value instance at each (or those) time step; the numerical formats of these instances are then combined to produce a common numerical format for the network values ​​across all time steps (1106). In other examples, the format selection algorithm is applied (e.g., simultaneously) to the statistics captured at all predefined time steps of the RNN execution (1105) to identify a common numerical format (1106) for the network values ​​across all time steps of the RNN execution (i.e., for each instance of the network value). In such examples, when the RNN is implemented in hardware, the format selection algorithm 1106 identifies the common numerical format used on all instances of the corresponding network value in the RNN.

[0196] When a format selection algorithm is applied simultaneously to statistics captured over all predefined number of time steps in the execution of an RNN, the output of the format selection algorithm can be a single generic numeric format for the network values. For example, statistics captured while running an RNN on sample input data could include the maximum absolute value of a set of two or more values ​​captured at each time step. The format selection algorithm could then include parameters that identify a generic, block-configurable numeric format for that set of values ​​by combining the maximum absolute values ​​captured at each time step and performing a MinMax algorithm on that maximum.

[0197] As explained above, to ensure time invariance during time steps in a hardware implementation of an RNN, each instance of a network value (i.e., the network value at each time step) should have the same numeric format. In cases where the format selection algorithm is performed multiple times on statistics captured at a predefined number of time steps (e.g., the format selection algorithm is applied independently to statistics captured at each time step or a subset of time steps), the format selection algorithm can identify more than one numeric format for each network value. In other words, different numeric formats can be identified for instances of the same set of values ​​at different time steps. In this case, the resulting numeric formats are combined (1106) to identify a general numeric format for each network value of a 1106 RNN. When implemented in hardware (e.g., implemented as a derivative neural network according to the above principles), this general numeric format can be used for all instances of the corresponding network value in the RNN. For example, refer to... Figure 4 Input tensor Each of these is an instance of the input tensor of the RNN at a specific time step, the first state tensor. Each of these is an instance of the first-state tensor of the RNN at a specific time step, and the second-state tensor... Each of these is an instance of the RNN at a specific time step, and so on. Combining the number formats for a given tensor identifier can be performed in any way that suits a particular number format. Number formats can be combined at format selection unit 344.

[0198] A block-configurable number format can be represented as a set of one or more parameters defining the number format; for example, a first integer value can represent the exponent, and a second integer value can represent the mantissa depth. Each parameter can be combined independently to form a general number format. For example, the integer parameters representing the number format established for a tensor instance can be combined by identifying integer values ​​of the median, minimum, maximum, or average (e.g., the integer value closest to the average), which can then be used as corresponding parameters for the general number format. It has been found that using the median of the exponent for the number format of each set of instances provides good accuracy.

[0199] Consider a specific example where the number format established for instances of network values ​​is defined by an integer parameter of the exponent of the configurable number format in the definition block. In this example, the number format can be independently identified by the network value format selection algorithm in each of the four iterations of the RNN. If the identified number format has exponents of 5, 6, 5, and 4, then the median exponent is 5, and the general number format can be identified as a number format with an exponent of 5.

[0200] Once a universal number format is established, it can be used in the hardware implementation of RNNs. For example, the universal number format can be provided to... Figure 3The transformation unit 326 of the data processing system shown is used for a derivative neural network. The same general numerical format for the network value identifier of the RNN is used for all instances of that network value in the derivative neural network. For example, the general numerical format established for the input tensor x of the test neural network is used as the numerical format for all instances of the input tensor of the derivative neural network, and the general numerical format established for the first state tensor h1 of the test neural network is used as the numerical format for all instances of the first state tensor h1 of the derivative neural network.

[0201] A derivative neural network implemented using the method according to the invention can represent an RNN that expands on a number of steps different from the predefined number of steps for performing number format selection. When an RNN executed to generate statistics for a number format selection algorithm is implemented as a derivative neural network according to the above principle, the number of steps of the RNN represented by the test neural network can differ from the number of steps represented by the derivative neural network.

[0202] In addition to providing consistent behavior over time, the method described in this paper also makes the chosen format more robust because it gathers information from multiple time steps of the RNN. For example, if a tensor behaves differently at a given time step than at previous time steps, resulting in different numerical formats at those time steps, this method has the potential to generalize this format to all other time steps before and after the given time step in the unfolded graph. This means that if anomalous behavior occurs at different points in the sequence, these tensor values ​​can be handled correctly.

[0203] The method of the present invention for performing number format selection for RNNs can be applied to neural networks in which causal and non-causal components are separated according to the principles described above. Since causal and non-causal operations are performed separately, these operations can be performed on different tensors whose general number formats can be independently selected according to the method of the present invention. This allows different number formats to be used for causal and non-causal operations, thereby improving performance (e.g., inference speed) and / or enabling a given performance level to be achieved with lower memory and processing overhead.

[0204] To perform operations on combinations of values ​​defined in different number formats, the number formats of one or more values ​​in the combination can be converted to ensure that the combined values ​​have the same number format. For example, refer to... Figure 7 If the output 704 of the non-causal cell has a first digital format and the state input 706 has a second digital format, then the addition operation at the causal cell 604 can be configured to convert the output 704 and / or the state input 706 to the same (possibly a third) digital format. This conversion can be performed in hardware, such as at the accelerator 302 in the data processing system 300.

[0205] Technicians will learn how to convert data values ​​between number formats. For example, from those with mantissas... m 1 Sum of Indices e 1 A number format converted to have mantissa m 2 Sum of Indices e 2 Another number format with the same bit depth can be performed as follows, where the exponents of the number formats are different:

[0206] (7)

[0207] Such conversions are affected by saturation and quantization errors, depending on the specific circumstances. e 2 Is it lower or higher? e 1 .

[0208] Because the method of this invention allows for the selection of different number formats for a set of two or more values ​​of an RNN (e.g., tensors, portions of tensors, groups of tensors), the performance of RNNs in hardware can be optimized for all implementations, especially those that use the principles described herein to form derivative neural networks based on RNNs expanded over a predetermined number of steps. When executing RNNs in hardware, performing number format selection independently of different network values ​​can provide better results by offering greater flexibility in format selection.

[0209] General Comments

[0210] Figure 3 The data processing system is shown as comprising numerous functional blocks. This is merely illustrative and not intended to define a strict division between the different logical elements of such an entity. Each functional block can be provided in any suitable manner. It should be understood that the intermediate values ​​formed by the computer system described herein do not need to be physically generated by the computer system at any point in time, and may only represent logical values ​​that conveniently describe the processing performed by the computer system between its inputs and outputs.

[0211] The accelerators described herein are embodied in hardware; for example, an accelerator may include one or more integrated circuits. The data processing systems described herein may be configured to perform any of the methods described herein. Unless otherwise indicated, the functions, methods, techniques, or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry systems), or any combination thereof. The terms “module,” “function,” “component,” “element,” “cell,” “block,” and “logic” may be used herein to generally denote software, firmware, hardware, or any combination thereof. In the case of software, a module, function, component, element, cell, block, or logic represents program code that, when executed on a processor, performs a specified task. The software described herein may be executed by one or more processors that execute code that causes one or more processors to perform an algorithm / method embodied by the software. Examples of computer-readable storage media include random access memory (RAM), read-only memory (ROM), optical disk, flash memory, hard disk storage, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that may be accessible by a machine.

[0212] As used herein, the terms computer program code and computer-readable instructions refer to any type of executable code for a processor, including code expressed in one or more of machine language, interpreted language, scripting language, and compiled high-level language. Executable code includes binary code, machine code, bytecode, code defining integrated circuits (such as hardware description languages ​​or netlists), and code expressed in programming languages ​​such as C, Java, or OpenCL. Executable code can be, for example, any kind of software, firmware, script, module, or library that, when properly executed, processed, interpreted, compiled, or run in a virtual machine or other software environment, causes the processor to perform the tasks specified by the code.

[0213] A processor can be any kind of device, machine, or special-purpose circuit, or a combination thereof, that has processing capabilities to execute instructions. A processor can be any kind of general-purpose or special-purpose processor, such as a system-on-a-chip, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), and so on. A computing system may include one or more processors.

[0214] This invention is also intended to cover software (such as HDL (Hardware Description Language) software) that defines the configuration of hardware as described herein, for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer-readable storage medium may be provided on which computer-readable program code in the form of an integrated circuit definition dataset is encoded, which, when processed in an integrated circuit manufacturing system, configures the system to manufacture a computer system configured to perform any of the methods described herein, or to manufacture a computer system as described herein. The integrated circuit definition dataset may, for example, be an integrated circuit description.

[0215] A method for manufacturing a computer system as described herein can be provided in an integrated circuit manufacturing system. An integrated circuit definition dataset can be provided, which, when processed in the integrated circuit manufacturing system, causes the method for manufacturing the computer system to be executed.

[0216] Integrated circuit definition datasets can be in the form of computer code, such as as a netlist, code for configuring programmable chips, or a hardware description language for defining integrated circuits at any level, including register-transfer level (RTL) code, high-level circuit representations such as Verilog or VHDL, and low-level circuit representations such as OASIS (RTM) and GDSII. Higher-level representations of logically defined integrated circuits (such as RTL) can be processed at a computer system configured to generate manufacturing definitions of integrated circuits within a software environment that includes definitions of circuit elements and rules for combining those elements to generate the manufacturing definitions of the integrated circuits defined by the representation. As is typically the case where software executes at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate the manufacturing definitions of the integrated circuits, executing the code that defines the integrated circuits to generate the manufacturing definitions of the integrated circuits.

[0217] Now refer to Figure 10 This describes an example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system in order to configure the system as a manufacturing computer system.

[0218] Figure 10An example of an integrated circuit (IC) manufacturing system 1002 is shown, configured to manufacture computer systems as described in any of the examples herein. Specifically, the IC manufacturing system 1002 includes a layout processing system 1004 and an integrated circuit generation system 1006. The IC manufacturing system 1002 is configured to receive an IC definition dataset (e.g., defining a computer system as described in any of the examples herein), process the IC definition dataset, and generate ICs based on the IC definition dataset (e.g., which embodies a computer system as described in any of the examples herein). The processing of the IC definition dataset configures the IC manufacturing system 1002 to manufacture integrated circuits embodying the computer systems described in any of the examples herein.

[0219] The layout processing system 1004 is configured to receive and process an IC definition dataset to determine a circuit layout. Methods for determining a circuit layout based on an IC definition dataset are known in the art and may involve, for example, synthesizing RTL code to determine the gate-level representation of the circuit to be generated, for example, in relation to logic components (e.g., NAND, NOR, AND, OR, MUX, and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout can be determined based on the gate-level representation of the circuit. This can be done automatically or with user intervention to optimize the circuit layout. Once the layout processing system 1004 has determined the circuit layout, it can output the circuit layout definition to the IC generation system 1006. The circuit layout definition may be, for example, a circuit layout description.

[0220] As is known in the art, IC generation system 1006 generates ICs according to a circuit layout definition. For example, IC generation system 1006 may implement a semiconductor device manufacturing process for generating ICs, which may involve a multi-step sequence of photolithography and chemical processing steps, during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask, which can be used in the photolithography process to generate ICs according to the circuit definition. Alternatively, the circuit layout definition provided to IC generation system 1006 may be in the form of computer-readable code, which IC generation system 1006 can use to form a suitable mask for generating ICs.

[0221] The various processes performed by the IC manufacturing system 1002 can all be implemented in one location, for example, by one party. Alternatively, the IC manufacturing system 1002 can be a distributed system, allowing some processes to be performed in different locations and by different parties. For example, some of the following stages can be performed in different locations and / or by different parties: (i) synthesizing RTL code representing an IC definition dataset to form a gate-level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate-level representation; (iii) forming a mask based on the circuit layout; and (iv) using the mask to manufacture the integrated circuit.

[0222] In other examples, by processing an integrated circuit definition dataset at an integrated circuit manufacturing system, the system can be configured to manufacture a computer system without processing the IC definition dataset to determine the circuit layout. For example, an integrated circuit definition dataset can define the configuration of a reconfigurable processor such as an FPGA, and processing that dataset can configure the IC manufacturing system to (e.g., by loading the configuration data into the FPGA) generate a reconfigurable processor with that defined configuration.

[0223] In some implementations, when processed in an integrated circuit manufacturing system, an integrated circuit manufacturing definition dataset can enable the integrated circuit manufacturing system to generate devices as described herein. For example, using an integrated circuit manufacturing definition dataset, as referenced above... Figure 10 The described method allows for the configuration of an integrated circuit manufacturing system to produce equipment as described in this article.

[0224] In some examples, an integrated circuit definition dataset may include software running on hardware defined at the dataset, or software running in combination with hardware defined at the dataset. Figure 10 In the example shown, the IC generation system can also be further configured by the integrated circuit definition dataset to load firmware onto the integrated circuit according to the program code defined in the integrated circuit definition dataset during the manufacturing of the integrated circuit, or otherwise provide the integrated circuit with program code to be used with the integrated circuit.

[0225] Compared to known implementations, the implementation of the concepts set forth in this application in devices, apparatuses, modules, and / or systems (and in the methods implemented herein) can lead to performance improvements. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and / or reduced power consumption. During the manufacture of such devices, apparatuses, modules, and systems (e.g., in integrated circuits), trade-offs can be made between performance improvements and physical implementations, thereby improving manufacturing methods. For example, a trade-off can be made between performance improvements and layout area to match the performance of known implementations but using less silicon. This can be accomplished, for example, by reusing functional blocks serially or sharing functional blocks among elements of the device, apparatus, module, and / or system. Conversely, the concepts set forth in this application that lead to improvements in the physical implementation of devices, apparatuses, modules, and systems (such as reduced silicon area) can be traded off against performance improvements. This can be accomplished, for example, by manufacturing multiple instances of the module within a predefined area budget.

[0226] Compared to known implementations, the implementation of the concepts set forth in this application in devices, apparatuses, modules, and / or systems (and in the methods implemented herein) can lead to performance improvements. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and / or reduced power consumption. During the manufacture of such devices, apparatuses, modules, and systems (e.g., in integrated circuits), trade-offs can be made between performance improvements and physical implementations, thereby improving manufacturing methods. For example, a trade-off can be made between performance improvements and layout area to match the performance of known implementations but using less silicon. This can be accomplished, for example, by reusing functional blocks serially or sharing functional blocks among elements of the device, apparatus, module, and / or system. Conversely, the concepts set forth in this application that lead to improvements in the physical implementation of devices, apparatuses, modules, and systems (such as reduced silicon area) can be traded off against performance improvements. This can be accomplished, for example, by manufacturing multiple instances of the module within a predefined area budget.

[0227] The applicant has independently disclosed each individual feature described herein, as well as any combination of two or more such features, to a degree that such features or combinations can be implemented based on the general knowledge of those skilled in the art, in accordance with the entire specification, regardless of whether such features or combinations of features solve any problem disclosed herein. In view of the foregoing description, those skilled in the art will understand that various modifications can be made within the scope of this invention.

Claims

1. A computer-implemented method, the method comprising selecting a numerical format for representing two or more values ​​used in configuring hardware, the hardware being adapted to execute a non-recurrent neural network to implement the recurrent neural network RNN, the method comprising: Receive the representation of the RNN; The representation of the RNN is implemented as a test neural network for operating on a test input sequence, wherein the step of implementing the representation of the RNN as a test neural network includes: transforming the representation of the RNN into a test neural network for operating on the first predetermined plurality of steps by expanding the RNN over the first predetermined plurality of steps to form the test neural network, wherein the test neural network is equivalent to the RNN over the first predetermined plurality of steps; The test neural network is operated on the test input sequence for the first predetermined plurality of steps. For each of the first predetermined plurality of steps, the operation of the test neural network includes instances of the two or more values ​​of the RNN, and statistical data is collected to provide to the number format selection algorithm. A number format selection algorithm is applied to the statistical data to derive a common number format for a first predetermined plurality of instances of the two or more values ​​of the RNN; and At the hardware suitable for executing a non-recurrent neural network, in the implementation of the RNN at the hardware, the general numerical format is used as the numerical format of two or more corresponding values ​​for operation on the input sequence, wherein the RNN at the hardware is implemented as a derivative neural network, the derivative neural network representing the RNN expanded over a second predetermined plurality of steps.

2. The method of claim 1, wherein each step of the test neural network is used to compute different test inputs to the sequence, and wherein applying the format selection algorithm includes applying the number format selection algorithm to the statistical data captured in all of the plurality of steps, the general number format being output by the number format selection algorithm.

3. The method of claim 1, wherein the general number format is a block configurable number format defined by one or more configurable parameters, and wherein the number format selection algorithm is configured to identify a block configurable number format of a predefined type.

4. The method of claim 1, wherein applying the digital format selection algorithm comprises: For each of the plurality of steps, independently identify the numeric format of each instance of the two or more values; as well as The number formats of the plurality of instances of the two or more values ​​are combined to derive the general number format of the plurality of instances of the two or more values ​​of the RNN.

5. The method of claim 4, wherein the number format selection algorithm is configured to identify a block configurable number format defined by one or more configurable parameters for each instance of the two or more values, and wherein the combination includes independently combining each of the one or more configurable parameters of the block configurable number format identified for each instance of the two or more values ​​to define the one or more configurable parameters of the general number format.

6. The method of claim 1, wherein the test neural network is configured to perform operations on a predefined plurality of test inputs, the number of which is equal to the number of the first predetermined plurality of steps.

7. The method of claim 1, wherein the implementation of the RNN at the hardware is formed in the following manner: The representation of the RNN is transformed into a derivative neural network for operation on a predetermined plurality of inputs to the input sequence, the derivative neural network having one or more state inputs and one or more state outputs, and being equivalent to the RNN in the second predetermined plurality of steps; and The derivative neural network is iteratively applied to the input sequence in the following manner: A sequence of instances implementing the derivative neural network in hardware; and The one or more state outputs from each instance of the derivative neural network at the hardware are provided as the one or more state inputs to subsequent instances of the derivative neural network at the hardware so as to compute the RNN on an input sequence longer than the predetermined plurality of inputs.

8. The method of claim 7, wherein the general number format formed for each of the two or more values ​​of the RNN is used as the number format for all instances of the two or more values ​​in the derivative neural network.

9. The method of claim 7, wherein the first predetermined plurality of steps includes fewer steps than the second predetermined plurality of steps.

10. The method of claim 7, wherein the RNN comprises one or more cells, each cell being arranged to receive a cell state input generated in the preceding step, and transforming the RNN into the test neural network further comprises, at each cell: The identifier is a non-causal operation performed independently of the state input generated in the previous step; and In the derivative neural network, at least some of the non-causal operations at multiple instances of the cells at at least some of the predetermined multiple steps are grouped together for parallel processing at the hardware.

11. The method of claim 10, wherein the cell includes a causal operation performed depending on the cell state input, and the transformation of the RNN further includes configuring the test neural network such that the result of the noncausal operation performed at the cell relative to an input from the test input sequence is combined with the causal operation performed at the cell relative to the same test input.

12. The method of claim 10, wherein the two or more values ​​are used for the non-causal operation, and the RNN includes two or more other values ​​used in the causal operation, and the method performs the application of the number format selection algorithm to the statistics to independently derive the general number format of the two or more values ​​of the RNN and the second general number format of the two or more other values ​​of the RNN.

13. The method of claim 1, wherein the test input sequence includes exemplary input values ​​selected to represent a typical or expected range of input values ​​for the RNN.

14. A data processing system for selecting one or more numerical formats for representing two or more values ​​used in configuring hardware, said hardware being adapted to execute a non-recurrent neural network to implement the recurrent neural network RNN, said data processing system comprising: processor; A transformation unit is configured to receive a representation of the RNN and transform the representation of the RNN into a test neural network for operation on a first predetermined plurality of steps, wherein the transformation unit is configured to expand the RNN on the first predetermined plurality of steps to form the test neural network, the test neural network being equivalent to the RNN on the first predetermined plurality of steps. Control logic configured at the processor to implement the representation of the RNN as the test neural network for computation on a test input sequence; as well as A format selection unit is configured to cause the processor to compute the test neural network on the test input sequence for a first predetermined plurality of steps, and to collect statistical data to provide to a number format selection algorithm, wherein for each of the first predetermined plurality of steps, the computation of the test input sequence includes instances of the two or more values ​​of the RNN. The format selection unit is configured to apply the number format selection algorithm to the statistics in order to derive a common number format for the first predetermined plurality of instances of the two or more values ​​of the RNN; as well as A hardware accelerator for processing neural networks, the hardware accelerator being adapted to execute non-recurrent neural networks; The control logic is further configured to implement the representation of the RNN at the hardware accelerator using the general numerical format of the two or more values ​​of the RNN for operation on the input sequence, wherein the RNN is implemented at the hardware accelerator as a derivative neural network, the derivative neural network representing the RNN expanded over a second predetermined plurality of steps.

15. The data processing system of claim 14, wherein the derivative neural network has one or more state inputs and one or more state outputs, and is equivalent to the RNN at a predetermined plurality of steps, and the system further comprises: Iterative logic, configured to iteratively apply the derivative neural network to the input sequence after the test neural network has been computed at the processor, in the following manner: The sequence of instances of the derivative neural network is implemented at the hardware accelerator; and The one or more state outputs of each representation of the derivative neural network from the hardware accelerator are provided as the one or more state inputs to subsequent representations of the derivative neural network at the hardware accelerator, so that the hardware accelerator can operate the RNN on an input sequence longer than the predetermined plurality of inputs.

16. A computer-readable storage medium storing computer-readable instructions thereon, which, when executed at a computer system, cause the computer system to perform a computer-implemented method, the computer-implemented method selecting a numerical format for representing two or more values ​​used in configuring hardware, the hardware being adapted to execute a non-recurrent neural network to implement the recurrent neural network RNN, the method comprising: Receive the representation of the RNN; The representation of the RNN is implemented as a test neural network for operating on a test input sequence, wherein implementing the representation of the RNN as a test neural network includes: transforming the representation of the RNN into a test neural network for operating on the first predetermined plurality of steps by expanding the RNN over the first predetermined plurality of steps to form the test neural network, wherein the test neural network is equivalent to the RNN over the first predetermined plurality of steps; The test neural network is operated on the test input sequence for the first predetermined plurality of steps. For each of the first predetermined plurality of steps, the operation of the test neural network includes instances of the two or more values ​​of the RNN, and statistical data is collected to provide to a number format selection algorithm; and A number format selection algorithm is applied to the statistical data to derive a common number format for a first predetermined plurality of instances of the two or more values ​​of the RNN; and At the hardware suitable for executing a non-recurrent neural network, in the implementation of the RNN at the hardware, the general numerical format is used as the numerical format of two or more corresponding values ​​for operation on the input sequence, wherein the RNN at the hardware is implemented as a derivative neural network, the derivative neural network representing the RNN expanded over a second predetermined plurality of steps.