Flexible programmable processing unit
The SIMDIM architecture addresses inefficiencies in digital data processing units by dynamically reconfiguring data paths with programmable switches, enhancing processing efficiency and reducing latency and energy consumption.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NOKIA SOLUTIONS & NETWORKS OY
- Filing Date
- 2025-12-19
- Publication Date
- 2026-07-02
AI Technical Summary
Existing digital data processing units face inefficiencies in area cost, energy consumption, and computational latency due to complex designs that often require a large number of unused processing elements and wide data paths, leading to high power consumption and longer processing times.
The SIMDIM architecture employs a configurable data path using programmable switches to interconnect processing elements, allowing dynamic reconfiguration of data paths based on control data, enabling efficient utilization of hardware resources and reducing computational latency.
The SIMDIM architecture achieves reduced computational latency and energy consumption by optimizing hardware usage, allowing for flexible and efficient processing of multiple data types within a single time slot, thereby improving processing efficiency and resource utilization.
Smart Images

Figure 2026110576000001_ABST
Abstract
Description
Technical Field
[0001] Embodiments of various examples generally relate to digital data processing units, processors, and related compilers.
Background Art
[0002] A digital data processing unit is hardware (e.g., software programmable circuitry) that receives input digital data, processes the input digital data according to instructions generated from a software program, and provides output digital data. The more efficiently this can be done, the lower both the area cost of the processing unit and the energy consumption during its useful life.
[0003] A number of different types of processing units can be used to process multiple data in parallel, including SIMD (Single Instruction, Multiple Data), SISD (Single Instruction, Single Data), MIMD (Multiple Instructions, Multiple Data), and MISD (Multiple Instructions, Single Data). Each of these types of processor units includes processing elements to which instructions can be assigned according to a given set of rules.
[0004] Requirements, parallelization, and computational latency, from the perspective of an algorithm (e.g., in terms of the number of input data and the type of operations), vary widely. Complex designs of processing units can be used to map complex algorithms to the processing unit, but this can - even in cases where they are rarely used - shape the processing unit to a particular size for the worst-case scenario, and thus require a large number of unused processing elements (i.e., dark silicon), high energy costs (wide data paths imply high power consumption), and more time to complete the processing.
Summary of the Invention
[0005] According to some aspects, the subject matter of the independent claims is provided. Some further aspects are defined in the dependent claims.
[0006] Embodiments, examples, and features described herein that are not within the scope of protection should be interpreted as useful examples for understanding the various embodiments or examples that are, if any, within the scope of protection.
[0007] According to a first embodiment, the digital data processing unit comprises an input for receiving input data, an output for providing output data, one or more processing elements, and one or more switches. At least one or each processing element has one or more inputs for receiving input data and one or more outputs for providing output data. At least one or each processing element is configured to apply mathematical operations to the input data of the processing element within a time slot to generate output data. At least one or each switch has an output and at least one input for receiving their respective input values. At least one or each switch is operated based on control data within a time slot to provide their respective received input values to one of the outputs of switches selected based on control data. One or more processing elements and one or more switches are interconnected to form data paths between the inputs and outputs of the processing unit. At least one or each of the data paths is configured to provide the final mathematical results from intermediate mathematical results generated by one or more processing elements in the data path under consideration at the corresponding output of the processing unit.
[0008] One or more switches may operate within a time slot based on control data to provide a neutral value to at least one of the switches selected based on the control data. The neutral value provided by the switches in the data path may also be the neutral value of the final mathematical result produced by this data path.
[0009] The neutral value provided by the output of a switch to the input of the processing element to which the mathematical operation is applied may be equal to the identity element for the mathematical operation.
[0010] The input of the switch may be connected to the input of a processing unit or the output of a processing element. The output of the switch may be connected to the output of a processing unit or the input of a processing element.
[0011] A data path can have at least two processing elements.
[0012] The data path may include a first switch having an input connected to the input of a processing unit, a first processing element having an input connected to the output of the first switch, a second switch having an input connected to the output of the first processing element, and a second processing element having an input connected to the output of the second switch.
[0013] The digital data processing unit may include bitwise OR gates belonging to a first data path and a second data path. The output of the bitwise OR gate may be connected to the output of the processing unit. The first input of the bitwise OR gate may be connected to a switch or processing element that provides the mathematical result of the first data path. The second input of the bitwise OR gate may be connected to a switch or processing element that provides the mathematical result of the second data path.
[0014] The input of a processing unit may have N first inputs for receiving data, where N ≥ 4. One or more switches may have N first switches and N / 2 second switches. One or more processing elements may have N / 2 first processing elements and N / 4 second processing elements. Each of the N first switches may have (i) an input connected to a corresponding input among the N first inputs, (ii) a first output providing the input values received by the inputs of the N first switches, and (ii) a second output providing a neutral value. Each of the N / 2 first processing elements may have (i) an input connected to the output of a first corresponding switch among the N first switches, and (ii) another input connected to the output of a second corresponding switch among the N first switches. Each of the N / 2 second switches may have (i) an input connected to the output of the corresponding processing element among the N / 2 first processing elements, (ii) a first output that provides the input value received by the input of the N / 2 second switch, and (ii) a second output that provides a neutral value. Each of the N / 4 second processing elements may have (i) an input connected to the output of the third corresponding switch among the N / 2 second switches, and (ii) another input connected to the output of the fourth corresponding switch among the N / 2 second switches.
[0015] According to the second embodiment, the processor comprises one or more digital data processing units according to the first embodiment.
[0016] According to a second aspect, the compiler is configured to compile program code to generate instructions to be processed by a processor according to the second aspect. The compiler may be configured to generate instructions for generating control data for one or more switches in one or more digital data processing units.
[0017] The compiler may be configured to generate instructions for configuring at least one of the switches in one or more digital data processing units to provide neutral values respectively.
[0018] Embodiments of some examples will be described herein while referring to the accompanying drawings.
Brief Description of the Drawings
[0019] [Figure 1A] It is a diagram illustrating an aspect of the behavior of a switch used in a processing unit according to an example. [Figure 1B] It is a diagram illustrating an aspect of the behavior of a switch used in a processing unit according to an example. [Figure 2] It is a diagram showing a processing unit according to an example. [Figure 3] It is a diagram showing a processing unit according to an example. [Figure 4A] It is a diagram showing various configurations of a processing circuit device according to an example. [Figure 4B] It is a diagram showing various configurations of a processing circuit device according to an example. [Figure 4C] It is a diagram showing various configurations of a processing circuit device according to an example. [Figure 4D] It is a diagram showing various configurations of a processing circuit device according to an example. [Figure 5] It is a diagram showing a framework for controlling a processing unit according to an example.
Modes for Carrying Out the Invention
[0020] Note that these drawings are intended to illustrate various aspects of the devices, methods, and structures used in the embodiments of the examples described herein. The use of similar or identical reference numbers in the various drawings is intended to indicate the presence of similar or identical elements or features.
[0021] Detailed example embodiments are disclosed herein. The example embodiments are given by way of illustration only and are not limitations of the present disclosure. These example embodiments may be embodied in many alternative forms with various modifications and should not be construed as limited only to the embodiments described herein. Further, the figures and descriptions may be simplified to illustrate important elements and / or aspects for a clear understanding of the present invention, but, for clarity, many other elements well known in the art or that may not be important for an understanding of the present invention are excluded.
[0022] The architecture of a digital data processing unit is disclosed.
[0023] This architecture is referred to herein as a SIMDIM architecture for "Single Instruction Multiple DIMmensions". The SIMDIM architecture dynamically provides a configurable data path in which multi-dimensional vector operations can be mapped to a single instruction within a time slot. The SIMDIM architecture provides specialization and more flexibility. The SIMDIM architecture can be used in various types of processors.
[0024] A digital data processing unit (also simply referred to herein as a processing unit) according to the SIMDIM architecture includes one or more processing elements and one or more programmable switches.
[0025] The SIMDIM architecture is based on the use of one or more switches to interconnect one or more processing elements, thereby providing flexibility to define various data paths in a given digital data processing unit at runtime depending on the control data provided to the switches. This is implemented with a fixed hardware interconnect between the switches and the processing elements and dynamically adjustable control data values for the switches.
[0026] A processing element (PE) is configured to perform operations within a time slot according to the operation code (or opcode) provided to the PE under consideration, in order to apply mathematical operations to input data (e.g., from an input data stream) and generate output data (e.g., for an output data stream).
[0027] A time slot could, for example, correspond to a clock cycle of the processor clock used by the processing unit.
[0028] The addresses of input and output data can be defined within registers, for example, in a register file.
[0029] The mathematical operations performed by PE may be any type of basic operation, such as multiplication, addition, division, trigonometric operations, shifts, logical operations (e.g., XOR, AND, OR, NOR, etc.), or combinations thereof.
[0030] A PE can be configured (and optimized) to apply one or more of the types of mathematical operations described above. A PE can be configured to perform operations on one or more specific data types, such as integer values of various lengths, floating-point values of various lengths, binary values, etc. A PE can be configured (e.g., optimized) to perform operations only on specific data types.
[0031] During computation, an opcode (an abbreviation of operation code, also known as instruction machine code or instruction code) is a part of a machine code instruction that specifies the operation to be performed.
[0032] The switch has outputs (e.g., at least two outputs) and at least one input to receive the respective input values. The switch operates based on control data within a time slot to provide the respective received input values to one of the switch's outputs, and the outputs are selected based on the switch's control data.
[0033] The control data for a switch may be transmitted by a control signal provided to the switch input, or it may be stored in a register. For the sake of simplifying the diagram and facilitating understanding, the control data for the switch will be described as being provided by a control signal. However, in all embodiments, a register may be used instead of a control signal to provide the control data.
[0034] Processing elements and switches are interconnected to form data paths between the inputs and outputs of a processing unit. Each data path may have one or more inputs and one output. A data path includes one or more switches and one or more PEs, as well as connections between these elements. The connections between the elements of a data path (i.e., switches and PEs) cause one or more inputs of the data path to connect to the outputs of the data path.
[0035] The data path provides the final mathematical result in a corresponding output of one of the processing units. The final mathematical result is based on the intermediate mathematical results generated by one or more processing elements in the data path under consideration.
[0036] Figures 1A and 1B illustrate the principle of the behavior of the switches used in the example processing unit 100.
[0037] Figures 1A and 1B show a simplified example of a switch 110 having one input 121 and two outputs 131 and 132.
[0038] The switch 110 operates based on control signals 140 for each time slot. The control signals 140 provide control data to the switch and determine the switch's behavior.
[0039] In this simplified example, which includes only one input 121 and two outputs 131 and 132, the control signal 140 can provide a binary control data value to select which of the two outputs 131 and 132 provides the input data received at input 121. The control data value is used to select, on an input-by-input basis, which output provides the input data received at the input under consideration.
[0040] This simplified example can be generalized to any number of inputs and outputs. There may be more inputs and more outputs. For example, a switch with NI=2 inputs and NO=3 outputs may be used, with control signals adapted to provide control data that identifies the input for each of the switch's outputs.
[0041] In the configuration shown in Figure 1A, the control signal 140 of switch 110 can transmit a binary value "1" so that the input value received by input 121 is provided to output 132. In contrast, in the configuration shown in Figure 1B, the control signal 140 of switch 110 can transmit a binary value "0" so that the input value received by input 121 is provided to the other output 131.
[0042] In some embodiment of the examples, the switch 110 may be configured to provide neutral data having a neutral value 150 to at least one of the outputs of the switch 110, to which no input data is provided. The output to which the neutral value is provided is selected based on control data provided to the switch (e.g., by a control signal or register).
[0043] The neutral value 150 that will be provided at the switch output may be fixed or dynamically changed. The neutral value may be fixed and pre-configured in the switch hardware or dynamically configured at runtime (for example, using configuration data values provided to the switch). The neutral value may be received by the switch, for example, at a specific input 151 of the switch (for example, at an input 151 reserved for configuring the neutral value).
[0044] Neutral data with a neutral value 150 provided by switch 110 in the data path has the neutral value of the final mathematical result produced by this data path. When a neutral value passes through one or more parts of the data path, it "deactivates" those parts of the data path.
[0045] The neutral value provided by the switch output to the input of the PE applying the mathematical operation is equal to the identity element for the mathematical operation, and therefore the neutral value has no effect on the mathematical result produced by the PE applying the mathematical operation. For example, - The identity element of addition is the value "0". - The identity element of multiplication is the value "1", - The identity element of the logical function "OR" is the value "0". - etc.
[0046] To distinguish neutral data from other data used in the processing unit, the other data will be referred to as “active data” in this specification.
[0047] Similarly, the inputs of a switch that receive a neutral value are designated as "inactive inputs," while the other inputs are called "active inputs." Furthermore, each output of a switch that provides a neutral value is called an "inactive output," while the other outputs are called "active outputs."
[0048] Similarly, an "active PE" can be defined as a PE that accepts at least one active data point. Therefore, an "inactive PE" can be defined as a PE that accepts only neutral values as input.
[0049] An "active data path" can be defined as a data path in which each processing element receives at least one active data. Thus, active data paths provide valuable results in the output of the processing unit, while the results of inactive data paths may be unused and discarded. An "inactive data path" can be defined as a data path in which each processing element receives only neutral values as input, or a combination thereof that produces neutral values for the final mathematical result of this computation or for subsequent processing elements.
[0050] The fact that one data point is active data does not prevent the value of the active data received or provided by the switch from being equal to a neutral value configured for this switch, depending in some cases—for example, depending on the use case and the value of the input data of the processing unit.
[0051] In all diagrams representing the structure of a processing unit, active data (e.g., inputs to the processing unit) is represented by black-filled boxes, while inactive data is represented by white-filled boxes. Similarly, active PEs are represented by black-filled shapes (e.g., circles representing adders, OR gates, etc.), while inactive PEs are represented by white-filled shapes (e.g., circles representing adders, OR gates, etc.). Furthermore, the output of an active data path is represented by a black-filled box, while the output of an inactive data path is represented by a white-filled box. Connections between switches and PEs, through which neutral values are transmitted, are represented by dotted lines, while other connections are represented by ordinary lines.
[0052] Returning to Figure 1A, the active data 160 received at input 121 of switch 110 is provided at output 132 of switch 110; therefore, input 121 is the active input of switch 110, and output 132 of switch 110 is the active output of switch 110. However, when a neutral value is provided to output 131, this output 131 is the inactive output of switch 110. In Figure 1B, a different configuration of the switch is used. That is, the active data 160 received at input 121 of switch 110 is provided at output 131 of switch 110; therefore, input 121 is the active input of switch 110, and output 131 of switch 110 is the active output of switch 110. However, when a neutral value is provided to output 132, this output 132 is the inactive output of switch 110 in this configuration.
[0053] In the subsequent Figures 2-3 and 4A-4D, - The switches are labeled si, where i is the sequence number. - PE is labeled with pi, where i is the sequence number. - The input to the processing unit is labeled with vi, where i is the sequence number. - The output of the processing unit is labeled with oi, where i is the sequence number.
[0054] By convention, as is also used in Figures 1A-1B and subsequently in Figures 2-3 and 4A-4D, switches that provide the active input to the "right" output 132 (e.g., switches operated by a control data value of "1") are represented by a solid black shape, while switches that provide the active input to the "left" output 131 (e.g., switches operated by a control data value of "0") are represented by a solid white shape.
[0055] Furthermore, the switch's inactive input 151, if present, is assumed to be included and is no longer shown for simplification. Thus, the switch has one visible active input, one active output, and one inactive output.
[0056] Figure 2 shows an example of a processing unit 200.
[0057] This is an example of a processing unit having two inputs v0 to v1, a two-input adder p1, three switches s0 to s2, and four outputs o0 to o3. This example demonstrates how switches can be used when calculating the addition of the two inputs v1 and v0.
[0058] In this specific example, switch s0 (black) operates based on a control data value (e.g., "1"), and therefore the active input of switch s0 from v0 is provided to the right output connected to adder p0 (i.e., output 132 in the example in Figures 1A and 1B), and the neutral value "0" is provided to the left output connected to output o0 of the processing unit (i.e., output 131 in the example in Figures 1A and 1B). Thus, the right output is the active output of switch s0, and the left output is the inactive output of switch s0.
[0059] Switch s1 (white) operates based on a control data value (e.g., "0"). Therefore, the active input of switch s1 from v1 is provided to the left output (i.e., output 131 in the example in Figures 1A and 1B) connected to the output o3 of the processing unit, and the neutral value "0" is provided to the right output (i.e., output 132 in the example in Figures 1A and 1B) connected to the adder p0. Thus, the left output is the active output of switch s1, and the right output is the inactive output of switch s1.
[0060] In this configuration, adder p0 performs the addition of v0 and 0 to obtain an output value equal to v0. This part also demonstrates the importance of using zero as a neutral value to enable the correct result when active and inactive data paths encounter a PE like adder p0.
[0061] The output of adder p0 is supplied to the input of switch s2. Switch s2 operates based on a control data value (e.g., "1"), and therefore the data input of switch s2 is supplied to output o1, and the neutral value "0" is supplied to output o2.
[0062] In this example, input data from v0 is provided to output o1 through adder p0, input data from v1 is provided to output o3 through switch s1, while outputs o0 and o2 receive zero from switches s0 and s2, respectively.
[0063] Different switch configurations result in different combinations of inputs at the outputs. In the example in Figure 2, each output o0 to o3 has a single possible input, and therefore no conflict occurs with respect to the four outputs.
[0064] Figure 3 shows an example of a processing unit 300.
[0065] The example in Figure 3 is a variation of Figure 2, in which the number of outputs is reduced from 4 to 2, as shown in Figure 3, thanks to two bitwise "OR" gates g0 and g1. Gates g0 and g1 are drawn in black because, when the switch is properly configured, at least one of the gate's inputs has active data.
[0066] Instead of using a multiplexer to select the active output between the two outputs o0 and o1 in Figure 2, an "OR" gate g0 is used in Figure 3 to combine the active output provided by switch s2 with the neutral value from switch s0. Therefore, only one input o0 is needed to obtain the active output. The same applies to the two outputs o2 and o3, as well as gate g1.
[0067] A multiplexer may be replaced with a simple bitwise OR gate when it is guaranteed that at most one of the different data paths entering the output of the processing unit is sending active data, and the other data paths are sending only inactive data (e.g., zeros). This replacement of the multiplexer with a bitwise "OR" gate further simplifies the SIMDIM architecture.
[0068] Figures 4A to 4D show various configurations of processing circuit equipment as an example.
[0069] To create a broader example of a processing unit using the SIMDIM architecture, all the concepts explained by referring to Figures 1 through 3 are applied in Figures 4A through 4D.
[0070] Figures 4A to 4D show distinct configurations of an N-input SIMDIM adder tree. Switches and connections (paths) are used in these examples to facilitate maximum flexibility at runtime. For a single instruction, as described below, this SIMDIM adder tree can support a number of different computation algorithms that compute results in different dimensions.
[0071] The SIMDIM adder tree 400 has N=8 inputs and N=8 outputs. The SIMDIM adder tree 400 includes 9 interconnected levels, including: - At the first level (tree root level): Eight inputs v0~v7 that receive their respective input data. - At the second level: eight switches s0~s7. Each of these switches receives input data from one of the corresponding inputs v0~v7 of the processing unit. - In the third level: four adders p0 to p3. Each of these adders receives as input the output data from the right outputs of two corresponding switches from the second level (for example, adder p2 receives the output data from the right outputs of switches s4 and s5). - In the fourth level: there are four switches s8 to s11. Each of these switches receives as input the output data from the corresponding adder among the adders p0 to p3 of the third level (for example, switch s9 receives the output data from adder p1). - At the fifth level: two adders p4~p5. Each of these adders takes as input the output data from the right outputs of two corresponding switches from the fourth level switches (for example, adder p4 takes the output data from the right outputs of switches s8 and s9). - At the sixth level: there are two switches s12~s13. Each of these switches receives as input the output data from the corresponding adder among the adders p4~p5 of the fifth level (for example, switch s12 receives the output data from adder p4). - At the seventh level: One adder p6 receives the output data from the right output of the two switches s12~s13 at the sixth level as input. - At the 8th level: seven bitwise "OR" gates g0-g6. Each of gates g0-g2 and g4-g6 takes as input the output data from the left outputs of two corresponding switches from different levels (for example, gate g0 takes the output data from the left outputs of switches s0 and s8), and gate g3 takes as input the output data from the left output of switch s3 and the output of adder p6. - At the 9th level (leaf level): The eight outputs o0~o7 each receive output data from one of the gates g0~g7 or from switch s7.
[0072] The SIMDIM adder tree 400 provides a data path for each output. The data path includes the following: - A single input data path DP7 leads to output o7. This data path includes v7 and s7 and provides mathematical results at output o7. - A two-input data path DP6 leads to output o6. This data path includes v7, v6, s7, s6, p3, s11, and g6, and provides mathematical results at output o6. - A four-input data path DP5 leads to output o5. This data path includes v7~v4, s7~s4, p3, p2, s11, s10, p5, s13, and g5, and provides mathematical results at output o5. - A two-input data path DP4 leads to output o4. This data path includes v5, v4, s5, s4, p2, s10, and g4, and provides mathematical results at output o4. - An 8-input data path DP3 leads to output o3. This data path includes v7~v0, s7~s0, p3~p0, s11~s8, p4~p5, s12~s13, p6, and g3, and provides mathematical results at output o3. - A two-input data path DP2 leads to output o2. This data path includes v3, v2, s3, s2, p1, s9, and g2, and provides a mathematical result at output o2. - A four-input data path DP1 leads to output o1. This data path includes v3~v0, s3~s0, p1, p0, s9, s8, p4, s12, and g1, and provides a mathematical result at output o1. - A two-input data path DP0 leads to output o0. This data path includes v1, v0, s1, s0, p0, s8, and g0, and provides mathematical results at output o0.
[0073] In the case of the connections between elements in this SIMDIM adder tree, there is no need to place a switch after the last adder p6, and the output of adder p6 is wired directly to the input of gate g3.
[0074] By configuring switches on these data paths, neutral values may propagate along portions of these data paths, so that these portions of the data path become inactive while others remain active. One or more active portions of a data path can form an active data path if these active portions connect at least one input of a processing unit to at least one output of a processing unit.
[0075] The use of neutral values may propagate along one or more parts of the data path, providing flexibility to activate (or deactivate) parts of the data path.
[0076] Figure 4A shows a first configuration of a SIMDIM adder tree that uses the active data path DP3 to compute the sum of eight inputs, e.g., o3 = v7 + v6 + v5 + v4 + v3 + v2 + v1 + v0, in one instruction and within a single time slot, at a single active output o3.
[0077] In this example, an eight-input data path is used, including v7~v0, s7~s0, p3~p0, s11~s8, p4~p5, s12~s13, p6, and g3, which provides a mathematical result at output o3. In this data path, all adders are active elements. All switches are configured with the value "1", meaning that active data received by a switch proceeds to the next adder, while inactive data proceeds to the inactive output.
[0078] At gate g3, the output of adder p6 undergoes a bitwise OR operation with the neutral value (zero) coming from switch s3, which is an inactive path in this configuration. Therefore, the result at output o3 is the correct sum of all eight inputs v0 to v7.
[0079] Each of the other outputs o0~o2 and o4~o7 is an inactive output and provides an output value equal to zero that can be discarded. The respective data paths DP0~DP2 and DP4~DP7 leading to these outputs are inactive data paths.
[0080] Figure 4B shows a second configuration of a SIMDIM adder tree that performs the following calculations simultaneously. - The sum of four inputs to one active output o5 (o5 = v7 + v6 + v5 + v4) by using the active data path DP5. - The sum of the four other inputs at another active output o1 by using the active data path DP1 (o1 = v3 + v2 + v1 + v0)
[0081] These sums can be calculated in a single instruction and within a single time slot.
[0082] In the example in Figure 4B, switches s0 to s11 are configured with the control data value "1", while switches s12 and s13 are configured with the control data value "0", and all others remain as in Figure 4A. This means that up to the sixth level, including switches s12 and s13, there is a normal adder tree, as in the example in Figure 4A.
[0083] In Figure 4B, switch s12 is configured with a control data value of "0," and therefore the active input data of switch s12 is provided to output o1 through gate g1, and the neutral value (zero) from s12 provided by s12 proceeds to adder p6.
[0084] Similarly, switch s13 is configured with a control data value of "0," and therefore the active input data of switch s13 is provided to output o5 through gate g5, and the neutral value (zero) from s13 is provided to adder p6.
[0085] This means that both inputs to adder p6 are neutral (zero), and therefore the result of this addition performed by adder p6 is always zero in this configuration. Thus, by configuring the control data value "0" on switches s12 and s13, the last part of the 8-input data path DP3 leading to output o3 (from adder p6 to output o3) is deactivated. Instead, all parts of data paths DP1 and DP5 leading to outputs o1 and o5 respectively are active and therefore become active data paths.
[0086] Except for the inputs to OR gates g1 and g5 connected to the activated outputs (the left output in this case) of switches s12 and s13, which provide active data to OR gates g1 and g5 respectively, all inputs to OR gates g0 through g6 receive all neutral values (zero).
[0087] A bitwise OR gate ensures that the output of a SIMDIM adder tree has the correct result.
[0088] Figure 4C shows a third configuration of the same SIMDIM adder tree that allows the following to be computed simultaneously. - Using the active data path DP6, the sum of the two inputs in one active output is o6 = v7 + v6 - Using the active data path DP4, the sum of two inputs to one active output is o4 = v5 + v4 - Using the Active Data Path DP2, the sum of the two inputs to one active output is o2 = v3 + v2 - Using the active data path DP0, the sum of the two inputs to one active output is o0 = v1 + v0
[0089] These sums can be calculated in a single instruction and within a single time slot.
[0090] In this example, switches s0 to s7 are configured with the control data value "1". Switches s8 to s11 are configured with the control data value "0", pushing the inputs of switches s8 to s11 directly to outputs o0, o2, o4, and o6, and the neutral value (zero) to adders p4 and p5, making adders p4 and p5 inactive PE in this configuration. As a result, the portion of the data paths DP1, DP3, and DP5 to which adders p4 and p5 belong is inactive from adders p4 and p5 to their respective outputs.
[0091] In this example, the configuration of switches s12 and s13 is not important because they receive neutral values (zero) from adders p4 and p5 respectively, making these adders inactive PEs. Switches s12 and s13 also provide neutral values (zero) to all of their outputs, which are then provided to adder p6 (which is also an inactive PE in this configuration) and to outputs o1 and o5 via gates g1 and g5, respectively. The inactive adder p6 similarly provides a neutral value (zero) to output o3. Thus, outputs o1, o3, o5, and o7 are inactive outputs that receive only neutral values (zero).
[0092] Figure 4D shows a fourth (asymmetric) configuration of a SIMDIM adder tree that simultaneously computes the following: - Using the active data path DP6, the sum of the two inputs in one active output is o6 = v7 + v6 - Using the active data path DP4, the sum of two inputs to one active output is o4 = v5 + v4 - Using the active data path DP3, the sum of one input to one active output is o3=v3 - Using the active data path DP1, the sum of the three inputs to one active output is o1 = v2 + v1 + v0
[0093] These sums can be calculated in a single instruction and within a single time slot.
[0094] In this use case, switches s0-s2 and s4-s9 are configured with a control data value of "1". Switches s3 and s10-s12 are configured with a control data value of "0", and therefore, not only gates g0, g2, and g5 but also adders p5 and p6 are inactive PEs. As in Figure 4C, switch s13 may be configured with a control data value of "0" or "1" without affecting the mathematical result.
[0095] The example in Figure 4D illustrates the importance of passing a neutral value (zero) on a data path to "deactivate" a portion of the data path. Comparing Figure 4A and Figure 4D, in Figure 4D, the data path DP3 is almost completely deactivated except for a portion of it that connects v3 to gate g3 via switch s3, whereas in Figure 4A, this is the portion of the data path DP3 that connects v3 to gate g3 via the deactivated switch s3.
[0096] In the example in Figure 4D, there is no switch after adder p6, meaning the result can freely proceed through the gate to output o3. Since both inputs to adder p6 from s12 and s13 are zero, the output of adder p6 is also zero. Therefore, it is safe to perform a bitwise OR operation with these zeros and the value v3 to provide the correct result (v3) to output o3.
[0097] The examples in Figures 4A to 4D can be generalized to any number of inputs, any number of outputs, any number of levels, and any type of interconnection in a tree formed by interconnecting processing elements with switches. The number of data elements that can be combined along a data path depends on the number of processing elements in the data path and how the processing elements are connected.
[0098] The examples in Figures 4A to 4D can be generalized to any type of processing element. For example, some or all of the adders in the processing units in Figures 4A to 4D may be replaced with bitwise XOR gates applicable to bytes, words, or vectors of any length. The neutral value is configured according to the type of mathematical operation performed by the processing element, for which a neutral value may be provided at the output of the switch.
[0099] For example, when reasonable, values other than zero may be considered for inactive paths. For instance, in the SIMDIM architecture discussed so far, the switch provides zero as the neutral value for the adder, but alternatively, "1" may be provided as the neutral value for the multiplier, which uses an AND gate instead of an OR gate, and therefore the inactive path should provide all 1s so as not to affect the mathematical result at the output of the data path.
[0100] In some embodiments, switch control data may be grouped, if the use case allows. While each switch may actually be controlled independently, other patterns may be used in some designs. For example, a single control data value may be used for a group of switches operated by the same control signal (or the same register). For instance, there may be one control signal (or one register) per layer in the tree, meaning that switches s0-s7 can be controlled by the same control signal (or the same register), and similarly for switches s8-s11 and s12-s13, respectively. This may limit flexibility but can simplify the hardware and / or programmer's tasks.
[0101] The examples in Figures 4A to 4D can be generalized to input and output data of any type (bits, bytes, words, vectors, etc.) and / or length. Each value of the input / output data can be encoded with one or more bits, one or more bytes as a vector, etc. The SIMDIM architecture supports element-wise operations (element i of input vector A and element i of input vector B) and bit-wise operations (bit i of input data A and bit i of input data B). For example, a SIMDIM architecture where all processing elements are bit-wise XOR can provide flexibility and very fast operations for aggregating some or all data elements from one or more vectors into data elements (e.g., bits).
[0102] Only a single instruction field may be used to utilize the SIMDIM architecture in a processor. This allows for a timing-wise reduction in computation time from several clock cycles to a single clock cycle, resulting in much simpler programs. Bitwise operations are extremely fast, allowing many, though not all, levels of the SIMDIM architecture to be combined in the same clock cycle, thus greatly speeding up bit aggregation. The overhead of the SIMDIM architecture due to the use of switches (and optional XOR gates on the outputs) is not significant at the processor scale.
[0103] The SIMDIM architecture is also well-suited to wiring-based implementations, which can provide the much-needed flexibility.
[0104] The SIMDIM architecture efficiently utilizes broad and very broad data path resources (tens, hundreds, thousands of inputs) for all kinds of and dimensions of vector accumulation, matrix multiplication, matrix inversion, convolution, and FIR filtering operations, bitwise operations, etc., with extremely low latency, while facilitating the mapping of available hardware algorithms for a wider range of use cases.
[0105] Furthermore, input data may be aggregated into vectors of different dimensions (for example, two vectors may be mapped to inputs v0-v7, with the first vector mapped to inputs v0-v1 and the second vector to inputs v2-v7), and provided to symmetric or asymmetric processing units by the SIMDIM architecture. This may be configured by software at runtime, providing another level of flexibility. The SIMDIM architecture greatly improves flexibility, software simplicity, and the efficiency of broader data paths.
[0106] While bitwise operations themselves are very fast, many processors require operations like the one described in Figure 3 to perform a bitwise XOR (or any other bitwise operation) on all bits in a vector, possibly. Bitwise operations should take an O(log2(M)) instruction, when a shortened, efficient operation can take one or two clock cycles and a single instruction, as will be discussed later in this document.
[0107] This approach, which involves adding switches to guide active data and neutral values through data paths in a tree, is applicable to more complex trees (or simpler trees) and data paths of any type or length.
[0108] These switches can be programmed via software based on clock cycles if necessary. These switches may be configured to determine where and how they provide their input data to their outputs, i.e., whether to provide it to the left output 131 or the right output 132, as in the examples discussed herein.
[0109] This flexibility allows for feeding very wide data vectors of variable dimensions or numerous smaller vectors into the SIMDIM architecture input at runtime. The configuration of the processing unit can vary any clock cycle, allowing the SIMDIM architecture to reuse the processing unit's hardware resources over time in various ways.
[0110] When the SIMDIM architecture is applied to an adder tree, it allows for the accumulation of many values in different combinations with lower latency, for example. The use of switches, as presented here, also greatly simplifies the mapping of the algorithm to hardware resources, whereas at the most advanced level of technology, this task can be significantly more complex, less flexible, and require different types and numbers of instructions.
[0111] The proposed SIMDIM architecture can mitigate the dark silicon problem (i.e., unused, lossy silicon) in wide and very wide data paths by flexibly and efficiently using available resources and by reducing the energy used by different computations.
[0112] The SIMDIM architecture is typically faster (in terms of width and parallelism), and therefore can proceed to sleep / power-off mode or be used for other calculations while waiting for the next primary use case data.
[0113] Figure 5 shows a framework for controlling a processing unit using an example.
[0114] The processor 550 may include one or more processing units 551 based on a SIMDIM architecture. For simplicity, only one processing unit is shown in Figure 5. The connection between the switch and the processing elements is performed by the hardware of the processing unit 551. The neutral value may be pre-configured in the switch hardware or configured dynamically at runtime.
[0115] A compiler 520 adapted for a processor 550 including one or more processing units 551 according to a SIMDIM architecture may be configured to compile program code 510 to generate instructions 530 to be processed by the processor 550. The compiler 520 is configured to generate instructions 530 for generating control data 560 (which will be converted to, for example, control signals for the switches or stored in registers) for one or more switches in the processing unit 551. The control data 560 may be generated by the processor's controller 555. As described herein, the control data for the switches is used to select, on an input-by-input basis, the output to which the input data received at the input under consideration will be provided.
[0116] When neutral values are not pre-configured in the switch hardware, the compiler 520 may be configured to compile the program code 510 to generate instructions 530 for configuring the neutral values that will be provided by one or more switches in the processing unit 551.
[0117] Instructions relating to the neutral value to be used may be converted, for example, by the controller 555 to a configuration data value provided to the processing unit 551 (for example, provided to the switch by a configuration signal or a register).
[0118] The same signal (or the same register) may be used for a given switch to provide configuration data values for controlling the neutral value that will be used by the switch, and to provide control data values for selecting each output for each input of the switch.
[0119] The compiler may be configured to ensure that a processing element is not allocated to two or more instructions within the same time slot by checking for possible conflicts with other instructions that would simultaneously allocate the same processing element.
[0120] Each digital data processing unit disclosed herein may be performed by a processing circuit device. Each digital data processing unit disclosed herein may be included in a DSP (Digital Signal Processor) including a microprocessor or any other type of processor. The term “processor” should not be construed as referring exclusively to hardware capable of executing program instructions, and may unconditionally include one or more processing circuits, whether programmable or not.
[0121] The term "circuit equipment" is, (a) Hardware-only circuit implementation (such as implementation of only analog and / or digital circuit equipment), (b) (where applicable) (i) combinations of analog and / or digital hardware circuits and software / firmware, and (ii) any part of a hardware processor with software, software, and combinations of hardware circuits such as memory that work together with the software, (c) Hardware circuits and / or processors, such as a microprocessor or a part of a microprocessor, that require software (e.g., firmware) for operation, but the software may not be present when not needed for operation. It can refer to one, more, or all of them.
[0122] As a further example, the term "circuit equipment" also covers a mere hardware circuit or processor (or a number of processors) or an implementation of a hardware circuit or processor and a portion of its (or their) accompanying software and / or firmware.
[0123] While terms such as 1, 2, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are used merely to distinguish one element from another. Without departing from the scope of this disclosure, for example, 1 may be referred to as 2, and similarly, 2 may be referred to as 1. As used herein, the terms "and / or," when used in a list of items, suggest that the list may include any or all combinations of one or more of the associated listed items.
[0124] The technical terms used herein are for the purpose of describing specific embodiments only and are not intended to be limiting. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well unless the context otherwise clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and / or “including,” when used herein, specify the presence of the described features, integers, steps, actions, elements, and / or components, but do not exclude the presence or addition of one or more other features, integers, steps, actions, elements, components, and / or groups thereof.
[0125] While embodiments have been described with reference to specific embodiments, it should be understood that these embodiments are merely illustrative of the principles and uses of the Disclosure. Therefore, it should be understood that numerous modifications can be made to the illustrative embodiments, and that other arrangements can be devised without departing from the spirit and scope of the Disclosure as determined by the claims and any equivalent thereof.
[0126] List of common abbreviations
[0127] MIMD Multiple Instruction Multiple Data
[0128] MISD (Multiple Instruction Single Data)
[0129] PE processing element
[0130] SIMD Single Instruction Multiple Data
[0131] SISD Single Instruction Single Data [Explanation of symbols]
[0132] 100 processing units 110 switches 121 input 131 Output 132 Output 140 Control signal 150 Neutral value 151 input 160 Active Data 200 processing units 300 processing units 400 SIMDIM Adder Tree 510 Program code, software program 520 Compilers 530 command 550 processors 551 Processing Unit 555 Controller 560 Control Data
Claims
1. A digital data processing unit, Input for receiving input data, Output for providing output data, A processing element comprising one or more processing elements, each processing element having one or more inputs for receiving input data and one or more outputs for providing output data, and each processing element being configured to apply mathematical operations to the input data of the processing element within a time slot in order to generate output data, One or more switches, each having an output and at least one input for receiving its respective input value, and each switch operates based on control data within a time slot such that it provides the received input value to one of the outputs of the switch selected based on the control data. Equipped with, The one or more processing elements and the one or more switches are interconnected to form a data path between the input and the output of the digital data processing unit. A digital data processing unit configured such that at least one of the data paths provides a final mathematical result from the intermediate mathematical results generated by one or more processing elements in the data path under consideration at the corresponding output of the processing unit.
2. At least one switch is operated within a time slot based on control data such that it provides a neutral value to at least one of the outputs of the switch selected based on the control data. The neutral value provided by the switch in the data path is the neutral value of the final mathematical result produced by the data path. The digital data processing unit according to claim 1.
3. The neutral value provided by the output of a switch to the input of the processing element to which the mathematical operation is applied is equal to the identity element for the mathematical operation. The digital data processing unit according to claim 2.
4. The input of the switch is connected to the input of the digital data processing unit or the output of the processing element. A digital data processing unit according to any one of claims 1 to 3.
5. The output of the switch is connected to the output of the digital data processing unit or the input of the processing element. A digital data processing unit according to any one of claims 1 to 4.
6. At least one of the aforementioned data paths comprises at least two processing elements. A digital data processing unit according to any one of claims 1 to 5.
7. A digital data processing unit according to any one of claims 1 to 6, wherein at least one of the data paths comprises a first switch having an input connected to the input of the processing unit, a first processing element having an input connected to the output of the first switch, a second switch having an input connected to the output of the first processing element, and a second processing element having an input connected to the output of the second switch.
8. Bitwise OR gates belonging to the first and second data paths Equipped with, The output of the bitwise OR gate is connected to the output of the processing unit. The first input of the bitwise OR gate is connected to a switch or processing element that provides the mathematical result of the first data path. The digital data processing unit according to any one of claims 1 to 7, wherein the second input of the bitwise OR gate is connected to a switch or processing element that provides the mathematical result of the second data path.
9. The input of the digital data processing unit comprises N first inputs for receiving data, where N ≥ 4. The one or more switches comprises N first switches and N / 2 second switches, The one or more processing elements comprises N / 2 first processing elements and N / 4 second processing elements, Each of the N first switches has (i) an input connected to a corresponding input among the N first inputs, (ii) a first output providing the input value received by the inputs of the N first switches, and (ii) a second output providing a neutral value. Each of the N / 2 first processing elements has (i) an input connected to the output of a first corresponding switch among the N first switches, and (ii) another input connected to the output of a second corresponding switch among the N first switches. Each of the N / 2 second switches has (i) an input connected to the output of the corresponding processing element among the N / 2 first processing elements, (ii) a first output that provides the input value received by the input of the N / 2 second switches, and (ii) a second output that provides a neutral value. The digital data processing unit according to any one of claims 2 to 8, wherein each of the N / 4 second processing elements has (i) an input connected to the output of a third corresponding switch among the N / 2 second switches, and (ii) another input connected to the output of a fourth corresponding switch among the N / 2 second switches.
10. A processor comprising one or more digital data processing units according to any one of claims 1 to 9.
11. A compiler configured to compile program code and generate instructions to be processed by the processor described in claim 10, wherein the compiler is configured to generate instructions for generating control data for one or more switches in one or more digital data processing units.
12. The compiler according to claim 11, wherein the processor comprises one or more digital data processing units as described in claim 2, and the compiler is configured to generate instructions for configuring at least one of the switches in the one or more digital data processing units to provide the respective neutral values.