Configurable dataflow backend for subword parallel SIMT processor

The configurable dataflow backend in SIMT processors addresses inefficiencies in low-precision and mixed-precision arithmetic by enabling subword parallelism and operand reuse, improving throughput and reducing memory requirements.

WO2026132138A1PCT designated stage Publication Date: 2026-06-25INTERUNIVERSITAIR MICRO ELECTRONICS CENT (IMEC VZW)

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
INTERUNIVERSITAIR MICRO ELECTRONICS CENT (IMEC VZW)
Filing Date
2025-12-17
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

SIMT processor architectures face inefficiencies in executing low-precision and mixed-precision arithmetic operations due to mismatched data sizes, limited instruction flexibility, and underutilization of hardware resources, leading to bottlenecks and increased clock cycles.

Method used

A configurable dataflow backend with multiple pipeline stages and functional units that support subword parallelism, allowing flexible execution of mixed-precision operations without dedicated instructions, and enabling operand reuse and precision management across stages.

Benefits of technology

This approach enhances throughput and reduces memory bandwidth requirements, providing efficient execution of low-precision computations with reduced overhead and simplified register file design.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure EP2025087854_25062026_PF_FP_ABST
    Figure EP2025087854_25062026_PF_FP_ABST
Patent Text Reader

Abstract

A Single Instruction Multiple Threads (SIMT) processor core comprising: (a) a configurable dataflow backend operative to support subword parallelism for execution of mixed-precision arithmetic operations, wherein the backend includes multiple dataflow network stages, each dataflow network stage including one or more functional units configurable to process multiple subwords within a register in parallel; (b) a configuration memory storing configuration data defining operational configurations of the dataflow network stages in the backend, wherein the configuration memory is accessible via a configuration index mechanism operative to select configuration data from the configuration memory to control the dataflow network stages in the backend, and wherein the configuration index mechanism is configured for enabling the dataflow backend to support multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] CONFIGURABLE DATAFLOW BACKEND FOR SUBWORD PARALLEL SIMT PROCESSOR Field of the invention

[0002] The present invention relates to the field of computer processors, and more specifically to Single Instruction Multiple Threads (SIMT) processor cores.

[0003] Background of the invention

[0004] Advancements in compute-intensive applications have heightened the demand for efficient execution of arithmetic operations. Single-lnstruction-Multiple-Threads (SIMT) processor architectures are widely employed to achieve parallelism by executing the same instruction across multiple hardware threads. In these architectures, arithmetic operations typically involve reading source operands from a register file and storing the results back into it. The size of the logical word for operands and results is expected to match the width of the register; otherwise, the data is sign- extended to fit the register dimensions. For instance, in a 64-bit register, a 32-bit integer would be sign-extended to occupy the full 64 bits.

[0005] With the rise of deep learning workloads, there has been a significant shift towards utilizing lower-precision data formats, such as 4-bit or 8-bit integers. These workloads are often resilient to quantization errors, allowing for reduced precision without substantially affecting performance. However, working with low-precision data in SIMT architectures presents several challenges.

[0006] One of the primary issues is the inefficiency in data fetching from memory. Low-precision data, when stored in memory, must often be loaded into registers that have a much larger word size. This mismatch leads to underutilization of the hardware capacity, as only a small portion of the register's capacity is used, resulting in inefficient memory bandwidth utilization. Likewise, the mismatch leads to underutilization of the overall data, as more data could have been fetched if it could be loaded more efficiently.

[0007] Another challenge involves the diverse computational requirements of different deep learning algorithms. Low-precision computations require various arithmetic operations with varying degrees of granularity in low-precision or mixed-precision operations. In SIMT architectures that use a reduced instruction set computing (RISC) instruction set, the relatively small size of the instructions limits the ability to define dedicated instructions for each variation of these operations. This constraint makes it difficult to efficiently support the wide range of operations needed for different algorithms within the existing instruction set architecture.

[0008] Additionally, the frontend of SIMT cores, responsible for instruction fetching and decoding, may not be optimized for handling the complexities introduced by low-precision or mixed-precision computations. Since all threads share the same frontend, there can be bottlenecks when attempting to execute instructions that require specialized handling or increased flexibility in the backend execution units.

[0009] EP2725484 provides a processor for performing parallel processing within a register. An exemplary processor may include a processing element having a compute unit and a register file. The register file includes a register that is divisible into lanes for parallel processing. The processor further includes a mask register and a predicate register. The mask register and the predicate register respectively include a number of mask bits and predicate bits equal to a maximum number of divisible lanes of the register. A state of the mask bits and predicate bits is set to respectively achieve enabling / disabling of the lanes from executing an instruction and conditional performance of an operation defined by the instruction. Further, the processor is operable to perform a reduction operation across the lanes of the processing element and / or generate an address for each of the lanes of the processing element. EP2725484 provides a solution for a single core architecture; implementing the use of mask registers and predicate registers for SIMT based computation would generate extra overhead. The active status of each lane during an instruction execution is decided by the contents of the mask register. This can lead to an increase in the number of clock cycles required to carry out the instruction.

[0010] There is, however, still a need for further advancements in the field to address at least some of the above challenges.

[0011] Summary of the invention

[0012] It is an object of embodiments of the present invention to provide efficient execution of mixed-precision arithmetic operations in a SIMT processor core. This objective is accomplished by the aspects of the present invention.

[0013] In a first aspect, the present invention relates to a Single Instruction Multiple Threads (SIMT) processor core comprising: (a) a configurable dataflow backend operative to support subword parallelism for execution of mixed-precision arithmetic operations, wherein the backend includes multiple dataflow network stages, each dataflow network stage including one or more functional units configurable to process multiple subwords within a register in parallel; (b) a configuration memory storing configuration data defining operational configurations of the dataflow network stages in the backend, wherein the configuration memory is accessible via a configuration index mechanism operative to select configuration data from the configuration memory to control the dataflow network stages in the backend, wherein the configuration index mechanism is configured for enabling the dataflow backend to support multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation. This enables efficient execution of mixed-precision arithmetic operations.

[0014] In embodiments, the dataflow backend may be pipelined, i.e. designed to execute multiple stages of a computation or data processing flow in parallel rather than sequentially. Each stage of the pipeline processes data and passes it on to the next stage as soon as its part of the task is complete. This allows for high throughput execution.

[0015] In embodiments, the dataflow backend may include multiple pipeline stages. This enables complex dataflow graphs to be implemented. As an example only, the invention not being limited thereto, the dataflow backend may include at least three pipeline stages. This provides a good balance between complexity and performance.

[0016] In embodiments, each functional unit of the dataflow network stage may be configurable to perform at least one of a plurality of arithmetic operations, such as addition, multiplication, minimum, and maximum operations. This allows supporting common arithmetic operations. In embodiments, in a given cycle, each functional unit may be configured to perform a selected arithmetic operation or remain inactive (non-functional). This provides flexibility to map different dataflow graphs.

[0017] In embodiments, the functional units may be configurable via control bits in the configuration data. This allows compact encoding of the dataflow configuration.

[0018] In embodiments, the dataflow backend may be configured to perform at least one of rounding, truncation, fixed-point arithmetic, and normalization to manage precision across pipeline stages. This enables handling of different precision requirements.

[0019] In embodiments, the functional units may be configurable to perform approximate computations by adjusting output precision. This can provide performance and energy efficiency benefits for error-tolerant applications.

[0020] In embodiments, operand reuse may be performed within at least one pipeline stage to reduce data fetch overhead. This improves performance by reducing memory bandwidth requirements.

[0021] In embodiments, the output precision of the dataflow backend may be less than or equal to the size of a register in the register file to enable writing outputs without requiring additional register file ports. This simplifies the register file design.

[0022] In embodiments, the configuration memory may be a memory-mapped input / output

[0023] (MMIO) region, allowing the processor core to communicate with the configuration memory via standard memory access mechanisms. This allows easy access to the configuration memory. In embodiments, the configuration data may include mask bits for selectively enabling or disabling functional units within each dataflow network stage during execution. This provides finegrained control over the dataflow.

[0024] In embodiments, the configuration data may include control bits for directing data flow between successive dataflow network stages. This allows flexible dataflow patterns to be implemented.

[0025] In embodiments, the configuration data may be broadcasted to all threads within the SIMT processor core. This amortizes the configuration overhead across multiple threads.

[0026] In embodiments, the configuration index mechanism may comprise a status register storing a configuration index value. This allows a larger configuration space to be addressed, hence it allows addressing a larger configuration memory for a wider range of dataflow options. This enables more flexibility.

[0027] In embodiments, the configuration index mechanism may be incorporated as part of an instruction in an instruction set architecture. This provides a compact way to specify the configuration. In embodiments, including the configuration index in an instruction may allow addressing a limited number of dataflow network stage configurations, e.g. due to instruction size constraints, for instance in case of a RISC based instruction set. This leads to a simpler design. The invention is not, however, limited to RISC based systems; also systems with more complicated instruction sets, such as CISC systems, for instance, could gain efficiency from the use of a set-up in accordance with embodiments of the present invention.

[0028] In embodiments, the configuration index mechanism may enable support for variable levels of subword parallelism. This allows the dataflow to be optimized for different data types.

[0029] In a second aspect, the present invention relates to a method of executing mixed-precision arithmetic operations in a Single Instruction Multiple Threads (SIMT) processor core, the method comprising: (a) utilizing a configurable dataflow backend to perform subword parallelism, wherein the backend includes multiple dataflow network stages, each dataflow network stage comprising one or more functional units configurable to process multiple subwords within a register in parallel; (b) accessing configuration data from a configuration memory that stores operational configurations for the dataflow network stages; (c) selecting the configuration data via a configuration index mechanism to control the dataflow network stages in the backend; wherein the configuration index mechanism enables support for multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation.

[0030] In embodiments, the method may further comprise configuring the functional units via control bits in the configuration data. This allows compact encoding of the dataflow configuration. In embodiments, the method may further comprise performing at least one of rounding, truncation, fixed-point arithmetic, and normalization to manage precision across pipeline stages. This enables handling of different precision requirements.

[0031] In embodiments, the method may further comprise configuring the functional units to perform approximate computations by adjusting output precision. This can provide performance and energy efficiency benefits for error-tolerant applications.

[0032] In embodiments, the method may further comprise performing operand reuse within at least one pipeline stage to reduce data fetch overhead. This improves performance by reducing memory bandwidth requirements.

[0033] In embodiments, output precision may be limited to the size of a register in a register file to enable writing outputs without requiring additional register file ports. This simplifies the register file design.

[0034] In embodiments, the method may comprise using mask bits in the configuration data for selectively enabling or disabling functional units within each dataflow network stage during execution. This provides fine-grained control over the dataflow.

[0035] In embodiments, the method may comprise using control bits in the configuration data for directing data flow between successive dataflow network stages. This allows flexible dataflow patterns to be implemented.

[0036] In embodiments, the method may further comprise broadcasting the configuration data to all threads within the SIMT processor core. This amortizes the configuration overhead across multiple threads.

[0037] In embodiments, the method may further comprise setting a status register with a configuration index value to select the configuration data. This allows a large configuration space to be addressed.

[0038] In a third aspect, the present invention relates to a non-transitory computer-readable storage medium storing instructions that, when executed by a Single Instruction Multiple Threads (SIMT) processor core comprising a configurable dataflow backend and a configuration memory, cause the processor to: (a) execute mixed-precision arithmetic operations using subword parallelism through the configurable dataflow backend, wherein the backend includes multiple dataflow network stages with functional units configurable to process multiple subwords within a register in parallel; (b) access configuration data defining operational configurations of the dataflow network stages from the configuration memory; (c) select the configuration data via a configuration index mechanism to control the dataflow network stages in the backend; wherein the configuration index mechanism enables the processor core to support multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation. Examples of non-transitory computer-readable storage media include hard drives, solid-state drives, memory cards, and magnetic tapes, the present invention not being limited thereto.

[0039] In embodiments, the instructions may further cause the processor to configure the dataflow backend as a pipelined dataflow network. This allows for high throughput execution.

[0040] In embodiments, the instructions may further cause the processor to configure multiple pipeline stages, for instance at least three pipeline stages, in the dataflow backend. This enables complex dataflow graphs to be implemented. Having a limited number of pipeline stages provides a good balance between complexity and performance.

[0041] In embodiments, the instructions may further cause the processor to configure each functional unit to perform at least one of a plurality of arithmetic operations, such as addition, multiplication, minimum, and maximum operations. This allows supporting common arithmetic operations.

[0042] In embodiments, the instructions may further cause the processor to configure, in a given cycle, each functional unit to perform a selected arithmetic operation or remain inactive (non-functional). This provides flexibility to map different dataflow graphs.

[0043] In embodiments, the instructions may further cause the processor to configure the functional units via control bits in the configuration data. This allows compact encoding of the dataflow configuration.

[0044] In embodiments, the instructions may further cause the processor to perform at least one of rounding, truncation, fixed-point arithmetic, and normalization to manage precision across pipeline stages. This enables handling of different precision requirements.

[0045] In embodiments, the instructions may further cause the processor to configure the functional units to perform approximate computations by adjusting output precision. This can provide performance and energy efficiency benefits for error-tolerant applications.

[0046] In embodiments, the instructions may further cause the processor to perform operand reuse within at least one pipeline stage to reduce data fetch overhead. This improves performance by reducing memory bandwidth requirements.

[0047] In embodiments, the instructions may further cause the processor to limit output precision to the size of a register in a register file to enable writing outputs without requiring additional register file ports. This simplifies the register file design.

[0048] In embodiments, the instructions may further cause the processor to broadcast the configuration data to all threads within the SIMT processor core. This amortizes the configuration overhead across multiple threads. In embodiments, the instructions may further cause the processor to set a status register with a configuration index value to select the configuration data. This allows a larger configuration space to be addressed.

[0049] It is an advantage of embodiments of the present invention that they support subword parallelism in SIMT cores while maintaining compatibility with RISC-based instruction sets. The invention is not limited to systems with RISC-based instruction sets; also systems with more complicated instruction sets, such as for instance CISC-based systems, can gain efficiency if implemented in accordance with embodiments of the present invention. It is a further advantage of embodiments of the present invention that the configurable backend allows for custom-defined instructions and backend configurations, enabling efficient execution of low-precision and mixed- precision arithmetic operations without the need for multiple dedicated instructions. It is an advantage of embodiments of the present invention that they provide increased throughput with reduced bandwidth requirements by utilizing dataflow-based subword-parallel operations. It is a further advantage of embodiments of the present invention that the overhead of storing configuration bits is amortized across multiple threads, enhancing efficiency in SIMT architectures. It is an advantage of embodiments of the present invention that they allow for higher reuse of operands within the dataflow network, thereby reducing data fetch overhead and increasing the throughput in terms of operations per byte fetched. It is a further advantage of embodiments of the present invention that they ensure output precision is less than or equal to the register size, enabling outputs to be written back to the register file without the need for additional ports. It is an advantage of embodiments of the present invention that they offer high-density computation solutions for narrow precision computing, which is particularly beneficial for accelerating deep learning workloads that are resilient to quantization. It is a further advantage of embodiments of the present invention that they lower memory requirements by allowing packed information to be directly loaded into register files, thereby improving efficiency when working with low- precision data formats.

[0050] Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Features from the dependent claims may be combined with features of the independent claims and with features of other dependent claims as appropriate and not merely as explicitly set out in the claims.

[0051] The above and other characteristics, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the principles of the invention. This description is given for the sake of example only, without limiting the scope of the invention. The reference figures quoted below refer to the attached drawings.

[0052] Brief description of the drawings

[0053] The invention will now be described further, by way of example, with reference to the accompanying drawings, in which:

[0054] FIG. 1 is a block diagram of a SIMT processor core according to embodiments of the present invention.

[0055] FIG. 2 is a dataflow diagram of a three-stage fused multiply add operation on packed 8-bit integers in a 32-bit architecture according to embodiments of the present invention.

[0056] FIG. 3 is a dataflow diagram utilizing a fused multiply add pipeline for fixed precision packed 8-bit operands across pipeline stages in a 32-bit architecture according to embodiments of the present invention.

[0057] FIG. 4 is a dataflow diagram utilizing a fused multiply add pipeline for mixed precision packed 8-bit and packed 16-bit operands in a 64-bit architecture according to embodiments of the present invention.

[0058] FIG. 5 is a dataflow diagram for math expressions utilizing a fused multiply add pipeline for packed 8-bit integers with 2x operand reuse in a 16-bit architecture according to embodiments of the present invention.

[0059] FIG. 6 is a schematic illustration of a method of executing mixed-precision arithmetic operations in a SIMT processor core, in accordance with embodiments of the present invention.

[0060] In the different figures, the same reference signs refer to the same or analogous elements. Any reference signs in the claims shall not be construed as limiting the scope.

[0061] Detailed description of illustrative embodiments

[0062] The present invention will be described with respect to particular embodiments and with reference to certain drawings but the invention is not limited thereto but only by the claims. The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. The dimensions and the relative dimensions do not correspond to actual reductions to practice of the invention.

[0063] The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequence, either temporally, spatially, in ranking or in any other manner. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.

[0064] It is to be noticed that the term "comprising", used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. It is thus to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the scope of the expression "a device comprising means A and B" should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.

[0065] Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

[0066] Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

[0067] It should be noted that the use of particular terminology when describing certain features or aspects of the invention should not be taken to imply that the terminology is being re-defined herein to be restricted to include any specific characteristics of the features or aspects of the invention with which that terminology is associated. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

[0068] DEFINITIONS

[0069] The following terms are provided solely to aid in the understanding of the invention.

[0070] As used herein, and unless otherwise specified, the term "Single Instruction Multiple Threads (SIMT) processor core" refers to a processor core architecture where a single instruction is executed simultaneously across multiple threads, each operating independently on its own data. Examples of specific embodiments include graphics processing units (GPUs) that execute the same instruction across numerous data threads (e.g. pixels or pixel sets) in parallel.

[0071] As used herein, and unless otherwise specified, the term "configurable dataflow backend" refers to a portion of the processor that can be programmed or configured to perform various data processing operations by defining the flow of data through a network of functional units. This backend's operational behaviour is determined by configuration data. Examples of specific embodiments include programmable pipelines in digital signal processors that adapt to different computational tasks based on configuration settings.

[0072] As used herein, and unless otherwise specified, the term "subword parallelism" refers to the ability to process multiple smaller data elements, called subwords, within a single larger data word or register simultaneously in parallel. Examples of specific embodiments include dividing a 32-bit register into four 8-bit subwords and performing simultaneous arithmetic operations on all four 8-bit values within a single instruction cycle.

[0073] As used herein, and unless otherwise specified, the term "mixed-precision arithmetic operations" refers to computational operations involving operands of different numerical precisions, such as combining 16-bit and 32-bit numbers in a single operation. Examples of specific embodiments include performing calculations where higher precision is required for specific operands while others can be processed with lower precision to optimize performance and resource usage.

[0074] As used herein, and unless otherwise specified, the term "dataflow network stages" refers to sequential processing stages within the dataflow backend, where each stage comprises one or more functional units that perform specific operations on data as the data moves through the stages. Examples of specific embodiments include pipeline stages in a processor where data undergoes operations like addition, multiplication, or logical transformations in a staged manner. As used herein, and unless otherwise specified, the term "corresponding" in the context of configuration index mechanism stage pointing to a location in a configuration memory stage for controlling a dataflow network stage, refers to a part of the configuration index mechanism and a part of the configuration memory stage that cooperate to control a particular stage of the dataflow network. The three are then said to be corresponding.

[0075] As used herein, and unless otherwise specified, the term "functional units configurable to process multiple subwords within a register in parallel" refers to processing elements within the dataflow network stages that can be programmed to simultaneously perform operations on multiple smaller data elements (subwords) contained within a single register. Examples of specific embodiments include arithmetic logic units (ALUs) configured to execute parallel addition or multiplication on segmented data within a register.

[0076] As used herein, and unless otherwise specified, the term "configuration memory" refers to a memory component that stores configuration data which defines the operational settings and behaviours of the dataflow network stages in the backend. Examples of specific embodiments include dedicated configuration registers or a memory-mapped space that holds setup parameters for functional units.

[0077] As used herein, and unless otherwise specified, the term "configuration data defining operational configurations of the dataflow network stages" refers to data that specifies the functions, control signals, and routing paths of the functional units within each stage of the dataflow backend. Examples of specific embodiments include bit patterns that enable specific operations, select input sources, or determine the sequence of data processing.

[0078] As used herein, and unless otherwise specified, the term "configuration index mechanism operative to select configuration data from the configuration memory to control the dataflow network stages in the backend" refers to a mechanism, such as a register or instruction field, that holds an index or pointer used to retrieve specific configuration data from the configuration memory. Examples of specific embodiments include an index register that selects among multiple configuration profiles stored in memory to modify the backend's operation.

[0079] As used herein, and unless otherwise specified, the phrase "without requiring dedicated instructions for each variation" refers to the ability of the processor to support multiple variations of operations through configurable hardware settings, eliminating the need for unique instructions in the instruction set for each possible operation. Examples of specific embodiments include using a single instruction with different configuration settings to perform various types of arithmetic operations, such as different precisions or operand sizes. As used herein, and unless otherwise specified, the term "pipelined" refers to an arrangement where multiple processing stages are set up in a sequence, allowing overlapping execution of operations to increase processing throughput. Examples of specific embodiments include instruction pipelines in CPUs where different stages like fetch, decode, execute, and writeback operate concurrently on different instructions.

[0080] As used herein, and unless otherwise specified, the term "operand reuse" refers to the technique of using the same operand multiple times within one or more stages of processing to reduce the need for additional data fetches, thus improving efficiency. Examples of specific embodiments include retaining a frequently used value in a register or cache within the dataflow backend to avoid repeated memory accesses.

[0081] As used herein, and unless otherwise specified, the phrase "output precision is less than or equal to the size of a register in the register file to enable writing outputs without requiring additional register file ports" refers to designing the output data such that it fits within a single register, avoiding the need for extra hardware resources to store larger outputs. Examples of specific embodiments include ensuring that the combined results of parallel subword operations do not exceed the register's capacity.

[0082] As used herein, and unless otherwise specified, the term "memory-mapped input / output (MMIO) region" refers to a specific area of memory address space that is mapped to hardware registers or devices, allowing software to interact with hardware components via standard memory access operations. Examples of specific embodiments include configuration registers for peripheral devices accessed through specific memory addresses.

[0083] As used herein, and unless otherwise specified, the term "mask bits" refers to bits within the configuration data used to selectively enable or disable specific functional units or operations within each dataflow network stage during execution. Examples of specific embodiments include a set of bits where each bit corresponds to a functional unit, with a value of '1' enabling the unit and 'O' disabling it.

[0084] As used herein, and unless otherwise specified, the term "control bits for directing data flow between successive dataflow network stages" refers to bits within the configuration data that determine the routing and flow of data between the different stages of the dataflow backend. Examples of specific embodiments include bits that select whether the output of one stage is passed to the next stage or fed back to an earlier stage.

[0085] As used herein, and unless otherwise specified, the phrase "broadcasted to all threads within the SIMT processor core" refers to sending the same configuration data or control signals simultaneously to all executing threads in the processor core to ensure consistent operation across threads. Examples of specific embodiments include distributing a configuration change to all threads in a GPU so they execute the same operation with the same parameters.

[0086] As used herein, and unless otherwise specified, the term "status register" refers to a processor register that stores information regarding the state or configuration of the processor, such as flags or indexes that influence operation. Examples of specific embodiments include registers that hold flags for arithmetic conditions or that store the current configuration index for selecting operational settings.

[0087] As used herein, and unless otherwise specified, the term "instruction size constraints" refers to limitations on the amount of information that can be encoded within a single instruction due to its fixed size, which affects the number of operations or data that can be specified directly. Examples of specific embodiments include a 32-bit instruction format where only a limited number of bits are available for immediate values or indexes, restricting the range of directly addressable configurations.

[0088] As used herein, and unless otherwise specified, the phrase "variable levels of subword parallelism" refers to the capability of the processor to adjust the number and size of subwords being processed in parallel, allowing for different degrees of parallelism based on the requirements of the computation. Examples of specific embodiments include configuring a register to handle either four 16-bit operations or eight 8-bit operations simultaneously, depending on the desired precision and performance.

[0089] As used herein, and unless otherwise specified, the term "non-transitory computer- readable storage medium" refers to any physical storage medium that can store instructions or data in a form accessible by a computer, excluding transitory signals or waves. Examples of specific embodiments include solid-state drives, hard disk drives, flash memory devices, optical discs, and magnetic storage media.

[0090] As used herein, and unless otherwise specified, the expression "instructions that, when executed by a Single Instruction Multiple Threads (SIMT) processor core comprising a configurable dataflow backend and a configuration memory, cause the processor to... (perform an action)" refers to software programs or code sequences designed to run on the specified hardware, causing it to perform the described operations or methods. Examples of specific embodiments include compiled code or firmware that configures the processor's dataflow backend and executes mixed- precision arithmetic operations as detailed in the claims.

[0091] The invention will now be described by a detailed description of several embodiments of the invention. It is clear that other embodiments of the invention can be configured according to the knowledge of persons skilled in the art without departing from the technical teaching of the invention, the invention being limited only by the terms of the appended claims.

[0092] In a first aspect, the present invention relates to a Single Instruction Multiple Threads (SIMT) processor core (100) comprising a SIMT frontend (11), for managing control flow and issuing instructions to a group of threads (a warp) in a way that all threads in the warp follow the same instruction sequence, even if they work on different pieces of data; and a SIMT backend (12), including hardware for performing arithmetic, logic or data operations on the data assigned to each thread.

[0093] In the context of the present invention, the backend (12) is a configurable dataflow system, meaning that it can adapt to perform mixed-precision arithmetic operations efficiently by using functional units that process data in parallel. The configurable dataflow backend (12) is operative to support subword parallelism for execution of mixed-precision arithmetic operations. The backend (12) includes multiple dataflow network stages, each dataflow network stage including one or more functional units (20) configurable to process multiple subwords within a register in parallel.

[0094] The dataflow backend (12) is configured using a configuration memory (13) storing configuration data defining operational configurations of the dataflow network stages in the backend (12). The configuration memory is accessible via a configuration index mechanism (14) operative to select configuration data from the configuration memory (13) to control the dataflow network stages in the backend (12). A configuration index for a particular stage comes as part of an instruction to index the configuration memory for that stage. The configuration index mechanism (14) enables the dataflow backend (12) to support multiple variations of mixed- precision arithmetic operations without requiring dedicated instructions for each variation.

[0095] It is an advantage of embodiments of the present invention that efficient execution of mixed-precision arithmetic operations is enabled without the need for multiple dedicated instructions. The fact of using a unified configuration memory, as opposed to the use of multiple distinct mask registers and predicate registers as known from the prior art, also provides an efficiency gain.

[0096] As illustrated for the embodiment of FIG. 1, the dataflow backend (12) may be pipelined. This allows for high-throughput execution. The dataflow backend (12) may include multiple pipeline stages, enabling complex dataflow graphs to be implemented. Specifically, the dataflow backend (12) may include at least three pipeline stages (not illustrated), providing a good balance between complexity and performance. The pipelined dataflow backend (12) may be configured stagewise using a configuration memory (13) having a plurality of configuration memory stages (ConfigMem), and a configuration index mechanism (14) having a plurality of configuration index mechanism stages (Configindex). Each configuration index mechanism stage (Configindex) points to a location in a corresponding configuration memory stage (ConfigMem) that stores configuration bits for controlling a corresponding dataflow network stage (Stage 0, Stage 1) in the backend (12). The configuration index mechanism (14) may be incorporated either as part of an instruction or as the value of a status register.

[0097] In embodiments, as illustrated in FIG. 1, the dataflow backend (12) is thus configured as a pipelined dataflow network with multiple pipeline stages. Each pipeline stage (Stage 0, Stage 1) may include functional units (20) configurable via control bits in configuration data to perform arithmetic operations on subwords within registers specific to individual threads. In embodiments, no data is shared between threads within any pipeline stage, allowing each thread to process its data independently. This implementation simplifies control logic and reduces hardware complexity associated with inter-thread communication, while enabling high-throughput execution of mixed-precision arithmetic operations through independent per-thread computation.

[0098] For example, referring to FIG. 2, a three-stage dataflow graph (200) for a fused multiply- add (FMA) operation on packed 8-bit integers in a 32-bit architecture is illustrated.

[0099] When considering the case of Fused Multiply Add (FMA) across n thread lanes of a SIMT core (100), it can be assumed, without loss of generality, that the operands Ai, are 8-bit integers and the output C, is a 32-bit integer, where n = 4 in case of a 32-bit architecture (width of register is 32 bits). FMA in the packed execution mode can be defined as follows:

[0100] At the beginning of the dataflow, Ai, Btare stored in packed format (4x 8-bits each) within first operand register (210 - indicated rsl in below code) and second operand register (211 - indicated rs2 in below code), respectively. This implies that the number of ports in the register file do not need to be modified. The control program for this packed FMA execution could be as follows. set status reg 01 f fmadd a8 b8 rs O , rsl , rs2 f fmadd a8 b8 rs3 , rs3 , rs 4 This code configures an SIMT processor core 100 for a specific mode using the status register, followed by tow FMA instructions operating on registers with mixed precision.

[0101] The first instruction set status_reg 01 sets a status register to a value of 01. The status register controls or indicates specific functional unit modes or flags. In particular, setting 01 might enable a specific operational mode, such as for instance mixed-precision arithmetic; or configure precision or rounding modes; or activate specific hardware features in the dataflow backend (12). The second instruction f_fmadd_a8_b8 rs O , rs l , rs 2 is an FMA instruction, where _a8_b8 specifies mixed-precision artithmetic, where a8 is the precision of the first operand stored in register rs l and b8 is the precision of the second operand stored in register rs 2. These could for instance represent 8-bit floating point formats. In the above code, rs O , rs l , rs 2 are the registers that hold the operands, rs l and rs 2 are multiplied, the result is added to r s 0 and the final result is stored back in r s 0.

[0102] The third instruction f_fmadd_a8_b8 rs 3 , rs 3 , rs 4 is similar to the second instruction, but operates on different registers: rs 3 and rs 4 are multiplied, the result is added to rs 3 and the final result is stored back in rs 3.

[0103] The proposed dataflow-based execution, may be considered, in accordance with embodiments of the present invention, to have multiple pipeline stages (201, 202, 203), where the inputs of a subsequent stage s+1 arrives from a previous stage s. Corresponding to each stage (201, 202, 203), there is a configuration memory (see ConfigMemO, ConfigMeml in FIG. 1), which stores the bits to control the dataflow in that stage, and to select the function of the functional unit 20. Each functional unit 20 can be referred to as a node in the dataflow network, as illustrated in FIG. 2. Afunctional unit 20 can support multiple arithmetic operations - addition, multiplication, min, max, etc. In a given cycle, a particular node, hence functional unit 20, is configured either to perform one of a plurality of arithmetic operations, which allows supporting common arithmetic operations, or to remain non-functional. This provides flexibility to map different dataflow graphs. The functional units (20) may be configurable via control bits in the configuration data, allowing compact encoding of the dataflow configuration. Assuming odoutputs of a node, the number of control bits needed for configuration of one node is ceil(log2(od)).

[0104] In the case detailed in FIG. 2, the dataflow graph (200) has three stages (201, 202, 203), and for each stage, the configuration memory (not illustrated in FIG. 2) is preloaded. All functional units (20) are configured to perform either a multiplication or an addition operation.

[0105] In Stage 0 (201), the 8-bit inputs (A0, ..., A3; B0, ..., B3) from two source registers (210,

[0106] 211) are multiplied, producing 16-bit results (A0*B0, ..., A3*B3). In Stage 1 (202), the consecutive outputs (A0*B0, A3*B3) from Stage 0 (201) are summed to produce 16-bit outputs (A0*B+Al*Bl; A3*B3+A4*B4).

[0107] In Stage 2 (203), the two outputs (A0*B0+Al*Bl; A3*B3+A4*B4) from Stage 1 (202) are added to produce the final 32-bit output (A0*B0+Al*Bl+ A3*B3+A4*B4).

[0108] This demonstrates how the dataflow backend (12) can be configured to perform common arithmetic operations such as multiplication and addition across multiple pipeline stages (201, 202, 203).

[0109] In the example illustrated in FIG. 2, none of the functional units (20) was configured to remain inactive. Nevertheless, each stage (201, 202, 203) may have a mask bit to enable or disable the relevant functional units (20). Depending on the intended operation to be mapped to the dataflow, multiple functional units (20) may be turned off at any stage of the network.

[0110] The above example has fixed precision for operands involved in an operation within a pipeline stage and mixed-precision across the pipeline stages.

[0111] In alternative embodiments, the dataflow backend (12) may be configured to perform at least one of rounding, truncation, fixed-point arithmetic, and normalization to manage precision across pipeline stages, enabling handling of different precision requirements. As illustrated in FIG. 3, a dataflow graph (300) utilizing the FMA pipeline for fixed-precision operands (packed 8- bit) within a pipeline stage and across the different pipeline stages in a 32-bit architecture is shown.

[0112] In the case detailed in FIG. 3, the dataflow graph (300) has three stages (301, 302, 303), and for each stage, the configuration memory (not illustrated in FIG. 3) is preloaded. All functional units (20) are configured to perform either a multiplication or an addition operation. Rounding, truncation, fixed-point arithmetic, and normalization can be used to maintain fixed-point precision across the pipeline stages.

[0113] In Stage 0 (301), the 8-bit inputs (A0, ..., A3; B0, ..., B3) from two source registers (210, 211) are multiplied, producing 8-bit results (A0*B0, ..., A3*B3).

[0114] In Stage 1 (302), the consecutive outputs (A0*B0, ..., A3*B3) from Stage 0 (301) are summed to produce 8-bit outputs (A0*B0+Al*Bl; A3*B3+A4*B4).

[0115] In Stage 2 (303), the two outputs (A0*B0+Al*Bl; A3*B3+A4*B4) from Stage 1 (302) are added to produce the final 8-bit output (A0*B0+Al*Bl+ A3*B3+A4*B4).

[0116] The functional units 20 may also be configurable to perform approximate computations, thus adjusting output precision. This can provide performance and energy efficiency benefits for error-tolerant applications. Yet another embodiment is explained with respect to FIG. 4, illustrating a dataflow graph (400) utilizing the FMA pipeline for mixed-precision operands (packed 8-bit and packed 16-bit) in a 64-bit architecture. FIG. 4 highlights the data flow graph (400) with mixed precision for operands both within and across pipeline stages (401, 402, 403). The choice of functional unit 20 and the precision of inputs and outputs are orthogonal to the proposed approach. For example, the inputs could be very well floating-point number or integers. Depending on the choice of the output precision, the computation can be approximate (for example, the result of adding two 16b numbers stored as 16b output) which can potentially lead to savings in energy and hardware implementation. The precision of final output by the dataflow network is always less than or equal to the size of the register file (e.g. 32b in this example). This allows effectively writing back to the thread-specific register file of the SIMT core, without additional ports.

[0117] The configuration index mechanism may enable support for variable levels of subword parallelism, allowing the dataflow to be optimized for different data types.

[0118] Operand reuse may be performed within at least one pipeline stage to reduce data fetch overhead, improving performance by reducing memory bandwidth requirements. Referring to FIG. 5, a dataflow graph (500) for other math expressions utilizing the FMA pipeline for packed 8- bit integers with a reuse of operands (2x in this example, for the first pipeline stage 501) in a 16- bit architecture is illustrated. This demonstrates subword parallelism and higher reuse of operands, which can reduce data fetch overhead and increase throughput in terms of operations per byte fetched. This approach is extendable to n-bit architectures.

[0119] In the case detailed in FIG. 5, the dataflow graph (500) has three stages (501, 502, 503), and for each stage, the configuration memory (not illustrated in FIG. 5) is preloaded. All functional units (20) are configured to perform either a multiplication or an addition operation. Rounding, truncation, fixed-point arithmetic, and normalization can be used to maintain fixed-point precision across the pipeline stages.

[0120] In Stage 0 (501), the 8-bit inputs (A0, Al; B0, Bl) from two source registers (510, 511) are multiplied, producing 8-bit results (A0*B0, A0*Bl, Al*B0, A1*B1). Herein, the 8-bit integers A0, Al and B0, Bl are reused, once fetched.

[0121] In Stage 1 (502), the consecutive outputs (A0*B0, A0*Bl, Al*B0, A1*B1) from Stage 0 (501) are summed to produce 16-bit outputs (A0*B0+A0*Bl; Al*B0+Al*Bl).

[0122] In Stage 2 (503), the two outputs (A0*B0+A0*Bl; Al*B0+Al*Bl) from Stage 1 (502) are added to produce the final 32-bit output (A0*B0+A0*Bl+Al*B0+Al*Bl). The output precision of the dataflow backend (12) may be less than or equal to the size of a register in the register file to enable writing outputs without requiring additional register file ports. This simplifies the register file design.

[0123] In embodiments of the present invention, the configuration memory (13) may be a memory-mapped input / output (MMIO) region, allowing easy access to the configuration memory (13). The configuration data may include mask bits for selectively enabling or disabling functional units (20) within each dataflow network stage during execution, providing fine-grained control over the dataflow. The configuration data may include control bits for directing data flow between successive dataflow network stages, allowing flexible dataflow patterns to be implemented.

[0124] In embodiments of the present invention, the configuration data may be broadcast to all threads within the SIMT processor core (100), amortizing the overhead of storing the configuration bits across multiple threads. The proposed approach of embodiments of the present invention offers a high throughput high density computation solution for narrow precision computing. It also offers lowered memory requirements since packed information can be directly loaded into the register files.

[0125] In embodiments of the present invention, the configuration index mechanism (14) may comprise a status register storing a configuration index value, allowing a larger configuration space to be addressed. The configuration index mechanism (14) may be incorporated as part of an instruction in an instruction set architecture, providing a compact way to specify the configuration. Including the configuration index in an instruction may allow addressing a limited number of configurations due to instruction size constraints, leading to a simpler design. In alternative implementations, including the configuration index in a status register may allow addressing a larger configuration memory for a wider range of dataflow options, enabling more flexibility.

[0126] In a second aspect, and as schematically illustrated in FIG. 6, the present invention relates to a method (600) of executing mixed-precision arithmetic operations in a Single Instruction Multiple Threads (SIMT) processor core (100). The method (600) comprises utilizing (601) a configurable dataflow backend (12) to perform subword parallelism, wherein the backend (12) includes multiple dataflow network stages (Stage 0, Stage 1), each dataflow network stage (Stage 0, Stage 1) comprising one or more functional units (20) configurable to process multiple subwords within a register in parallel. The method (600) further involves accessing (602) configuration data from a configuration memory (13) that stores operational configurations for the dataflow network stages (Stage 0, Stage 1) and selecting (603) the configuration data via a configuration index mechanism (14) to control the dataflow network stages (Stage 0, Stage 1) in the backend (12), wherein the configuration index mechanism (14) enables support for multiple variations of mixed- precision arithmetic operations without requiring dedicated instructions for each variation.

[0127] The method (600) may further comprise configuring (604) the functional units (20) via control bits in the configuration data, allowing compact encoding of the dataflow configuration.

[0128] In particular embodiments, the method (600) may comprise configuring (604) one or more of the functional units (20) to perform at least one of rounding, truncation, fixed-point arithmetic, and normalization to manage precision across pipeline stages, enabling handling of different precision requirements. In particular embodiments, the method (600) may further comprise configuring (604) one or more of the functional units (20) to perform approximate computations by adjusting output precision, which can provide performance and energy efficiency benefits for error-tolerant applications.

[0129] The method (600) may further comprise performing operand reuse within at least one pipeline stage to reduce data fetch overhead, improving performance by reducing memory bandwidth requirements.

[0130] In a method (600) according to embodiments of the present invention, output precision may be limited to the size of a register in a register file to enable writing outputs without requiring additional register file ports, simplifying the register file design.

[0131] The method (600) may further comprise broadcasting the configuration data to all threads within the SIMT processor core (100), thus amortizing the configuration overhead across multiple threads.

[0132] The method (600) may further comprise setting a status register with a configuration index value to select the configuration data, allowing a larger configuration space to be addressed.

Claims

Claims1.- A Single Instruction Multiple Threads (SIMT) processor core (100) comprising:(a) a configurable dataflow backend (12) operative to support subword parallelism for execution of mixed-precision arithmetic operations, wherein the backend (12) includes multiple dataflow network stages (Stage 0, Stage 1), each dataflow network stage (Stage 0, Stage 1) including one or more functional units (20) configurable to process multiple subwords within a register in parallel; and(b) a configuration memory (13) storing configuration data defining operational configurations of the dataflow network stages (Stage 0, Stage 1) in the backend (12), wherein the configuration memory (13) is accessible via a configuration index mechanism (14) operative to select configuration data from the configuration memory (13) to control the dataflow network stages (Stage 0, Stage 1) in the backend (12), wherein the configuration index mechanism (14) is configured for enabling the dataflow backend (12) to support multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation.2.- The SIMT processor core according to claim 1, wherein the dataflow backend is pipelined.3.- The SIMT processor core according to claim 2, wherein the dataflow backend includes multiple pipeline stages.4.- The SIMT processor core according to any one of claims 1 to 3, wherein each functional unit is configurable to perform at least one of addition, multiplication, minimum, and maximum operations.5.- The SIMT processor core according to any one of claims 1 to 4, wherein each functional unit is configurable to perform a selected arithmetic operation or remain inactive in any cycle.6.- The SIMT processor core according to any one of claims 1 to 5, wherein the output precision of the dataflow backend is less than or equal to the size of a register in the register file to enable writing outputs without requiring additional register file ports.7.- The SIMT processor core according to any one of claims 1 to 6, wherein the configuration index mechanism comprises a status register storing a configuration index value.8.- The SIMT processor core according to any one of claims 1 to 7, wherein the configuration index mechanism is incorporated as part of an instruction in an instruction set architecture.A method (600) of executing mixed-precision arithmetic operations in a Single InstructionMultiple Threads (SIMT) processor core (100), the method comprising:(a) utilizing (601) a configurable dataflow backend (12) to perform subword parallelism, wherein the backend (12) includes multiple dataflow network stages (Stage 0, Stage 1), each dataflow network stage (Stage 0, Stage 1) comprising one or more functional units (20) configurable to process multiple subwords within a register in parallel;(b) accessing (602) configuration data from a configuration memory (13) that stores operational configurations for the dataflow network stages (Stage 0, Stage 1);(c) selecting (603) the configuration data via a configuration index mechanism (14) to control the dataflow network stages (Stage 0, Stage 1) in the backend (12); wherein the configuration index mechanism (14) enables support for multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation. The method according to claim 9, further comprising performing operand reuse within at least one pipeline stage to reduce data fetch overhead. The method according to any one of claims 9 to 10, further comprising setting a status register with a configuration index value to select the configuration data. A non-transitory computer-readable storage medium storing instructions that, when executed by a Single Instruction Multiple Threads (SIMT) processor core comprising a configurable dataflow backend and a configuration memory, cause the processor core to:(a) execute mixed-precision arithmetic operations using subword parallelism through the configurable dataflow backend, wherein the backend includes multiple dataflow network stages with functional units configurable to process multiple subwords within a register in parallel;(b) access configuration data defining operational configurations of the dataflow network stages from the configuration memory;(c) select the configuration data via a configuration index mechanism to control the dataflow network stages in the backend; wherein the configuration index mechanism enables the processor core to support multiple variations of mixed-precision arithmetic operations without requiring dedicated instructions for each variation.