FP16-S7E8 mixed precision for deep learning and other algorithms

By adopting mixed-precision vector multiplication-accumulation instructions in the FP16-S7E8 format, the problems of time consumption and unresponsive hyperparameters in the IEEE-FP16 format in machine learning hardware accelerators are solved, enabling efficient training and inference of neural networks, reducing silicon area and power consumption, and supporting the convergence of large neural networks.

CN115421686BActive Publication Date: 2026-06-16INTEL CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INTEL CORP
Filing Date
2019-08-05
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing machine learning hardware accelerators suffer from time-consuming training and inference processes when using the IEEE-FP16 format. This requires expert knowledge and may lead to non-active hyperparameters, resulting in slow training speed. Furthermore, the IEEE-FP16 format cannot effectively represent individual scalar product results within GEMM, impacting performance.

Method used

It adopts the FP16-S7E8 format for mixed-precision vector multiply-accumulate (MPVMAC) instructions, uses 16-bit format for multiplication and 32-bit single-precision accumulation to avoid the defects of the IEEE-FP16 format, and combines with SIMD processor for parallel processing to reduce silicon area and power consumption.

🎯Benefits of technology

It improves the performance and power efficiency of neural network processing, reduces silicon area and power consumption, achieves efficient processing in the training and inference phases, and supports the convergence of large neural networks such as AlexNet and ResNet-50.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115421686B_ABST
    Figure CN115421686B_ABST
Patent Text Reader

Abstract

Disclosed embodiments relate to mixed-precision vector multiply-accumulate (MPVMAC). In one example, a processor includes fetch circuitry to fetch a packed instruction having a field to specify a source vector having N single-precision formatted elements and a location of a packed vector having N neural half-precision (NHP) formatted elements; decode circuitry to decode the fetched packed instruction; execution circuitry to respond to the decoded packed instruction by converting each element of the source vector to NHP format and writing each converted element to a corresponding packed vector element, wherein the processor is further to fetch, decode, and execute a MPVMAC instruction to multiply corresponding NHP formatted elements using a 16-bit multiplier and accumulate each product with previous contents of a corresponding destination using a 32-bit accumulator.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The field of this invention generally relates to computer processor architectures, and more specifically, to FP16-S7E8 mixed precision for deep learning and other algorithms. Background Technology

[0002] Many hardware accelerators used today for machine learning via neural networks primarily perform matrix multiplication during both training and inference. Hardware accelerators for machine learning strive to achieve optimal raw performance numbers and power-to-performance ratios.

[0003] Machine learning architectures (such as deep neural networks) have been applied to fields including computer vision, image recognition, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, and drug design.

[0004] Matrix multiplication is a key performance / power limiter used in many algorithms, including machine learning.

[0005] Attempts to accelerate instruction throughput and improve performance may attempt to use reduced precision, such as IEEE-FP16 (S10E5), a half-precision floating-point (FP) format with 10 significant bits (sometimes called mantissa, coefficient, argument, or fraction) and 5 exponent, defined in the IEEE 754-2008 standard issued by the Institute of Electrical and Electronics Engineers (IEEE). However, when using IEEE-FP16 (S10E5), due to the over-allocation of significant bits to exponent bits, IEEE-FP16 (S10E5) tends to be time-consuming, requires expert knowledge, and may produce hyperparameters that are less desirable (i.e., slower to train) than those obtained through, for example, single precision (i.e., properties that are fixed before training and do not change during or as a result of training). Attached Figure Description

[0006] The invention is illustrated by way of example, not limitation, in the accompanying drawings, wherein similar reference numerals indicate similar elements, and in the drawings:

[0007] Figure 1 This is a block diagram illustrating a processing component for executing mixed-precision vector multiply-accumulate (MPVMAC) instructions according to an embodiment;

[0008] Figure 2 This is a block diagram illustrating a processing component for executing mixed-precision vector multiply-accumulate (MPVMAC) instructions according to an embodiment;

[0009] Figure 3This is a block flowchart illustrating a processor executing mixed-precision vector multiply-accumulate (MPVMAC) instructions according to an embodiment;

[0010] Figure 4A A block diagram is shown, illustrating a floating-point format according to some embodiments;

[0011] Figure 4B The increased dynamic range of the neural half-precision (FP16-S7E8) floating-point format compared to the standard half-precision floating-point format is shown.

[0012] Figure 5A This is a block diagram illustrating the execution of instructions for converting a format from standard single precision to neural half precision according to some embodiments;

[0013] Figure 5B This is a block flowchart illustrating an embodiment of a processor executing instructions for converting a format from standard single precision to neural half precision;

[0014] Figure 6A This is a block diagram illustrating the execution of instructions for converting a format from neural half-precision to standard single precision according to some embodiments;

[0015] Figure 6B This is a block flowchart illustrating an embodiment of a processor that executes instructions for converting a format from neural half-precision to standard single-precision.

[0016] Figure 7A This is a flowchart, according to some embodiments, for conducting machine learning experiments using mixed precision vector multiply-accumulate (MPVMAC) instructions;

[0017] Figure 7B Experimental results related to the non-convergence of CIFAR-10 with a 5-bit exponent used in the accumulator and multiplier in FP32 are shown;

[0018] Figure 7C Experimental results are shown relating to the convergence of CIFAR-10 with FP32 accumulation, using the number of bits in the mantissa and the 6-bit exponent and parameter scan.

[0019] Figure 7D Experimental results related to the convergence of AlexNet with multipliers implemented using IEEE-FP16 (S10E5) and FP16-S7E8 representations and accumulators implemented using FP32 are shown.

[0020] Figure 7E The convergence plot of ResNet-50; FP16-S7E8 to IEEE-FP16 / 32 is shown.

[0021] Figure 8It is the format of the Mixed Precision Vector Multiply-Accumulate (MPVMAC) instruction according to some embodiments;

[0022] Figures 9A-9B This is a block diagram illustrating a general vector-friendly instruction format and its instruction template according to some embodiments of the present invention;

[0023] Figure 9A This is a block diagram illustrating a general vector-friendly instruction format and its category A instruction template according to some embodiments of the present invention;

[0024] Figure 9B This is a block diagram illustrating a general vector-friendly instruction format and its category B instruction template according to some embodiments of the present invention;

[0025] Figure 10A This is a block diagram illustrating an exemplary specific vector-friendly instruction format according to some embodiments of the present invention;

[0026] Figure 10B This is a block diagram illustrating a field of a specific vector-friendly instruction format that constitutes a complete opcode field according to one embodiment;

[0027] Figure 10C This is a block diagram illustrating a specific vector-friendly instruction format that constitutes a register index field according to one embodiment;

[0028] Figure 10D This is a block diagram illustrating a specific vector-friendly instruction format that constitutes an enhanced operation field according to one embodiment;

[0029] Figure 11 This is a block diagram of a register architecture according to one embodiment;

[0030] Figure 12A This is a block diagram illustrating both an exemplary ordered pipeline and an exemplary register renaming, out-of-order release / execution pipeline according to some embodiments;

[0031] Figure 12B This is a block diagram illustrating both an exemplary embodiment of an ordered architecture core to be included in a processor and an exemplary register renaming, out-of-order release / execution architecture core, according to some embodiments;

[0032] Figure 13A -B shows a block diagram of a more specific exemplary ordered core architecture, which will be one of several logic blocks in the chip (including other cores of the same type and / or different types);

[0033] Figure 13A It is a block diagram of a single processor core according to some embodiments, together with its connection to the on-die interconnect network and a local subset of its Level 2 (L2) cache memory;

[0034] Figure 13B According to some embodiments Figure 13A A magnified view of a portion of the processor core;

[0035] Figure 14 This is a block diagram of a processor according to some embodiments, which may have more than one core, may have an integrated memory controller, and may have integrated graphics;

[0036] Figure 15-18 This is a block diagram of an exemplary computer architecture;

[0037] Figure 15 A block diagram of a system according to some embodiments is shown;

[0038] Figure 16 This is a block diagram of a first more specific exemplary system according to some embodiments;

[0039] Figure 17 This is a block diagram of a second, more specific, exemplary system according to some embodiments;

[0040] Figure 18 These are block diagrams of a system-on-a-chip (SoC) according to some embodiments; and

[0041] Figure 19 This is a block diagram illustrating the use of a software instruction converter to convert binary instructions in a source instruction set into binary instructions in a target instruction set, according to some embodiments. Detailed Implementation

[0042] Numerous specific details are set forth in the following description. However, it should be understood that some embodiments can be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.

[0043] References to "an embodiment," "an embodiment," "an exemplary embodiment," etc., in the specification indicate that the described embodiment may include a feature, structure, or characteristic, but each embodiment may not necessarily include that feature, structure, or characteristic. Furthermore, such phrases do not necessarily refer to the same embodiment. Additionally, when a feature, structure, or characteristic is described with respect to an embodiment, if explicitly described, it is claimed that such a feature, structure, or characteristic affecting other embodiments is within the knowledge of those skilled in the art.

[0044] The following description discloses various systems and methods for executing mixed-precision vector multiply-accumulate (MPVMAC) instructions using a 16-bit format including a tag bit, seven significant bits, and an eight-bit exponent. This format is referred to as the FP16-S7E8 format (due to its significant bit and exponent bit width) or the neural half-precision format (as it has been shown to improve the performance of neural networks used in machine learning contexts, including during both the training and inference phases). According to some embodiments, executing the disclosed MPVMAC instructions involves using a reduced-precision FP16-S7E8 (neural half-precision) format during multiplication, followed by a 32-bit single-precision accumulation, which can use the FP32 / binary 32 single-precision floating-point format defined in IEEE 754-2008.

[0045] Using the FP16-S7E8 format for multiplication is expected to yield benefits in terms of lower precision, memory bandwidth, power, and silicon area, with minimal negative impact on training and inference performance. Combining the FP16-S7E8 format with the execution of MPVMAC instructions is expected to improve the performance and power efficiency of neural network processing and similar deep learning workloads.

[0046] The disclosed embodiments avoid reliance on the IEEE-FP16 (S10E5) reduced-precision format, which could result in architectures that do not work with low-precision multiplication. The disclosed embodiments also avoid forcing the IEEE-FP16 (S10E5) format to work by maintaining a copy of each data element in binary 32 format, thus enabling rewinding and restarting if the algorithm fails. This approach requires significant additional memory and silicon area.

[0047] Therefore, the disclosed embodiments use a 16-bit format for multiplication. This format works well in neural network applications, and is therefore sometimes referred to as "neural" half-precision. Based on 7 significant bits and an 8-bit exponent, this format is also called FP16-S7E8 (or S7E8-FP16). Using the disclosed FP16-S7E8 format represents numbers with a very close range to those in FP32 format. The disclosed embodiments also include mixed-precision multiply-accumulate instructions that use this 16-bit FP16-S7E8 format for multiplication, but use 32-bit single precision for accumulation.

[0048] The disclosed embodiments are expected to improve GEMM (Generalized Matrix Multiplication) performance. In terms of silicon area, the disclosed embodiments are expected to be slightly cheaper than IEEE-FP16 (S10E5) and significantly cheaper than 32-bit binary 32. The area cost of a floating-point multiplier is dominated by the mantissa multiplier and is proportional to the square of the number of mantissa bits. Implementing GEMM using the disclosed FP16-S7E8 format is expected to require significantly less silicon area and consume less power compared to implementations utilizing FP32 multipliers.

[0049] The disclosed embodiments describe a new 16-bit FP representation (such as...) Figure 4A (As shown) to accommodate the training requirements of deep learning workloads. The disclosed format will be used as part of mixed-precision fused multiply-accumulate operations, such as... Figure 2 As shown in the image.

[0050] Figure 1 This is a block diagram illustrating a processing component for executing mixed-precision vector multiply-accumulate (MPVMAC) instructions according to some embodiments. As shown, storage device 101 stores MPVMAC instructions 103 to be executed. As further described below, in some embodiments, computing system 100 is a SIMD processor for parallel processing of multiple elements of a packed data vector (e.g., a matrix).

[0051] In operation, the MPVMAC instruction 103 is fetched from the storage device 101 via the fetch circuit 105. The fetched MPVMAC instruction 107 is then decoded by the decoding circuit 109. (Relative to...) Figure 8 , 9A The MPVMAC instruction format further shown and described in -B and 10A-D has fields (not shown here) for specifying a first, second, and destination matrix. In some embodiments, the specified second matrix is ​​a sparse matrix with a sparsity of less than one (sparseness is the proportion of non-zero elements, i.e., the second matrix has at least some zero-value elements). Decoding circuit 109 decodes the taken MPVMAC instruction 107 into one or more operations. In some embodiments, this decoding includes generating multiple micro-operations to be executed by execution circuitry (e.g., execution circuitry 117). Decoding circuitry 109 also decodes instruction suffixes and prefixes (if used). Execution circuitry 117 is described below at least relative to... Figure 2-6B 12A-B and 13A-B are further described and explained.

[0052] In some embodiments, register renaming, register allocation, and / or scheduling circuitry 113 provides functionality for one or more of the following: 1) renaming logical operand values ​​to physical operand values ​​(e.g., register alias tables in some embodiments), 2) assigning status bits and flags to decoded instructions, and 3) scheduling decoded MPVMAC instructions 111 from the instruction pool for execution on execution circuitry 117 (e.g., using reserved stations in some embodiments).

[0053] Registers (register file) and / or memory 115 store data as operands for the decoded MPVMAC instruction 111 to be operated by execution circuitry 117. Exemplary register types include write mask registers, packed data registers, general-purpose registers, and floating-point registers, as follows at least relative to... Figure 11 Further description and illustration.

[0054] In some embodiments, the write-back circuit 119 submits the execution result of the decoded MPVMAC instruction 111. Relative to Figure 2-6B 12A-B and 13A-B further illustrate and describe the execution circuit 117 and system 100.

[0055] Figure 2 This is a block diagram illustrating a processing component for executing mixed-precision vector multiply-accumulate (MPVMAC) instructions according to an embodiment. As shown, the computing system 200 executes instruction 202 to convert single-precision formatted vectors, source 1 (FP32) 206 and source 2 (FP32) 208, using FP32 to FP16-S7E8 converters 212 and 214, and stores the converted FP16-S7E8 formatted source 1 and source 2 vector elements into neural half-precision FP16-S7E8 formatted registers, source 1 (FP16-S7E8) 216 and source 2 (FP16-S7E8) 218. Then, the execution circuitry executes instruction 204 to multiply each element of the two FP16-S7E8 formatted source vectors using 16-bit multiplication circuitry 220. Each product generated by multiplication circuit 220 is then accumulated with the previous value of the corresponding element of destination (FP32) 224 using 32-bit accumulator (FP32) circuit 222.

[0056] As shown in the figure, instruction 202 is a VCVTPS2PNH (Vector Transform Packed Single-Precision 2-Packed Neural Half-Precision) instruction that specifies a 256-bit memory location or vector register destination, a 512-bit memory location or vector register source vector, and an 8-bit immediate value for specifying rounding behavior. (Reference) Figure 8 , 9A-B and 10A-D further illustrate and describe the format of the VCVTPS2NH instruction. Here, the VCVTPS2NH instruction will be called twice, each time specifying a 512-bit memory location as the source and a 256-bit vector register as the destination.

[0057] As shown in the figure, instruction 204 is a VDPPNHS (Vector Dot Product Packed Neural Half-Precision Multiplication Single-Precision Accumulation) instruction that specifies two 256-bit source vector registers or memory locations and a 512-bit destination vector register. It also specifies a mask k1, whose lowest-order 16 bits control whether each destination vector register element is written with a new value when not masked. It also includes {z} bits, which specify whether the masked destination element is zeroed out or masked. Relative to... Figure 8 , 9A -B and 10A-D further illustrate and describe the format of the VDPPNHS instruction. Here, instruction 204 specifies source 1 (FP16-S7E8) 216 and source 2 (FP16-S7E8) 218 ​​as its source vectors and specifies destination (FP32) 224 as its destination.

[0058] Compared to Figure 3 , 5A -B, 12A-B, and 13A-B further illustrate and describe the operation of the computing system 200 for executing (MPVMAC) instructions.

[0059] For simplicity, the circuitry used to execute the MPVMAC instruction is shown operating on a single data value. However, it should be understood that the source and destination shown are vectors. In some embodiments, the computing system 200 performs serial operations on the elements of a vector. In some embodiments, the computing system 200 performs parallel operations on multiple elements of a vector. In other embodiments, the computing system 200 operates on multiple elements of a matrix (slice), such as rows or columns of matrix (slice) elements. In some embodiments, the computing system 200 utilizes SIMD (Single Instruction Multiple Data) circuitry to perform parallel operations on multiple vector elements.

[0060] Figure 3 This is a block flowchart illustrating a processor executing a mixed-precision vector multiply-accumulate (MPVMAC) instruction according to an embodiment. As shown, the processor executing flow 300 executes instruction 302 twice to convert two single-precision (FP32) source vectors into neural half-precision (NHP) vectors, and then executes a mixed-precision vector multiply-accumulate (MPVMAC) instruction 304 to multiply each pair of corresponding 16-bit element pairs, and then uses a 32-bit processing channel to accumulate each product with the previous value of the corresponding FP32 destination.

[0061] At 306, the processor uses fetch circuitry to fetch compression instructions having a field specifying the location of a source vector having N single-precision formatted elements, and a compressed vector having N neural half-precision (NHP) formatted elements. At 308, the processor uses decoding circuitry to decode the fetched compression instructions.

[0062] In 310, the processor is used to respond to the decoded compression instructions by using execution circuitry to convert each element of the source vector into NHP format, round each converted element according to the rounding mode, and write each rounded element to the corresponding compressed vector element, wherein the NHP format includes seven significant bits and eight exponent bits, and wherein the source vector and the compressed vector are respectively in memory or in a vector register.

[0063] In 312, the processor uses fetch, decode, and execute circuitry to fetch, decode, and execute a second compression instruction that specifies the positions of a second source vector having N elements formatted according to a single-precision format and a second compressed vector having N elements formatted according to an NHP format.

[0064] In 314, the processor is used to fetch and decode an MPVMAC instruction using fetch and decode circuitry. The instruction has fields for specifying first and second source vectors having N NHP-formatted elements, and a destination vector having N single-precision-formatted elements, wherein the specified source vectors are compressed vectors and second compressed vectors.

[0065] In 316, the processor is used to respond to the decoded MPVMAC instruction for each of the N elements by generating a 16-bit product of the compressed vector elements and the second compressed vector elements, and accumulating the generated 16-bit product with the previous contents of the corresponding element of the destination vector.

[0066] In some embodiments, at 318, the processor is configured to write back the execution result / retirement MPVMAC instruction. Operation 318 is optional, as indicated by its dashed box, in which case the write-back may occur at different times, or may not occur at all.

[0067] Compared to Figure 2 , 5A -B, 12A-B, and 13A-B further illustrate and describe the operation of the processor for executing (MPVMAC) instructions.

[0068] Figure 4AThe floating-point formats used in conjunction with some of the disclosed embodiments are shown. As shown, the FP16-S7E8 format 402 is used in conjunction with various disclosed embodiments and consists of a marker bit, an 8-bit exponent, and 7 significant bits (sometimes referred to as the mantissa, coefficient, independent variable, or fraction). On the other hand, the disclosed embodiments avoid using the IEEE 754 FP16 (S10E5) half-precision (sometimes referred to as binary 16 or FP16-S10E5) format 404, which consists of a marker bit, a 5-bit exponent, and 10 significant bits. The disclosed embodiments avoid the IEEE binary 16 format because its result of a single scalar product within a GEMM tends to non-converge during training when it cannot be represented by the 5-bit exponent in IEEE half-precision—IEEE-FP16 (S10E5). Also shown is the IEEE 754 single-precision (FP32 / binary 32) format 406. The disclosed embodiments use a 16-bit multiplication stage to generate the product of FP16-S7E8 operands, and then accumulate the product through a 32-bit channel to generate a 32-bit FP32 result.

[0069] Figure 4B The diagram shows the proportions of these intermediate results (during CIFAR-10 training) that cannot be represented using IEEE half-precision (FP-16) but have a 6-bit exponent. Histogram 450 shows that the range covering 452 with the 6-bit exponent is significantly larger than the range covering 454 with IEEE-FP16 S10E5. This is a significant advantage because training with IEEE-FP16 S10E5 is unsuccessful, but training with FP16 S7E8 is successful.

[0070] Since a large number of these scalar products are reduced to individual elements of the matrix result, the precision of each individual scalar product result is less important, and therefore the number of bits in the effective bits can be safely reduced. As described in the Experimental Results section below, convergence of certain neural networks can be achieved by reformatting the multiplier input from IEEE FP32 to a 16-bit format (such as FP16-S7E8).

[0071] Figure 5A This is a block diagram illustrating the execution of instructions for converting a format from standard single precision to neural half precision according to some embodiments. As shown, computing system 500 executes instruction 502 to convert a single-precision formatted vector, source 1 (FP32) 504, using FP32 to FP16-S7E8 converter 510 and rounding circuit 512 to generate 16 neural half-precision (NHP) formatted values ​​and store them in a compressed vector (FP16-S7E8) 514.

[0072] As shown in the figure, instruction 502 is a VCVTPS2PNH (Vector Transform Packed Single-Precision 2-Packed Neural Half-Precision) instruction that specifies a 256-bit memory location or vector register destination, a 512-bit memory location or vector register source vector, and an 8-bit immediate value for specifying rounding behavior. Relative to Figure 8 , 9A -B and 10A-D further illustrate and describe the format of the VCVTPS2PNH instruction.

[0073] In operation, the computing system 500 uses an execution circuit 508, including an FP32-to-FP16-S7E8 converter 510 and a rounding circuit 512, to transform each element of source 1 (FP32) 504 and store each transformed element into the corresponding element of the compressed vector (FP16-S7E8) 514. Relative to Figure 5B , 12A -B and 13A-B further illustrate and describe the operation of the computing system 500 for executing (MPVMAC) instructions.

[0074] Figure 5B This is a block flowchart illustrating an embodiment of a processor executing instructions for converting a format from standard single precision to neural half precision. As shown, process 550 is executed by the processor to execute instruction 552, thereby converting a single-precision (FP32) source vector into a neural half-precision (NHP) vector.

[0075] In 556, the processor uses fetch circuitry to fetch compression instructions having fields specifying the positions of a source vector with N single-precision formatted elements and a compressed vector with N neural half-precision (NHP) formatted elements. In 558, the processor uses decoding circuitry to decode the fetched compression instructions.

[0076] In the 560, the processor is used to respond to the decoded compression instructions by using the execution circuitry to convert each element of the source vector into NHP format, round each converted element according to the rounding mode, and write each rounded element to the corresponding compressed vector element, wherein the NHP format includes seven significant bits and eight exponent bits, and wherein the specified source vector and compressed vector are respectively in memory or in a register.

[0077] In some embodiments, at 562, the processor is configured to write back the execution result / retirement MPVMAC instruction. Operation 562 is optional, as indicated by its dashed box, in which case the write-back may occur at a different time, or may not occur at all.

[0078] Figure 6AThis is a block diagram illustrating the execution of instructions for converting a format from neural half-precision to standard single-precision according to some embodiments. As shown, computing system 600 executes instruction 602 to convert a neural half-precision formatted vector, source 1 (FP16-S7E8) 604, using FP16-S7E8 to FP32 converter 610 to generate 16 single-precision (FP32) formatted values ​​and store them in destination vector (FP32) 614.

[0079] As shown in the figure, instruction 602 is a VCVTPNH2PS (Vector Transform Packed Neural Half-Precision to Packed Single-Precision) instruction that specifies a 512-bit memory location or vector register destination and a 256-bit memory location or vector register source vector. Relative to Figure 8 , 9A -B and 10A-D further illustrate and describe the format of the VCVTPNH2PS instruction.

[0080] In operation, the computing system 600 uses execution circuitry 608, including an FP16-S7E8 to FP32 converter 610, to transform each element of source 1 (FP16-S7E8) 604 and stores each transformed element into the corresponding element of destination vector (FP32) 614. Relative to Figure 6B , 12A -B and 13A-B further illustrate and describe the operation of the computing system 600 for executing (MPVMAC) instructions.

[0081] Figure 6B This is a block flowchart illustrating an embodiment of a processor executing instructions for converting a format from neural half-precision to standard single-precision. As shown, process 650 is executed by the processor to execute instruction 652, thereby converting a neural half-precision (NHP) source vector into a single-precision (FP32) vector.

[0082] In 656, the processor uses fetch circuitry to fetch extended instructions having fields specifying the positions of a compressed source vector with N neural half-precision (NHP) formatted elements and a destination vector with N single-precision formatted elements. In 658, the processor uses decode circuitry to decode the fetched extended instructions.

[0083] In the 660, the processor uses execution circuitry to respond to the decoded extended instructions by converting each element of the compressed source vector into single-precision format and writing each converted element to the corresponding destination vector element.

[0084] In some embodiments, at 662, the processor is configured to write back the execution result / retirement MPVMAC instruction. Operation 662 is optional, as indicated by its dashed box, in which case the write-back may occur at different times, or may not occur at all.

[0085] Experimental results

[0086] The disclosed embodiments using the FP16-S7E8 reduced-precision data format are expected to improve the performance and efficiency of the processor executing mixed-precision vector multiply-accumulate (MPVMAC) instructions as part of a machine learning workload, as demonstrated by the experimental results described below. As described below, based on... Figure 7A The methodology shown and described was tested. Figure 7B The results show that when using a 16-bit format with a 5-bit exponent (such as IEEE-FP16 (S10E5)), attempting to train on the CIFAR-10 dataset for images fails to converge. Figure 7C It is shown that convergence is achieved using the same settings but with a 6-bit exponent. Assuming that the experiment shows convergence is achieved using a 6-bit exponent, convergence is conservatively expected to be achieved using an 8-bit exponent (e.g., the FP16-S7E8 format as in the disclosed embodiment). Figure 7D and Figure 7E Experimental results for the AlexNet and ResNet-50 neural networks are presented, showing improved convergence when using the FP16-S7E8 format as described in the disclosed embodiments, compared to the performance of the IEEE-FP16 (S10E5) format.

[0087] Experimental Methodology

[0088] Figure 7A This is a flowchart illustrating the methodology applied to conducting machine learning experiments. Experimental reports are generated by performing training runs of various neural networks using Experiment Method 700, employing the CAFFE (Convolutional Architecture for Fast Feature Embedding) deep learning framework. CAFFE supports many different types of deep learning architectures adapted for image classification and image segmentation.

[0089] As shown in the figure, in section 702, the neural network testbench is configured and compiled. In section 704, a custom SGEMM is installed. (GEMM—General Matrix Multiplication—is a common function in deep learning and is part of the BLAS—Basic Linear Algebra Subroutines—library. SGEMM and DGEMM are single-precision and double-precision versions of GEMM, respectively.) During experimentation, SGEMM calls will be intercepted and replaced with the custom SGEMM to use a custom tool called the "BLAS Interceptor" to change the precision of the multipliers and accumulators without making any changes to the binary representation.

[0090] In 706, the binary is run using the GNU MPFR library of CIFAR-10 (CIFAR-10 is an established computer vision dataset for object recognition that can contain tens of thousands of 32x32 color images) and low-precision algorithms that round to zero for AlexNet and ResNet.

[0091] In 708, generate a customized report.

[0092] Non-convergence of CIFAR-10 using a 5-bit exponent for the accumulator and multiplier in FP32

[0093] Figure 7B Experimental results related to the non-convergence of CIFAR-10 with a 5-bit exponent used in the FP32 accumulator and multiplier are shown. As shown in Figure 710, the convergence graph illustrates CIFAR-10 with an SGEMM implementation featuring a multiplier with a 5-bit exponent and an FP32 accumulator. It can be seen that the network does not converge with a 5-bit exponent used in the multiplier, regardless of the number of bits in the mantissa. The loss remains constant (~2.3) for all mantissa bits.

[0094] This can be done Figure 4B As seen in the histogram, the FP16-S10E5 format with a 5-bit exponent has a significantly narrower convergence range than FP16 (S7E8). Therefore, the IEEE-FP16 (S10E5) format, which requires a 5-bit exponent and a 10-bit mantissa, will fail to converge because the insertion substitution of the SGEMM multiplier implemented in the IEEE-FP16 (S10E5) semantics will cause convergence failure and prompt the user to perform hyperparameter tuning to compensate for this.

[0095] On the other hand, in some embodiments, multipliers implemented according to the disclosed FP16-S7E8 format can be insertional replacements. In such embodiments, FP16-S7E8 multipliers are used as insertional replacements, for example, by linking SGEMM software running in a machine learning context to a different library that instructs the hardware to perform 16-bit multiplication according to the FP16 S7E8 format and accumulate the results according to the FP32 format. This insertional replacement will benefit the processor in terms of consuming less power, improving performance due to narrower multiplication, increasing instruction throughput, and alleviating register and memory pressure by reducing the size of the data elements being transferred. Similarly, in some embodiments, FP16S7E8 is used as an insertional replacement for SGEMM multipliers implemented according to any other 16-bit or 32-bit format, for example, by linking SGEMM software to a different function call library that enables FP16 S7E8. In some embodiments, FP16 S7E8 libraries are linked as insertional replacements for SGEMM multipliers without the involvement of the operating system or any other software running on the processor. In some embodiments, such as in response to the need to reduce power consumption or increase instruction throughput, the FP15S7E8 library is dynamically linked as a pluggable replacement for the SGEMM multiplier.

[0096] Convergence of CIFAR-10 with 6-bit exponent multiplication and FP32 accumulator

[0097] Figure 7C Experimental results related to the convergence of CIFAR-10 with 6-bit exponent and parameter scan for the number of bits in the mantissa and FP32 accumulation are shown. As shown in Figure 720, the number of bits in the exponent is extended to 6 and using the same... Figure 7B The same experiments were subsequently conducted. In this case, the SGEMM implementation consists of a multiplier implemented using a 6-bit exponent (and a parameter scan of the number of bits in the mantissa) and an accumulator implemented using FP32 semantics.

[0098] As shown in the figure, for multiplication and accumulation performed in FP32, CIFAR-10 converges and matches the behavior of FP32 with a 6-bit exponent, where the number of bits in the mantissa is greater than two. Figure 7C The convergence graph 720 captures the convergence behavior.

[0099] Figure 4B The histogram of the scalar product seen in the diagram again provides the basic argument that the exponential extension of 1 increases the range of numbers that the multiplier can represent and successfully represents all products.

[0100] Therefore, expanding the exponent and implementing the multiplier with lower precision in SGEMM (via the FP32 accumulator) allows for insertion substitutions on neural networks like CIFAR-10 using hyperparameters tuned for FP32.

[0101] AlexNet and ResNet50 with FP16-S7E8, IEEE-FP16 (S10E5), and FP32 converged.

[0102] The above experiments can be used to cover larger networks. The multipliers used in SGEMM, specifically IEEE-FP16 S10E5 (10-bit mantissa, 5-bit exponent) and FP16-S7E8 (7-bit mantissa, 8-bit exponent), are implemented using intrinsics. The intrinsics implementation is an exact match for the MPFR implementation.

[0103] AlexNet

[0104] AlexNet was the first large neural network to win the ILSVRC (ImageNet Large Scale Visual Recognition) competition and be widely used in computer vision tasks. Figure 7D Experimental results related to the convergence of AlexNet with multipliers implemented using IEEE-FP16 (S10E5) and FP16-S7E8 representations and accumulators implemented using FP32 are included, and convergence graph 730 of AlexNet with SGEMM using lower precision multiplication is shown. Convergence graph 730 shows that the FP16-S7E8 representation produces better convergence results than IEEE-FP16 (S10E5), which requires significantly more iterations to achieve similar convergence. Although not shown, experimental results for AlexNet show that using FP16-S7E8 for multipliers produces convergence performance comparable to that of using FP32 for multipliers.

[0105] RES-NET 50

[0106] Figure 7E Figure 740 shows the convergence of ResNet-50;FP16-S7E8 against IEEE-FP16 / 32. ResNet-50 is one of the large neural network implementations with 50 layers recently used in the ILSVRC challenge. The convergence behavior of ResNet-50 with low-precision SGEMM is shown as follows. Figure 7EAs seen in [the image], it follows the behavior observed for AlexNet. SGEMM with a multiplier using FP16-S7E8 tracks the convergence pattern of FP32. IEEE-FP16 (S10E5) requires more iterations to converge. Although not shown, experimental results for ResNet-50 show that using FP16-S7E8 for the multiplier yields convergence performance comparable to using FP32.

[0107] Figure 8 This is the format of a Mixed Precision Vector Multiply-Accumulate (MPVMAC) instruction according to some embodiments. As shown, the MPVMAC instruction 800 includes fields for specifying an opcode 802 (here, a conversion or multiply-accumulate), a destination vector 804, and a source vector (source 1) 806. When formatted in single precision (FP32), the specified source and destination vectors are 512-bit memory locations or vector registers; otherwise, when formatted in neural half precision (NHP or FP16-S7E8), the specified source and destination vectors are 356-bit memory locations or vector registers.

[0108] Some embodiments also include one or more additional fields for specifying a second source (source 2) 808, an 8-bit immediate value 810, and the number of elements N 812 in the source and destination vectors. These additional fields are optional, as indicated by their dashed boxes, and in such cases, they may not be included in the instructions, or they may instead control the behavior via software programmable control registers.

[0109] Compared to Figure 9A -B and Figure 10A -D further illustrates and describes the format of the MPVMAC instruction 800.

[0110] Instruction set

[0111] An instruction set may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, bit positions) to specify the operation to be performed (e.g., opcode) and operands (on which the operation is performed) and / or other data fields (e.g., mask) in other things. Some instruction formats are further decomposed through the definition of instruction templates (or subformats). For example, an instruction template for a given instruction format may be defined to have different subsets of the fields of the instruction format (the included fields typically take the same order, but at least some have different bit positions because fewer fields are included) and / or to be defined to have given fields that are interpreted differently. Thus, each instruction of the ISA is represented using a given instruction format (and, if defined, one of the given instruction templates of that instruction format) and includes fields for specifying the operation and operands. For example, the exemplary ADD instruction has a specific opcode and instruction format, which includes an opcode field for specifying which opcode and an operand field (source 1 / destination and source 2) for selecting operands; and the occurrence of this ADD instruction in the instruction stream will have specific content in the operand field for selecting a specific operand. A collection of SIMD extensions (known as Advanced Vector Extensions (AVX) (AVX1 and AVX2) and encoding schemes using Vector Extensions (VEX)) has been published and / or released (see, for example, the Intel® 64 and IA-32 Architectures Software Developers Manual, September 2014; and the Intel® Advanced Vector Extensions Programming Reference, October 2014).

[0112] Exemplary instruction format

[0113] The embodiments of the instructions described herein can be implemented in different formats. Furthermore, exemplary systems, architectures, and pipelines are detailed below. Embodiments of the instructions can be executed on such systems, architectures, and pipelines, but are not limited to those detailed herein.

[0114] General vector-friendly command format

[0115] The vector-friendly instruction format is an instruction format suitable for use with vector instructions (e.g., certain fields exist specific to vector operations). Although embodiments are described in which both vector and scalar operations are supported through the vector-friendly instruction format, alternative embodiments use only vector operations in the vector-friendly format.

[0116] Figures 9A-9B This is a block diagram illustrating a general vector-friendly instruction format and its instruction template according to some embodiments of the present invention. Figure 9AThis is a block diagram illustrating a general vector-friendly instruction format and its category A instruction template according to some embodiments of the present invention; while Figure 9B This is a block diagram illustrating a general vector-friendly instruction format and its category B instruction template according to some embodiments of the present invention. Specifically, for the general vector-friendly instruction format 900, category A and category B instruction templates are defined, both of which include a no-memory-access 905 instruction template and a memory-access 920 instruction template. In the context of the vector-friendly instruction format, the term "general" means that the instruction format is not bound to any particular instruction set.

[0117] While embodiments of the invention will be described, the vector-friendly instruction format supports the following: a 64-byte vector operand length (or size) with a data element width (or size) of 32 bits (4 bytes) or 64 bits (8 bytes) (and thus, a 64-byte vector consists of 16 double-word elements or alternatively 8 quad-word elements); a 64-byte vector operand length (or size) with a data element width (or size) of 16 bits (2 bytes) or 8 bits (1 byte); and 32-bit (4 bytes), 64-bit (8 bytes), and 16-bit (2-word) elements. 32-byte vector operand lengths (or sizes) with a data element width (or size) of 32 bits (4 bytes), 64 bits (8 bytes), 16 bits (2 bytes), or 8 bits (1 byte); and 16-byte vector operand lengths (or sizes) with a data element width (or size); however, alternative embodiments may support more, fewer, and / or different vector operand sizes (e.g., 256-byte vector operands) with more, fewer, or different data element widths (e.g., 128-bit (16-byte) data element widths).

[0118] Figure 9A The Category A instruction templates include: 1) within the No Memory Access 905 instruction template, showing the No Memory Access, Full Round Control Type Operation 910 instruction template and the No Memory Access, Data Transformation Type Operation 915 instruction template; and 2) within the Memory Access 920 instruction template, showing the Memory Access, Temporary 925 instruction template and the Memory Access, Non-Temporary 930 instruction template. Figure 9B The Category B instruction templates include: 1) within the No Memory Access 905 instruction template, showing the No Memory Access, Write Mask Control, Partial Rounding Control type operation 912 instruction template and the No Memory Access, Write Mask Control, vsize type operation 917 instruction template; and 2) within the Memory Access 920 instruction template, showing the Memory Access, Write Mask Control 927 instruction template.

[0119] The general vector-friendly instruction format 900 includes... Figures 9A-9B The following fields are listed in order as shown below.

[0120] Format field 940 — A specific value in this field (the instruction format identifier value) uniquely identifies the vector-friendly instruction format and thus the occurrence of an instruction in the vector-friendly instruction format within an instruction stream. Therefore, this field is optional in the sense that it is not needed for instruction sets that only have a general vector-friendly instruction format.

[0121] Basic Operations Field 942 — Its content distinguishes different basic operations.

[0122] Register index field 944—its contents, either directly or generated from addresses, specify the locations of the source and destination operands (which are in registers or in memory). These include a sufficient number of bits for selecting N registers from a PxQ (e.g., 32x512, 16x128, 32x1024, 64x1024) register file. While in one embodiment N can be up to three source and one destination registers, alternative embodiments may support more or fewer source and destination registers (e.g., up to two sources, where one of these sources also acts as a destination; up to three sources, where one of these sources also acts as a destination; up to two sources and one destination).

[0123] The Modifier field 946—its contents distinguish between instructions in the general vector instruction format that specify memory access and those that do not; that is, it distinguishes between the 905 instruction template (no memory access) and the 920 instruction template (memory access). Memory access operations read and / or write to the memory level (in some cases where values ​​in registers are used to specify the source and / or destination addresses), while non-memory access operations do not (e.g., where the source and destination are registers). Although in one embodiment this field also selects between three different modes for performing memory address operations, alternative embodiments may support more, fewer, or different modes for performing memory address operations.

[0124] The augmentation operation field 950—its contents distinguish which of several different operations, in addition to the basic operation, will be performed. This field is context-specific. In some embodiments, this field is divided into a category field 968, an α field 952, and a β field 954. The augmentation operation field 950 allows a general group of operations to be executed in a single instruction instead of two, three, or four instructions.

[0125] The scale field 960—its contents allow for use in memory address generation (e.g., for using 2...). 缩放Scaling of the content of the index field (generated by the index + base address).

[0126] Displacement field 962A — its contents are generated as memory addresses (e.g., for use with 2). 缩放 The part of the address generation (*index + base address + offset) is used.

[0127] The displacement factor field 962B (note that the juxtaposition of displacement field 962A directly on displacement factor field 962B indicates that one or the other is used) — its contents are used as part of the address generation; it specifies the displacement factor to be scaled by the size (N) of the memory access — where N is the number of bytes in the memory access (e.g., for using 2) 缩放 *Address generation of the index + base address + scaled displacement). Redundant low-order bits are ignored, and therefore, the contents of the displacement factor field are multiplied by the total size of the memory operands (N) to generate the final displacement to be used in the effective address of the operation. The value of N is determined by the processor hardware at runtime based on the full opcode field 974 (described later herein) and the data manipulation field 954C. The displacement field 962A and the displacement factor field 962B are optional in the sense that they are not used in the no-memory-access 905 instruction template and / or in the sense that only one or neither of the two can be implemented in different embodiments.

[0128] The data element width field 964—its contents distinguish which of the multiple data element widths should be used (in some embodiments for all instructions; in other embodiments for only some instructions). This field is optional in the sense that if only one data element width is supported and / or some aspect of the opcode is used to support the data element width, then this field is not required.

[0129] The write mask field 970—its contents control, on a data element position basis, whether the position of that data element in the destination vector operand reflects the result of the base operation and the augmentation operation. Category A instruction templates support merge write masks, while Category B instruction templates support both merge and zero-out write masks. During a merge, the vector mask allows any set of elements in the destination to be protected from updating during the execution of any operation (specified by the base operation and the augmentation operation); in another embodiment, the old value of each element in the destination where the corresponding mask bit has a 0 is preserved. In contrast, during a zero-out, the vector mask allows any set of elements in the destination to be zeroed during the execution of any operation (specified by the base operation and the augmentation operation); in one embodiment, the elements in the destination are set to 0 when the corresponding mask bit has a 0 value. A subset of this functionality is the ability to control the vector length of the operation being performed (that is, the span of the elements being modified, from first to last); however, the modified elements do not necessarily have to be contiguous. Therefore, the write mask field 970 allows for some vector operations, including load, store, arithmetic, logical, etc. While embodiments of the invention are described in which the content of the write mask field 970 is selected from one of a plurality of write mask registers containing the write mask to be used (and thus the content of the write mask field 970 indirectly identifies the mask to be performed), alternative embodiments instead or additionally allow the content of the mask write field 970 to directly specify the mask to be performed.

[0130] Immediate number field 972 — Its contents allow for the specification of immediate numbers. This field is optional in the sense that it does not exist in implementations of general vector-friendly formats that do not support immediate numbers, and it does not exist in instructions that do not use immediate numbers.

[0131] Category field 968—its content distinguishes between different categories of instructions. (See reference) Figure 9A -B indicates that the content of this field selects between Category A and Category B instructions. Figure 9A In -B, rounded rectangles are used to indicate a specific value presented in a field (e.g., in...). Figure 9A -B corresponds to category A 968A and category B 968B for category field 968.

[0132] Instruction template for category A

[0133] In the case of the non-memory access 905 instruction template of category A, the α field 952 is interpreted as the RS field 952A, the contents of which distinguish which of the different amplification operation types should be executed (for example, rounding 952A.1 and data transformation 952A.2 are specified accordingly for the non-memory access, rounding type operation 910 and the non-memory access, data transformation type operation 915 instruction template), while the β field 954 distinguishes which of the specified types of operations should be executed. In the non-memory access 905 instruction template, the scaling field 960, the displacement field 962A, and the displacement-scaling field 962B are absent.

[0134] No memory access instruction template—Complete rounding control type operation

[0135] In the instruction template of operation 910 with no memory access and full round control type, the β field 954 is interpreted as a round control field 954A, the contents of which provide static rounding. Although in the embodiments described in this invention, the round control field 954A includes suppression of all floating-point exception (SAE) fields 956 and round operation control fields 958, alternative embodiments may support encoding both of these concepts into the same field, or having only one or the other of these concepts / fields (e.g., having only the round operation control field 958).

[0136] SAE field 956—its content determines whether exception reporting is disabled; when the content of SAE field 956 indicates that suppression is enabled, a given instruction will not report any kind of floating-point exception flags and will not raise any floating-point exception handlers.

[0137] The rounding operation control field 958—its contents distinguish which of a set of rounding operations to perform (e.g., round up, round down, round towards zero, and round to nearest). Therefore, the rounding operation control field 958 allows for instruction-based changes to the rounding mode. In some embodiments, where the processor includes a control register for specifying the rounding mode, the contents of the rounding operation control field 950 override that register value.

[0138] No memory access instruction template—Data transformation type operation

[0139] In the instruction template 915 for No Memory Access Data Transformation Type Operation, the β field 954 is interpreted as the data transformation field 954B, the contents of which distinguish which of the multiple data transformations to be performed (e.g., no data transformation, swizzle, broadcast).

[0140] In the case of the memory access 920 instruction template of category A, the α field 952 is interpreted as the eviction hint field 952B, the contents of which distinguish which eviction hint is to be used (in... Figure 9A In this context, temporary 952B.1 and non-temporary 952B.2 are specified accordingly for memory access, temporary 925 instruction templates, and memory access, non-temporary 930 instruction templates, while the β field 954 is interpreted as a data manipulation field 954C, the contents of which distinguish which of several data manipulation operations (also known as primitives) is to be performed (e.g., no manipulation; broadcast; source upconversion; and destination downconversion). The memory access 920 instruction template includes a scaling field 960 and optionally includes a displacement field 962A or a displacement-scaling field 962B.

[0141] Vector memory instructions perform vector loading from memory and vector storage to memory via translation support. Similar to regular vector instructions, vector memory instructions transfer data from / to memory element by element, with the actual transferred elements indicated by the contents of a vector mask selected as a write mask.

[0142] Memory access instruction template - temporary

[0143] Temporary data is data that can potentially be reused quickly enough to benefit from cached memory. However, this is a hint, and different processors may implement it in different ways, including ignoring the hint entirely.

[0144] Memory access instruction template—non-temporary

[0145] Non-temporary data is data that is unlikely to be reused quickly enough to benefit from being cached in Level 1 cache and should be given priority for eviction. However, this is a hint, and different processors may implement it in different ways, including ignoring the hint entirely.

[0146] Category B instruction template

[0147] In the case of instruction template of category B, α field 952 is interpreted as write mask control (Z) field 952C, the contents of which distinguish whether the write mask controlled by write mask field 970 should be merged or zeroed.

[0148] In the case of the non-memory access 905 instruction template of category B, a portion of the β field 954 is interpreted as the RL field 957A, the contents of which distinguish which of the different amplification operation types should be executed (for example, rounding 957A.1 and vector length (VSIZE) 957A.2 are specified accordingly for the no-memory access, write mask control, partial rounding control type operation 912 instruction template, and the no-memory access, write mask control, VSIZE type operation 917 instruction template), while the remaining portion of the β field 954 distinguishes which of the specified type of operation should be executed. In the no-memory access 905 instruction template, the scaling field 960, the displacement field 962A, and the displacement-scaling field 962B are absent.

[0149] In the 910 instruction template of the no memory access, write mask control, partial rounding control type operation, the remainder of the β field 954 is interpreted as the rounding operation field 959A, and exception event reporting is disabled (the given instruction does not report any kind of floating-point exception flag and does not invoke any floating-point exception handler).

[0150] Rounding operation control field 959A—like rounding operation control field 958, its content distinguishes which of a set of rounding operations to perform (e.g., round up, round down, round towards zero, and round to nearest). Therefore, rounding operation control field 959A allows for changes in the rounding mode on an instruction-by-instruction basis. In some embodiments, where the processor includes a control register for specifying the rounding mode, the content of rounding operation control field 950 overrides that register value.

[0151] In the instruction template 917 for no memory access, write mask control, and VSIZE type operation, the remaining portion 954 of the β field is interpreted as a vector length field 959B, the contents of which distinguish which of the multiple data vector lengths to be executed (e.g., 128, 256, or 512 bytes).

[0152] In the case of the Class B memory access 920 instruction template, a portion of the β field 954 is interpreted as a broadcast field 957B, the contents of which determine whether a broadcast-type data manipulation operation is to be performed, while the remaining portion of the β field 954 is interpreted as a vector length field 959B. The memory access 920 instruction template includes a scaling field 960 and optionally includes a displacement field 962A or a displacement-scaling field 962B.

[0153] Regarding the general vector-friendly instruction format 900, the complete opcode field 974 is shown, including the format field 940, the basic operation field 942, and the data element width field 964. While one embodiment is shown where the complete opcode field 974 includes all of these fields, in embodiments that do not support all of these fields, the complete opcode field 974 includes fewer than all of them. The complete opcode field 974 provides the operation code (opcode).

[0154] The augmentation operation field 950, the data element width field 964, and the write mask field 970 allow these features to be specified on an instruction-by-instruction basis in a general vector-friendly instruction format.

[0155] Combining the mask field and the data element width field creates categorized instructions because they allow the mask to be applied based on different data element widths.

[0156] The various instruction templates established within Category A and Category B are advantageous in different contexts. In some embodiments of the invention, different processors or different cores within a processor may support Category A only, Category B only, or both categories. For example, a high-performance general-purpose out-of-order core intended for general-purpose computing may support Category B only, a core intended primarily for graphics and / or scientific (throughput) computing may support Category A only, and a core intended for both may support both categories (of course, some hybrid cores with templates and instructions from both categories but not all templates and instructions from both categories are within the scope of the invention). Similarly, a single processor may include multiple cores, all of which support the same category or where different cores support different categories. For example, in a processor with separate graphics and general-purpose cores, one of the graphics cores intended primarily for graphics and / or scientific computing may support Category A only, while one or more general-purpose cores may be high-performance general-purpose cores intended for general-purpose computing with out-of-order execution and register renaming that support Category B only. Another processor that does not have a separate graphics core may include one more general-purpose ordered or out-of-order core supporting both Category A and Category B. Of course, features from one category may also be implemented in another category in different embodiments of the invention. Programs written in high-level languages ​​will be translated (e.g., timely or statically compiled) into a variety of different executable forms, including: 1) an instruction-only form having a category supported by the target processor for execution; or 2) an alternative routine written with different combinations of instructions from all categories and control flow code having a selection routine to run based on instructions supported by the processor (its currently running code).

[0157] Exemplary Vector-Friendly Instruction Format

[0158] Figure 10A This is a block diagram illustrating an exemplary specific vector-friendly instruction format according to some embodiments of the present invention. Figure 10A This illustrates a specific vector-friendly instruction format 1000, which is specific in its specified positions, sizes, interpretations, and the order of fields, as well as the meaning of the values ​​of some of those fields. This specific vector-friendly instruction format 1000 can be used to extend the x86 instruction set, and therefore some fields are similar to or identical to those used in the existing x86 instruction set and its extensions (e.g., AVX). This format is consistent with the prefix-encoded fields, true opcode byte fields, MOD R / M fields, SIB fields, offset fields, and immediate numeric fields of the existing x86 instruction set with extensions. Figure 10A The fields mapped to in Figure 9 are shown.

[0159] It should be understood that although embodiments of the invention are described with reference to a specific vector-friendly instruction format 1000 in the context of a general vector-friendly instruction format 900 for illustrative purposes, the invention is not limited to the specific vector-friendly instruction format 1000 unless stated otherwise. For example, the general vector-friendly instruction format 900 envisions a variety of possible sizes for various fields, while the specific vector-friendly instruction format 1000 is shown as a field with a specific size. By way of a particular example, although the data element width field 964 is shown as a bit field in the specific vector-friendly instruction format 1000, the invention is not so limited (that is, the general vector-friendly instruction format 900 envisions other sizes for the data element width field 964).

[0160] The general vector-friendly instruction format 900 includes... Figure 10A The following fields are listed in order as shown below.

[0161] The EVEX prefix (bytes 0-3) 1002 - is encoded in four-byte form.

[0162] Format field 940 (EVEX byte 0, bits [7:0]) - The first byte (EVEX byte 0) is format field 940, and it contains 0x62 (a unique value used to distinguish the vector-friendly instruction format in some embodiments).

[0163] The second to fourth bytes (EVEX bytes 1-3) include multiple bit fields that provide specific capabilities.

[0164] The REX field 1005 (EVEX byte 1, bits [7-5]) consists of the following: the EVEX.R bit field (EVEX byte 1, bits [7]-R), the EVEX.X bit field (EVEX byte 1, bits [6]-X), and the 957BEX byte 1, bits [5]-B. The EVEX.R, EVEX.X, and EVEX.B bit fields provide the same functionality as the corresponding VEX bit fields and are encoded using 1s complement form, i.e., ZMM0 is encoded as 1111B and ZMM15 is encoded as 0000B. Other fields of the instruction are encoded as the lower three bits of the register index (rrr, xxx, and bbb) as known in the art, such that Rrrr, Xxxx, and Bbbb can be formed by adding EVEX.R, EVEX.X, and EVEX.B.

[0165] REX' field 1010 — This is the first part of REX' field 1010 and is the EVEX.R' bit field (EVEX byte 1, bit [4]-R') used to encode the upper 16 or lower 16 of the 32-register set of the extension. In some embodiments, this bit, along with other bits as indicated below, is stored in a bit-inverted format to distinguish the BOUND instruction (in the known x86 32-bit mode), whose true opcode byte is 62, but the value of 11 in the MOD field is not accepted in the MOD R / M field (described below); alternative embodiments of the invention do not store this bit and another bit as indicated below in an inverted format. The value of 1 is used to encode the lower 16 registers. In other words, R'Rrrr is formed by combining EVEX.R', EVEX.R, and another RRR from other fields.

[0166] Opcode mapping field 1015 (EVEX byte 1, bits [3:0] — mmmm) — its content encoding implies the preceding opcode byte (0F, 0F 38, or 0F 3).

[0167] The data element width field 964 (EVEX byte 2, bit [7] — W) is represented by the symbol EVEX.W. EVEX.W is used to define the granularity (size) of the data type (32-bit data element or 64-bit data element).

[0168] The role of EVEX.vvvv 1020 (EVEX byte 2, bits [6:3] -vvvv) -EVEX.vvvv can include the following: 1) EVEX.vvvv encodes the first source register operand specified in inverted (1s complement) form and is valid for instructions with two or more source operands; 2) EVEX.vvvv encodes the destination register operand specified in 1s complement form for certain vector shifts; or 3) EVEX.vvvv does not encode any operands, the field is reserved and should contain 1111b. Therefore, the EVEX.vvvv field 1020 encodes the four low-order bits of the first source register specifier stored in inverted (1s complement) form. Depending on the instruction, additional different EVEX bit fields are used to extend the specifier size to 32 registers.

[0169] EVEX.U 968 Category field (EVEX byte 2, bit [2]-U) — If EVEX.U=0, it indicates category A or EVEX.U0; if EVEX.U=1, it indicates category B or EVEX.U1.

[0170] The prefix encoding field 1025 (EVEX byte 2, bits [1:0]-pp) provides additional bits for the base operation field. Besides supporting legacy SSE instructions in the EVEX prefix format, this also has the benefit of a compact SIMD prefix (instead of requiring bytes to represent the SIMD prefix, the EVEX prefix requires only 2 bits). In one embodiment, to support legacy SSE instructions that use SIMD prefixes (66H, F2H, F3H) in both legacy and EVEX prefix formats, these legacy SIMD prefixes are encoded into the SIMD prefix encoding field; and extended into the legacy SIMD prefix at runtime before being provided to the PLA for the decoder (so the PLA can run both legacy and EVEX formats of these legacy instructions without modification). Although newer instructions can directly use the contents of the EVEX prefix encoding field as opcode extensions, some embodiments extend in a similar manner for consistency but allow for different meanings to be specified by these legacy SIMD prefixes. An alternative embodiment may redesign the PLA to support 2-bit SIMD prefix encoding and therefore not require extension.

[0171] The α field 952 (EVEX byte 3, bit [7] — EH; also known as EVEX.EH, EVEX.rs, EVEX.RL, EVEX.writemask control, and EVEX.N; also indicated by α) — as previously described, this field is context-specific.

[0172] β field 954 (EVEX byte 3, bits [6:4]-SSS, also known as EVEX.s 2-0 EVEX.r 2-0 ,EVEX.rr1,EVEX.LL0,EVEX.LLB; also shown via βββ) — as described previously, this field is context-specific.

[0173] REX' field 1010B — This is the remainder of REX' field 1010 and is the EVEX.V' bit field (EVEX byte 3, bit [3]-V') of the upper 16 or lower 16 of the 32-register set that can be used to encode the extension. This bit is stored in bit-inverted format. A value of 1 is used to encode the lower 16 registers. In other words, V'VVVV is formed by combining EVEX.V' and EVEX.vvvv.

[0174] Write mask field 970 (EVEX byte 3, bits [2:0]-kkk) — its contents specify the index of the register in the write mask register as described previously. In some embodiments, the particular value EVEX.kkk=000 has a special behavior that implies no write mask is used for a specific instruction (this can be implemented in a variety of ways, including using a write mask that is hardwired to all registers or hardware that bypasses the masking hardware).

[0175] The true opcode field 1030 (byte 4) is also known as the opcode byte. The opcode portion is specified in this field.

[0176] The MOD R / M field 1040 (byte 5) includes the MOD field 1042, the Reg field 1044, and the R / M field 1046. As previously described, the content of the MOD field 1042 distinguishes between memory access and non-memory access operations. The role of the Reg field 1044 can be summarized into two scenarios: encoding a destination register operand or a source register operand, or being treated as an opcode extension and not used to encode any instruction operand. The role of the R / M field 1046 can include either encoding an instruction operand referencing a memory address, or encoding a destination register operand or a source register operand.

[0177] Scaling, Index, Base (SIB) Byte (Byte 6) — As previously described, the contents of scaling field 950 are used for memory address generation. SIB.xxx 1054 and SIB.bbb 1056 — The contents of these fields have previously been mentioned regarding register indices Xxxx and Bbbb.

[0178] Displacement field 962A (bytes 7-10) — When MOD field 1042 contains 10, bytes 7-10 are displacement field 962A, and it works the same as the legacy 32-bit displacement (disp32) and operates at the byte granularity.

[0179] Displacement factor field 962B (byte 7) — When MOD field 1042 contains 0 and 1, byte 7 is the displacement factor field 962B. This field is located in the same position as the legacy x86 instruction set 8-bit displacement (disp8), which operates at the byte granularity. Because disp8 is an extended notation, it can only address offsets between -128 and 127 bytes; in 64-byte cache lines, disp8 uses 8 bits that can be set to only four truly useful values: -128, -64, 0, and 64; because a larger range is often needed, disp32 is used; however, disp32 requires 4 bytes. In contrast to disp8 and disp32, the displacement factor field 962B is a reinterpretation of disp8; when using the displacement factor field 962B, the actual displacement is determined by the contents of the displacement factor field multiplied by the size (N) of the memory operand access. This type of displacement is called disp8. * N. This reduces the average instruction length (for displacements but with a much larger range of single bytes). This type of compressed displacement is based on the assumption that the effective displacement is a multiple of the granularity of the memory access and therefore redundant low-order bits of the address offset do not need to be encoded. In other words, the displacement factor field 962B replaces the legacy x86 instruction set 8-bit displacement. Therefore, the displacement factor field 962B is encoded in the same way as the x86 instruction set 8-bit displacement (so it remains unchanged in the ModRM / SIB encoding rules), with disp8 being overloaded to disp8. * The only exception to N. In other words, there is no change in the encoding rules or encoding length, except in the interpretation of the displacement values ​​via hardware (which requires scaling the displacement by the size of the memory operand to obtain the byte-level address offset).

[0180] Immediately operate the digital field 972 as previously described.

[0181] Full opcode field

[0182] Figure 10B This is a block diagram illustrating the fields of a specific vector-friendly instruction format 1000 constituting a complete opcode field 974 according to some embodiments. Specifically, the complete opcode field 974 includes a format field 940, a basic opcode field 942, and a data element width (W) field 964. The basic opcode field 942 includes a prefix encoding field 1025, an opcode mapping field 1015, and a true opcode field 1030.

[0183] Register index field

[0184] Figure 10C This is a block diagram illustrating the fields of a specific vector-friendly instruction format 1000 constituting register index field 944 according to some embodiments. Specifically, register index field 944 includes REX field 1005, REX' field 1010, MODR / M.reg field 1044, MODR / Mr / m field 1046, VVVV field 1020, xxx field 1054, and bbb field 1056.

[0185] Amplification operation field

[0186] Figure 10D This is a block diagram illustrating the fields of a specific vector-friendly instruction format 1000 constituting the augmentation operation field 950 according to some embodiments. When the category (U) field 968 contains 0, it symbolizes EVEX.U0 (category A 968A); when it contains 1, it symbolizes EVEX.U1 (category B 968B). When U=0 and the MOD field 1042 contains 11 (symbolizing no memory access operation), the α field 952 (EVEX byte 3, bits [7]-EH) is interpreted as the rs field 952A. When the rs field 952A contains 1 (rounding 952A.1), the β field 954 (EVEX byte 3, bits [6:4]-SSS) is interpreted as the rounding control field 954A. The rounding control field 954A includes a one-bit SAE field 956 and a two-bit rounding operation field 958. When the rs field 952A contains 0 (data transformation 952A.2), the β field 954 (EVEX byte 3, bits [6:4]-SSS) is interpreted as a three-bit data transformation field 954B. When U=0 and the MOD field 1042 contains 00, 01, or 10 (symbolizing memory access operation), the α field 952 (EVEX byte 3, bits [7]-EH) is interpreted as an eviction notice (EH) field 952B and the β field 954 (EVEX byte 3, bits [6:4]-SSS) is interpreted as a three-bit data manipulation field 954C.

[0187] When U=1, the α field 952 (EVEX byte 3, bits [7]-EH) is interpreted as the write mask control (Z) field 952C. When U=1 and the MOD field 1042 contains 11 (symbolizing no memory access operation), part of the β field 954 (EVEX byte 3, bits [4]-S0) is interpreted as the RL field 957A; when it contains 1 (rounded to 957A.1), the remaining part of the β field 954 (EVEX byte 3, bits [6-5]-S0) is interpreted as the RL field 957A. 2-1The ) is interpreted as rounding operation field 959A, and when RL field 957A contains 0 (VSIZE957.A2), the remaining part of β field 954 (EVEX byte 3, bits [6-5]-S) 2-1 ) was interpreted as the vector length field 959B (EVEX byte 3, bits [6-5]-L) 1-0 When U=1 and MOD field 1042 contains 00, 01, or 10 (symbolizing memory access operations), β field 954 (EVEX byte 3, bits [6:4]-SSS) is interpreted as vector length field 959B (EVEX byte 3, bits [6-5]-L). 1-0 ) and broadcast field 957B (EVEX byte 3, bit [4]-B).

[0188] Exemplary Register Architecture

[0189] Figure 11 This is a block diagram of a register architecture 1100 according to some embodiments. In the illustrated embodiment, there are 32 vector registers 1110, each 512 bits wide; these registers are referenced as zmm0 through zmm31. The lower 256 bits of the lower 16 zmm registers are overwritten on registers ymm0-16. The lower 128 bits of the lower 16 zmm registers (the lower 128 bits of the ymm registers) are overwritten on registers xmm0-15. A specific vector-friendly instruction format 1000 operates on these overwritten register stacks as shown in the following table.

[0190]

[0191] In other words, the vector length field 959B is selected between a maximum length and one or more other shorter lengths, each of which is half the length of the aforementioned length; and instruction templates without the vector length field 959B operate on the maximum vector length. Further, in one embodiment, the category B instruction template of a particular vector-friendly instruction format 1000 operates on packetized or scalar single / double-precision floating-point data and packetized or scalar integer data. Scalar operations are performed at the lowest-order data element position in the zmm / ymm / xmm registers; higher-order data element positions, depending on the embodiment, remain the same as before the instruction or are zeroed out.

[0192] Write mask register 1115 – In the illustrated embodiment, there are eight write mask registers (k0 through k7), each 64 bits in size. In an alternative embodiment, write mask register 1115 is 16 bits in size. As previously described, in some embodiments, vector mask register k0 cannot be used as a write mask; when the encoding that normally indicates k0 is used for the write mask, it selects a hardwired write mask of 0xffff, effectively disabling write masking for that instruction.

[0193] General Purpose Register 1125 - In the illustrated embodiment, there are sixteen 64-bit general purpose registers used together with the existing x86 addressing modes for addressing memory operands. These registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.

[0194] The scalar floating-point stack register file (x87 stack) 1145, on which the MMX packet integer flat register file 1150 is aliased - in the illustrated embodiment, the x87 stack is an octal stack used to perform scalar floating-point operations on 32 / 64 / 80-bit floating-point data using the x87 instruction set extension; while the MMX register is used to perform operations on 64-bit packet integer data, and also to hold operands for some operations performed between the MMX and XMM registers.

[0195] Alternative embodiments may use wider or narrower registers. Additionally, alternative embodiments may use more, fewer, or different register files and registers.

[0196] Exemplary core architectures, processors, and computer architectures

[0197] Processor cores can be implemented in different ways, for different purposes, and in different processors. For example, implementations of such cores may include: 1) general-purpose ordered cores intended for general-purpose computing; 2) high-performance general-purpose out-of-order cores intended for general-purpose computing; and 3) dedicated cores intended primarily for graphics and / or scientific (throughput) computing. Implementations of different processors may include: 1) CPUs comprising one or more general-purpose ordered cores intended for general-purpose computing and / or one or more general-purpose out-of-order cores intended for general-purpose computing; and 2) coprocessors comprising one or more dedicated cores intended primarily for graphics and / or scientific (throughput) computing. Such different processors result in different computer system architectures, which may include: 1) coprocessors on a separate chip from the CPU; 2) coprocessors on a separate die in the same package as the CPU; 3) coprocessors on the same die as the CPU (in which case such coprocessors are sometimes referred to as dedicated logic, such as integrated graphics and / or scientific (throughput) logic, or dedicated cores); and 4) on-chip systems that may include the described CPU (sometimes referred to as application cores or application processors), the coprocessors described above, and additional functionality on the same die. An exemplary core architecture is described next, followed by a description of exemplary processor and computer architectures.

[0198] Exemplary core architecture

[0199] Ordered and disordered kernel diagrams

[0200] Figure 12A This is a block diagram illustrating both an exemplary ordered pipeline and an exemplary register renaming, out-of-order release / run pipeline according to some embodiments of the present invention. Figure 12B This is a block diagram illustrating exemplary embodiments of ordered architecture cores to be included in a processor according to some embodiments of the present invention, as well as exemplary register renaming, out-of-order release / running architecture cores. Figure 12A The solid boxes in -B represent ordered pipelines and ordered cores, while the optional dashed boxes represent register renaming, out-of-order release / run pipelines, and cores. Given that the ordered aspect is a subset of the out-of-order aspect, the out-of-order aspect will be described.

[0201] exist Figure 12A In the processor pipeline 1200, there are fetch stage 1202, length decoding stage 1204, decoding stage 1206, allocation stage 1208, renaming stage 1210, scheduling (also known as dispatch or issue) stage 1212, register read / memory read stage 1214, run stage 1216, write back / memory write stage 1218, exception handling stage 1222, and commit stage 1224.

[0202] Figure 12BThe processor core 1290 is shown, which includes a front-end unit 1230 coupled to a runtime engine unit 1250, and both are coupled to a memory unit 1270. Core 1290 can be a Reduced Instruction Set Computing (RISC) core, a Complex Instruction Set Computing (CISC) core, a Very Long Instruction Word (VLIW) core, or a hybrid or alternative core type. As another option, core 1290 can be a dedicated core, such as, for example, a network or communication core, a compression engine, a coprocessor core, a general-purpose computing graphics processing unit (GPGPU) core, a graphics core, etc.

[0203] Front-end unit 1230 includes a branch prediction unit 1232 coupled to instruction cache unit 1234, which is coupled to instruction translation lookaside buffer (TLB) 1236, which is coupled to instruction fetch unit 1238, which is coupled to decode unit 1240. Decoding unit 1240 (or decoder) decodes instructions and generates outputs of one or more micro-operations, microcode entry points, microinstructions, other instructions, or other control signals, which are decoded from, or otherwise reflected in, or derived from the original instruction. Decoding unit 1240 can be implemented using various mechanisms. Examples of suitable mechanisms include, but are not limited to, lookup tables, hardware implementations, programmable logic arrays (PLAs), microcode read-only memory (ROM), etc. In one embodiment, core 1290 includes a microcode ROM or another medium (e.g., in decoder 1240 or otherwise within front-end unit 1230) storing microcode for certain macro instructions. The decoding unit 1240 is coupled to the rename / allocator unit 1252 in the running engine unit 1250.

[0204] The execution engine unit 1250 includes a rename / allocator unit 1252 coupled to a set of retirement units 1254 and one or more scheduler units 1256. Scheduler units 1256 represent any number of different schedulers, including reservation stations, central instruction windows, etc. Scheduler units 1256 are coupled to physical register file units 1258. Each of the physical register file units 1258 represents one or more physical register files, which store one or more different data types, such as scalar integers, scalar floating-point numbers, packetized integers, packetized floating-point numbers, vector integers, vector floating-point numbers, status (e.g., an instruction pointer to the address of the next instruction to be executed), etc. In one embodiment, physical register file unit 1258 includes vector register units, write mask register units, and scalar register units. These register units can provide architectural vector registers, vector mask registers, and general-purpose registers. Physical register file unit 1258 is overlapped by retirement unit 1254 to illustrate various ways in which register renaming and out-of-order execution can be implemented (e.g., using a reorder buffer and retirement register file; using a future heap, a history buffer, and retirement register file; using register mapping and a pool of registers; etc.). Retirement unit 1254 and physical register file unit 1258 are coupled to run cluster 1260. Run cluster 1260 includes a set of one or more run units 1262 and a set of one or more memory access units 1264. Run units 1262 can perform various operations (e.g., shift, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating-point, packetized integer, packetized floating-point, vector integer, vector floating-point). While some embodiments may include multiple run units dedicated to a particular function or set of functions, other embodiments may include multiple run units that all perform all functions or only one run unit. Scheduler unit 1256, physical register file unit 1258, and run cluster 1260 are shown as possibly multiple because some embodiments create separate pipelines for certain types of data / operations (e.g., scalar integer pipelines, scalar floating-point / packet integer / packet floating-point / vector integer / vector floating-point pipelines, and / or memory access pipelines, each having its own scheduler unit, physical register file unit, and / or run cluster—and in the case of separate memory access pipelines, some embodiments in which only the run cluster of this pipeline has memory access unit 1264 are implemented). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order deployments / runs, and the remainder are ordered.

[0205] A set of memory access units 1264 is coupled to memory unit 1270, which includes a data TLB unit 1272 coupled to the data cache memory unit 1274, which is coupled to a Level 2 (L2) cache memory unit 1276. In one exemplary embodiment, memory access unit 1264 may include a load unit, a memory address unit, and a memory data unit, each of which is coupled to the data TLB unit 1272 in memory unit 1270. Instruction cache memory unit 1234 is further coupled to the Level 2 (L2) cache memory unit 1276 in memory unit 1270. The L2 cache memory unit 1276 is coupled to one or more other levels of cache memory and ultimately to main memory.

[0206] By way of example, the exemplary register renaming, out-of-order release / running kernel architecture can implement the following pipeline 1200: 1) Instruction fetch 1238 executes fetch 1202 and length decoding stage 1204; 2) Decoding unit 1240 executes decoding stage 1206; 3) Rename / allocator unit 1252 executes allocation stage 1208 and rename stage 1210; 4) Scheduler unit 1256 executes scheduling stage 1212; 5) Physical register file unit 1258 and memory unit 1270 execute register read / memory read stage 1214; running cluster 1260 executes running stage 1216; 6) Memory unit 1270 and physical register file unit 1258 execute write-back / memory write stage 1218; 7) Various units may be involved in exception handling stage 1222; and 8) Retirement unit 1254 and physical register file unit 1258 execute commit stage 1224.

[0207] Core 1290 may support one or more instruction sets (e.g., the x86 instruction set (with some extensions added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, CA; the ARM instruction set of ARM Holdings of Sunnyvale, CA (with optional additional extensions such as NEON)), including the instructions described herein. In one embodiment, core 1290 includes logic for supporting packet data instruction set extensions (e.g., AVX1, AVX2), thus allowing operations used by many multimedia applications to be performed using packet data.

[0208] It should be understood that a core may support multithreading (running two or more parallel sets of operations or threads) and may do so in a variety of ways, including time-segmented multithreading, simultaneous multithreading (in the case of a single physical core providing a logical core for each thread, that physical core is performing simultaneous multithreading), or a combination thereof (e.g., time-segmented fetch and decode and subsequent simultaneous multithreading, such as in Intel® Hyper-Threading Technology).

[0209] Although register renaming is described in the context of out-of-order execution, it should be understood that register renaming can be used in ordered architectures. While the illustrated embodiment of the processor also includes separate instruction and data cache memory units 1234 / 1274 and a shared L2 cache memory unit 1276, alternative embodiments may have a single internal cache memory for both instructions and data, such as, for example, a Level 1 (L1) internal cache memory, or multiple levels of internal cache memory. In some embodiments, the system may include a combination of internal cache memory and external cache memory external to the core and / or processor. Alternatively, all cache memory may be external to the core and / or processor.

[0210] Specific Exemplary Ordered Core Architecture

[0211] Figure 13A -B illustrates a block diagram of a more specific exemplary ordered core architecture where the core is one of several logic blocks in a chip (including other cores of the same type and / or different types). The logic blocks communicate via a high-bandwidth interconnect network (e.g., a ring network) depending on the application, consisting of some fixed functional logic, memory I / O interfaces, and another necessary I / O logic.

[0212] Figure 13A This is a block diagram of a single processor core according to some embodiments of the invention, along with its connections to an on-die interconnect network 1302 and its local subsets of a Level 2 (L2) cache memory 1304. In one embodiment, the instruction decoder 1300 supports the x86 instruction set with packetized data instruction set extensions. The L1 cache memory 1306 allows low-latency access to cache memory into scalar and vector units. Although in one embodiment (for design simplification), scalar unit 1308 and vector unit 1310 use separate register sets (correspondingly, scalar register 1312 and vector register 1314), and data transferred between them is written to memory and then read back from the Level 1 (L1) cache memory 1306, alternative embodiments of the invention may use different means (e.g., using a single register set or including a communication path that allows data to be transferred between the two register sets without being written and read back).

[0213] The local subsets of L2 cache 1304 are part of the global L2 cache, which is divided into separate local subsets, one per processor core. Each processor core has a direct access path to its own local subset of L2 cache 1304. Data read by a processor core is stored in its L2 cache subset 1304 and can be accessed quickly, in parallel with other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 1304 and dumped from other subsets if necessary. A ring network ensures the consistency of shared data. The ring network is bidirectional to allow agents such as processor cores, L2 cache, and other logic blocks to communicate with each other within the chip. Each ring data path is 1012 bits wide per direction.

[0214] Figure 13B According to some embodiments of the present invention Figure 13A An expanded view of a portion of the processor cores. Figure 13B This includes the L1 data cache 1306A portion of L1 cache 1304, and further details regarding vector unit 1310 and vector register 1314. Specifically, vector unit 1310 is a 16-wide vector processing unit (VPU) (see 16-wide ALU 1328) that runs one or more integer, single-precision float, and double-precision float instructions. The VPU supports scrambling register inputs at memory inputs via scrambling unit 1320, value conversion via value conversion units 1322A-B, and copying via copy unit 1324. Write mask register 1326 allows vector writes of predicted results.

[0215] Processor with integrated memory controller and graphics

[0216] Figure 14 This is a block diagram of a processor 1400 that may have more than one core, an integrated memory controller, and integrated graphics according to some embodiments of the present invention. Figure 14 The solid box in the figure shows a processor 1400 with a single core 1402A, a system agent 1410, and a collection of one or more bus controller units 1416, while the dashed box optional addition shows an alternative processor 1400 with multiple cores 1402A-N, a collection of one or more integrated memory controller units 1414 among the system agent units 1410, and dedicated logic 1408.

[0217] Therefore, different implementations of processor 1400 may include: 1) a CPU with dedicated logic 1408 that is integrated graphics and / or scientific (throughput) logic (which may include one or more cores) and cores 1402A-N that are one or more general-purpose cores (e.g., general-purpose ordered cores, general-purpose out-of-order cores, or combinations of said two cores); 2) a coprocessor with cores 1402A-N that are a large number of dedicated cores intended primarily for graphics and / or scientific (throughput); and 3) a coprocessor with cores 1402A-N that are a large number of general-purpose ordered cores. Thus, processor 1400 may be a general-purpose processor, a coprocessor, or a dedicated processor, such as, for example, a network or communication processor, a compression engine, a graphics processor, a GPGPU (General-Purpose Graphics Processing Unit), a high-throughput many-core (MIC) coprocessor (including 30 or more cores), an embedded processor, and so on. The processor may be implemented on one or more chips. Using any of a number of processing technologies (such as, for example, BiCMOS, CMOS, or NMOS), processor 1400 may be implemented on one or more substrates and / or portions thereof.

[0218] The memory hierarchy includes one or more levels of in-core cache memory, one or more sets of shared cache memory cells 1406, and external memory (not shown) coupled to a set of integrated memory controller units 1414. The set of shared cache memory cells 1406 may include one or more intermediate-level cache memories, such as Level 2 (L2), Level 3 (L3), Level 4 (L4), or other levels of cache memory, Last-Level Cache (LLC), and / or combinations thereof. While in one embodiment, ring-based interconnect units 1412 interconnect integrated graphics logic 1408 (which is an example of dedicated logic and is also referred to herein as dedicated logic), the set of shared cache memory cells 1406, and system proxy units 1410 / integrated memory controller units 1414, alternative embodiments may use any number of known techniques for interconnecting such units. In one embodiment, consistency between one or more cache memory cells 1406 and core 1402-AN is maintained.

[0219] In some embodiments, one or more cores of core 1402A-N have multi-threading capabilities. System agent 1410 includes those components that coordinate and operate core 1402A-N. System agent unit 1410 may include, for example, a power control unit (PCU) and a display unit. The PCU may be or include the logic and components required to regulate the power state of integrated graphics logic 1408 and core 1402A-N. The display unit is used to drive one or more externally connected displays.

[0220] The 1402A-N core can be homogeneous or heterogeneous in terms of its architecture instruction set; that is, two or more cores of the 1402A-N core can be capable of running the same instruction set, while other cores can be capable of running different instruction sets or only a subset of those instruction sets.

[0221] Exemplary computer architecture

[0222] Figure 15-18 This is a block diagram of an exemplary computer architecture. Other system designs and configurations known in the fields of laptop computers, desktop computers, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cellular phones, portable media players, handheld devices, and various other electronic devices are also suitable. Generally, systems or electronic devices capable of incorporating a wide variety of processors and / or other operating logic as disclosed herein are generally suitable.

[0223] Now for reference Figure 15 The diagram illustrates a system 1500 according to an embodiment of the present invention. System 1500 may include one or more processors 1510, 1515 coupled to a controller hub 1520. In one embodiment, the controller hub 1520 includes a graphics memory controller hub (GMCH) 1590 and an input / output hub (IOH) 1550 (which may be on a separate chip); the GMCH 1590 includes a memory 1540 and a coprocessor 1545 coupled to the memory and graphics controller; the IOH 1550 couples an input / output (I / O) device 1560 to the GMCH 1590. Alternatively, one or both of the memory and the graphics controller may be integrated within a processor (as described herein), with the memory 1540 and the coprocessor 1545 directly coupled to the processor 1510 and the controller hub 1520 on a single chip with the IOH 1550.

[0224] The optional nature of the 1515 additional processor is in Figure 15 The term "disconnected line" is used to refer to the processor. Each processor 1510, 1515 may include one or more of the processing cores described herein, and may be a version of processor 1400.

[0225] Memory 1540 may be, for example, dynamic random access memory (DRAM), phase-change memory (PCM), or a combination of the two memories. In at least one embodiment, controller hub 1520 communicates with processors 1510, 1515 via a multipoint bus such as a front-side bus (FSB), a point-to-point interface such as a fast path interconnect (QPI), or a similar connection 1595.

[0226] In one embodiment, the coprocessor 1545 is a dedicated processor, such as, for example, a high-throughput MIC processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU, an embedded processor, and so on. In one embodiment, the controller hub 1520 may include an integrated graphics accelerator.

[0227] There are several differences in the spectrum of specifications between Physical Resources 1510 and 1515 regarding specifications including architecture, microarchitecture, thermal and power consumption characteristics, and similar metrics.

[0228] In one embodiment, processor 1510 executes instructions that control general-type data processing operations. Embedded within these instructions may be coprocessor instructions. Processor 1510 recognizes these coprocessor instructions as the type to be executed by an attached coprocessor 1545. Therefore, processor 1510 issues these coprocessor instructions (or control signals representing coprocessor instructions) to coprocessor 1545 on a coprocessor bus or other interconnect. Coprocessor 1545 accepts and executes the received coprocessor instructions.

[0229] Now for reference Figure 16 The diagram shown is a block diagram of a first more specific exemplary system 1600 according to an embodiment of the present invention. Figure 16 The multiprocessor system 1600 shown is a point-to-point interconnect system and includes a first processor 1670 and a second processor 1680 coupled via a point-to-point interconnect 1650. Each of processors 1670 and 1680 may be a version of processor 1600. In some embodiments, processors 1670 and 1680 are respectively processors 1510 and 1515, and coprocessor 1638 is coprocessor 1545. In another embodiment, processors 1670 and 1680 are respectively processor 1510 and coprocessor 1545.

[0230] Processors 1670 and 1680 are shown, each including an integrated memory controller (IMC) unit 1672 and 1682. Processor 1670 also includes point-to-point (PP) interfaces 1676 and 1678 as part of its bus controller unit; similarly, the second processor 1680 includes PP interfaces 1686 and 1688. Using PP interface circuits 1678 and 1688, processors 1670 and 1680 can exchange information via point-to-point (PP) interface 1650. Figure 16 As shown, IMC 1672 and 1682 couple the processor to the corresponding memory (namely memory 1632 and memory 1634), which may be a portion of the main memory locally attached to the corresponding processor.

[0231] Using point-to-point interface circuits 1676, 1694, 1686, and 1698, processors 1670 and 1680 can each exchange information with chipset 1690 via their respective PP interfaces 1652 and 1654. Chipset 1690 can optionally exchange information with coprocessor 1638 via high-performance interface 1639. In one embodiment, coprocessor 1638 is a dedicated processor, such as, for example, a high-throughput MIC processor, network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, etc.

[0232] A shared cache memory (not shown) may be included in either processor or outside of both processors and connected to the processors via a PP interconnect, such that if the processors are in a low-power mode, the local cache memory information of either or both processors may be stored in the shared cache memory.

[0233] Chipset 1690 can be coupled to first bus 1616 via interface 1696. In one embodiment, first bus 1616 may be a peripheral component interconnect (PCI) bus, or a bus such as PCI high-speed bus or another third-generation I / O interconnect bus, although the scope of the invention is not so limited.

[0234] like Figure 16As shown, various I / O devices 1614 may be coupled to a first bus 1616 along with a bus bridge 1618, which couples the first bus 1616 to a second bus 1620. In one embodiment, one or more additional processors 1615, such as a coprocessor, a high-throughput MIC processor, a GPGPU, an accelerator (such as, for example, a graphics accelerator or digital signal processing (DSP) unit), a field-programmable gate array, or any other processor, are coupled to the first bus 1616. In one embodiment, the second bus 1620 may be a low pin count (LPC) bus. Various devices may be coupled to the second bus 1620, including, for example, a keyboard and / or mouse 1622, a communication device 1627, and a storage unit 1628, such as a hard disk drive or other mass storage device, which may include instruction / code and data 1630 (in one embodiment). Further, audio I / O 1624 may be coupled to the second bus 1620. Note that other architectures are possible. For example, alternatives... Figure 16 The point-to-point architecture allows the system to implement a multi-point bus or another such architecture.

[0235] Now for reference Figure 17 The diagram shown is a block diagram of a second more specific exemplary system 1700 according to an embodiment of the present invention. Figure 16 and 17 Similar elements in the drawings are labeled with similar reference numerals, and Figure 16 Some aspects have been from Figure 17 The middle part is omitted to avoid making Figure 17 Other aspects are difficult to understand.

[0236] Figure 17 Processors 1670 and 1680 are shown to respectively include integrated memory and I / O control logic (“CL”) 1772 and 1782. Therefore, CL 1772 and 1782 include an integrated memory controller unit and I / O control logic. Figure 17 It is shown that not only are memories 1632 and 1634 coupled to CLs 1772 and 1782, but I / O device 1714 is also coupled to control logic 1772 and 1782. Legacy I / O device 1715 is coupled to chipset 1690.

[0237] Now for reference Figure 18 The diagram shown is a block diagram of a SoC 1800 according to an embodiment of the present invention. Figure 14 Similar components are labeled with similar reference numerals. Similarly, the dashed box is an optional feature on more advanced SoCs. Figure 18In this configuration, interconnect unit 1802 is coupled to: application processor 1810, which includes a collection of one or more cores 1402A-N (including cache memories 1404A-N) and a shared cache memory unit 1406; system agent unit 1410; bus controller unit 1416; integrated memory controller unit 1414; a collection or one or more coprocessors 1820, which may include integrated graphics logic, an image processor, an audio processor, and a video processor; static random access memory (SRAM) unit 1830; direct memory access (DMA) unit 1832; and display unit 1840 for coupling to one or more external displays. In one embodiment, coprocessor 1820 includes a dedicated processor, such as, for example, a network or communication processor, a compression engine, a GPGPU, a high-throughput MIC processor, an embedded processor, and so on.

[0238] Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementations. Embodiments of the invention may be implemented as program code or a computer program running on a programmable system, said programmable system including at least one processor, a storage system (including volatile and non-volatile memories and / or storage elements), at least one input device, and at least one output device.

[0239] Such as Figure 16 The program code 1630 shown can be applied to input instructions to perform the functions described herein and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor, such as, for example, a digital signal processor (DSP), a microcontroller, an application-specific integrated circuit (ASIC), or a microprocessor.

[0240] The program code can be implemented in a high-level procedural or object-oriented programming language to communicate with the processing system. If desired, the program code can also be implemented in assembly or machine language. In fact, the mechanisms described herein are not limited to any specific programming language. In any case, the language can be a compiled or interpreted language.

[0241] One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium, the representative instructions representing various logics within a processor that, when read by a machine, cause the machine to make logic for performing the techniques described herein. Such representations (known as “IP cores”) may be stored on tangible, machine-readable media and supplied to various customers or manufacturing facilities for loading onto manufacturing machines that actually make the logic or processor.

[0242] Such machine-readable storage media may include, without limitation, a non-transient, tangible arrangement of articles made or formed by a machine or apparatus, including storage media such as hard disks, including floppy disks, optical disks, compact disc read-only memory (CD-ROM), rewritable compact discs (CD-RW), and any other type of disk, magneto-optical disk, semiconductor devices such as read-only memory (ROM), random access memory (RAM) such as dynamic random access memory (DRAM), static random access memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM), phase-change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.

[0243] Therefore, embodiments of the present invention also include non-transitory, tangible machine-readable media containing instructions or design data, such as a hardware description language (HDL), that defines the architectures, circuits, devices, processors, and / or system features described herein. Such embodiments may also be referred to as program products.

[0244] Simulation (including binary conversion, code transformation, etc.)

[0245] In some cases, an instruction translator can be used to translate instructions from a source instruction set into a target instruction set. For example, an instruction translator can translate (e.g., using static binary translation, including dynamic binary translation with dynamic compilation), transform, emulate, or otherwise translate instructions into one or more other instructions to be processed by the core. Instruction translators are implemented in software, hardware, firmware, or a combination thereof. Instruction translators can be on the processor, off the processor, or partially on the processor but not entirely off the processor.

[0246] Figure 19 This is a block diagram illustrating the use of a software instruction converter according to some embodiments of the invention to convert binary instructions in a source instruction set into binary instructions in a target instruction set. In the illustrated embodiments, the instruction converter is a software instruction converter, although alternatively, the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. Figure 19The diagram illustrates that a program written in a high-level language 1902 can be compiled using an x86 compiler 1904 to generate x86 binary code 1906, which can be natively run by a processor 1916 with at least one x86 instruction set core. A processor 1916 with at least one x86 instruction set core refers to any processor capable of performing substantially the same functions as an Intel processor with at least one x86 instruction set core, by compatibly running or otherwise processing (1) a substantial portion of the instruction set of the Intel x86 instruction set core, or (2) an object code version of an application or other software intended to run on an Intel processor with at least one x86 instruction set core, to achieve substantially the same results as on an Intel processor with at least one x86 instruction set core. The x86 compiler 1904 refers to a compiler operable to generate x86 binary code 1906 (e.g., object code), which can be run on a processor 1916 with at least one x86 instruction set core, with or without additional linking processing. Similarly, Figure 19 The diagram illustrates how, using an alternative instruction set compiler 1908, a program written in a high-level language 1902 can be compiled to generate alternative instruction set binary code 1910, which can be natively executed by a processor 1914 without at least one x86 instruction set core (e.g., a processor with a core running the MIPS instruction set of MIPS Technologies of Sunnyvale, CA and / or the ARM instruction set of ARM Holdings of Sunnyvale, CA). An instruction translator 1912 is used to translate the x86 binary code 1906 into code that can be natively executed by the processor 1914 without an x86 instruction set core. This translated code cannot be identical to the alternative instruction set binary code 1910, as an instruction translator capable of doing so would be difficult to create; however, the translated code will perform general operations and consist of instructions from the alternative instruction set. Therefore, the instruction converter 1912 represents software, firmware, hardware, or a combination thereof that allows a processor or other electronic device that does not have an x86 instruction set processor or core to run x86 binary code 1906 through emulation, simulation, or any other process.

[0247] This application provides the following technical solution:

[0248] Technical Solution 1. A processor, comprising:

[0249] A fetching circuit is used to fetch a compression instruction having a field for specifying the position of a source vector having N single-precision formatted elements and a compressed vector having N neural half-precision (NHP) formatted elements.

[0250] A decoding circuit, which is used to decode the acquired compression instructions;

[0251] An execution circuit, the execution circuit being configured to respond to the decoded compression instructions by:

[0252] Convert each element of the source vector into the NHP format;

[0253] Round each transformed element according to the rounding mode; and

[0254] Write each rounded element to its corresponding compressed vector element;

[0255] The NHP format described therein includes seven significant bits and eight exponent bits; and

[0256] The source vector and the compressed vector are each in memory or in a register.

[0257] Technical Solution 2. The processor as described in Technical Solution 1,

[0258] The fetching, decoding, and execution circuitry is further configured to fetch, decode, and execute a second compression instruction, the second compression instruction specifying the positions of a second source vector having N elements formatted according to the single-precision format and a second compressed vector having N elements formatted according to the NHP format;

[0259] The aforementioned fetch and decode circuitry is further configured to fetch and decode a Mixed Precision Vector Multiply-Accumulate (MPVMAC) instruction, the MPVMAC instruction having fields for specifying first and second source vectors having N NHP-formatted elements and a destination vector having N single-precision-formatted elements; wherein the specified source vectors are the compressed vector and the second compressed vector; and

[0260] The execution circuitry is further configured to respond to the decoded MPVMAC instruction for each of the N elements by generating a 16-bit product of the compressed vector element and the second compressed vector element and accumulating the generated 16-bit product with the previous content of the corresponding element of the destination vector.

[0261] Technical Solution 3. The processor as described in Technical Solution 2, wherein the MPVMAC instruction further has a field for specifying a write mask, the specified write mask comprising N bits, each bit being used to identify when the corresponding element of the destination vector is unmasked and written together with the generated 16-bit product, or when the corresponding element of the destination vector is mapped and cleared or merged.

[0262] Technical Solution 4. The processor as described in Technical Solution 1,

[0263] The fetching circuit is further configured to fetch extended instructions having fields for specifying the location of the destination vector and the compressed vector, the destination vector having N elements formatted according to the single-precision format;

[0264] The processor further includes:

[0265] A decoding circuit, the decoding circuit being used to decode the fetched extended instructions; and

[0266] An execution circuit, the execution circuit being configured to respond to the decoded extended instructions by:

[0267] Convert each element of the compressed vector into the single-precision format; and

[0268] Write each transformed element to the corresponding destination vector element.

[0269] Technical Solution 5. The processor as described in Technical Solution 1, wherein the single-precision format is a binary 32 format standardized by the Institute of Electrical and Electronics Engineers as part of the IEEE 754-2008 standard.

[0270] Technical Solution 6. The processor as described in Technical Solution 5, wherein the rounding mode is specified by the IEEE 754 standard and is one of the following: rounding to the nearest number, taking an even number when two numbers are equally close; rounding to the nearest number, taking the number farther from zero when two numbers are equally close; rounding towards zero; rounding towards positive infinity; and rounding towards negative infinity, wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0271] Technical Solution 7. The processor as described in Technical Solution 1, wherein the specified source vector and the compressed vector each occupy one or more rows of a matrix having M rows by N columns.

[0272] Technical Solution 8. The processor as described in Technical Solution 1, wherein the execution circuitry is further configured to perform rounding as needed during conversion, accumulation, and multiplication according to the rounding mode.

[0273] Technical Solution 9. The processor as described in Technical Solution 1, wherein the rounding mode is one of the following: rounding to the nearest even number, rounding towards negative infinity, rounding towards positive infinity, and rounding towards zero, and wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0274] Technical Solution 10. The processor as described in Technical Solution 1, wherein the execution circuitry is further configured to perform saturation on demand during accumulation and multiplication.

[0275] Technical Solution 11. A method comprising:

[0276] A fetch circuit is used to fetch a compression instruction, which has a field for specifying the position of a source vector having N single-precision formatted elements and a compressed vector having N neural half-precision (NHP) formatted elements.

[0277] Use a decoding circuit to decode the compressed instructions;

[0278] The execution circuitry responds to the decoded compression instructions by performing the following operations:

[0279] Convert each element of the source vector into the NHP format;

[0280] Round each transformed element according to the rounding mode; and

[0281] Write each rounded element to its corresponding compressed vector element;

[0282] The NHP format described therein includes seven significant bits and eight exponent bits; and

[0283] The source vector and the compressed vector are each in memory or in a register.

[0284] Technical Solution 12. The method as described in Technical Solution 11, further comprising:

[0285] The fetch, decode, and execute circuitry is used to fetch, decode, and execute a second compression instruction, the second compression instruction specifying the positions of a second source vector having N elements formatted according to the single-precision format and a second compressed vector having N elements formatted according to the NHP format;

[0286] The aforementioned fetch and decode circuitry is used to fetch and decode a Mixed Precision Vector Multiply-Accumulate (MPVMAC) instruction, which has fields for specifying first and second source vectors having N NHP-formatted elements and a destination vector having N single-precision-formatted elements, wherein the specified source vectors are the compressed vector and the second compressed vector; and

[0287] The execution circuit responds to the decoded MPVMAC instruction for each of the N elements by generating a 16-bit product of the compressed vector element and the second compressed vector element and accumulating the generated 16-bit product with the previous content of the corresponding element of the destination vector.

[0288] Technical Solution 13. The method as described in Technical Solution 12, wherein the MPVMAC instruction further has a field for specifying a write mask, the specified write mask comprising N bits, each bit being used to identify when the corresponding element of the destination vector is unmasked and written together with the generated 16-bit product, or when the corresponding element of the destination vector is mapped and cleared or merged.

[0289] Technical Solution 14. The method as described in Technical Solution 11 further includes:

[0290] The fetch circuit is used to fetch extended instructions, the extended instructions having fields for specifying the location of the destination vector and the compressed vector, the destination vector having N elements formatted according to the single-precision format;

[0291] Use a decoding circuit to decode the fetched extended instructions;

[0292] The execution circuitry responds to the decoded extended instructions by performing the following operations:

[0293] Convert each element of the compressed vector into the single-precision format; and

[0294] Write each transformed element to the corresponding destination vector element.

[0295] Technical Solution 15. The method as described in Technical Solution 11, wherein the single-precision format is a binary 32 format standardized by the Institute of Electrical and Electronics Engineers as part of the IEEE 754-2008 standard.

[0296] Technical Solution 16. The method of Technical Solution 15, wherein the rounding mode is specified by the IEEE 754 standard and is one of the following: rounding to the nearest number, taking an even number when two numbers are equally close; rounding to the nearest number, taking the number farther from zero when two numbers are equally close; rounding towards zero; rounding towards positive infinity; and rounding towards negative infinity, wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0297] Technical Solution 17. The method as described in Technical Solution 11, wherein the specified source vector and the compressed vector each occupy one or more rows of a matrix having M rows by N columns.

[0298] Technical Solution 18. The method as described in Technical Solution 11, wherein the execution circuit is further configured to perform rounding as needed during conversion, accumulation, and multiplication according to the rounding mode.

[0299] Technical Solution 19. The method of Technical Solution 11, wherein the rounding mode is one of the following: rounding to the nearest even number, rounding towards negative infinity, rounding towards positive infinity, and rounding towards zero, and wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0300] Technical Solution 20. The method as described in Technical Solution 11, wherein the execution circuit is further configured to perform saturation on demand during accumulation and multiplication.

[0301] More examples

[0302] Example 1 provides an exemplary processor comprising: fetch circuitry for fetching compression instructions having a field specifying the positions of a source vector having N single-precision formatted elements and a compressed vector having N neural half-precision (NHP) formatted elements; decoding circuitry for decoding the fetched compression instructions; and execution circuitry for responding to the decoded compression instructions by: converting each element of the source vector to the NHP format; rounding each converted element according to a rounding mode; and writing each rounded element to a corresponding compressed vector element; wherein the NHP format includes seven significant bits and eight exponent bits; and wherein the source vector and the compressed vector are each in memory or in a register.

[0303] Example 2 includes the subject of an exemplary processor as described in Example 1, wherein the fetch, decode, and execute circuitry is further configured to fetch, decode, and execute a second compression instruction specifying the positions of a second source vector having N elements formatted according to the single-precision format and a second compressed vector having N elements formatted according to the NHP format; wherein the fetch and decode circuitry is further configured to fetch and decode a vector multiplication instruction, the mixed-precision vector multiplication-accumulation (MPVMAC) instruction having a field for specifying first and second source vectors having N NHP-formatted elements and a destination vector having N single-precision-formatted elements; wherein the specified source vectors are the compressed vector and the second compressed vector; and wherein the execute circuitry is further configured to respond to the decoded vector multiplication instruction for each of the N elements by generating a 16-bit product of the compressed vector elements and the second compressed vector elements and accumulating the generated 16-bit product with the previous contents of the corresponding elements of the destination vector.

[0304] Example 3 includes the subject of an exemplary processor as described in Example 1, wherein the fetch circuitry is further configured to fetch extended instructions having fields for specifying the location of a destination vector and the compressed vector, the destination vector having N elements formatted according to the single-precision format; decoding circuitry for decoding the fetched extended instructions; and execution circuitry for responding to the decoded extended instructions by: converting each element of the compressed vector into the single-precision format; and writing each converted element to the corresponding destination vector element.

[0305] Example 4 includes the subject of an exemplary processor as described in Example 2, wherein the vector multiplication instruction further has a field for specifying a write mask, the specified write mask comprising N bits, each bit being used to identify when the corresponding element of the destination vector is unmasked and written together with the generated 16-bit product, or when the corresponding element of the destination vector is mapped and zeroed or merged.

[0306] Example 5 includes the subject of an exemplary processor as described in any one of Examples 1-4, wherein the single-precision format is a binary 32 format standardized by the Institute of Electrical and Electronics Engineers as part of the IEEE 754-2008 standard.

[0307] Example 6 includes the subject of an exemplary processor as described in any one of Examples 1-4, wherein the specified source vector and the compressed vector each occupy one or more rows of a matrix having M rows by N columns.

[0308] Example 7 includes the subject of an exemplary processor as described in any one of Examples 1-4, wherein the execution circuitry is further configured to perform rounding as needed during transformation, accumulation, and multiplication, according to a rounding mode.

[0309] Example 8 includes the subject of an exemplary processor as described in Example 1, wherein the rounding mode is one of the following: rounding to the nearest even number, rounding toward negative infinity, rounding toward positive infinity, and rounding toward zero, and wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0310] Example 9 includes the subject of an exemplary processor as described in Example 5, wherein the rounding mode is specified by the IEEE 754 standard and is one of the following: rounding to the nearest number, taking an even number when two numbers are equally close; rounding to the nearest number, taking the number farther from zero when two numbers are equally close; rounding towards zero; rounding towards positive infinity; and rounding towards negative infinity, and wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0311] Example 10 includes the subject of an exemplary processor as described in any one of Examples 1-4, wherein the execution circuitry is further configured to perform saturation on demand during accumulation and multiplication.

[0312] Example 11 provides an exemplary method comprising: using fetch circuitry to fetch compression instructions having fields specifying the positions of a source vector having N single-precision formatted elements and a compressed vector having N neural half-precision (NHP) formatted elements; using decoding circuitry to decode the fetched compression instructions; using execution circuitry to respond to the decoded compression instructions by: converting each element of the source vector to the NHP format; rounding each converted element according to a rounding mode; and writing each rounded element to the corresponding compressed vector element; wherein the NHP format includes seven significant bits and eight exponent bits; and wherein the source vector and the compressed vector are each in memory or in a register.

[0313] Example 12 includes the subject matter of the exemplary method as described in Example 11, further comprising: using the fetch, decode, and execute circuitry to fetch, decode, and execute a second compression instruction specifying the positions of a second source vector having N elements formatted according to the single-precision format and a second compressed vector having N elements formatted according to the NHP format; using the fetch and decode circuitry to fetch and decode a vector multiplication instruction, the mixed-precision vector multiplication-accumulation (MPVMAC) instruction having fields for specifying first and second source vectors having N NHP-formatted elements and a destination vector having N single-precision-formatted elements; wherein the specified source vectors are the compressed vector and the second compressed vector; and using the execute circuitry to respond to the decoded vector multiplication instruction for each of the N elements by generating a 16-bit product of the compressed vector elements and the second compressed vector elements and accumulating the generated 16-bit product with the previous contents of the corresponding elements of the destination vector.

[0314] Example 13 includes the subject matter of the exemplary method as described in Example 11, further comprising: using the fetching circuitry to fetch an extended instruction having a field for specifying the location of a destination vector and the compressed vector, the destination vector having N elements formatted according to the single-precision format; using the decoding circuitry to decode the fetched extended instruction; and using the execution circuitry to respond to the decoded extended instruction by: converting each element of the compressed vector into the single-precision format; and writing each converted element to the corresponding destination vector element.

[0315] Example 14 includes the subject of an exemplary method as described in Example 12, wherein the vector multiplication instruction further has a field for specifying a write mask, the specified write mask comprising N bits, each bit being used to identify when the corresponding element of the destination vector is unmasked and written together with the generated 16-bit product, or when the corresponding element of the destination vector is mapped and zeroed or merged.

[0316] Example 15 includes the subject of an exemplary method as described in any one of Examples 11-14, wherein the single-precision format is a binary 32 format standardized by the Institute of Electrical and Electronics Engineers as part of the IEEE 754-2008 standard.

[0317] Example 16 includes the subject of an exemplary method as described in any one of Examples 11-14, wherein the specified source vector and the compressed vector each occupy one or more rows of a matrix having M rows by N columns.

[0318] Example 17 includes the subject of an exemplary method as described in any one of Examples 11-14, wherein the execution circuitry is further configured to perform rounding as needed during transformation, accumulation, and multiplication according to a rounding mode.

[0319] Example 18 includes the subject of an exemplary method as described in Example 11, wherein the rounding mode is one of the following: rounding to the nearest even number, rounding towards negative infinity, rounding towards positive infinity, and rounding towards zero, and wherein the rounding mode is specified on a per-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0320] Example 19 includes the subject of an exemplary method as described in Example 15, wherein the rounding mode is specified by the IEEE 754 standard and is one of the following: rounding to the nearest number, taking an even number when two numbers are equally close; rounding to the nearest number, taking the number farther from zero when two numbers are equally close; rounding towards zero; rounding towards positive infinity; and rounding towards negative infinity, and wherein the rounding mode is specified on an instruction-by-instruction basis by an immediate value specified by the instruction, or on an embedded basis by software programmable control and a status register.

[0321] Example 20 includes the subject of an exemplary method as described in any one of Examples 11-14, wherein the execution circuitry is further configured to perform saturation on demand during accumulation and multiplication.

Claims

1. A chip, comprising: Multiple memory controllers; Secondary (L2) cache memory coupled to the plurality of memory controllers; A processor coupled to the plurality of memory controllers and coupled to the L2 cache, the processor having a plurality of cores, the plurality of cores including a core that, in response to an instruction specifying a first source vector comprising a plurality of 16-bit floating-point data elements, a second source vector comprising a plurality of 16-bit floating-point data elements, and a third source vector comprising a plurality of floating-point data elements, performs the following operations: The plurality of 16-bit floating-point data elements from the first source vector are multiplied by the corresponding 16-bit floating-point data elements from the plurality of 16-bit floating-point data elements from the second source vector to generate a plurality of corresponding products, wherein the 16-bit floating-point data elements from the first source vector and the 16-bit floating-point data elements from the second source vector each have a sign bit, eight exponent bits, and seven significant bits. The plurality of products are accumulated with the corresponding floating-point data elements of the plurality of floating-point data elements from the third source vector to generate a plurality of corresponding accumulated floating-point data elements; One or more of the accumulated floating-point data elements are rounded according to the rounding mode; Saturate one or more of the accumulated floating-point data elements; as well as Store multiple result floating-point data elements in the destination; Interconnects coupled to the processor; as well as A bus controller coupled to the processor.

2. The chip according to claim 1, wherein, The 16-bit floating-point data element from the first source vector and the 16-bit floating-point data element from the second source vector are each neural half-precision elements.

3. The chip according to claim 1, wherein, The rounding mode is to round to the nearest even number, and is specified by the instruction.

4. The chip according to any one of claims 1 to 3 further includes an instruction converter, the instruction converter being used to convert the instructions into one or more instructions of different instruction sets executable by the core.

5. The chip according to any one of claims 1 to 3, wherein, The plurality of cores includes a graphics core.

6. The chip according to any one of claims 1 to 3, wherein, The multiple nuclei are heterogeneous.

7. A method executed by a chip, the method comprising: The memory is accessed through multiple memory controllers of the chip; The data is stored in the chip's Level 2 (L2) cache memory; Data is processed by multiple cores of the processor in the chip, the multiple cores including cores; Instructions executed through the core, specifying a first source vector comprising multiple 16-bit floating-point data elements, a second source vector comprising multiple 16-bit floating-point data elements, and a third source vector comprising multiple floating-point data elements, are used for: The plurality of 16-bit floating-point data elements from the first source vector are multiplied by the corresponding 16-bit floating-point data elements from the plurality of 16-bit floating-point data elements from the second source vector to generate a plurality of corresponding products, wherein the 16-bit floating-point data elements from the first source vector and the 16-bit floating-point data elements from the second source vector each have a sign bit, eight exponent bits, and seven significant bits. The plurality of products are accumulated with the corresponding floating-point data elements of the plurality of floating-point data elements from the third source vector to generate a plurality of corresponding accumulated floating-point data elements; One or more of the accumulated floating-point data elements are rounded according to the rounding mode; Saturate one or more of the accumulated floating-point data elements; as well as Store multiple result floating-point data elements in the destination; Data is transferred from the processor to the interconnect of the chip; as well as The bus is accessed through the chip's bus controller.

8. The method according to claim 7, wherein, The 16-bit floating-point data element from the first source vector and the 16-bit floating-point data element from the second source vector are each neural half-precision elements.

9. The method of claim 7, further comprising determining the rounding mode according to the instructions, wherein, The rounding mode is to round to the nearest even number.

10. The method according to any one of claims 7 to 9, further comprising converting the instructions into one or more instructions of a different instruction set executable by the execution circuitry of the core.

11. The method according to any one of claims 7 to 9, wherein, The plurality of cores includes a graphics core.

12. The method according to any one of claims 7 to 9, wherein, The multiple nuclei are heterogeneous.

13. A computer system, comprising: System memory; as well as A processor coupled to the system memory, the processor comprising: Multiple memory controllers; Secondary (L2) cache memory coupled to the plurality of memory controllers; A processor coupled to the plurality of memory controllers and coupled to the L2 cache, the processor having a plurality of cores, the plurality of cores including a core that, in response to an instruction specifying a first source vector comprising a plurality of 16-bit floating-point data elements, a second source vector comprising a plurality of 16-bit floating-point data elements, and a third source vector comprising a plurality of floating-point data elements, performs the following operations: The plurality of 16-bit floating-point data elements from the first source vector are multiplied by the corresponding 16-bit floating-point data elements from the plurality of 16-bit floating-point data elements from the second source vector to generate a plurality of corresponding products, wherein the 16-bit floating-point data elements from the first source vector and the 16-bit floating-point data elements from the second source vector each have a sign bit, eight exponent bits, and seven significant bits. The plurality of products are accumulated with the corresponding floating-point data elements of the plurality of floating-point data elements from the third source vector to generate a plurality of corresponding accumulated floating-point data elements; One or more of the accumulated floating-point data elements are rounded according to the rounding mode; Saturate one or more of the accumulated floating-point data elements; and Store multiple result floating-point data elements in the destination; Interconnects coupled to the processor; and A bus controller coupled to the processor.

14. The computer system of claim 13, further comprising a mass storage device coupled to the processor, and wherein, The 16-bit floating-point data element from the first source vector and the 16-bit floating-point data element from the second source vector are each neural half-precision elements.

15. The computer system of claim 13, further comprising a mass storage device coupled to the processor, and wherein, The rounding mode is to round to the nearest even number, and is specified by the instruction.

16. The computer system according to any one of claims 13 to 15, further comprising: Communication devices coupled to the processor; as well as An instruction converter is used to convert the instructions into one or more instructions of different instruction sets executable by the execution circuitry of the core.

17. The computer system according to any one of claims 13 to 15, further comprising a communication means coupled to the processor, wherein, The plurality of cores includes a graphics core.

18. The computer system according to any one of claims 13 to 15, further comprising a communication means coupled to the processor, wherein, The multiple nuclei are heterogeneous.

19. At least one machine-readable storage medium, comprising instructions that, when executed by a processor, cause the processor to perform any of the methods according to claims 7 to 12.

20. An apparatus comprising components for performing any one of the methods according to claims 7 to 12.

21. An apparatus, the apparatus comprising: A component used to access memory through multiple memory controllers on a chip; Components used to store data in the secondary (L2) cache memory of the chip; Components for processing data via multiple cores of the processor of the chip, the multiple cores including cores; The component for executing instructions via the core specifies instructions for: a first source vector comprising multiple 16-bit floating-point data elements, a second source vector comprising multiple 16-bit floating-point data elements, and a third source vector comprising multiple floating-point data elements. The plurality of 16-bit floating-point data elements from the first source vector are multiplied by the corresponding 16-bit floating-point data elements from the plurality of 16-bit floating-point data elements from the second source vector to generate a plurality of corresponding products, wherein the 16-bit floating-point data elements from the first source vector and the 16-bit floating-point data elements from the second source vector each have a sign bit, eight exponent bits, and seven significant bits. The plurality of products are accumulated with the corresponding floating-point data elements of the plurality of floating-point data elements from the third source vector to generate a plurality of corresponding accumulated floating-point data elements; One or more of the accumulated floating-point data elements are rounded according to the rounding mode; Saturate one or more of the accumulated floating-point data elements; as well as Store multiple result floating-point data elements in the destination; Data is transferred from the processor to the interconnect of the chip; as well as The bus is accessed through the chip's bus controller.

22. The device according to claim 21, wherein, The 16-bit floating-point data element from the first source vector and the 16-bit floating-point data element from the second source vector are each neural half-precision elements.

23. The apparatus of claim 21, further comprising components for determining the rounding mode according to the instructions, wherein, The rounding mode is to round to the nearest even number.

24. The apparatus according to any one of claims 21 to 23, further comprising a component for converting the instructions into one or more instructions of a different instruction set executable by the execution circuitry of the core.

25. The device according to any one of claims 21 to 23, wherein, The plurality of cores includes a graphics core.

26. The device according to any one of claims 21 to 23, wherein, The multiple nuclei are heterogeneous.