Processors, methods, systems and instructions for data element comparison

DE112016004351B4Active Publication Date: 2026-07-02INTEL CORP

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: DE · DE
Patent Type: Patents
Current Assignee / Owner: INTEL CORP
Filing Date: 2016-08-24
Publication Date: 2026-07-02

Application Information

Patent Timeline

24 Aug 2016

Application

02 Jul 2026

Publication

DE112016004351B4

IPC: G06F9/30; G06F9/38; G06F12/08

CPC: G06F9/30032; G06F9/3013; G06F9/30021; G06F9/30036; G06F9/30192; G06F9/30038; G06F9/3001; G06F9/30101

AI Tagging

Technology Topics

Computer architecture Data operations

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing processors face challenges in efficiently performing operations on packed data elements that are not vertically aligned, particularly in the context of sparse matrices, leading to increased computation time in applications like machine learning due to the need to handle zero values and align non-zero values for vector operations.

Method used

The implementation of data element comparison instructions and processors that can execute these instructions, which include packed data registers and operation mask registers to efficiently identify and align non-zero values within packed data operands, allowing for parallel operations on sparse matrices.

Benefits of technology

This approach enhances the performance of sparse vector arithmetic operations by efficiently identifying and aligning non-zero values, reducing computation time and improving the efficiency of machine learning applications.

✦ Generated by Eureka AI based on patent content.

Patent Text Reader

Abstract

Processor (310) comprising: a decoding unit (314) for decoding a data element comparison instruction (312), wherein the data element comparison instruction (312) specifies a first packed source data operand (322) containing at least four data elements, specifies a second packed source data operand (324) containing at least four data elements, and specifies one or more destination memory locations (326); and an execution unit (318) coupled to the decoding unit (314), wherein the execution unit (318) in response to the data element comparison instruction (312) stores at least one result mask operand (328, 330) at the one or more target memory locations (326), wherein the at least one result mask operand (328, 330) contains, for each corresponding data element in one of the first and the second packed source data operands (322, 324), another mask element at the same relative position, wherein each mask element specifies,whether the corresponding data element in one of the first and second packed source data operands (322, 324) is equal to any of the data elements in the other of the first and second packed source data operands (322, 324), wherein the execution unit (318) in response to the instruction (312) stores two result mask operands (328, 330) at the one or more target memory locations (326), wherein the two result mask operands (328, 330) contain the following: a first result mask operand (328) which contains, for each corresponding data element in the first packed source data operand (322), another mask element at the same relative position, wherein each mask element of the first result mask operand (328) indicates whether the corresponding data element in the first packed source data operand (322) is equal to any of the data elements in the second packed source data operand. (324) is; and a second result mask operand (330),which contains a different mask element for each corresponding data element in the second packed source data operand (324) at the same relative position, wherein each mask element of the second result mask operand (330) indicates whether the corresponding data element in the second packed source data operand (324) is equal to any of the data elements in the first packed source data operand (322).

Need to check novelty before this filing date? Find Prior Art

Description

BACKGROUND Technical Field

[0001] The embodiments described herein generally relate to processors. In particular, the embodiments described herein relate generally to processors to process packed data operands. background information

[0002] Many processors have Single Instruction, Multiple Data (SIMD) architectures. In the SIMD architectures, a packed data instruction, a vector instruction, or a SIMD instruction can operate on multiple data elements or multiple pairs of data elements simultaneously or in parallel. The processor may include parallel execution hardware responsive to the packed data instruction to perform multiple operations concurrently or in parallel.

[0003] Multiple data elements may be packed within a register or memory location as packed data or vector data. In the packed data, the bits of the register or other memory location may be logically divided into a sequence of data elements. A 256 bits wide packed data register can e.g. B. four 64-bit data elements, eight 32-bit data elements, sixteen 16-bit data elements, and so on. Each of the data items may represent a separate individual piece of data (e.g., a pixel color, a complex number component, etc.) that may be acted upon separately and / or independently of the others. character list

[0004] The invention can best be understood by reference to the following description and accompanying drawings, which are used to illustrate the embodiments. In the drawings: figure 1 is a block diagram of a portion of an exemplary sparse matrix. figure Figure 2 illustrates a representation of a compressed sparse row after a subset of the columns of rows 1 and 2 of the sparse matrix figure 1. figure 3 is a block diagram of one embodiment of a processor operable to execute one embodiment of a data item compare instruction. figure 4 is a block diagram of one embodiment of a method for executing one embodiment of a data item compare instruction. figure 5 is a block diagram of a first example embodiment of a data item compare operation. figure 6 is a block diagram of a second example embodiment of a data item compare operation. figure 7 is a block diagram of a third example embodiment of a data element compare operation. figure 8 is a block diagram of a fourth example embodiment of a data item compare operation. figure 9 is a block diagram of an exemplary masked data element merge operation. figure 10 is a block diagram of an exemplary embodiment of a suitable set of packed data operation mask registers. figure 11 is a block diagram of an exemplary embodiment of a suitable set of packed data registers. figure 12A-C are block diagrams illustrating a generic vector-friendly instruction format and its instruction templates according to embodiments of the invention. figure 13A-B is a block diagram illustrating an example specific vector-friendly instruction format and opcode field according to embodiments of the invention. figure 14A-D is a block diagram illustrating an example specific vector-friendly instruction format and its fields according to embodiments of the invention. figure 15 is a block diagram of one embodiment of a register architecture. figure 16A is a block diagram illustrating one embodiment of an in-order pipeline and an out-of-order issue / execution pipeline with register renaming. figure16B is a block diagram of one embodiment of a processor core including a front-end unit coupled to an execution engine unit, both of which are coupled to a memory unit. figure 17A is a block diagram of one embodiment of a single processor core along with its connection to the interconnect fabric on the die and along with its local subset of the level 2 (L2) cache. figure 17B is a block diagram of one embodiment of an expanded view of a portion of the processor core figure 17A. figure 18 is a block diagram of one embodiment of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics. figure 19 is a block diagram of a first embodiment of a computer architecture. figure 20 is a block diagram of a second embodiment of a computer architecture. figure 21 is a block diagram of a third embodiment of a computer architecture. figure 22 is a block diagram of a fourth embodiment of a computer architecture. figure 23 is a block diagram of using a software instruction translator to translate binary instructions in a source instruction set to binary instructions in a target instruction set according to embodiments of the invention. DETAILED DESCRIPTION OF EMBODIMENTS

[0005] Disclosed herein are data element comparison instructions, processors to execute the instructions, methods performed by the processors when processing or executing the instructions, and systems including one or more processors to process or execute the instructions. Numerous specific details (e.g., specific instruction operations, data formats, processor configurations, microarchitectural details, sequences of operations, etc.) are set forth in the following description. However, the embodiments may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of the description.

[0006] The data item comparison instructions disclosed herein are general purpose instructions and are not limited to any known usage. Rather, these instructions can be used for different purposes and / or in different ways based on the creativity of the programmer, the compiler, or the like. In some embodiments, these instructions may be used to process data associated with sparse matrices, although the scope of the invention is not so limited. In some embodiments, these instructions may be used to process the data associated with a compressed sparse line (CSR) representation, although the scope of the invention is not so limited. To further illustrate certain concepts, specific uses of these instructions to process the indices into a CSR format that can be used to represent the indices and the values of a sparse matrix will be described, although it should be recognized that this is the case is just one possible use of these commands. Representatively, this can be useful in data analysis, high-performance computing, machine learning, sparse linear algebra problems, and the like. In other embodiments, these instructions may be used to process other types of data besides sparse arrays and / or CSR format data. These commands can e.g. B. can be used to store different types of data, such as e.g. B. multimedia data, graphics data, sound data, video data, pixels, text string data, string data, financial data or other types of integer data or the like. Furthermore, such processing of the data can be used for various purposes, such as: B. to identify duplicate data items, select duplicate data items, merge duplicate data items, remove duplicate data items, modify duplicate data items, or for various other purposes.

[0007] figure 1 is a block diagram of a portion of an exemplary sparse matrix 100 . The matrix generally represents a two-dimensional data structure in which the data values are arranged in rows and columns. The data values may also be referred to herein simply as values or data items. The illustrated example sparse matrix is shown to have at least thirty-nine columns and at least two rows, and optionally more. Alternatively, other sparse matrices may have more rows and / or fewer or more columns. The values of the first row are shown as a* values, where the asterisk (*) represents the column number containing the value. Similarly, the values of the second row are shown as b* values, with the asterisk (*) representing the column number containing the value. The value in row 1 , Split 7 is a7, the value in line 1 , Split 23 is a23, the value in line 2 , Split 15 is b15 etc.

[0008] In many different applications it may be desirable to act on two vectors, e.g. B. on two rows of the sparse matrix. This can e.g. B. for dot product calculations of sparse vectors. Such scalar product calculations of sparse vectors are generally z. B. used in machine learning applications. Examples of such machine learning applications are the kernelized support vector machine (SVM), the open-source libSVM, kernelized principal component analysis, and the like. A kernel commonly used in such applications is the quadratic distance computation pattern, also known as the L2 norm between two vectors. The quadratic distance function, f, (|| f ||) between two vectors α and β is represented by Equation 1: ‖ α − β ‖ 2 = α 2 + β 2 − 2 α · β

[0009] The inner product (•) between the two vectors α and β, which may be sparse vectors, is represented as a dot product calculation as shown in Equation 2: α · β = ∑ α ( i ) * β ( i ) , 0 ≤ i ≤ min ( Länge ( α ) ,Länge ( β ) )

[0010] Such sparse vector dot product calculations tend to add significantly to the overall computation time of machine learning and other applications. Accordingly, increasing the power of performing such sparse vector dot product calculations may tend to help improve the performance of both machine learning and other applications.

[0011] In figure 1 can be the sparse matrix 100 can be said to be sparse if a significant number or proportion of the values of the matrix are zero values. Such zero values often have special mathematical properties; multiplication by zero produces e.g. B. a product of zero or the like. In the case of multiplying the values in different rows of the same column, such zero values can produce products with a zero value, whereas multiplying two non-zero values can produce non-zero values. For example, the multiplication of the data elements in the rows produces 1 and 2 the column 2 (i.e., a2 * 0) a product of zero, whereas the multiplication of the data items in the rows 1 and 2 the column 3 (i.e., a3 * b3) produces a non-zero product. Furthermore, in the specific case of a multiplication accumulation or dot product type calculation, such zero values often cannot contribute to the total accumulation value or dot product.

[0012] Accordingly, in these and certain other uses, it may be desirable to ignore these zero values of the sparse matrix. In the sparse matrix of this particular example, there are only three pairs of values from the rows 1 and 2 , occupying a common column, both containing non-zero values, as shown by reference numeral 102. Specifically, this applies to a3 and b3, to a7 and b7, and to a23 and b23. In some embodiments, it may be advantageous to efficiently identify and / or isolate such pairs of values. As explained further below, the data item comparison instructions disclosed herein are useful for this purpose, although not limited to just this purpose.

[0013] figure 2 illustrates a representation 204 a compressed sparse row (CSR) of a subset of the columns of the rows 1 and 2 according to the sparse matrix figure 1. In the CSR representation or format, the values of the matrix and / or a vector (e.g. a single row of the matrix) are represented by a 2-tuple or a pair of an index and a corresponding value . In the case of the sparse matrix mentioned above, the index can be e.g. B. represent the column number, while the value may represent the data value for a given row in that column. This <index:wert>-2 tuples or pairs can generally be lined up together in increasing index order for all non-zero data values in a row. The end of the string can be specified by a sentinel value such as a negative one (i.e., -1) value. The null values may be omitted from the CSR representation or "compacted". By way of example, the CSR representations may be for a subset of the row's columns 1 and for a subset of the row's columns 2 be presented as follows: < 2 :a2 > , < 3 :a3 > , < 7 :a7 > , < 9 :a9 > , < 12 :a12 > , < 13 :a13 > ,... < 39 :a39 > < 3 :b3 > , < 5 :b5 > , < 6 :b6 > , < 7 :b7 > , < 11 :b11 > , < 15 :b15 > ,... < 31 :b31 >

[0014] As can be easily seen, such a CSR format omits the null values (e.g., which cannot contribute to a dot product or other type of operation). However, a likely consequence of the CSR representation or format is that the values that were in the same column of a matrix (or set of vectors), such as B. the data values a3 and b3, partly due to the removal of generally different numbers of zeros and / or zeros at different positions in the different vectors, cannot be in the same relative 2-tuple position and / or "aligned" if they are converted to the CSR representation This lack of alignment is illustrated in the reference numeral 206 shown. In the matrix after figure 1 were z. B. the values a3 and b3 both in the column 3 , where they were aligned vertically, although in the CSR representation of the rows 1 and 2 the tuple <3:a3> is in the second position from the left in the list of tuples (e.g. because a3 is the second non-zero value in the row 1 is), whereas the pair <3:b3> of the line 2 is in the first position from the left in the list of tuples (e.g. because b3 is the first non-zero value in the row 2 is). Similarly, data elements a7 and b7 and a23 and b23 can also be in different relative positions in CSR format.

[0015] A likely consequence of this is that when the data is processed in vector processors, packed data processors, or single-instruction-multiple-data (SIMD) processors, the values that were in the same column of the matrix are no longer in the same ones corresponding vertically aligned data element positions of the packed data operands, vectors, or SIMD operands. In some embodiments, it may be desirable to act on the values in the same column (e.g. in the case of vector multiplication, etc.). This can tend to pose certain challenges in efficiently implementing operations on such values because the vector operations, the packed data operations, or the SIMD operations are often designed to operate on corresponding vertically aligned data elements. An instruction set can e.g. B. comprise a packed multiply instruction to multiply a corresponding pair of least significant data elements of first and second packed source data operands, multiply a corresponding pair adjacent to the least significant data elements of first and second packed source data operands, etc. Conversely, the packed multiply instruction may be inoperable, to multiply data items in non-corresponding or non-vertically aligned positions.

[0016] figure 3 is a block diagram of one embodiment of a processor 310 , operable to provide an embodiment of a data item compare instruction 312 to execute. In some embodiments, the processor may be a general purpose processor (e.g., a general purpose microprocessor or central processing unit (CPU) of the type used in a desktop, laptop, or other computer). Alternatively, the processor can be a special purpose processor. Examples of suitable special purpose processors include, but are not limited to, network processors, communications processors, cryptographic processors, graphics processors, coprocessors, embedded processors, digital signal processors (DSPs), and controllers (e.g., microcontrollers). The processor may be any of various complex instruction set computing (CISC) architectures, reduced instruction set computing (RISC) architectures, very long instruction word (VLIW) architectures, hybrid architectures, or other types of architectures or may have a combination of different architectures (eg, different cores may have different architectures).

[0017] During operation, the processor 310 a data item comparison instruction 312 receive. The command can e.g. B. received from a memory on a bus or other interconnection. The instruction may represent a macro instruction, assembly language instruction, machine code instruction, or other instruction or control signal of an instruction set of the processor. The data item compare instruction may, in some embodiments, explicitly (e.g., through one or more fields or a set of bits) include a first packed source data operand 322 specify or otherwise indicate (e.g., implicitly indicate), may specify or otherwise indicate a second packed source data operand 324, and may at least one destination storage location 326 specify or otherwise indicate where a first result mask operand 328 and optionally a second result mask operand 330 are to be saved. In some embodiments, there may be at least four or at least eight data items in each of the first and second packed source data operands. In some embodiments, the data elements may represent indices corresponding to a CSR representation, although the scope of the invention is not so limited. As an example, the instruction may include source and / or destination operand specification fields to specify registers, data storage locations, or other storage locations for the operands. Alternatively, one or more of these operands may optionally be implicit to the instruction (e.g., implicit to an opcode of the instruction).

[0018] In figure 3 may, in some embodiments, be the first packed source data operand 322 optionally in a first packed data register of a set of registers 320 for packed data, while the second packed source data operand 324 optionally in a second packed data register of the set of registers 320 can be stored for packed data. Alternatively, data storage locations or other storage locations may optionally be used for one or more of these operands. Each of the packed data registers may represent a memory location on the die operable to store packed data, vector data, or single instruction multiple data (SIMD) data. The packed data registers may represent architectural registers that are visible to software and / or a programmer, and / or are the registers specified by the instructions of the processor's instruction set to identify the operands. These architectural registers are in contrast to other non-architectural registers in a given microarchitecture (e.g., the temporary registers, the reorder buffers, the quiesce registers, etc.). The packed data registers can be implemented in different ways in different microarchitectures and are not limited to any particular design type. Examples of suitable types of packed data registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations thereof. Specific examples of suitable packed data registers include those listed in figure 1-11 and described, but are not limited to these.

[0019] In figure 3, in some embodiments, the processor may optionally maintain a set of operation mask registers 322 included for packed data. Each of the packed data operation mask registers may represent a memory location on the die operable to store at least one packed data operation mask. The packed data operation mask registers may represent architectural registers that are visible to software and / or a programmer, and / or are the registers specified by the instructions of the processor's instruction set to identify the operands. Examples of suitable types of packed data operation mask registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations thereof. Specific examples of suitable packed data operation mask registers include those listed in figure 10-10 and the mask or k-mask registers described in the back of the application, but are not limited to these.

[0020] As further shown, in some embodiments, the one or more target storage locations 326 optionally one or more packed data operation mask registers in the set of operation mask registers 332 be for packed data. In some embodiments, a first packed data operation mask register may optionally be used to store the first result mask operand 328 and a second different packed data operation mask register may optionally be used to store the second result mask operand 330 to store, as described below (e.g. in connection with figure 5) is further explained. In other embodiments, a single packed data operation mask register may optionally be used to store both the first result mask operand 328 and the second result mask operand 330 to store, as described below (e.g. in connection with figure 6) is further explained. In still other embodiments, the first result mask operand 328 and the second result mask operand 330 optionally in a packed data register in the set of registers 320 be stored for packed data as follows (e.g. in connection with figure 8) is further explained. The result mask operands can e.g. B. be stored in a different packed data register than those used to store the first and second packed source data operands. Alternatively, a packed data register used for either the first packed source data operand or the second packed source data operand can optionally be reused to store the first and second result mask operands. The command can e.g. B. Specify a packed data source / destination register that may be implicitly or implicitly recognized by the processor to be used both initially for a packed source data operand and subsequently used to store the result mask operands.

[0021] In figure 3 contains the processor of a decoding unit or decoder 314 . The decoding unit may receive and decode the data item comparison instruction. The decode unit may be one or more relatively lower level instructions or one or more control signals 316 (e.g., one or more microinstructions, microoperations, microcode entry points, decoded instructions or control signals, etc.) that reflect, represent, and / or are derived from the data item comparison instruction at a relatively higher level. In some embodiments, the decode unit may have one or more input structures (e.g., port(s), interconnect(s), interface) to receive the data item compare command, command detection and decode logic coupled thereto to recognize and decode the data element comparison command, and one or more output structures (e.g., a port(s), an interconnect(s), an interface) coupled thereto to execute the command(s). a lower level or to output the control signal(s). The decoding unit may be implemented using a variety of different mechanisms, including but not limited to microcode read-only memories (microcode ROMs), look-up tables, hardware implementations, programmable logic arrays (PLAs), and other mechanisms suitable for implementing decoding units restricted.

[0022] In some embodiments, instead of the data item comparison instruction provided directly to the decode unit, an instruction emulator, translator, morpher, interpreter, or other instruction translation module may optionally be used. Various types of command translation modules can be implemented in software, hardware, firmware, or a combination thereof. In some embodiments, the instruction translation module may be external to the processor, such as. e.g. on a separate die and / or in memory (e.g. as a static, dynamic or run-time emulation module). Illustratively, the command conversion module may receive the data item comparison command, which may be from a first command set, and may emulate, translate, morph, interpret, or otherwise convert the data item comparison command into one or more corresponding intermediate commands or control signals that may be generated from a second, different instruction set can be. The one or more intermediate instructions or the one or more control signals of the second instruction set can be assigned to a decoding unit (e.g. the decoding unit 314 ) that it can then decode into one or more lower-level instructions or control signals executable by the processor's native hardware (e.g., one or more execution units).

[0023] In figure 3 is the execution unit 318 with the decoding unit 314 , the registers 320 for packed data and optionally the operation mask registers 332 for packed data (e.g. when the result mask operands 328 , 330 are to be stored in it). The execution unit may have the one or more decoded or otherwise implemented instructions or control signals 316 received, which represent the data element comparison command and / or are derived from the data element comparison command. The execution unit can also use the first packed source data operand 322 and the second packed source data operand 324 receive. The execution unit may be operable in response to the data item compare instruction and / or as a result of the data item compare instruction (e.g., responsive to the one or more instructions or the one or more control signals decoded from the instruction). be to the first result mask operand 328 and the optional second result mask operand 330 at the one or more destination locations 326 , specified by the command. In some embodiments, at least one result mask operand (e.g., the first result mask operand 328 ) for each corresponding data item in one of the first and second packed source data operands (e.g., the first packed source data operand 322 ) contain another mask element at the same relative position within the operands. In some embodiments, each mask element may indicate whether the corresponding data element is contained in the aforementioned one of the first and second packed source data operands (e.g., the first result mask operand 328 ) equal to any data elements in the other of the first and second packed source data operands (e.g. the second result mask operand 330 ) is.

[0024] In some embodiments, the first result mask operand 328 for each corresponding data item in the first packed source data operand 322 contain another mask element at the same relative position within the operands, each mask element of the first result mask operand 328 can indicate whether the corresponding data item is in the first packed source data operand 322 equal to any of the data elements in the second packed source data operand 324 is. In some embodiments, the second result mask operand 330 for each corresponding data element in the second packed source data operand 330 contain another mask element at the same relative position within the operands, each mask element of the second result mask operand 330 can indicate whether the corresponding data item is in the second packed source data operand 324 equal to any of the data elements in the first packed source data operand 322 is. In some embodiments, each mask element may be a single mask bit. In some embodiments, the result may be any of those set out in figure 5- figure 8-8, although the scope of the invention is not so limited.

[0025] The execution unit and / or processor may include specific or specialized logic (e.g., transistors, integrated circuitry, or other hardware, potentially combined with firmware (e.g., instructions stored in non-volatile memory), and / or software ) operable to execute the data item comparison instruction and / or the result in response to and / or as a result of the data item comparison instruction (eg, in response to one or more instructions or one or more control signals that have been decoded from the data item compare instruction). In some embodiments, the execution unit may include one or more input structures (e.g., port(s), interconnect(s), interface) to receive source operands, circuitry, or logic coupled thereto to receive the source operands to receive and process and to generate the result operands, and one or more output structures (e.g., a port(s), an interconnect(s), an interface) coupled thereto to output the result operands. In some embodiments, the execution unit may optionally include comparison circuitry or logic coupled to the data elements of the source operands through a fully connected crossbar, wherein each data element in the first packed source data operand may be compared to any data element in the second packed source data operand such that a comparison of all elements with all elements can be carried out. In some embodiments, if e.g. For example, if there are N integer elements in the first packed source data operand and M integer elements in the second packed source data operand, then N * M comparisons can be performed.

[0026] To avoid obscuring the description is a relatively simple processor 310 shown and described. However, the processor may optionally include other processor components. Various different embodiments can e.g. B. contain various different combinations and configurations of the components in any of the figure 15- figure 18 and described. All of the components of the processor can be coupled together to enable them to work as intended.

[0027] figure 4 is a block diagram of one embodiment of a method 436 for executing an embodiment of a data item compare instruction. In various embodiments, the method may be performed by a processor, instruction processing device, or other digital logic device. In some embodiments, the method 436 by and / or within the processor 310 after figure 3 to be executed. The ones here for the processor 310 The components, features, and specific optional details described also optionally apply to the method 436 . Alternatively, the procedure 436 executed by and / or within a similar or different processor or by and / or within a similar or different device. In addition, the processor can 310 Execute procedures belonging to the procedure 436 similar or different from the procedure 436 are.

[0028] The procedure contains in the block 437 receiving the data item comparison command. In various aspects, the instruction may be received at a processor or a portion thereof (e.g., an instruction fetch unit, a decode unit, a bus interface unit, etc.). In various aspects, the instruction may originate from a source external to the processor and / or die (e.g., from a memory, interconnect, etc.) or from a source within the processor and / or die (e.g., received from an instruction cache, an instruction queue, etc.). The data element compare instruction may specify or otherwise indicate a first packed source data operand containing at least four data elements, or in some cases at least eight or more data elements, a second packed source data operand containing at least four data elements, or in some cases at least eight or more data elements, and specify one or more destination locations. In some embodiments, the data elements may represent indices corresponding to a CSR representation, although the scope of the invention is not so limited.

[0029] In the block 438 at least one result mask operand may be stored in the one or more target memory locations in response to and / or as a result of the data item compare instruction. The at least one result mask operand may include a different mask element for each corresponding data element in one of the first and second packed source data operands at the same relative position within the operands. Each mask element may indicate whether the corresponding data element in the aforementioned one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. In some embodiments, at least two result mask operands are stored. In some embodiments, the two result mask operands may be stored in a single mask register. In other embodiments, the two result mask operands may be stored in two different mask registers. In still other embodiments, the two result mask operands may be stored in a packed data operand, such as B. by storing one bit of each of the first and second result mask operands in each data element of the packed data operand.

[0030] The method illustrated includes architectural operations (e.g., those visible from a software perspective). In other embodiments, the method may optionally include one or more microarchitectural operations. Illustratively, the instruction may be fetched, decoded, scheduled out of order, the source operands may be accessed, an execution unit may perform the microarchitectural operations to implement the instruction, etc. In some embodiments, the microarchitectural operations to implement the instruction may be optional include comparing each data element of the first packed source data operand to each data element of the second packed source data operand. In some embodiments, crossbar based hardware comparison logic may be used to perform these comparisons.

[0031] In some embodiments, the method may optionally be performed during or as part of an algorithm to speed up sparse vector-sparse vector arithmetic (e.g., a sparse vector-sparse vector dot product computation), although the scope of the Invention is not so limited. In some embodiments, the result mask operands stored in response to the instruction may be used to concatenate or collate the data items that the result mask operands indicate match in the packed source data operands. In some embodiments, the result mask operands may be e.g. B. specified as a source operand of a masked data element merge instruction and used by a masked data element merge instruction. In other embodiments, the result mask operand(s) can be minimally processed, in which case the resulting result mask operand(s) can be specified as the source operand(s) of the masked data element merge instruction(s). (can) and can be used by the masked data element merge instruction(s).

[0032] figure 5 is a block diagram showing a first exemplary embodiment of a data item compare operation 540 10, which may be executed in response to a first example embodiment of a data item compare instruction. The instruction may have a first packed source data operand 522 specify or otherwise indicate and may include a second packed source data operand 524 specify or otherwise indicate. These source operands may be stored in packed data registers, data storage locations, or other storage locations as previously described.

[0033] In the illustrated embodiment, both the first and second packed source data operands are 512-bit operands comprising sixteen 32-bit data elements, although other sized operands, other sized data elements, and other numbers of data elements are optionally used in other embodiments can become. Typically, the number of data elements in each packed source data operand may be equal to the bit size of the packed source data operand divided by the bit size of a single data element. In various embodiments, the sizes of each of the packed source data operands 64 bits, 128 bits, 256 bits, 512 bits, or 1024 bits, although the scope of the invention is not so limited. In various embodiments, the size of each data element 8 bits, 16 bits, 32 bits, or 64 bits, although the scope of the invention is not so limited. Other packed data operand sizes and other data element sizes are also suitable. In various embodiments, there may be at least four, at least eight, at least sixteen, at least thirty-two, or more than thirty-two data items (e.g., at least sixty-four data items) in each of the packed source data operands. Often the number of data elements in both the first and second packed source data operands can be the same, although this is not required.

[0034] For further illustration, some illustrative examples of suitable alternative formats are mentioned, although the scope of the invention is not limited to only these formats. A first exemplary format is a 128-bit packed byte format containing sixteen 8-bit data elements. A second exemplary format is a 128-bit packed word format containing eight 16-bit data elements. A third exemplary format is a 256-bit packed byte format containing thirty-two 8-bit data elements. A fourth exemplary format is a 256-bit packed word format containing sixteen 16-bit data elements. A fifth exemplary format is a 256-bit packed double-word format containing eight 32-bit data elements. A sixth exemplary format is a 512-bit packed word format containing thirty-two 16-bit data elements. A seventh exemplary format is a 512-bit packed double-word format containing sixteen 32-bit data elements. An eighth exemplary format is a 512-bit packed quadword format containing eight 64-bit data elements.

[0035] As shown, in some embodiments, in response to the command and / or operation, a first result mask operand 528 generated and in a first mask register specified by the command 532 - 1 be stored and can be a second result mask operand 530 generated and in a second mask register specified by the command 532 - 2 get saved. In some embodiments, the first and second packed source data operands 522 , 524 into an execution unit 518 be entered. The execution unit, in response to the instruction (e.g., as indicated by one or more instructions or one or more control signals 516 that have been decoded from the instruction) generate and store the result mask operands. In some embodiments, this may include the execution unit comparing each data element in the packed first source data operand to each data element in the packed second source data operand. Each of the sixteen data items in the first packed source data operand may e.g. B. be compared to each of the sixteen data elements in the second packed source data operand for a total of two hundred and fifty-six comparisons.

[0036] Each result mask operand may correspond to a different one of the packed source data operands. In the illustrated embodiment, the first result mask operand may e.g. B. correspond to the first packed source data operand, while the second result mask operand may correspond to the second packed source data operand. In some embodiments, each result mask operand may have the same number of mask elements as the number of data elements in the corresponding packed source data operand. In the illustrated embodiments, each of the mask elements is a single bit. As shown, the first result mask operand may have sixteen 1-bit mask elements, each corresponding to a different one of the sixteen data elements of the first packed source data operand at the same relative position within the operands, and the second result mask operand may have sixteen 1-bit mask elements, where each corresponds to a different one of the sixteen data elements of the second packed source data operand at the same relative position within the operands. In the case of other numbers of data elements in other embodiments, if a first packed source data operand has N data elements and a second packed source data operand has M data elements, then N * M comparisons can be performed, with a first N-bit result mask corresponding to the first packed corresponding source data operand can be stored and a second M-bit result mask corresponding to the second packed source data operand can be stored.

[0037] In some embodiments, each mask element may have a value (eg, in this case, a bit value) to indicate whether its corresponding source data element (eg, in the same relative position) in its corresponding packed source data operand with any of the source data elements in the other does not match corresponding packed source data operands or not. Each bit in the first result mask operand can e.g. B. have a bit value to indicate whether or not its corresponding data element (e.g. in the same relative position) in the first packed source data operand matches any of the data elements in the second packed source data operand, whereas each bit in the second result mask operand has a bit value to indicate whether or not its corresponding data element (e.g., in the same relative position) in the second packed source data operand matches any of the data elements in the first packed source data operand. According to one possible convention used in the illustrated embodiment, each mask bit set to a binary one (i.e., 1) may indicate that its corresponding data element in its corresponding packed source data operand has at least one non-corresponding data element in the other matches or equals the packed source data operands. In contrast, each mask bit cleared to a binary zero (i.e., 0) may indicate that its corresponding data element in its corresponding packed source data operand does not match or equal any of the data elements in the other non-corresponding packed source data operand. The opposite convention is also suitable for other embodiments.

[0038] In the particular exemplary embodiment illustrated, the only data elements in the first packed source data operand that match or are the same as the data elements in the second packed source data operand, e.g. B. those with the values 3 , 7 and 23 . Considering the first packed source data operand is the data element of the value 3 at the second data item position from the left or least significant bit, is the data item of the value 7 is in the third data item position from the left or the least significant bit and is the data item of the value 23 at the tenth data element position from the left, or least significant bit. Similarly, in the first result mask operand, only the second, third, and tenth mask bits from the left or least significant end are set to a binary one (i.e., 1) to indicate that the corresponding data items in the first packed source data operand are associated with at least one data item in match the second packed source data operand, while all other bits are cleared to a binary zero (i.e., 0) to indicate that the corresponding data elements in the first packed source data operand do not match any data elements in the second packed source data operand.

[0039] Likewise, considering the second packed source data operand is the data element of the value 3 in the first data item position from the left, or least significant bit, is the data item of the value 7 is in the fourth data item position from the left or the least significant bit and is the data item of the value 23 at the ninth data element position from the left, or least significant bit. Similarly, in the second result mask operand, only the first, fourth, and ninth mask bits from the left or least significant end are set to a binary one (i.e., 1) to indicate that the corresponding data items in the second packed source data operand are associated with at least one data item in match the first packed source data operand, while all other bits are cleared to a binary zero (i.e., 0) to indicate that the corresponding data elements in the second packed source data operand do not match any data elements in the first packed source data operand.

[0040] In some embodiments, the first and second mask registers may represent the registers from a set of architectural registers of a processor to be used by the masked packed data instructions of an instruction set of the processor to perform operation masking, operation statement, or conditional control of the packed data operation to execute. In some embodiments, the first and second mask registers may e.g. B. Registers in the set of operation mask registers 322 for packed data figure be 3 The masked packed data instructions may be operable to specify the mask registers as the source operands (e.g., may have a field to specify the mask registers as the source operands) to be used to mask a packed data operation , to testify or to control conditionally. In some embodiments, the masking, assertion, or conditional control may be provided at a per data item granularity such that operations on different data items or pairs of corresponding data items may be masked, predicated, or conditionally controlled separately and / or independently of the others. Each mask bit can e.g. B. have a first value to allow the operation to be performed and to allow the corresponding result data item to be stored at the destination, or may have a second different value to not allow the operation is performed and / or to disallow the corresponding result data item to be stored at the destination. According to one possible convention, a mask bit cleared to a binary zero (i.e., 0) may represent a hidden operation for which a corresponding operation should not be performed and / or a corresponding result should not be stored, whereas a mask bit set to a binary one (i.e. , 1) set mask image can represent an unmasked operation for which a corresponding operation is to be performed and a corresponding result is to be stored. The opposite convention is also possible.

[0041] in the in figure In the embodiment illustrated in Figure 5, the first and second result mask operands are stored in different mask registers (e.g., in different packed data operation mask registers). A possible advantage for some embodiments is that each result mask operand and / or each mask register is configured for use as a packed source data operation mask operand for a masked or asserted packed data instruction, such as e.g. B. a masked or asserted data element merge command (such as a VPCOMPRESS command) is directly appropriate, although the scope of the invention is not limited to such use. Illustratively, two instances of the masked or asserted data element merge instruction can each use a different one of the first and second result mask operands, without requiring substantially any additional processing of the first and second result mask operands, as a source mask operand, a propositional operand, or a conditional control operand for use a data item merge operation. The unmasked bits or mask elements of the result masks or mask registers may correspond to the matching indices of the CSR tuples that were compared, and the masked or asserted data element merge instruction may use these unmasked bits or mask elements to merge the corresponding values of these CSR tuples. Further details of how such masked or asserted data element merge commands can be used in this manner are discussed further below.

[0042] figure 6 is a block diagram showing a second example embodiment of a data item compare operation 640 10 is illustrated, which may be executed in response to a second example embodiment of a data item compare instruction. The operation 640 exhibits certain similarities with the operation 540 after figure 5 on. To avoid obscuring the description, the different and / or additional properties for the operation 640 described primarily without repeating all of the optionally similar or common features and details regarding operation 540. However, it should be recognized that the characteristics and details of operation previously described 540 including their variations and alternative embodiments also optional for the operation 640 may apply unless otherwise stated or otherwise clearly evident.

[0043] As in the embodiment after figure 6, the instruction may have a first packed source data operand 622 specify or otherwise indicate a second packed source data operand 624 specify or otherwise indicate. The first and second packed source data operands can be placed in an execution unit 618 be entered. The execution unit, in response to the instruction (e.g., as indicated by one or more instructions or one or more control signals 616 , which have been decoded from the instruction) a first result mask operand 628 and a second result mask operand 630 generate and save.

[0044] A difference according to the embodiment figure 6 regarding the embodiment figure 5 is that the first and second result mask operands are stored in a single mask register 632, rather than each in a different mask register (e.g., the first mask register 532 - 1 and the second mask register 532 - 2 ) is saved. The first result mask operand is specific 628 stored in the least significant 16 bits of the single mask register, while the second result mask operand 630 is stored in the next contiguous 16 bits of the single mask register. Alternatively, the positions of the first and second mask operands can optionally be swapped. In this case, the least significant part of the single mask register (e.g. the least significant 16 bits) corresponds to one of the packed source data operands (e.g. the first packed source data operand in this case), while a more significant part of the single mask register (e.g. the next most significant 16 bits) corresponds to another one of the packed source data operands (e.g. in this case the second packed source data operand). In the illustration, the mask register is shown to be only 32 bits, although in other embodiments it may be fewer or more, such as 32 bits. 64 bits.

[0045] In some embodiments, the least significant first result mask operand may be selected for use as a packed source data operation mask operand for a masked packed data instruction, such as a packed data instruction. a masked or asserted data element merge command (such as a VPCOMPRESS command), may be directly appropriate, although the scope of the invention is not limited to such use. Furthermore, a simple shift can be used to shift bits [16:31] of the mask register to bits [0:15], allowing the second result mask operand to be used as a packed source data operation mask operand for a masked packed data instruction, such as e.g. a masked or asserted data element merge command (such as a VPCOMPRESS command), may be directly appropriate, although the scope of the invention is not limited to such use.

[0046] figure 7 is a block diagram showing a third example embodiment of a data item compare operation 740 10 is illustrated, which may be executed in response to a third example embodiment of the data item compare instruction. The operation 740 exhibits certain similarities with the operation 540 after figure 5 on. To avoid obscuring the description, the different and / or additional properties for the operation 740 described primarily without repeating all of the optionally similar or common features and details regarding operation 540. However, it should be recognized that the characteristics and details of operation previously described 540 including their variations and alternative embodiments also optional for the operation 740 may apply unless otherwise stated or otherwise clearly evident.

[0047] As in the embodiment after figure 7, the instruction may have a first packed source data operand 722 specify or otherwise indicate a second packed source data operand 724 specify or otherwise indicate. The first and second packed source data operands can be placed in an execution unit 718 be entered. The execution unit, in response to the instruction (e.g., as indicated by one or more instructions or one or more control signals 716 , which have been decoded from the instruction) generate and store a result.

[0048] A difference according to the embodiment figure 7 with respect to the embodiment figure 5 is that the execution unit 718 only a single result mask operand 728 can generate and store. In some embodiments, the single result mask operand may be stored in a mask register (e.g., a packed data operation mask register). In some embodiments, the single result mask operand may correspond to one of the first and second packed source data operands (e.g., the first packed source data operand in the illustrated example). In some embodiments, the result mask operand 728 and / or the mask register 732 for use as a packed source data operation mask operand for a masked packed data instruction such as a masked or asserted data element merge command (such as a VPCOMPRESS command), may be directly appropriate, although the scope of the invention is not limited to such use. Another instance of the instruction (with the same opcode) can be executed again to generate the result mask operand for the other packed source data operand.

[0049] figure 8 is a block diagram showing a fourth example embodiment of a data item compare operation 840 10 is illustrated, which may be executed in response to a fourth example embodiment of the data item compare instruction. The operation 840 exhibits certain similarities with the operation 540 after figure 5 on. To avoid obscuring the description, the different and / or additional properties for the operation 840 described primarily without repeating all of the optionally similar or common features and details regarding operation 540. However, it should be recognized that the characteristics and details of operation previously described 540 including their variations and alternative embodiments also optional for the operation 840 may apply unless otherwise stated or otherwise clearly evident.

[0050] As in the embodiment after figure 8, the instruction can have a first packed source data operand 822 specify or otherwise indicate a second packed source data operand 824 specify or otherwise indicate. The first and second packed source data operands can be placed in an execution unit 818 be entered. The execution unit, in response to the instruction (e.g., as indicated by one or more instructions or one or more control signals 816 , which have been decoded from the instruction) a first result mask operand 828 and a second result mask operand 830 generate and save.

[0051] A difference according to the embodiment figure 8 regarding the embodiment figure 5 is that the execution unit 818 generate the first and second result mask operands 828, 830 and in a packed result data operand 820 can save. The packed result data operand can e.g. B. in a packed data register, in a data storage location or other storage location. In one embodiment, the packed result data operand or register may be a 512-bit operand or register, although the scope of the invention is not so limited. Another difference is that the mask bits of the first and second result mask operands may be located within other non-mask bits. As shown, there may be two bits in each result data element in the packed result data operand that are used as the mask bits. One of these two bits in each data element can be used for the first result mask operand, while the other can be used for the second result mask operand. The two least significant bits of each data element can e.g. B. can be used optionally, the two most significant bits of each data element can be used optionally, the least significant and the most significant bit can be used optionally, or any other set of bits can be used optionally. In the illustrated embodiment, the two least significant bits are used, with the least significant bit of the two being used for the first mask operand and the most significant bit of the two being used for the second mask operand, although this is not required.

[0052] The following pseudocode represents an example embodiment of a data item comparison instruction named VXBARCMPU: VXBARCMPU{Q|DQ} VDEST, SRC1, SRC2 / / The command creates 2 masks for n indexes in each of SRC1 and SRC2 / / VDEST, SRC1 and SRC2 are each a packed data register VDEST = 0 ; / / initialize, VDEST holds the final 2-bit masks for i ← 1 to n / / n=16 (Q) or 8 (DQ) for j ← 1 to n / / n=16 (Q) or 8 (DQ) bool match = (SRC1.element[i] == SRC2.element[j]) ? 1:0 / / n^2 comparisons VDEST.element[i].bit[0] = VDEST.element[i].bit[0] | match; / / bit0 VDEST.element[j].bit[1] = VDEST.element[j].bit[l] | match; / / bit1

[0053] In this pseudocode, Q represents a 32-bit quadword, while DQ represents a 64-bit double quadword. The symbol "|" represents the logical OR. The term "match" represents the comparison for equality, e.g. B. of whole numbers.

[0054] Now, in the embodiments according to the figure 5- figure 8 of each of the bits in the result mask operand provides a summary or cumulative indication of whether or not its corresponding source data item matches any of the source data items in the other non-corresponding operand. In addition, in the embodiments according to the figure 5- figure 8 each result mask operand has the same number of mask bits as the number of data elements in its corresponding source operand. As such, these mask bits are in a format generally suitable for use as a mask operand for a masked packed data instruction such as a mask. a masked or declared data item merge command (e.g. a masked VPCOMPRESS command) is well suited.

[0055] An alternative possible approach would be to store a number of bits per comparison equal to the number of comparisons performed. Each of these bits alone would not provide a summary or cumulative indication of whether or not its corresponding source data item matches any of the source data items in the other non-corresponding operand. Instead, each of these bits per comparison would correspond to a single comparison performed between a different combination of a data element of the first packed source data operand and a data element of the second packed source data operand. In the case of two packed source data operands each having N data elements, N*N comparisons can be performed and N*N result mask bits can be stored using this alternative approach. In the case of two sixteen data element operands, two hundred and fifty-six can be executed and one 256-bit result mask can be stored instead of just two 16-bit result masks.

[0056] A potential disadvantage with such an alternative approach, however, is that the result mask operand tends to be in a less useful and / or efficient format for certain types of subsequent operations. There is not a single such bit per comparison, e.g. B. indicates without further processing whether a data item in one source has a matching data item in the other source or not. As such, these per-comparison result mask bits may not lend themselves well to use as a mask operand for a masked packed data instruction such as a mask operand without further processing. a masked or declared data item merge command (such as a masked VPCOMPRESS command), may be appropriate. Additionally, the additional bits provided for all comparison results may tend to consume more interconnect bandwidth, register space, power, and so on.

[0057] In contrast, each of the first and second result mask operands 528 , 530 and / or each of the first and second mask registers 532 - 1 , 532 - 2 be directly usable as a source mask by a masked packed data instruction (such as a masked VPCOMPRESS instruction). Likewise, the first result mask operand 628 be directly usable as a source mask by a masked packed data instruction (such as a masked VPCOMPRESS instruction), while the second result mask instruction 630 (e.g. by a simple 16-bit shift) can easily be made directly usable. Likewise, the result mask operand 728 and / or the mask register 732 be usable directly as a source mask by a masked packed data instruction (such as a masked VPCOMPRESS instruction).

[0058] In any of the in the figure 3- figure 8 embodiments, certain comparisons may optionally be avoided in some embodiments if it is fixed for the instruction (e.g., fixed or implied for an opcode of the instruction) or can otherwise be assured that the data elements of the source operands are each in an ascending order order (as may be the case, for example, when working with the indices of data in CSR format, or when working with certain other types of data). It can e.g. B. Comparisons can be avoided when it can be easily determined that none of the elements in the packed source data operands would match. By way of example, logic may be included to test whether either the least significant data element in the first packed source data operand is greater than the most significant data element in the second packed source data operand or the most significant data element in the first packed source data operand is less than the least significant data element in the second packed source data operand source data operand is and whether any of these are true, to avoid comparing each data element of one source with each data element of the other source. On the one hand, this can help to reduce the power consumption, although it is optionally not required.

[0059] figure 9 is a block diagram of an example of a masked data element merge operation 996 , which may be executed in response to a masked data element merge command. An example of such an instruction suitable for the embodiments is the VPCOMPRESSED instruction in x86, although use of this instruction is not required.

[0060] The masked data element merge instruction may have a packed source data operand 997 specify. In some embodiments, the packed source data operand may store data values corresponding to indices of a CSR format. The packed source data operand can e.g. B. store data values corresponding to the indices of one of the first packed source data operands 522 , 622 , 722 or 822 correspond to. Referring again to the sparse matrix figure 1, the data value a3 corresponds to the index 3 the column 3 , the data value a7 corresponds to the index 7 the column 7 etc.

[0061] The masked data element merge instruction may also include a source mask operand 928 specify. In various embodiments, the source mask operand may be the first result mask operand 528 , the first result mask operand 628, or the result mask operand 728 be. Alternatively, the packed result mask operand 820 minimal processing to the source mask operand 928 to create.

[0062] The packed source data operand 997 and the source mask operand 928 can an execution unit 918 to be provided. The execution unit may be operable in response to the instruction and / or operation to load the packed result data operand 998 save. In some embodiments, the instruction / operation may cause the execution unit to use the active data elements in the packed source data operand 997 , corresponding to the mask bits of the source mask operand 928 corresponding to the same relative positions set to a binary one, in the least significant data element positions of the packed result data operand, contiguously. Any remaining data elements of the packed result data operand may be cleared to zero. As shown, the three values a3, a7 and a23 of the packed source data operand, which are the only three active values with corresponding mask bits set, can be joined at the three least significant data element positions of the packed result data operand with all high order result data elements set to zero. In this case, the VPCOMPRESSED instruction uses zero-set masking, which sets the masked result data items to zero.

[0063] Further instances of a masked data element merge instruction may be similarly executed to merge the matching values b3, b7 and b23 in the three least significant data element positions of another packed result data operand. The second result mask operand 530 can e.g. B. along with the corresponding values from the CSR representation of the row 2 the sparse matrix 100 be used. With this approach, the matching or similar data values of the data represented in a CSR format can be isolated, merged and placed in SIMD vertical alignment at the same relative positions in the packed data operands. Such operations can be repeated until the vectors or rows of the end of the sparse matrix reach their ends. This can help to enable efficient vertical SIMD processing of these matched data values. This can be used advantageously in one aspect to improve the performance of sparse vector arithmetic operations on a sparse vector.

[0064] figure 10 is a block diagram of an exemplary embodiment of a suitable set of operation mask registers 1032 for packed data. In the illustrated embodiment, the set includes eight registers, labeled k0 through k7. Alternate embodiments may include either fewer than eight registers (e.g., two, four, six, etc.) or more than eight registers (e.g., sixteen, thirty-two, etc.). Each of these registers can be used to store a packed data operation mask. In the illustrated embodiment, each of the registers consists of 64 bits. In alternative embodiments, the widths of the registers can be either wider than 64 bits (eg, 80 bits, 128 bits, etc.) or narrower than 64 bits (eg, 8 bits, 16 bits, 32 bits, etc.). The registers can be implemented in a variety of ways and are not limited to any particular circuit or design type. Examples of suitable registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations thereof.

[0065] In some embodiments, the operation mask registers 1032 a separate dedicated set of architectural registers for packed data. In some embodiments, the instructions may encode or specify the packed data operation mask registers in different bits or in one or more different fields of an instruction format than those used to encode other types of registers (e.g., packed data registers). , encode or specify. Illustratively, an instruction may use three bits (e.g., a 3-bit field) to encode or specify any of the eight packed data operation mask registers k0 through k7. In alternate embodiments, if there are fewer or more packed data operation mask registers, either fewer or more bits may be used at a time. In a particular implementation, only the packed data operation mask registers k1 through k7 (but not k0) may be addressed as a assertion operand to assert a masked packed data operation. Register k0 can be used as a regular source or destination, but it cannot be encoded as a propositional operand (e.g., if k0 is specified, it has "no-mask" encoding), although it can is not required.

[0066] figure 11 is a block diagram of an exemplary embodiment of a suitable set of registers 1120 for packed data. The packed data registers include thirty-two 512-bit packed data registers designated ZMM0 through ZMM31. In the illustrated embodiment, the low-order 256 bits of the bottom sixteen registers, namely ZMM0-ZMM15, are labeled as or overlaid on the corresponding 256-bit packed data registers, labeled YMM0-YMM15, although this is not required . Likewise, in the illustrated embodiment, the low order 128 bits of registers YMM0-YMM15 are referred to as or overlaid on the corresponding 128-bit packed data registers denoted XMM0-XMM15, although this is also not required. The 512-bit registers ZMM0 through ZMM31 are operable to hold 512-bit packed data, 256-bit packed data, or 128-bit packed data. The 256-bit registers YMM0-YMM15 are operable to hold 256-bit packed data or 128-bit packed data. The 128-bit registers XMM0-XMM15 are operable to hold 128-bit packed data. In some embodiments, each of the registers can be used to store either packed floating point data or packed integer data. Various data element sizes are supported, including at least 8-bit byte data, 16-bit word data, 32-bit double word, 32-bit single precision floating point data, 64-bit double word and 64-bit double precision floating point data. In alternate embodiments, other numbers of registers and / or other sizes of registers may be used. In still other embodiments, the registers may or may not use aliasing of larger registers to smaller registers and / or may or may not be used to store floating point data.

[0067] An instruction set contains one or more instruction formats. A given instruction format defines various fields (number of bits, locations of bits) to specify, among other things, the operation to be performed (the opcode) and the operand(s) on which the operation is to be performed. Some command formats are further broken down by defining command templates (or subformats). The instruction templates of a given instruction format can e.g. B. be defined as having other subsets of the fields of the instruction format (the included fields are typically in the same order, with at least some having different bit positions because fewer fields are included), and / or defined as having a given field that is interpreted differently. Thus, each instruction of an ISA is expressed using a given instruction format (and in a given instruction template of that instruction format's instruction templates, if defined), containing fields for specifying the operation and operands. An example ADD command has e.g. B. a specific opcode and an instruction format containing an opcode field to specify these opcode and operand fields to select operands (source1 / destination and source2); wherein an occurrence of that ADD instruction in an instruction stream has specific contents in the operand fields that select specific operands. A set of SIMD extensions referred to as the evolved vector (AVX) extensions (AVX1 and AVX2) using the vector extension (VEX) coding scheme have been released and / or published (see e.g. Intel® 64 and IA 32 Architectures Software Developers Manual, October 2011 ; and see Intel® Advanced Vector Extensions Programming Reference, June 2011 ). Example Command Formats

[0068] The embodiments of the instruction(s) described herein may be embodied in a variety of formats. Additionally, example systems, architectures, and pipelines are described in detail below. The embodiment(s) of the instruction(s) may be executed in such systems, architectures, and pipelines, but are not limited to those described in detail. The VEX command format

[0069] VEX encoding allows instructions to have more than two operands and allows SIMD vector registers to be longer than 128 bits. Using a VEX prefix provides syntax for three (or more) operands. For example, the previous two-operand instructions resulted in B. operations such. B. A = A + B, which overwrites a source operand. Using a VEX prefix allows operands to perform non-destructive operations, such as B. A = B + C

[0070] figure 12A illustrates an example AVX instruction format that includes a VEX prefix 1202 , a real opcode field 1230 , a Mod R / M byte 1240, a SIB byte 1250 , a displacement field 1262 and an IMM8 1272. figure 12B illustrates which fields figure 12A a complete opcode field 1274 and a base operation field 1242 form. figure 12C illustrates which fields figure 12A a register index field 1244 form.

[0071] The VEX prefix (the bytes 0 - 2 ) 1202 is encoded in a three-byte form. The first byte is the format field 1240 (the VEX byte 0 , bits [7:0]) that contains an explicit C4 byte value (the unique value used to distinguish the C4 instruction format). The second-third bytes (the VEX bytes 1 - 2 ) contain a number of bit fields that provide a specific capability. Specifically, there is the REX field 1205 (the VEX byte 1 , bits [7-5]) from a VEX.R bit field (the VEX byte 1 , the bit [7] - R), a VEX.X bit field (the VEX byte 1 , the bit[6]-X) and a VEX.B bit field (the VEX byte 1 , the bit [5] - B). The other fields of the instructions encode the lower three bits of the register indices as is known in the art (rrr, xxx and bbb) so that Rrrr, Xxxx and Bbbb are formed by adding VEX.R, VEX.X and VEX.B can become. The opcode mapping field 1215 (the VEX byte 1 , bits [4:0] - mmmmm) contains the content to encode an implied leading opcode byte. The W field 1264 (the VEX byte 2 , bit [7] - W) - is represented by the notation VEX.W and provides different functions depending on the command. The role of VEX.vvvv 1220 (the VEX byte 2 , bits [6:3] - vvvv) may contain the following: 1) the VEX.vvvv encodes the first source register operand, is in inverted ( 1 -complementary) form and is valid for instructions with 2 or more source operands; 2) the VEX.vvvv encodes the destination register operand and is specified in 1's complement form for certain vector shifts; or 3) the VEX.vvvv does not encode an operand, the field is reserved and should contain 1211b. If the size field VEX.L 1268 (VEX byte 2, bit[2]-L) = 0, it indicates a 128-bit vector; if VEX.L = 1, it indicates a 256-bit vector. The prefix encoding field 1225 (the VEX byte 2 , bits [1:0] - pp) provide additional bits for the basic operation field.

[0072] The real opcode field 1230 (the byte 3 ) is also known as the opcode byte. Part of the opcode is specified in this field.

[0073] The MOD R / M field 1240 (the byte 4 ) contains a MOD field 1242 (bits [7-6]), a Reg field 1244 (bits [5-3]) and an R / M field 1246 (bits [2-0]). The role of the Reg field 1244 may contain the following: encoding of either the destination register operand or a source register operand (the rrr of the rrrr), or it may be treated as an opcode extension and not used to encode any instruction operand. The role of the R / M field 1246 may include the following: encoding the instruction operand that points to a memory address, or encoding either the destination register operand or a source register operand.

[0074] Scale, Index, Base (SIB) - The contents of the scale field 1250 (the byte 5 ) contains the SS 1252 (bits [7-6]) used for memory address generation. The contents of SIB.xxx 1254 (bits [5-3]) and SIB.bbb 1256 (bits [2-0]) have previously been referenced with respect to register indexes Xxxx and Bbbb.

[0075] The Displacement Field 1262 and immediate field (IMM8) 1272 contain address data. The generic vector-friendly instruction format

[0076] A vector-friendly instruction format is an instruction format that is appropriate for vector instructions (e.g., there are certain fields specific to vector operations). While embodiments are described in which both vector and scalar operations are supported by the vector friendly instruction format, in alternative embodiments only vector operations use the vector friendly instruction format.

[0077] the figure 13A- figure 13B are block diagrams illustrating a generic vector-friendly instruction format and its instruction templates according to embodiments of the invention. figure 13A is a block diagram illustrating a generic vector-friendly instruction format and its class A instruction templates according to embodiments of the invention; while figure 13B is a block diagram illustrating the generic vector-friendly instruction format and its class B instruction templates according to embodiments of the invention. Specifically, a generic vector-friendly instruction format 1300 , for which the class A and class B instruction templates are defined, both of these instruction templates without memory access 1305 and memory access instruction templates 1320 contain. The term generic in the context of the vector-friendly instruction format refers to the instruction format that is not tied to any specific instruction set.

[0078] Meanwhile, embodiments of the invention will be described in which the vector-friendly instruction format supports the following: a vector operand length (or size) of 64 bytes with data element widths (or sizes) of 32 bits ( 4 bytes) or 64 bits ( 8 bytes), (thus a 64-byte vector consists of either 16 double-word-sized elements or 8 quad-word-sized elements); a vector operand length (or size) of 64 bytes with data element widths (or sizes) of 16 bits ( 2 bytes) or 8 bits ( 1 Byte); a vector operand length (or size) of 32 bytes with data element widths (or sizes) of 32 bits ( 4 bytes), 64 bits ( 8 bytes), 16 bits ( 2 bytes) or 8 bits ( 1 Byte); and a vector operand length (or size) of 16 bytes with data element widths (or sizes) of 32 bits ( 4 bytes), 64 bits ( 8 bytes), 16 bits ( 2 bytes) or 8 bits ( 1 Byte); alternative embodiments may use more, fewer, and / or different vector operand sizes (e.g., 256-byte vector operands) with more, fewer, or different data element widths (e.g., 128-bit data element widths ( 16 bytes)) support.

[0079] The class A instruction templates in figure 13A contain the following: 1) within the instruction templates without memory access 1305 is a command template of an operation 1310 of the full round control type with no memory access and an instruction template of an operation 1315 of data transformation type without memory access; and 2) within the memory access instruction templates 1320 is a temporal command template 1325 with memory access and a non-temporal instruction template 1330 shown with memory access. The class B instruction templates in figure 13B contain the following: 1) within the instruction templates with no memory access 1305 is a command template of an operation 1312 of the split round control type with writemask control and no memory access and an instruction template of an operation 1317 shown of the vsize type with writemask control and no memory access; and 2) within the memory access instruction templates 1320 is a command template with writemask control 1327 and memory access shown.

[0080] The generic vector-friendly command format 1320 contains the following fields, which are listed below in the in the figure 13A- figure 13B are listed.

[0081] The format field 1340 - a specific value (an instruction format identifier value) in this field uniquely identifies the vector-friendly instruction format and consequently the occurrences of instructions in the vector-friendly instruction format in the instruction streams. As such, this field is optional in the sense that it is not required for an instruction set having only the generic vector-friendly instruction format.

[0082] The basic operation field 1342 - its content distinguishes the various basic operations.

[0083] The register index field 1344 - Its content specifies, directly or through address generation, the locations of the source and destination operands, whether they are in registers or in memory. These contain a sufficient number of bits to select N registers from a PxQ register file (e.g. 32x512, 16x128, 32x1024, 64x1024). While in one embodiment N can be up to three sources and one destination register, alternative embodiments can support more or fewer sources and destination registers (e.g., they can support up to two sources, with one of those sources also acting as the destination, they can support up to three sources, with one of those sources also acting as the target, and they can support up to two sources and one target).

[0084] The modifier field 1346 - its content distinguishes the occurrences of the instructions in the generic vector instruction format that specify vector access from those that do not; i.e. i.e. between the instruction templates without memory access 1305 and the instruction templates 1320 with memory access. The memory access operations read and / or write the memory hierarchy (in some cases specifying the source and / or destination addresses using the values in the registers), while the non-memory access operations do not (e.g. the source and the destinations are registers). Also, while in one embodiment this field distinguishes between three different ways to perform the memory address calculations, alternative embodiments may support more, fewer, or different ways to perform the memory address calculations.

[0085] The enlargement operation field 1350 - its content distinguishes which of a variety of different operations to perform in addition to the basic operation. This field is context specific. In one embodiment of the invention, this field is a class field 1368 , an alpha field 1352 and a beta field 1354 divided. The enlargement operation field 1350 allows common groups of operations to be performed in a single command instead of 2, 3, or 4 commands.

[0086] The Scale Field 1360 - its content allows scaling the content of the index field for memory address generation (e.g. for address generation, the 2nd Maßstab * Index + base used).

[0087] The Displacement Field 1362A - its content is used as part of memory address generation (e.g. for address generation, the 2nd Maßstab * index + base + offset used).

[0088] The displacement factor field 1362B (Note that the juxtaposition of the displacement field 1362A directly above the shift factor field 1362B indicates that one or the other is used) - its content is used as part of address generation; it specifies a shift factor to be scaled with the size of a memory access (N) - where N is the number of bytes in the memory access (e.g. for address generation, the 2nd Maßstab * used index + base + scaled shift). Redundant low-order bits are ignored, thus the content of the displacement factor field is multiplied by the total size (N) of the memory operands to produce the final displacement to be used in computing an effective address. The value of N is determined by the processor hardware at runtime based on the full opcode field 1374 (described later herein) and the data manipulation field 1354C definitely. The Displacement Field 1362A and the displacement factor field 1362B are optional in the sense that they are for the command templates without memory access 1305 may not be used and / or different embodiments may implement only one or neither.

[0089] The data item width field 1364 - its content distinguishes which of a number of data element widths is to be used (in some embodiments for all instructions; in other embodiments only for some of the instructions). This field is optional in the sense that it is not required if only one data element width is supported and / or the data element widths are supported using some aspect of the opcodes.

[0090] The writemask field 1370 its content controls, on a per data element position basis, whether the data element position in the destination vector operand reflects the result of the base operation and the augmentation operation. Class A instruction templates support merging writemasking, while class B instruction templates support both merging and zeroing writemasking. When merging, the vector masks allow any set of elements in the target to be protected from updates during the execution of any operation (specified by the base operation and the augmentation operation); in another embodiment, the old value of each element of the target where the corresponding mask bit has a 0 is kept. In contrast, the vector masks in nulling allow any set of elements in the target to be nulled during the execution of any operation (specified by the base operation and the augmentation operation); in one embodiment, an element of the target is set to 0 if the corresponding mask bit has a 0 value. A subset of this functionality is the ability to control the vector length of the operation being performed (i.e., the span of elements being modified, from first to last); however, it is not necessary that the elements being modified be consecutive. Consequently, the writemask field enables 1370 partially vector operations, including loads, stores, arithmetic, logic, etc. While describing the embodiments of the invention, the contents of the writemask field 1370 selects one of a number of writemask registers containing the writemask to be used (whereby the contents of the writemask field 1370 indirectly identified that the masking is to be performed), alternative embodiments allow instead or in addition that the content of the writemask field 1370 directly specifies the masking to be performed.

[0091] The Instant Field 1372 - its content allows the specification of an immediate value. This field is optional in the sense that it is not present in an implementation of the generic vector-friendly format that does not support an immediate value, and it is not present in instructions that do not use an immediate value.

[0092] The class field 1368 - its content distinguishes between different classes of commands. In the figure 13A-B select the contents of this field between class A and class B instructions. In the figure 13A-B, the squares with rounded corners are used to indicate a specific value that is present in a field (e.g. Class A 1368A or Class B 1368B for the class field 1368 in figure 13A-B). The Class A instruction templates

[0093] In the case of the instruction templates without memory access 1305 the class A becomes the alpha field 1352 as an RS field 1352A interpreted, the content of which distinguishes which of the various types of enlargement operations to perform (e.g. rounding 1352A .1 and the data transformation 1352A .2 are for the operation's command templates 1310 the round type without memory access or the operation 1315 of the data transformation type specified without memory access), while the beta field 1354 distinguishes which of the operations of the specified type is to be performed. In the no memory access 1305 instruction templates are the scale field 1362 , the displacement field 1362A and the displacement scale field 1362B unavailable. The instruction templates without memory access - the operation of the full round control type

[0094] In the operation's command template 1310 of the type of full round control without memory access becomes the beta field 1354 as a turn control field 1354A interpreted whose content(s) provide static rounding. While in the described embodiments of the invention the round control panel 1354A a field 1356 for suppressing all floating point exceptions (SAE) and a round operation control field 1358 contains, alternative embodiments may support encoding these two concepts into the same field, or may only have one or the other of these concepts / fields (e.g. may only have the round operation control field 1358 exhibit).

[0095] The SAE field 1356 - its content distinguishes whether the exception event message is to be blocked or not; if the content of the SAE field 1356 indicates that suppression is enabled, a given instruction does not report any type of floating point exception flag and does not start a floating point exception handler.

[0096] The Round Operations panel 1358 - its content distinguishes which of a group of rounding operations to perform (e.g. round up, round down, round to zero and round to nearest). Consequently, the round operation control field enables 1358 changing the rounding mode on a per instruction basis. In an embodiment of the invention in which a processor includes a control register for specifying rounding modes, the contents of the round operation control field are set 1350 override this register value. The instruction templates without memory access - the operation of the data transformation type

[0097] In the operation's command template 1315 of the data transformation type without memory access becomes the beta field 1354 as a data transformation field 1354B interpreted whose content distinguishes which of a number of data transformations is to be performed (e.g. no data transformation, swizzle, broadcast).

[0098] In the case of a memory access instruction template 1320 the class A becomes the alpha field 1352 as an eviction notice field 1352B interpreted, the content of which distinguishes which of the clearance notices should be used (in figure 13A is temporal 1352B.1 and not temporal 1352B.2 for the memory access temporal instruction template 1325 or the non-temporal instruction template with memory access 1330 specified), while the beta field 1354 as a data manipulation field 1354C is interpreted, the content of which distinguishes which of a number of data manipulation operations (also known as primitives) is to be performed (e.g., no manipulation; broadcast; upconversion of a source; and downconversion of a destination). The command templates with memory access 1320 contain the scale field 1360 and optionally the displacement field 1362A or the displacement scale field 1362B .

[0099] The vector memory instructions perform vector load-from-memory and vector-memory-to-memory operations with translation support. As with regular vector instructions, the vector memory instructions transfer data from / to memory in a data item-by-data fashion, with the items actually transferred being dictated by the contents of the vector mask selected as the writemask. The command templates with memory access - temporal

[0100] The temporal data is data that is likely to be reused soon enough to benefit from caching. This is a hint, however, and different processors may implement it in different ways, including ignoring the hint entirely. The command templates with memory access - not temporal

[0101] The non-temporal data is unlikely to be reused soon enough to benefit from caching in the 1st level cache and should be given priority for eviction. This is a hint, however, and different processors may implement it in different ways, including ignoring the hint entirely. The class B instruction templates

[0102] In the case of the class B instruction templates, the alpha field 1352 as a writemask control field 1352C (Z) interpreted whose content differs from that specified by the writemask field 1370 controlled writemasking should be a merge or a zeroing.

[0103] In the case of the instruction templates without memory access 1305 Class B becomes part of the beta field 1354 as an RL field 1357A interpreted, the content of which distinguishes which of the various types of enlargement operations to perform (e.g. rounding 1357A .1 and the vector length (VSIZE) 1357A .2 are for the instruction template of the operation, respectively 1312 of the partial round control type with writemask control and no memory access or the instruction template of the operations 1317 of VSIZE type with writemask control and no memory access specified) while the rest of the beta field 1354 distinguishes which of the operations of the specified type is to be performed. In the command templates without memory access 1305 are the scale field 1360 , the displacement field 1362A and the displacement scale field 1362B unavailable.

[0104] In the operation's command template 1310 of the split round control type with writemask control and no memory access becomes the remainder of the beta field 1354 as a round operation field 21359A interpreted with exception event reporting disabled (a given instruction does not report any type of floating point exception flag and does not start a floating point exception handler).

[0105] The Round Operations panel 1359A - just like the round operation control field 1358, its content distinguishes which of a group of rounding operations is to be performed (e.g. round up, round down, round to zero and round to nearest). Consequently, the round operation control field enables 1359A changing the rounding mode on a per instruction basis. In an embodiment of the invention in which a processor includes a control register for specifying rounding modes, the contents of the round operation control field are set 1350 override this register value.

[0106] In the operation's command template 1317 of the VSIZE type with writemask control and no memory access becomes the rest of the beta field 1354 as a vector length field 1359B interpreted whose content distinguishes which of a number of data vector lengths to work with (e.g. 128, 256 or 512 bytes).

[0107] In the case of a memory access instruction template 1320 Class B becomes part of the beta field 1354 as a broadcast field 1357B interpreted, the content of which distinguishes whether a broadcast-type data manipulation operation should be performed during the remainder of the beta field 1354 as the vector length field 1359B is interpreted. The command templates with memory access 1320 contain the scale field 1360 and optionally the displacement field 1362A or the displacement scale field 1362B .

[0108] Regarding the generic vector-friendly command format 1300 is shown to be a full opcode field 1374 the format field 1340 , the basic operation field 1342 and the data item width field 1364 contains. While an embodiment is shown in which the full opcode field 1374 contains all of these fields, with the full opcode field 1374 contains fewer than all of these fields in the embodiments that do not support all of them. The full opcode field 1374 provides the operation code (the opcode).

[0109] The enlargement operation field 1350 , the data item width field 1364 and the writemask field 1370 allow these features to be specified in the generic vector-friendly instruction format on a per-instruction basis.

[0110] The combination of the writemask field and the data element width field creates typed instructions because they allow the mask to be applied based on different data element widths.

[0111] The different instruction templates found within class A and class B are advantageous in different situations. In some embodiments of the invention, different processors or different cores within a processor may support only class A, only class B, or both classes. A high-performance out-of-order general-purpose core intended for general-purpose computing can only support class B, a core intended primarily for graphics and / or scientific (throughput) computing can only support class A support, and a core intended for both can support both (of course, a core that has some mix of templates and commands from both classes, but not all templates and commands from both classes is within the scope of these Invention). Also, a single processor may contain multiple cores, all of which support the same class or where different cores support different classes. In a processor with separate graphics and general purpose cores, e.g. B. one of the graphics cores intended primarily for graphics and / or scientific computing may only support class A, while one or more of the general purpose cores may be high performance general purpose cores with out-of-order execution and register renaming intended for the Universal calculation are provided that only support class B. Another processor that does not have a separate graphics core may contain one or more in-order or out-of-order general-purpose cores that support both class A and class B. Of course, the features of one class can also be implemented in the other class in other embodiments of the invention. The programs written in a high-level language would be compiled (e.g., compiled in time or compiled statically) into a variety of different executable forms, including the following: 1) a form containing only the instructions of the class(es) which is (are) supported by the target processor for execution; or 2) a form having alternative routines written using various combinations of the instructions of all classes and having control flow code that selects the routines to be executed based on the instructions supported by the processor currently executing the code. An example specific vector-friendly instruction format

[0112] figure 14 is a block diagram illustrating an example specific vector-friendly instruction format according to embodiments of the invention. figure 14 shows a specific vector-friendly instruction format 1400 , which is specific in the sense that it specifies the location, size, interpretation, and order of the fields, as well as the values for some of those fields. The specific vector-friendly instruction format 1400 can be used to extend the x86 instruction set, thus some of the fields are similar or the same as those used in the existing x86 instruction set and its extension (e.g. AVX). This format remains consistent with the prefix encoding field, real opcode bytes field, MOD R / M field, SIB field, displacement field, and immediate fields of the existing x86 instruction set with extensions. The fields after figure 13, into which the fields after figure 14 are illustrated.

[0113] It should be appreciated that although the embodiments of the invention make reference to the specific vector-friendly instruction format 1400 in the context of the generic vector-friendly instruction format 1300 are described for purposes of illustration, the invention is not limited to the specific vector-friendly instruction format 1400 is restricted except where claimed. The generic vector-friendly instruction format 1300 considered e.g. B. different possible sizes for the different fields, while showing that the specific vector-friendly instruction format 1400 Has fields with specific sizes. While as a specific example, the data element width field 1364 is in the specific vector friendly instruction format 1400 illustrated as a one-bit field, the invention is not so limited (i.e., the generic vector-friendly instruction format 1300 consider other sizes of data element width field 1364).

[0114] The generic vector-friendly instruction format 1300 contains the following fields, which are listed below in the in figure 14A are listed in the order illustrated. The EVEX prefix (the bytes 0 - 3 ) 1402 - is encoded in a four-byte form.

[0115] The format field 1340 (the EVEX byte 0 , bits [7:0]) - the first byte (the EVEX byte 0) is the format field 1340 , where it contains 0x62 (the unique value used to distinguish the vector-friendly instruction format in one embodiment of the invention).

[0116] The second-fourth bytes (the EVEX bytes 1 - 3 ) contain a number of bit fields that provide a specific capability.

[0117] The REX field 1405 (the EVEX byte 1 , bits [7-5]) - consists of an EVEX.R bit field (the EVEX byte 1 , the bit [7] - R), an EVEX.X bit field (the EVEX byte 1 , the bit [6] -X) and the 1357BEX byte 1, the bit [5] - B). The EVEX.R, EVEX.X and EVEX.B bit fields provide the same functionality as the corresponding VEX bit fields, encoded using 1's complement form, i. i.e. ZMM0 is encoded as 1211B, ZMM15 is encoded as 0000B. Other fields of the instructions encode the lower three bits of the register indexes as is known in the art (rrr, xxx and bbb) so that Rrrr, Xxxx and Brrr are formed by adding EVEX.R, EVEX.X and EVEX.B can become.

[0118] The REX' field 1310 - this is the first part of the REX' field 1310 and is the EVEX.R' bit field (the EVEX byte 1 , the bit [4] - R') used to encode either the upper 16 or the lower 16 of the extended 32 register set. In one embodiment of the invention, this bit is stored along with others, as indicated below, in a bit-inverted format to distinguish it (in the well-known x86 32-bit mode) from the BOUND instruction whose true opcode byte 62 but which in the MOD R / M field (described below) does not accept the value of 11 in the MOD field; alternative embodiments of the invention do not store this and the other bits indicated below in the inverted format. A value of 1 is used to encode the lower 16 registers. In other words, R'Rrrr is formed by combining EVEX.R', EVEX.R and the other RRR from the other fields.

[0119] The opcode mapping field 1415 (the EVEX byte 1 , bits [3:0] - mmmm) - its content encodes an implied leading opcode byte (0F, 0F 38, or 0F 3).

[0120] The data item width field 1364 (the EVEX byte 2 , the bit [7] - W) - is represented by the notation EVEX.W. EVEX.W is used to define the granularity (the size) of the data type (either 32-bit data elements or 64-bit data elements).

[0121] The EVEX.vvvv 1420 (the EVEX byte 2 , bits [6:3] - vvvv) - the role of EVEX.vvvv may include the following: 1) EVEX.vvvv encodes the first source register operand, is specified in an inverted (1's complement) form, and is for instructions with 2 or more source operands valid; 2) EVEX.vvvv encodes the destination register operand and is specified in 1's complement form for certain vector shifts; or 3) EVEX.vvvv does not encode an operand, the field is reserved and should contain 1211b. Thus, the EVEX.vvvv field 1420 encodes the low order 4 bits of the first source register specifier stored in inverted (1's complement) form. Depending on the instruction, an additional different EVEX bit field is used to extend the size of the specifier to 32 registers.

[0122] The EVEX.U 1368 class field (the EVEX byte 2, bit [2] - U) - If EVEX.U = 0, it indicates class A or EVEX.U0; if EVEX.U = 1, it indicates class B or EVEX.U1.

[0123] The prefix encoding field 1425 (the EVEX byte 2, bits [1:0] - pp) - provides additional bits for the basic operation field. In addition to providing support for the alt-SSE instructions in the EVEX prefix format, this also has the benefit of compacting the SIMD prefix (instead of requiring a byte to express the SIMD prefix, the EVEX prefix only requires 2 bits). In one embodiment, to support the legacy SSE instructions using the SIMD prefix (66H, F2H, F3H) in both legacy and EVEX prefix format, these legacy SIMD prefixes are encoded in the SIMD prefix encoding field ; at runtime they are expanded to the legacy SIMD prefix before being provided to the encoder's PLA (so the PLA can execute both the legacy and EVEX format of these legacy instructions without modification). Although newer instructions could use the contents of the EVEX prefix encoding field directly as an opcode extension, certain embodiments extend in a similar manner for consistency, while allowing different meanings to be specified by these legacy SIMD prefixes. An alternative embodiment can redesign the PLA to support the 2-bit SIMD prefix encodings and thus does not require the extension.

[0124] The alpha field 1352 (the EVEX byte 3 , the bit [7] - EH; also known as EVEX.EH, EVEX.rs, EVEX.RL, EVEX.writemask control, and EVEX.N; also illustrated with α) - as previously described, this field is context specific.

[0125] The beta field 1354 (the EVEX byte 3, bits [6:4] - SSS, also as EVEX.s 2-0 , EVEX.r 2-0 , EVEX.rr1, EVEX.LL0, EVEX.LLB known; also illustrated with βββ) - as previously described, this field is context specific.

[0126] The REX' field 1310 - this is the remainder of the REX' field and is the EVEX.V' bit field (the EVEX byte 3 , the bit [3] - V') which can be used to encode either the upper 16 or the lower 16 of the extended 32 register set. This bit is stored in a bit-inverted format. A value of 1 is used to encode the lower 16 registers. In other words, V'VVVV is formed by combining EVEX.V', EVEX.vvvv.

[0127] The writemask field 1370 (the EVEX byte 3 , bits [2:0] - kkk) - its content specifies the index of a register in the writemask registers as previously described. In one embodiment of the invention, the specific value EVEX.kkk = 000 has specific behavior that implies that no writemask is used for the particular instruction (this can be done in a variety of ways including using a writemask hardwired to all, or hardware that bypasses the masking hardware).

[0128] The real opcode field 1430 (the byte 4 ) is also known as the opcode byte. Part of the opcode is specified in this field.

[0129] The MOD R / M field 1440 (the byte 5 ) contains the MOD field 1442 , the Reg field 1444 and the R / M field 1446. As previously described, the content of the MOD field 1442 distinguishes between memory access and non-memory access operations. The role of the Reg field 1444 can be summarized in two situations: encoding either the destination register operand or a source register operand, or can be treated as an opcode extension and not used to encode any instruction operand. The role of the R / M field 1446 may include the following: encoding the instruction operand that points to a memory address, or encoding either the destination register operand or a source register operand.

[0130] The scale, index, base (SIB) byte (the byte 6 ) - As previously described, the contents of the scale field 1350 used for memory address generation. The SIB.xxx 1454 and the SIB.bbb 1456 - the contents of these fields have previously been referenced in relation to the register indices Xxxx and Bbbb.

[0131] The Displacement Field 1362A (the bytes 7 - 10 ) - if the MOD field 1442 10 contains are the bytes 7 - 10 the displacement field 1362A , operating the same as the alt 32-bit shift (disp32) and operating at byte granularity.

[0132] The displacement factor field 1362B (the byte 7 ) - if the MOD field 1442 01 contains is the byte 7 the displacement factor field 1362B . The location of this field is the same as that of the 8-bit displacement (disp8) of the alt x86 instruction set, which operates at byte granularity. Because disp8 is sign-extended, it can only address between -128 and 137 byte offsets; in terms of 64 byte cache lines, the disp8 uses 8 bits that can be set to only four really useful values -128, -64, 0 and 64; because a larger range is often required, the disp32 is used; however, the disp32 requires 4 bytes. Unlike the disp8 and disp32 is the displacement factor field 1362B a reinterpretation of the disp8; if the shift factor field 1362B is used, the actual shift is determined by the contents of the shift factor field multiplied by the size of the memory operand access (N). This type of shift is referred to as disp8 * N. This reduces the average instruction length (a single byte is used for displacement, but with a much larger range). Such a compressed displacement is based on the assumption that the effective displacement is a multiple of the granularity of the memory access, and thus the redundant low-order bits of the offset need not be encoded. In other words, the shift factor field 1362B replaces the 8-bit displacement of the alt x86 instruction set. Hence, the shift factor field is 1362B encoded in the same way as an 8-bit relocation of the x86 instruction set (hence no ModRM / SIB encoding rules changes), with the only exception that the disp8 is overloaded to the disp8*N. In other words, there are no changes in the encoding rules or the encoding lengths, only in the interpretation of the displacement value by the hardware (which has to scale the displacement with the size of the memory operand to get a byte-by-byte address offset).

[0133] The Instant Field 1372 works as previously described. The full opcode field

[0134] figure 14B is a block diagram showing the fields of the specific vector friendly instruction format 1400 which illustrates the full opcode field 1374 according to an embodiment of the invention. Specifically contains the full opcode field 1374 the format field 1340 , the basic operation field 1342 and the data item width field 1364 (W). The basic operation field 1342 contains the prefix encoding field 1425 , the opcode mapping field 1415, and the real opcode field 1430 . The register index field

[0135] figure 14C is a block diagram showing the fields of the specific vector-friendly instruction format 1400 illustrates the the register index field 1344 according to an embodiment of the invention. Specifically, the registry index field contains 1344 the REX field 1405 , the REX' field 1410, the MODR / M.reg field 1444, the MODR / M.r / m field 1446, the VVVV field 1420 , the xxx field 1454 and the bbb field 1456. The enlargement operation field

[0136] figure 14D is a block diagram showing the fields of the specific vector-friendly instruction format 1400 illustrates the the Magnify operation panel 1350 according to an embodiment of the invention. If the class field 1368 (U) contains 0, it means EVEX.U0 (Class A 1368A); if it contains 1, it means EVEX.U1 (class B 1368B). If U = 0 and the MOD field 1442 11 contains (meaning a no-memory access operation), the alpha field becomes 1352 (the EVEX byte 3 , bit [7] - EH) interpreted as the rs field 1352A. If the rs field 1352A contains a 1 (rounding 1352A .1), becomes the beta field 1354 (the EVEX byte 3 , bits [6:4] - SSS) as the round control field 1354A interpreted. The turn control panel 1354A contains a one-bit SAE field 1356 and a two-bit round operation field 1358 . When the rs field 1352A contains a 0 (the data transform 1352A.2), the beta field becomes 1354 (the EVEX byte 3 , bits [6:4] - SSS) as a three-bit data transform field 1354B interpreted. If U = 0 and the MOD field 1442 00 , 01 or 10 contains (meaning a memory access operation), the alpha field becomes 1352 (the EVEX byte 3 , bit [7] - EH) as the eviction notice field 1352B (EH field) is interpreted and becomes the beta field 1354 (the EVEX byte 3 , bits [6:4] - SSS) as a three-bit data manipulation field 1354C interpreted.

[0137] If U = 1, the alpha field 1352 the (EVEX byte 3 , bit [7] - EH) as the writemask control field 1352C (Z) interpreted. If U = 1 and the MOD field 1442 11 (meaning a no-memory access operation) becomes part of the beta field 1354 (the EVEX byte 3 , the bit [4] - p 0 ) than the RL field 1357A interpreted; if it contains a 1 (rounding 1357A .1), the remainder of the beta field becomes 1354 (the EVEX byte 3 , the bit [6-5] - p 2-1 ) as the round operation field 1359A interpreted while when the RL field 1357A contains a 0 (VSIZE 1357.A2), the rest of the beta field 1354 (the EVEX byte 3 , the bit [6-5] - p 2-1 ) as the vector length field 1359B (the EVEX byte 3 , the bit [6-5] - L 1-0 ) is interpreted. If U = 1 and the MOD field 1442 00 , 01 or 10 contains (meaning a memory access operation), the beta field becomes 1354 (the EVEX byte 3 , bits [6:4] - SSS) as the vector length field 1359B (the EVEX byte 3 , the bit [6-5] - L 1-0 ) and the broadcast field 1357B (the EVEX byte 3 , which interprets bit [4] - B). An example register architecture

[0138] figure 15 is a block diagram of a register architecture 1500 according to an embodiment of the invention. In the illustrated embodiment, there are 32 vector registers 1510 , which are 512 bits wide; these registers are referred to as zmm0 through zmm31. the 256 Lower order bits of the lower 16 zmm registers are overlaid on registers ymm0-16. the 128 Low order bits of the lower 16 zmm registers (the low order 128 bits of the ymm registers) are overlaid on the xmm0-15 registers. The specific vector-friendly instruction format 1400 acts on this superimposed register file as illustrated in the tables below. adjustable vector length class operations register Instruction templates that do not contain the vector length field 1359B A 1310, 1315, zmm register (the vector length is 64 bytes) ( figure 13A; U = 0) 1325, 1330 B 1312 zmm register (the vector length is 64 bytes) ( figure 13B; U=1) Instruction templates containing vector length field 1359B B 1317, 1327 zmm, ymm, or xmm register (the vector length is 64 bytes, 32 bytes, or 16 bytes depending on the vector length field 1359B) ( figure 13B; U = 1)

[0139] In other words, the vector length field 1359B selects between a maximum length and one or more other shorter lengths, each such shorter length being half the length of the previous length; where the instruction templates without the vector length field 1359B act on the maximum vector length. Furthermore, in one embodiment, the instruction templates work with class B of the specific vector-friendly instruction format 1400 to packed or scalar single / double precision floating point data and packed or scalar integer data. The scalar operations are operations performed on the lowest order data element position in a zmm / ymm / xmm register; the higher-order data element positions are either left as the same as they were before the instruction or set to zero, depending on the embodiment.

[0140] The writemask registers 1515 - in the illustrated embodiment there are 8 writemask registers (k0 to k7), each 64 bits in size. In an alternative embodiment, the writemask registers 1515 a size of 16 bits. As previously described, in one embodiment of the invention, the vector mask register k0 cannot be used as a write mask; when the encoding that would normally specify k0 is used for a writemask, it selects the hardwired OxFFFF writemask, which effectively disables writemasking for that instruction.

[0141] The universal registers 1525 - In the illustrated embodiment, there are sixteen 64-bit general purpose registers that are used in conjunction with the existing x86 addressing modes to address the memory operands. These registers are referred to by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.

[0142] Scalar floating point stack register file (x87 stack) 1545, which is otherwise the packed MMX integer flat register file 1550 - in the illustrated embodiment, the x87 stack is an eight element stack used to perform scalar floating point operations on 32 / 64 / 80-bit floating point data using the x87 instruction set extension; while the MMX registers are used to both perform operations on packed 64-bit integer data and to hold operands for some operations performed between the MMX and XMM registers.

[0143] Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternate embodiments of the invention may use more, fewer, or different register files and registers. Exemplary core architectures, processors and computer architectures

[0144] The processor cores can be implemented in different ways, for different purposes, and in different processors. The implementations of such cores can e.g. B. Contain: 1) an in-order general purpose core intended for general purpose computation; 2) a high-performance out-of-order general-purpose core dedicated to general-purpose computing; 3) a special purpose core intended primarily for graphics and / or scientific computation (throughput computation). The implementations of the various processors may include: 1) a CPU having one or more general purpose in-order cores dedicated to general-purpose computing and / or one or more general-purpose out-of-order cores dedicated to general-purpose computing are, contains; and 2) a coprocessor containing one or more special purpose cores intended primarily for graphics and / or scientific (throughput). Such different processors result in different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and / or scientific logic (throughput logic), or special purpose cores); and 4) a system on a chip that may include on the same the CPU described (sometimes referred to as the application core(s) or application processor(s)), the coprocessor described above, and additional functionality . Example core architectures are described next, followed by descriptions of example processors and computer architectures. Exemplary Core ArchitecturesA block diagram of an in-order and out-of-order core

[0145] figure 16A is a block diagram illustrating both an example in-order pipeline and an example out-of-order issue / execution pipeline with register renaming, according to embodiments of the invention. figure 16B is a block diagram illustrating an example embodiment of both an in-order architecture core and an example out-of-order issue / execution architecture core with register renaming included in a processor according to embodiments of the invention. The boxes with solid lines in the figure 16A-B illustrate the in-order pipeline and the in-order core, while the optional addition of the dashed line boxes shows the out-of-order issue / execution pipeline with register renaming and the out-of-order issue - Illustrate / execution core with registry renaming. Considering that the in-order aspect is a subset of the out-of-order aspect, the out-of-order aspect is described.

[0146] In figure 16A includes a processor pipeline 1600 a hol level 1602 , a length decoding stage 1604 , a decoding stage 1606 , an assignment level 1608 , a renaming stage 1610 , a scheduling stage (also referred to as a dispatch or issue stage) 1612, a register read / memory read stage 1614 , an execution stage 1616 , a write-back / memory-write stage 1618 , an exception-handling level 1622 and a storage stage 1624 .

[0147] figure 16B shows a processor core 1690 , which is a front-end unit 1630 contains, sent to an execution engine unit 1650 is coupled, both to a storage unit 1670 are coupled. the core 1690 can be a reduced instruction set computation (RISC) core, a complex instruction set computation (CISC) core, a very long instruction word (VLIW) core, or a hybrid or alternative core type. As yet another option, the core 1690 be a special core, e.g. a network or communication core, a compression engine, a coprocessor core, a general purpose computing graphics processing unit (GPGPU) core, a graphics core or the like.

[0148] The front end unit 1630 includes a branch prediction unit 1632 attached to an instruction cache unit 1634 coupled to an instruction translation lookaside buffer (TLB) 1636 is coupled to an instruction fetch unit 1638 is coupled to a decoding unit 1640 is coupled. The decoding unit 1640 (or the decoder) may decode instructions and produce as an output one or more micro-operations, microcode entry points, micro-instructions, other instructions, or other control signals that are decoded from or otherwise reflect the original instructions or are derived from the original instructions . The decoding unit 1640 can be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read-only memories (microcode ROMs), etc. In one embodiment, the core includes 1690 a microcode ROM or other medium that stores microcode for specific macroinstructions (e.g., in the decode unit 1640 or otherwise within the front end unit 1630). The decoding unit 1640 is to a rename / allocator unit 1652 in the execution engine unit 1650 coupled.

[0149] The execution engine unit 1650 contains a rename / allocator unit 1652 which is attached to a quiesce unit 1654 and to a set of one or more scheduler unit(s) 1656 is coupled. The Scheduler Unit(s) 1656 represents any number of different schedulers, including reservation stations, a central command window, etc. The scheduler unit(s) 1656 is to the unit(s) 1658 linked to the physical register file(s). Each of the units 1658 the physical register file(s) represents one or more physical register files, different of which are one or more different data types, such as e.g. scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the unit comprises 1658 of the physical register file(s) a vector register unit, a writemask register unit, and a scalar register unit. These register units can provide architectural vector registers, vector mask registers, and general purpose registers. The unit(s) 1658 of the physical register file(s) is by the quiescing unit 1654 overlapped to illustrate the various ways in which register renaming and out-of-order execution can be implemented (e.g., using a reorder buffer(s) and a quiesce register file(s); using a future file( en), a history buffer(s) and a quiesce register file(s); using register maps and a pool of registers; etc.). The decommissioning unit 1654 and the unit(s) 1658 of the physical register file(s) are to the execution cluster(s). 1660 coupled. The execution cluster(s). 1660 contains (contain) a set of one or more execution units 1662 and a set of one or more memory access units 1664 . The Execution Units 1662 Can perform various operations (e.g., shifts, an addition, a subtraction, a multiplication) and on different data types (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The Scheduler Unit(s) 1656 , the unit(s) 1658 the physical register file(s) and the execution cluster(s). 1660 are shown as possibly multiple because certain embodiments create separate pipelines for certain data types / operations (e.g., a scalar integer pipeline, a scalar floating point / packed integer / packed floating point / vector integer / vector floating point pipeline, and / or a memory access pipeline that each have their own scheduler unit, physical register file unit(s), and / or execution cluster - where in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of that pipeline comprises the memory access unit(s ) 1664 having). It should also be recognized that where separate pipelines are used, one or more of these pipelines may be out-of-order issue / execution and the remainder in-order.

[0150] The set of memory access units 1664 is to the storage unit 1670 coupled to a data TLB unit 1672 contains attached to a data cache unit 1674 coupled to a level 2 (L2) cache unit 1676 . In an exemplary embodiment, the memory access units 1664 a load unit, a store address unit, and a store data unit, each attached to the data TLB unit 1672 in the storage unit 1670 is coupled. The instruction cache unit 1634 is also connected to a level 2 cache unit (L2 cache unit) 1676 in the memory unit 1670 coupled. The L2 cache unit 1676 couples to one or more other levels of cache and ultimately to a main memory.

[0151] By way of example, the example out-of-order issue / execution core architecture with register renaming can be the pipeline 1600 implement as follows: 1) instruction fetching 1638 performs the fetch and length decoding stages 1602 and 1604 out; 2) the decoding unit 1640 leads the decoding stage 1606 out; 3) the renamer / allocator unit 1652 performs the assignment stage 1608 and the renaming stage 1610 out; 4) the scheduler unit(s) 1656 leads the planning stage 1612 out; 5) the unit(s) 1658 the physical register file(s) and the storage unit 1670 perform the register read / memory read stage 1614 out; the execution cluster 1660 leads the execution stage 1616 out; 6) the storage unit 1670 and the physical register file(s) unit(s) 1658 perform the write-back / memory write stage 1618 out; 7) to the exception handling stage 1622 different entities may be involved; and 8) the Decommissioning Unit 1654 and the unit(s) 1658 of the physical register file(s) run the store stage 1624 out.

[0152] the core 1690 may contain one or more instruction sets (e.g. the x86 instruction set (with some extensions added in newer versions); the MIPS instruction set from MIPS Technologies of Sunnyvale, CA; the ARM instruction set (with optional additional extensions such as e.g., NEON) from ARM Holdings of Sunnyvale, CA), including the command(s) described here. In one embodiment, the core includes 1690 logic to support an instruction set extension for packed data (e.g. AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be executed using packed data.

[0153] It should be recognized that the core can support multithreading (the execution of two or more parallel sets of operations or threads) in a variety of ways including time-slicing multithreading, concurrent multithreading (where a single physical core runs a logical core for each which provides threads for which this physical core performs concurrent multithreading) or a combination thereof (e.g. time-slicing fetch and decode and then concurrently multithreading, such as in Intel® Hyperthreading Technology) can execute.

[0154] While register renaming is described in the context of out-of-order execution, it should be recognized that register renaming can be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 1634 / 1674 and a shared L2 cache unit 1676, alternative embodiments may have a single internal cache for both instructions and data, such as. B. an internal level 1 cache (L1 cache), or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and / or processor. Alternatively, all of the cache may be external to the core and / or processor. A specific example in-order core architecture

[0155] the figure 17A-B illustrate a block diagram of a more specific example in-order core architecture, where the core would be one of multiple logic blocks (including other cores of the same type and / or different types) in a chip. The logic blocks communicate through a high bandwidth interconnect network (e.g., a ring network) with any fixed function logic, memory I / O interfaces, and other necessary I / O logic, depending on the application.

[0156] figure 17A is a block diagram of a single processor core along with its connection to the interconnection network 1702 on the die and with its local subset of the level 2 cache (L2 cache) 1704 according to embodiments of the invention. In one embodiment, an instruction decoder supports 1700 the x86 instruction set with the packed data instruction set extension. An L1 cache 1706 allows low latency accesses to the cache memory into the scalar and vector units. While in one embodiment (to simplify the design) a scalar unit 1708 and a vector unit 1710 separate sets of registers (scalar registers 1712 or vector register 1714 ) and the data transferred between them is written to memory and then read back from a level 1 (L1) cache 1706, alternative embodiments of the invention may use a different approach (e.g., use a single set of registers or contain a communication path that allows the data to be transferred between the two register files without being written and read back).

[0157] The local subset of the L2 cache 1704 is part of a global L2 cache that is divided into separate local subsets, one per processor core. Each processor core has a direct access path to its own local subset of the L2 cache 1704. The data read by a processor core is stored in its L2 cache subset 1704 and can be quickly accessed in parallel with other processor cores accessing their own local L2 cache subsets. The data written by a processor core is stored in its own L2 cache subset 1704 and is flushed from other subsets as needed. The ring network ensures coherency for the shared data. The ring network is bi-directional to allow agents such as B. to allow the processor cores, the L2 caches and other logic blocks to communicate with each other within the chip. Each ring data path is per direction 1012 bits wide.

[0158] figure 17B is an expanded view of a portion of the processor core in FIG figure 17A according to embodiments of the invention. figure 17B includes both an L1 data cache portion 1706A of the L1 cache 1704 and further details regarding the vector unit 1710 and the vector register 1714 . Specific is the vector unit 1710 a 16 wide vector processing unit (VPU) (see the 16 wide ALU 1728 ) that executes one or more of the integer, single-precision floating-point, and double-precision floating-point instructions. The VPU supports swizzling the register inputs with the swizzle unit 1720 , numeric translation with the numeric translation units 1722A-B, and replication with the replication unit 1724 at the memory entry. The writemask registers 1726 allow the resulting vector writes to be asserted. A processor with an integrated memory controller and integrated graphics

[0159] figure 18 is a block diagram of a processor 1800 , which may have more than one core, may have an integrated memory controller, and may have integrated graphics, according to embodiments of the invention. The boxes with solid lines in figure 18 illustrate a processor 1800 with a single core 1802A , a system agent 1810 , a set of one or more bus controller units 1816 , while the optional addition of the dotted-line boxes adds an alternative processor 1800 with multiple cores 1802A-N , a set of one or more integrated storage controller unit(s) 1814 in the system agent unit 1810 and special purpose logic 1808 are illustrated.

[0160] Consequently, different implementations of the processor 1800 Contain the following: 1) a CPU with the special logic 1808 that is integrated graphics and / or scientific logic (throughput logic) (which may include one or more cores), where cores 1802A-N are one or more general-purpose cores (e.g., in-order general-purpose cores, out-of-order general-purpose cores , a combination of the two); 2) a coprocessor with cores 1802A-N, which are a large number of specialty cores intended primarily for graphics and / or scientific (throughput); and 3) a coprocessor with the cores 1802A-N , which are a large number of in-order universal cores. Consequently, the processor can 1800 a general purpose processor, a coprocessor, or a special purpose processor such as a network or communications processor, a compression engine, a graphics processor, a GPGPU (a general purpose graphics processing unit), a high throughput integrated multi-core (MIC) coprocessor (containing 30 or more cores), an embedded processor, or be like that. The processor can be implemented in one or more chips. The processor 1800 can be part of one or more substrates and / or using any of a number of process techniques, such as e.g. B. BiCMOS, CMOS or NMOS, implemented in one or more substrates.

[0161] The memory hierarchy includes one or more levels of cache within cores, a set of one or more shared cache units 1806 and external memory (not shown) attached to the set of integrated memory controller units 1814 is coupled. The set of shared cache units 1806 can have one or more mid-level caches, such as B. the plane 2 (L2), the plane 3 (L3), the plane 4 (L4) or other levels of cache, a last level cache (LLC), and / or combinations thereof. While in one embodiment a ring-based interconnection unit 1812 the integrated graphic logic 1808 , the set of shared cache units 1806 and the 1810 system agent unit / integrated storage controller unit(s) 1814 to each other, alternative embodiments may use any number of well-known techniques to interconnect such units. In one embodiment, coherency is maintained between one or more cache units 1806 and the kernels 1802-A-N maintain.

[0162] In some embodiments, one or more of the cores 1802A-N capable of multi-threading. The system agent 1810 contains those components that make up the nuclei 1802A-N coordinate and operate. The system agent unit 1810 can e.g. B. contain a power control unit (PCU) and a display unit. The PCU can be or contain the logic and components needed to determine the performance state of the cores 1802A-N and the integrated graphics logic 1808 to settle. The display unit is used to control one or more externally connected displays.

[0163] The cores 802A-N may be homogeneous or heterogeneous in terms of architectural instruction set; i.e. that is, two or more of the nuclei 1802A-N may be capable of executing the same set of instructions, while others may be capable of executing only a subset of that set of instructions or a different set of instructions. Exemplary computer architectures

[0164] the figure 19- figure 21 are block diagrams of exemplary computer architectures. Other system designs and configurations known in the art for laptops, desktops, handheld PCs, personal digital assistants, development workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set Top boxes, microcontrollers, cellular phones, portable media players, handheld devices, and various other electronic devices are also known to be suitable. In general, a vast variety of systems or electronic devices, which may include a processor and / or other execution logic as disclosed herein, are generally suitable.

[0165] In figure 19 is a block diagram of a system 1900 shown in accordance with an embodiment of the present invention. The system 1900 can have one or more processors 1910 , 1915 included attached to a controller hub 1920 are coupled. In one embodiment, the controller hub includes 1920 a graphics memory controller hub (GMCH) 1990 and an Input / Output HUB (EAH) 1950 (which may be on separate chips); where the GMCH 1990 Contains memory and graphics controllers to which a memory 1940 and a coprocessor 1945 are coupled; where the EAH 1950 the input / output (I / O) devices 1960 to the GMCH 1990 couples. Alternatively, one or both of the memory and graphics controllers are integrated within the processor (as described herein) that is memory 1940 and the coprocessor 1945 directly to the processor 1910 paired and the controller HUB is located 1920 in a single chip with the EAH 1950 .

[0166] The optional nature of the additional processors 1915 is in figure 19 indicated with dashed lines. Any processor 1910 , 1915 may include one or more of the processing cores described herein and may be any version of the processor 1800 be.

[0167] The memory 1940 can e.g. B. a dynamic random access memory (DRAM), a phase change memory (PCM) or a combination of the two. For at least one embodiment, the controller hub communicates 1920 via a bus with several stations, e.g. B. a front side bus (FSB), a point-to-point interface such. B. a QuickPath Interconnect (QPI), or a similar connection 1995 with the processor(s) 1910 , 1915 .

[0168] In one embodiment, the coprocessor 1945 a special purpose processor such as a high throughput MIC processor, network or communications processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, the controller hub 1920 contain an integrated graphics accelerator.

[0169] There can be various differences between the physical resources 1910 , 1915 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal and power characteristics, and the like.

[0170] In one embodiment, the processor leads 1910 Commands that control the data processing separation of a general type. Coprocessor instructions may be embedded in the instructions. The processor 1910 recognizes these coprocessor instructions as a type defined by the attached coprocessor 1945 should be executed. Accordingly, the processor returns 1910 these coprocessor instructions (or control signals representing the coprocessor instructions) on a coprocessor bus or other interconnect to the coprocessor 1945 out. The coprocessor(s) 1945 accepts (accepts) the received coprocessor commands and executes (executes) them.

[0171] In figure 20 is a block diagram of a first more specific example system 2000 shown in accordance with an embodiment of the present invention. As in figure 20 is the multiprocessor system 2000 a point-to-point interconnect system, there being a first processor 2070 and a second processor 2080 contains, via a point-to-point interconnection 2050 are coupled. Each of the processors 2070 and 2080 can any version of the processor 1800 be. In one embodiment of the invention, processors 2070 and 2080 are the processors 1910 or. 1915 , while the coprocessor 2038 the coprocessor is 1945. In another embodiment, the processors are 2070 and 2080 the processor 1910 or coprocessor 1945 .

[0172] It is shown that the processors 2070 and 2080 the integrated memory controller (IMC) units 2072 or. 2082 contain. The processor 2070 also includes as part of its bus controller units point-to-point interfaces (P-P interfaces) 2076 and 2078; similarly contains the second processor 2080 the P-P interfaces 2086 and 2088. The processors 2070 , 2080 may exchange information over a point-to-point interface (P-P interface) 2050 using the P-P interface circuits 2078,2088. As in figure 20, the IMCs couple 2072 and 2082 the processors to a corresponding memory, namely a memory 2032 and a storage 2034 , which may be portions of main memory locally associated with the respective processors.

[0173] The processors 2070 , 2080 can each receive information about the individual P-P interfaces 2052, 2054 using the point-to-point interface circuits 2076 , 2094, 2086, 2098 with a chipset 2090 exchange. The chipset 2090 can optionally provide information through a high-performance interface 2039 with the coprocessor 2038 exchange. In one embodiment, the coprocessor 2038 a special purpose processor such as a high throughput MIC processor, network or communications processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.

[0174] A shared cache (not shown) may be included within each processor or external to the two processors, yet connected to the processors via the P-P interconnect such that the local cache information of one or both processors is stored in the shared cache may be if a processor is set in a low power mode.

[0175] The chipset 2090 can through an interface 2096 to a first bus 2016 be coupled. In one embodiment, the first bus 2016 a peripheral component interconnection (PCI) bus or a bus such as a PCI Express bus or other third generation I / O interconnect bus, although the scope of the present invention is not so limited.

[0176] As in figure 20, various I / O devices 2014 can be connected together with a bus bridge 2018 who take the first bus 2016 to a second bus 2020 couples, to the first bus 2016 be coupled. In one embodiment, one or more additional processor(s) 2015 , such as B. coprocessors, high throughput MIC processors, GPGPUs, accelerators (such as graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays or any other processor to the first bus 2016 coupled. In one embodiment, the second bus 2020 be a low pin count (LPC) bus. In one embodiment, different devices can be attached to the second bus 2020 be coupled, including e.g. B. a keyboard and / or a mouse 2022 , the communication devices 2027 and a storage unit 2028, such as e.g. B. a disk drive or other mass storage device, the instructions / code and data 2030 may contain. Also, an audio I / O 2024 to the second bus 2020 be coupled. It is stated that other architectures are possible. Instead of point-to-point architecture after figure 20, the system may implement a multi-station bus or other such architecture.

[0177] In figure 21 is a block diagram of a second more specific example system 2100 shown in accordance with an embodiment of the present invention. Same elements in the figure 20 and figure 21 bear the same reference numerals, with certain aspects accommodating figure 20 out figure 21 have been omitted after concealing the other aspects figure 21 to avoid.

[0178] figure 21 illustrates that the processors 2070 , 2080 may include integrated memory and I / O control logic ("CL") 2072 and 2082, respectively. Thus, the CL 2072 , 2082 the integrated memory controller units and an I / O control logic. figure 21 illustrates that not only the memory 2032 , 2034 to the CL 2072 , 2082 are coupled, but also that the I / O devices 2114 are coupled to the control logic 2072 , 2082 are coupled. The legacy I / O devices 2115 are on the chipset 2090 coupled.

[0179] In figure 22, shown is a block diagram of a SoC 2200 according to an embodiment of the present invention. Similar items in figure 18 bear the same reference numbers. Also, the dashed line boxes are optional features in more advanced SoCs. In figure 22 is an interconnection unit(s) 2202 coupled to: an application processor 2210 , which is a set of one or more cores 212A-N and shared cache unit(s) 1806 contains; a system agent entity 1810 ; a bus controller unit(s) 1816 ; an integrated storage controller unit(s) 1814 ; a set of one or more coprocessors 2220 , which may include integrated graphics logic, an image processor, an audio processor, and a video processor; a static random access memory (SRAM) unit 2230 ; a direct memory access (DMA) unit 2232 ; and a display unit 2240 for coupling to one or more external displays. In one embodiment, the coprocessor(s) includes 2220 a special purpose processor such as a network or communications processor, a compression engine, a GPGPU, a high throughput MIC processor, an embedded processor, or the like.

[0180] Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. The embodiments of the invention may be computer programs or program code executed in programmable systems comprising at least one processor, a memory system (including volatile and / or non-volatile data storage and / or storage elements), at least one input device and at least one output device. be implemented.

[0181] The program code, such as B. the in figure 20 illustrated code 2030 , can be applied to input commands to perform the functions described here and produce output information. The output information can be applied to one or more output devices in a known manner. For purposes of this application, the processing system includes any system that has a processor, such as a computer. B. a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC) or a microprocessor.

[0182] The program code can be implemented in a high-level procedural programming language or an object-oriented programming language to communicate with a processing system. The program code can also be implemented in an assembly or machine language if desired. Indeed, the mechanisms described here are not limited in scope to any particular programming language. In any case, the language can be a compiled or an interpreted language.

[0183] One or more aspects of at least one embodiment may be implemented by representative instructions, stored on a machine-readable medium, representing various logic within the processor that, when read by a machine, causes the machine to produce logic to implement the perform the techniques described here. Such representations, known as "IP cores," may be stored in a tangible machine-readable medium and may be supplied to various customers or manufacturing facilities to load into the manufacturing machines that actually make up the logic or processor.

[0184] Such machine-readable storage media may include, without limitation, non-transitory tangible arrangements of articles of manufacture or be formed by any machine or device, including storage media such as hard disks, any other type of disks including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), rewritable compact disks (CD-RWs) and magneto-optical disks, semiconductor devices such as e.g. B. read-only memories (ROMs), random access memories (RAMs), such as. dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read only memories (EPROMs), flash memories, electrically erasable programmable read only memories (EEPROMs), phase change memories (PCM), magnetic or optical cards or any other type of media suitable for storing electronic instructions.

[0185] Accordingly, embodiments of the invention also include non-transitory tangible machine-readable media containing instructions or containing design data, such as. B. a Hardware Description Language (HDL) that define structures, circuits, devices, processors and / or system features described herein. Such embodiments can also be referred to as program products. The emulation (including binary translation, code morphing, etc.)

[0186] In some cases, an instruction translator may be used to translate an instruction from a source instruction set to a target instruction set. The command converter can e.g. e.g., translate (e.g., using static binary translation, dynamic binary translation, including dynamic compilation), morph, emulate, or otherwise translate an instruction into one or more other instructions to be processed by the kernel. The command translator can be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be on-processor, off-processor, or part on-processor and part off-processor.

[0187] figure 23 is a block diagram contrasting the use of a software instruction translator to translate binary instructions in a source instruction set to binary instructions in a target instruction set, according to embodiments of the invention. In the illustrated embodiment, the command translator is a software command translator, although alternatively, the command translator may be implemented in software, firmware, hardware, or various combinations thereof. figure 23 shows a program in a high-level language 2302 that can be compiled using an x86 compiler 2304 to produce binary x86 code 2306 that can be executed natively by a processor having at least one x86 instruction set core 2316. Processor with at least one x86 instruction set core 2316 represents any processor that performs substantially the same functions as an Intel processor with at least one x86 instruction set core by compatible execution or otherwise processing ( 1 ) a significant portion of the instruction set of the Intel x86 instruction set core or ( 2 ) of object code versions of applications or other software intended to run on an Intel processor with at least one x86 instruction set core to achieve essentially the same result as an Intel processor with at least one x86 to reach the instruction set core. The x86 compiler 2304 represents a compiler operable to generate binary x86 code 2306 (eg, object code) that executes with at least one x86 instruction set core 2316 with or without additional linkage processing in the processor can be. Similar shows figure 23 the program in the higher language 2302 That using a compiler 2308 for an alternative instruction set can be compiled to binary code 2310 of the alternate instruction set natively executed by a processor without at least one x86 instruction set core 2314 (e.g., a processor with cores executing the MIPS instruction set from MIPS Technology of Sunnyvale, CA, and / or supporting the ARM Execute instruction set from ARM Holdings of Sunnyvale, CA). The command translator 2312 is used to convert binary x86 code 2306 into code that can be executed natively by the processor without an x86 instruction set core 2314. This converted code is probably not the same as the binary code 2310 the alternative instruction set, because an instruction converter that can do this is difficult to manufacture; however, the converted code achieves general operation and may be formed from instructions from the alternative instruction set. Consequently, the command converter represents 2312 Software, firmware, hardware, or a combination thereof that, through emulation, simulation, or any other process, enables a processor or other electronic device that does not have an x86 instruction set processor or core to execute binary x86 code 2306.

[0188] The for one of figure 1- figure 2 and figure 5- figure 11 described components, features and details may also apply to any of the figure 3- figure 4 apply. Furthermore, the components, features and details described for any of the devices may also optionally apply to any of the methods that may be performed by and / or with such a device in the embodiments. Any of the processors described herein may be used in any of the computer systems disclosed herein (e.g. the figure 19- figure 23) must be included. In some embodiments, the computer system may include dynamic random access memory (DRAM). Alternatively, the computer system may include some type of volatile memory that does not need to be refreshed, or flash memory. The instructions disclosed herein can be executed on any of the processors shown herein, having any of the microarchitectures shown herein, in any of the systems shown herein. The instructions disclosed herein may have any of the features of the instruction formats shown here (e.g. in the figure 12- figure 14) have.

[0189] In the specification and claims, the terms "coupled" and / or "connected" along with their derivatives may have been used. These terms are not intended as synonyms for each other. Instead, in embodiments, "connected" may be used to indicate that two or more elements are in direct physical and / or electrical contact with one another. "Coupled" may mean that two or more elements are in direct physical and / or electrical contact are in contact with each other. However, "coupled" can also mean that two or more elements are not in direct contact with each other, but still work together or interact with each other. An execution unit can, for example, through one or more intervening components with a register and / or a decode unit Arrows are used in the figures to show connections and couplings.

[0190] The terms "logic", "unit", "module" or "component" may have been used in the description and / or claims. Any of these terms can be used to refer to hardware, firmware, software, or various combinations thereof. In exemplary embodiments, any of these terms may refer to integrated circuitry, application specific integrated circuits, analog circuits, digital circuits, programmed logic devices, memory devices containing instructions, and the like, and various combinations thereof. In some embodiments, these may include at least some hardware (e.g., transistors, gates, other circuitry components, etc.).

[0191] The term "and / or" may have been used. The term "and / or" as used herein means one or the other or both (e.g. A and / or B means A or B or both A as well as B).

[0192] In the above description, specific details are set forth in order to provide a thorough understanding of the embodiments. However, other embodiments may be practiced without some of these specific details. The scope of the invention should not be determined by the specific examples provided above, but only by the claims that follow. In other instances, well-known circuits, structures, devices, and operations have been shown in block diagram form and / or without detail in order to avoid obscuring the understanding of the description. Where considered appropriate, reference numbers, or the suffixes of reference numbers, have been repeated between figures to indicate corresponding or analogous elements, which may optionally have similar or the same characteristics, unless otherwise specified or clearly obvious.

[0193] Certain operations may be performed by hardware components or may be embodied in machine-readable or circuit-executable instructions that can be used to cause and / or cause a machine, circuit, or hardware component (e.g., a processor, portion of a processor, circuitry, etc.) programmed with the instructions that perform operations. The operations can also optionally be performed by a combination of hardware and software. A processor, machine, circuit, or hardware may include specific or specialized circuitry or other logic (e.g., hardware, potentially combined with firmware and / or software) operable to execute and / or process the instruction and store a result in response to the command.

[0194] Some embodiments include an article of manufacture (e.g., a computer program product) that includes a machine-readable medium. The medium may contain a mechanism that provides information in a form readable by the machine, e.g. B. stores. The machine-readable medium may provide for or have stored therein an instruction or a sequence of instructions that, if and / or when executed by a machine, are operable to cause the machine to execute and / or cause the machine to execute it the machine performs one or more operations, methods, or techniques disclosed herein.

[0195] In some embodiments, the machine-readable medium may include a non-transitory machine-readable storage medium. The non-transitory machine-readable storage medium can e.g. a floppy disk, an optical storage medium, an optical disk, an optical data storage device, a CD-ROM, a magnetic disk, a magneto-optical disk, a read-only memory (ROM), a programmable ROM (PROM), an erasable and programmable ROM (EPROM), electrically erasable and programmable ROM (EEPROM), random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), flash memory, phase change memory, phase change data storage material, non-volatile memory , a non-volatile data storage device, non-transitory memory, a non-transitory data storage device, or the like. The non-transitory machine-readable storage medium does not consist of a transitory propagated signal. In some embodiments, the storage medium may include a tangible medium containing a solid.

[0196] Examples of suitable machines include, but are not limited to, a general purpose processor, a special purpose processor, a digital logic circuit, an integrated circuit, or the like. Still other examples of suitable machines include a computer system or electronic device that includes a processor, digital logic circuitry, or an integrated circuit. Examples of such computer systems or electronic devices include desktop computers, laptop computers, notebook computers, tablet computers, netbooks, smartphones, cellular phones, servers, network devices (e.g., routers and switches), mobile internet devices (MIDs), media players, smart TVs, nettops, set-top boxes, and video game controllers, but are not limited to these.

[0197] Reference throughout this specification to "a single embodiment", "an embodiment", "one or more embodiments", "some embodiments" indicates e.g. For example, a specific feature may, but is not necessarily required to be, included in the practice of the invention. Similarly, in the description, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of the various inventive aspects. However, this method of disclosure should not be interpreted as reflecting an intention that the invention requires more features than are expressly set out in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Accordingly, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of the invention. EXEMPLARY EMBODIMENTS

[0198] The following examples relate to further embodiments. The specifics in the examples can be used throughout one or more embodiments.

[0199] Example 1 is a processor that includes a decode unit to decode a data item compare instruction. The data element compare instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations. The processor also includes an execution unit coupled to the decode unit. The execution unit stores at least one result mask operand in the one or more target memory locations in response to the data item compare instruction. The at least one result mask operand includes a different mask element for each corresponding data element in one of the first and second packed source data operands at the same relative position. Each mask element indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands.

[0200] Sample 2 contains the processor of the sample 1 wherein the execution unit stores two result mask operands in the one or more target memory locations in response to the instruction. The two result mask operands include a first result mask operand that includes a different mask element for each corresponding data element in the first packed source data operand at the same relative position. Each mask element of the first result mask operand indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand. A second result mask operand contains a different mask element for each corresponding data element in the second packed source data operand at the same relative position. Each mask element of the second result mask operand indicates whether the corresponding data element in the second packed source data operand is equal to any of the data elements in the first packed source data operand.

[0201] Sample 3 contains the processor of the sample 2 , wherein the one or more target memory locations comprise a first mask register and a second mask register, and wherein the execution unit stores the first result mask operand in the first mask register and stores the second result mask operand in the second mask register in response to the instruction.

[0202] Sample 4 contains the processor of the sample 2 , wherein the one or more target memory locations comprise a single mask register and wherein the execution unit stores the first result mask operand and the second result mask operand in the single mask register in response to the instruction.

[0203] Sample 5 contains the processor of the sample 4 wherein the execution unit, responsive to the instruction, stores the first result mask operand in a least significant portion of the single mask register and stores the second result mask operand in a more significant portion of the single mask register than the least significant portion.

[0204] Sample 6 contains the processor of the sample 1 wherein the execution unit stores both a first result mask operand and a second result mask operand in a packed data register in response to the instruction, and wherein each data element in the packed data register includes both a first result mask operand mask element and a second result mask operand mask element.

[0205] Sample 7 contains the processor of the sample 1 wherein the execution unit stores a single result mask operand in a single mask register in response to the instruction.

[0206] Sample 8 contains the processor of the sample 1 wherein the execution unit stores the at least one result mask operand in at least one mask register in response to the instruction and wherein an instruction set of the processor includes masked packed data instructions operable to specify the at least one mask register as a storage location for a source mask operand that uses should be used to mask an operation on packed data.

[0207] Example 9 includes the processor of any of Examples 1 through 8, wherein the execution unit, in response to the instruction, stores a number of result mask bits in the at least one result mask operand no greater than a number of the data elements in the first and second packed source data operands.

[0208] Example 10 includes the processor of any of Examples 1 through 8, wherein the execution unit, in response to the instruction, stores the at least one result mask operand in which each mask element includes a single mask bit.

[0209] Example 11 includes the processor of any of Examples 1 through 8, wherein the decode unit decodes the instruction specifying the first packed source data operand containing at least eight data elements and the second packed source data operand containing at least eight data elements.

[0210] Example 12 includes the processor of any of Examples 1 through 8, wherein the decode unit decodes the instruction specifying the first packed source data operand containing at least 512 bits and the second packed source data operand containing at least 512 bits.

[0211] Example 13 is a method in a processor that includes receiving a data item compare instruction. The data element compare instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations. The method also includes storing at least one result mask operand in the one or more target memory locations in response to the data item compare instruction. The at least one result mask operand includes a different mask element for each corresponding data element in one of the first and second packed source data operands at the same relative position. Each mask element indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands.

[0212] Example 14 includes the procedure of the example 13 , wherein storing includes storing a first result mask operand in the one or more target memory locations. The first result mask operand contains a different mask element for each corresponding data element in the first packed source data operand at the same relative position. Each mask element of the first result mask operand indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand. The storing also includes storing a second result mask operand in the one or more target storage locations. The second result mask operand contains a different mask element for each corresponding data element in the second packed source data operand at the same relative position. Each mask element of the second result mask operand indicates whether the corresponding data element in the second packed source data operand is equal to any of the data elements in the first packed source data operand.

[0213] Example 15 includes the procedure of the example 14 , wherein storing the first result mask operand includes storing the first result mask operand in a first mask register and wherein storing the second result mask operand includes storing the second result mask operand in a second mask register.

[0214] Example 16 includes the procedure of the example 14 , wherein storing the first result mask operand and storing the second result mask operand includes storing both the first and second result mask operands in a single mask register.

[0215] Example 17 includes the procedure of the example 13 , wherein storing the at least one result mask operand in the one or more target storage locations includes storing both a first result mask operand and a second result mask operand in a packed result data operand.

[0216] Example 18 includes the procedure of the example 13 and further including receiving a masked packed data instruction specifying the at least one result mask operand as a propositional operand.

[0217] Example 19 is a system for processing instructions that includes an interconnect and a processor coupled to the interconnect. The processor receives a data item compare instruction. The instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations. The processor stores at least one result mask operand in the one or more target memory locations in response to the instruction. The at least one result mask operand includes a different mask bit for each corresponding data element in one of the first and second packed source data operands in the same relative position. Each mask bit indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. The system also includes dynamic random access memory (DRAM) coupled to the interconnect. The DRAM optionally stores a sparse vector arithmetic algorithm with a sparse vector. The sparse vector arithmetic algorithm with a sparse vector optionally includes a masked data element merge instruction specifying the at least one result mask operand as a source operand to mask a data element merge operation.

[0218] Example 20 contains the system of the example 19 wherein, in response to the instruction, the execution unit stores two result mask operands, each corresponding to a different one of the packed source data operands, the two result mask operands to be stored in at least one mask register.

[0219] Example 21 is an article of manufacture that includes a non-transitory machine-readable storage medium. The non-transitory machine-readable storage medium stores a data item comparison instruction. The instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations. The instruction, if executed by a machine, causes the machine to perform operations, including storing a first result mask operand in the one or more target memory locations. The first result mask operand contains a different mask bit for each corresponding data element in the first packed source data operand in the same relative position. Each mask bit indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand.

[0220] Example 22 contains the article of manufacture of the example 21 , wherein the instruction, if executed by a machine, causes the machine to perform operations including storing a second result mask operand in the one or more target memory locations. Additionally, optionally, the one or more target storage locations include at least one mask register. Also, optionally, the first and second result mask operands together have no more mask bits than a number of the data elements in the first and second packed source data operands.

[0221] Example 23 includes the processor of any of Examples 1 through 8, further including an optional branch prediction unit to predict branches and an optional instruction prefetch unit coupled to the branch prediction unit, the instruction prefetch unit Prefetch instructions containing the data item comparison instruction. The processor may also optionally have an optional level instruction cache 1 (L1 instruction cache) coupled to the instruction prefetch unit, the LI instruction cache storing instructions, an optional LI data cache to store data, and an optional level cache 2 (L2 cache) to store data and instructions. The processor may also optionally include an instruction fetch unit coupled to the decode unit, the L1 instruction cache, and the L2 cache to fetch the data element compare instruction from one of the L1 instruction cache and the L2 cache in some cases fetch and to provide the data element compare instruction to the decode unit. The processor may also optionally include a register renaming unit to rename registers, an optional scheduler to schedule one or more operations decoded by the data item compare instruction for execution, and an optional store unit to store the execution results of the data item compare instruction to save, included.

[0222] Example 24 includes an on-chip system that includes at least one interconnect, the processor of any of Examples 1 through 8 coupled to the at least one interconnect, an optional graphics processing unit (GPU) coupled to the at least one interconnect, a an optional digital signal processor (DSP) coupled to the at least one interconnect, an optional display controller coupled to the at least one interconnect, an optional memory controller coupled to the at least one interconnect, an optional wireless modem coupled to the at least one interconnect, an optional image signal processor coupled to the at least one interconnect, an optional universal bus (USB) 3.0 compatible controller coupled to the at least one interconnect, an optional one with Bluetooth 4.1 compatible controller coupled to the at least one interconnect, and an optional wireless transceiver controller coupled to the at least one interconnect.

[0223] Example 25 is a processor or other device to perform the method of any of Examples 13-18 or to be operable to perform the method of any of Examples 13-18.

[0224] Example 26 is a processor or other device that includes means for performing the method of any of Examples 13-18.

[0225] Example 27 is an article of manufacture that optionally includes a non-transitory machine-readable medium that optionally stores or otherwise provides an instruction that, if and / or when executed by a processor, computer system, electronic device, or other machine, is operable to cause the machine to perform the method of any one of Examples 13-18.

[0226] Example 28 is a processor or other device substantially as described herein.

[0227] Example 29 is a processor or other device operable to perform any method substantially as described herein.

[0228] Example 30 is a processor or other device to execute any data item comparison instruction substantially as described herein (e.g., having components to execute any data item comparison instruction substantially as described herein to execute, or operable to execute any data item compare instruction substantially as described herein).

[0229] Example 31 is a computer system or other electronic device that includes a processor having a decode unit to decode instructions of a first instruction set. The processor also includes one or more execution units. The electronic device also includes a memory device coupled to the processor. The storage device stores a first instruction, which may be any of the data item comparison instructions substantially as disclosed herein, and which is from a second instruction set. The storage device also stores instructions to translate the first instruction into one or more instructions of the first instruction set. The one or more instructions of the first instruction set, when executed by the processor, cause the processor to store any of the results of the first instruction disclosed herein.

Claims

[1] Processor comprising the following: a decoding unit for decoding a data element comparison instruction, wherein the data element comparison instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations; and An execution unit coupled to the decoding unit, wherein the execution unit, in response to the data element comparison instruction, stores at least one result mask operand at the one or more target memory locations, wherein the at least one result mask operand contains, for each corresponding data element in one of the first and second packed source data operands, another mask element at the same relative position, wherein each mask element indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. [2] Processor according to claim 1, wherein the execution unit, in response to the instruction, stores two result mask operands at the one or more target memory locations, wherein the two result mask operands contain the following: a first result mask operand which contains a different mask element for each corresponding data element in the first packed source data operand at the same relative position, wherein each mask element of the first result mask operand indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand; and a second result mask operand which contains a different mask element for each corresponding data element in the second packed source data operand at the same relative position, wherein each mask element of the second result mask operand indicates whether the corresponding data element in the second packed source data operand is equal to any of the data elements in the first packed source data operand. [3] Processor according to claim 2, wherein the one or more destination memory locations comprise a first mask register and a second mask register and wherein the execution unit, in response to the instruction, stores the first result mask operand in the first mask register and stores the second result mask operand in the second mask register. [4] Processor according to claim 2, wherein the one or more destination memory locations comprise a single mask register and wherein the execution unit, in response to the instruction, stores the first result mask operand and the second result mask operand in the single mask register. [5] Processor according to claim 4, wherein the execution unit, in response to the instruction, stores the first result mask operand in a least significant section of the single mask register and stores the second result mask operand in a section of the single mask register that is more significant than the least significant section. [6] Processor according to claim 1, wherein the execution unit, in response to the instruction, stores both a first result mask operand and a second result mask operand in a packed data register, and wherein each data element in the packed data register comprises both a mask element of the first result mask operand and a mask element of the second result mask operand. [7] Processor according to claim 1, wherein the execution unit stores a single result mask operand in a single mask register in response to the instruction. [8] Processor according to claim 1, wherein the execution unit, in response to the instruction, stores the at least one result mask operand in at least one mask register, and wherein an instruction set of the processor contains masked instructions for packed data that are operative to specify the at least one mask register as a memory location for a source mask operand to be used to mask an operation for packed data. [9] Processor according to any one of claims 1 to 8, wherein the execution unit, in response to the instruction, stores a number of result mask bits in the at least one result mask operand, which is not greater than a number of data elements in the first and the second packed source data operand. [10] Processor according to any one of claims 1 to 8, wherein the execution unit, in response to the instruction, stores the at least one result mask operand in which each mask element comprises a single mask bit. [11] Processor according to any one of claims 1 to 8, wherein the decoding unit decodes the instruction specifying the first packed source data operand containing at least eight data elements and the second packed source data operand containing at least eight data elements. [12] Processor according to any one of claims 1 to 8, wherein the decoding unit decodes the instruction specifying the first packed source data operand containing at least 512 bits and the second packed source data operand containing at least 512 bits. [13] Method in a processor comprising the following: Receiving a data element comparison instruction, wherein the data element comparison instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations; and Storing at least one result mask operand at one or more target memory locations in response to the data element comparison instruction, wherein the at least one result mask operand contains, for each corresponding data element in one of the first and second packed source data operands, another mask element at the same relative position, wherein each mask element indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. [14] The method of claim 13, wherein the storage comprises: Storing a first result mask operand at one or more target memory locations, wherein the first result mask operand contains, for each corresponding data element in the first packed source data operand, a different mask element at the same relative position, wherein each mask element of the first result mask operand indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand; and Storing a second result mask operand at one or more target memory locations, wherein the second result mask operand contains a different mask element at the same relative position for each corresponding data element in the second packed source data operand, and wherein each mask element of the second result mask operand indicates whether the corresponding data element in the second packed source data operand is equal to any of the data elements in the first packed source data operand. [15] Method according to claim 14, wherein storing the first result mask operand comprises storing the first result mask operand in a first mask register and wherein storing the second result mask operand comprises storing the second result mask operand in a second mask register. [16] Method according to claim 14, wherein storing the first result mask operand and storing the second result mask operand comprises storing both the first and the second result mask operand in a single mask register. [17] Method according to claim 13, wherein storing the at least one result mask operand at the one or more target memory locations comprises storing both a first result mask operand and a second result mask operand in a packed result data operand. [18] Method according to claim 13, further comprising receiving a masked instruction for packed data which specifies the at least one result mask operand as a statement operand. [19] System for processing commands, which includes the following: an interconnection; a processor coupled to the interconnection, wherein the processor receives a data element comparison instruction, the instruction specifying a first packed source data operand containing at least four data elements, specifying a second packed source data operand containing at least four data elements, and specifying one or more destination memory locations, wherein in response to the instruction the processor stores at least one result mask operand at the one or more destination memory locations, wherein the at least one result mask operand contains, for each corresponding data element in one of the first and second packed source data operands, a different mask bit at the same relative position, wherein each mask bit indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands; and a dynamic read / write memory (DRAM) coupled to the interconnection, wherein the DRAM stores a sparse vector arithmetic algorithm, wherein the sparse vector arithmetic algorithm contains a masked data element join instruction that specifies the at least one result mask operand as a source operand to mask a data element join operation. [20] System according to claim 19, wherein the execution unit stores two result mask operands in response to the instruction, each corresponding to a different one of the packed source data operands, wherein the two result mask operands are to be stored in at least one mask register. [21] A manufactured article comprising a non-transient machine-readable storage medium, wherein the non-transient machine-readable storage medium stores a data element comparison instruction, wherein the instruction specifies a first packed source data operand containing at least four data elements, a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations, wherein, if executed by a machine, the instruction causes the machine to perform the operations which include the following: Storing a first result mask operand at one or more target memory locations, wherein the first result mask operand contains a different mask bit for each corresponding data element in the first packed source data operand at the same relative position, each mask bit indicating whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand. [22] Article according to claim 21, wherein the instruction, if executed by a machine, causes the machine to perform the operations which include storing a second result mask operand at the one or more destination memory locations, and wherein the one or more destination memory locations comprise at least one mask register and wherein the first and the second result mask operand together do not have more mask bits than a number of data elements in the first and the second packed source data operand. [23] Device comprising means for carrying out the method according to any one of claims 13 to 18. [24] Manufacturing article comprising a machine-readable medium which stores an instruction which, when executed by a machine, is operative to cause the machine to carry out the method according to any one of claims 13 to 18. [25] Electronic device comprising an interconnection, the processor according to any one of claims 1 to 8 coupled to the interconnection, and a dynamic read / write memory (DRAM) coupled to the interconnection.